Page MenuHomePhabricator

Semantic MediaWiki showing multiple unused instances of properties
Closed, DeclinedPublic

Description

Author: vicente.aguilar

Description:
On some of our wikis, on Special:Properties some properties appear twice (see attachment), once with "0 uses" and once with "X uses" (X>0, the real usage data). When clicking on the Property:XYZ link they both go to the same article with the property definition, whether it has been defined or not (red link). Depending on which one of the two occurrences ("0 uses" or "X uses") appears first on Special:Properties, we don't get any values when #asking for this property. This happens for properties of any type.

Sometimes a full refresh (SMW_refreshData -ftpv, then another -v) fixes the issue, sometimes it doesn't.

This is difficult to reproduce, we have several wikis with same version (OS, MW, SMW), configuration and content (partially) and we can have the issue in one but not the others. Content is exported/imported or synced using Special:Push, no copy-paste or manual sync.

ATM we're using MW 1.19.3, SMW 1.8.0.4 with CentOS 6.2 (PHP 5.3.3, MySQL 5.1.61).


Version: master
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=48707

Details

Reference
bz48706

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 1:23 AM
bzimport set Reference to bz48706.
bzimport added a subscriber: Unknown Object (MLST).

vicente.aguilar wrote:

Duplicated properties screenshot

Attached:

Duplicated_Properties.png (42×257 px, 2 KB)

It happens when the edited article is accessed under two different names considered the same due to database collation and capitalisation.

vicente.aguilar wrote:

Well, that could be it because one of the things I noticed while trying to diagnose this issue and bug 48707 is that we have different charsets and collations on some of our wikis (created at different times with different MW versions, but if the default changes shouldn't it be updated when running update.php? Oh well, that's another issue anyway.)

I'll try to unify all our charset/collations when I get the time, it's something I wanted to look into anyway.

In any case, if this was really the origin of the duplicates: why sometimes the issue doesn't go away after a SMW_refreshData -ftpv? If I got your reasoning right, the dup gets to the DB the moment someone accesses an article with a different - but equal according to the collation - name. But right after a refresh shouldn't be everything OK then?

the dup gets to the DB the moment someone accesses an article with a

different - but equal according to the collation - name.
Not just accesses but edits.

SMW_refreshData.php -ftpv is wrong. It purges type and property pages but after that you should run SMW_refreshData.php -fv to rebuild the pages themselves.

Unknown Object (User) added a comment.May 23 2013, 2:14 PM

(In reply to comment #2)

It happens when the edited article is accessed under two different names
considered the same due to database collation and capitalisation.

Are you talking about redirects? Normally, an article has one specific name so how can it be that I can access an article under two different names unless it is an redirect?

vicente.aguilar wrote:

(In reply to comment #4)

Not just accesses but edits.

Well, ok, but my point remains: right after a full refresh, shouldn't the DB be clean of dupes?

SMW_refreshData.php -ftpv is wrong. It purges type and property pages but
after
that you should run SMW_refreshData.php -fv to rebuild the pages themselves.

Yes, that's what we do, -ftpv and then -v.

vicente.aguilar wrote:

(In reply to comment #5)

Are you talking about redirects? Normally, an article has one specific name
so how can it be that I can access an article under two different names
unless it is an redirect?

No, he means DB collation, the way the DB (not MediaWiki, but mySQL) compares two strings. It has to do with different charsets and different languages, e.g. considering capital and lower case equal or not, removing tildes, etc. All this is configured on a per-table and per-field basis.

http://dev.mysql.com/doc/refman/5.0/en/charset-collation-effect.html

Unknown Object (User) added a comment.May 23 2013, 2:41 PM

(In reply to comment #7)

(In reply to comment #5)

Are you talking about redirects? Normally, an article has one specific name
so how can it be that I can access an article under two different names
unless it is an redirect?

http://dev.mysql.com/doc/refman/5.0/en/charset-collation-effect.html

So, according to the link above, [[Has property::Muffler]] annotation with a latin1_swedish_ci collation would if being switched to a latin1_german2_ci collation being understood as [[Has property::Müller]]?

Which would lead to [[Has property::Muffler]] and [[Has property::Müller]] for the same article?

vicente.aguilar wrote:

(In reply to comment #8)

So, according to the link above, [[Has property::Muffler]] annotation with a
latin1_swedish_ci collation would if being switched to a latin1_german2_ci
collation being understood as [[Has property::Müller]]?

That 1st example is about sorting, not comparison.

But yes, if MW/SMW is not doing any more checks and is relying only on the DB (which I can't tell, I haven't looked at the code that closely), depending on the collation Bar == bär == BAR.

Is that really the cause of this issue? I don't know. But my wikis do have a different charset/collation configuration so... maybe.

No, I don't think so. It's the collation in article title that causes duplication.

If you have an article called Müller and set any property [[has property::value]] on it and then open http://your.site/wiki/Muller?action=edit and you DB collation treats u and ü as the same (that is, it will not allow to create Muller if there is already Müller) than there will be two of each properties for Müller and Muller and both of them (one red) will appear in any SMW query for those properties ({{#ask:[[has property::value]]|format=list}} will give Müller, Muller).

You don't even need to change DB collation.

Similar artifacts will be observable in MW logs: they will show the name under which the page was accessed (red if different) not stored.

Created attachment 12770
Another example of duplicated properties

Another example of duplicated properties.

Attached:

duplicated_properties.png (203×394 px, 53 KB)

I'm still seeing this issue on all the current master releases of SMW. I have duplication for a good number of properties as well.

http://wikiapiary.com/wiki/Special:Properties

Attaching a screenshot of the duplication for Has bot segment. If I can help with debugging I would be happy to do so.

(Sorry for two entries, didn't know I could put that with the image attachment.)

Just noting that this duplication does not get counted when SMWInfo is used to ask for properties. For example, on Special:Properties it shows 210 properties:

http://wikiapiary.com/w/index.php?title=Special:Properties&limit=500&offset=0

SMWInfo shows 169 and 160:

http://wikiapiary.com/w/api.php?action=smwinfo

Aklapper subscribed.

The Semantic MediaWiki developers requested in https://phabricator.wikimedia.org/T64114 to move their task tracking to https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues and to close remaining tasks in Wikimedia Phabricator. If you still face the problem reported in this task in a supported version of SMW, please feel free to transfer your report to https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues . We are sorry for the inconvenience.