Page MenuHomePhabricator

$wgUseCategoryBrowser generates many dupes
Closed, DeclinedPublic

Description

Author: srainwater

Description:
I turned on $wgUseCategoryBrowser and discovered it displays a very large number of duplicate entries. I'm using this on a large wiki (Camera-Wiki.org) with several thousand pages and hundreds of categories. In some cases it displays the top level category entry as many as 10 or 20 times and many categories are display 3 to 5 times.

Seems like a simple fix to add code to filter out duplicates. If someone can point me to the appropriate piece of code I'd be happy to provide a patch.

Here's a typical display from the bottom of one page in our wiki:

Root category
Root category
Root category
Root category
Root category
Root category
Root category
Root category
Root category > Cameras
Root category > Cameras
Root category > Cameras > Cameras by first letter > B
Root category > Cameras > Cameras by first letter > C
Root category > Cameras > Medium format > 127 film
Root category > Companies > Camera makers
Root category > Countries > Italy
Root category > Countries > Italy > Bencini
Root category > Imaging media > Film > Film formats
Root category > Special categories
Root category > Special categories
Root category > Special categories
Root category > Special categories
Root category > Special categories
Root category > Special categories
Root category > Special categories
Root category > Special categories
Root category > Templates > Wiki > Flickr image
Root category > Templates > Wiki > Flickr image
Root category > Templates > Wiki > Flickr image
Root category > Templates > Wiki > Flickr image
Root category > Templates > Wiki > Hidden categories > Image by AWCam
Root category > Templates > Wiki > Hidden categories > Image by Dirk HR Spennemann
Root category > Templates > Wiki > Hidden categories > Image by Rick Soloway
Root category > Templates > Wiki > Hidden categories > Image by jgs4309976


Version: 1.18.x
Severity: normal

Details

Reference
bz33614

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:04 AM
bzimport set Reference to bz33614.
bzimport added a subscriber: Unknown Object (MLST).

srainwater wrote:

Found a fairly trivial fix for this. In Skin.php, I added an array_unique() to the explode(). the line was:

$tempout = explode( "\n", $this->drawCategoryBrowser( $parenttree, $this ) );

I changed it to:

$tempout = array_unique( explode( "\n", $this->drawCategoryBrowser( $parenttree, $this ) ) );

The only drawback now is that it still displays hidden categories, which doesn't seem right. Probably a separate bug however.

Here's the current output from the same page as show in initial comment:

Root category
Root category > Cameras
Root category > Cameras > Cameras by first letter > B
Root category > Cameras > Cameras by first letter > C
Root category > Cameras > Medium format > 127 film
Root category > Companies > Camera makers
Root category > Countries > Italy
Root category > Countries > Italy > Bencini
Root category > Imaging media > Film > Film formats
Root category > Special categories
Root category > Templates > Wiki > Flickr image
Root category > Templates > Wiki > Hidden categories > Image by AWCam
Root category > Templates > Wiki > Hidden categories > Image by Dirk HR Spennemann
Root category > Templates > Wiki > Hidden categories > Image by Rick Soloway
Root category > Templates > Wiki > Hidden categories > Image by jgs4309976

srainwater wrote:

Upon further thought, there's still redundancy here. For example:

If the page is in:

A > B > C > D

There's really no point in also displaying these lines:

A
A > B
A > B > C

As they're all included in D. They're not really paths to the given page anyway. Really what's wanted is a list of unique paths through the hierarchy to the given page. There's no need to provide additional paths to each point along the way. If that makes sense.

Adding patch keyword for solution in comment #1

If the page is in:

A > B > C > D

Well the whole idea of category browser is to put the article in D category and skipping A,B,C :-b

array_unique() works there. But it is on display. We should be able to filter before rendering, I E when building the category tree.

I've also noticed that the hiddencats display regardless of the status of the Show Hidden Categories checkbox in user preferences. Need a way to actually hide the hidden cats..

ken wrote:

Thanks for this bug report! I thought for sure I had something in my wiki configured incorrectly.

I ended up hacking my 1.17 wiki to fix this. I replaced this line from includes/Skin.php:

$tempout = array_unique(explode( "\n", $this->drawCategoryBrowser( $parenttree, $this ) ));

with this:

if ($wgUser->getBoolOption( 'showhiddencats' )) {

$tempout = array_unique(explode( "\n", $this->drawCategoryBrowser( $parenttree, $this ) ));

}
else {

$tempout = preg_grep( "/Hidden categories/", array_unique(explode( "\n", $this->drawCategoryBrowser( $parenttree, $this ) )), PREG_GREP_INVERT );

}

ken wrote:

sorry, I pasted the wrong line for the "original" line. The original line is this (it does not have array_unique in it):

$tempout = explode( "\n", $this->drawCategoryBrowser( $parenttree, $this ) );

(In reply to comment #6)

Thanks for this bug report! I thought for sure I had something in my wiki
configured incorrectly.

I ended up hacking my 1.17 wiki to fix this. I replaced this line from
includes/Skin.php:

Glad to hear you got this working on your wiki.

(In response more to the patch keyword added by others then to your comment) We can't directly incorporate your code into core MediaWiki since there is no guarantee that the hidden category's name is actually "Hidden categories" (with i18n and all).

Ideally this filtering would be done when querying the db/building the list of categories, as opposed to after the fact.

Correct me if I am wrong, but wouldn't it be feasible to replace:

$tempout = explode( "\n", $this->drawCategoryBrowser( $parenttree, $this ) );

With this:

if ($wgUser->getBoolOption( 'showhiddencats' )) {

$tempout = array_unique(explode( "\n", $this->drawCategoryBrowser(

$parenttree, $this ) ));
}
else {

$tempout = preg_grep( "/MediaWiki:Hidden-categories/", array_unique(explode( "\n",

$this->drawCategoryBrowser( $parenttree, $this ) )), PREG_GREP_INVERT );
}

So that instead of specifically specifying it as the hidden category's name as "Hidden categories" you have it refer to the MediaWiki page that the name is actually set on?

$tempout = preg_grep( "/MediaWiki:Hidden-categories/",

can't be used:

We can't directly incorporate your code into core MediaWiki
since there is no guarantee that the hidden category's name is
actually "Hidden categories" (with i18n and all).

preg_grep( "/" . preg_quote( wfMsgForContent( "MediaWiki:Hidden-categories" ), "/" ) ."/", ...

Would work, which is what I believe you were trying to get at (In theory anyways, I haven't tested it). However, I think it would be preferable to look for the cat_hidden prop in page_props table when doing the actual db query.

(In reply to comment #11)

preg_grep( "/" . preg_quote( wfMsgForContent( "MediaWiki:Hidden-categories" ),
"/" ) ."/", ...

Would work, which is what I believe you were trying to get at (In theory
anyways, I haven't tested it). However, I think it would be preferable to look
for the cat_hidden prop in page_props table when doing the actual db query.

That was what I was trying to get at.. That way, it wouldn't matter what the hiddencat name actually was, as it would be defined correctly in all instance on that page anyways.

balano wrote:

I was also seeing duplicates in my small (1000 non-stubs) MW 1.18.2. I believe at least some of the duplicates are coming because the category browser drops the bottom level category off some entries. I've documented this behavior at

http://www.mediawiki.org/w/index.php?title=Help:Categories&stable=0&shownotice=1&fromsection=Adding_a_page_to_a_category#Adding_a_page_to_a_category

and I repeat it here:

(At least in MediaWiki 1.18.2) if a category is a subcategory of more than one parent, both hierarchies will be listed, but the tagged category will be stripped off all but one of these. This creates the potential for what appear to be duplicate entries if a category with multiple parents and one of its parents are both tagged on a page. For example suppose Maryanne is a subcategory of both Mary and Anne. If a page tags categories Maryanne and Anne then the Category breadcrumbs will show

Anne
Anne
Mary -> Maryanne

"Anne" appears to be duplicated, but what is meant is

Anne
Anne -> Maryanne
Mary -> Maryanne

This is a old issue, is it any work on this?

The last update was back in June 2012. So no ,nobody is looking at this. That last comment ( T35614#374586 ) explains how to reproduce the case.

Maybe a glitch in Title::getParentCategoryTree().