Page MenuHomePhabricator

PAGESINCATEGORY should differentiate between pages and subcategories
Closed, ResolvedPublic

Description

Author: londenp

Description:
This magic word PAGESINCATEGORY, which is very useful, counts the number of articles and subcategories in a category.

I don't know if this is wished for functionality for this magic word, but it would help a lot of

a) <nowiki>{{Pagesincategory:category}}</nowiki> would not count the subcategories of a certain category
or
b) a new magic word is created which will count only the amount of articles in a certain category next to the existing magic word.

I did not found a bug/feature request about this, but if there is already one: sorry about this one.

Thanks


Version: unspecified
Severity: enhancement
URL: http://nl.wikibooks.org

Details

Reference
bz14237

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:11 PM
bzimport set Reference to bz14237.

Wiki.Melancholie wrote:

*** Bug 13691 has been marked as a duplicate of this bug. ***

londenp wrote:

It seems that bug 13691 is not an exact duplicate of this bug, although about the same magic word. That bug says it is been resolved, but then this bug is turned into a feature request for a new magic word, so that is b) in above comment.

Thanks

redekopmark wrote:

What about a new magic word {{ARTICLESINCATEGORY}} that would only display the number of mainspace pages in a category? similar to the differences between NUMBEROFPAGES NUMBEROFARTICLES

I think ARTICLES should never be used in magic words. PAGES should be used instead.

I think the current behaviour is confusing. I could imagine PAGESINCATEGORY only reporting the number of pages in a category, excluding FILES and CATEGORIES. This implies there would be 4 magic words to report on either all category members (MEMBERSINCATEGORY), files in category (FILESINCATEGORY), categories in category (CATEGORIESINCATEGORY), and pages in category (PAGESCATEGORY).

  • Bug 15645 has been marked as a duplicate of this bug. ***

I'm opposed to Siebrand's view. Pages for me are any pages, including subcategories, files, talk pages, articles, project pages. They only differ by the namespace in which they reside, and there are possibly many other namespaces (don't assume that all wikis will behave like Wikipedia).

If you want to have counts be namespace, then what would be needed is a two parameter syntax like:

{{PAGESINCATEGORY:categoryname|namespace-id}}

to make the restriction (the same magic keyword can be used, to provide separate counts for each namespace).

or even possibly like
{{PAGESINCATEGORY:categoryname|namespace-id1|namespace-id2|...}}
if you want to include a list of several namespaces to include in the count.

The existing difference between NUMBEROFPAGES and NUMBEROFARTICLES does not rely on namespace differenciation but on statistical parameters (notably the page size, excluding included templates).

Introducing the term "member" will just add more confusion.

  • Bug 21822 has been marked as a duplicate of this bug. ***
  • Bug 25376 has been marked as a duplicate of this bug. ***

Duping both of those bugs to this. Implementation per comment 6 (or similar) would solve all of these bugs at once.

note that multiple parameters for the syntax I propose may be reduced to just one:
{{PAGESINCATEGORY:categoryname|restriction}}
where restriction may be:

  • "" : no namespace id at all, useful to add namespaces
  • "*" : all namespace ids (the default), useful to remove namespaces

followed by one or more of:

  • "+id" : add this namespace id to the current list
  • "-id" : remove this namespace id from the list

if the restriction does not start by "*" or "+" or "-", then "+" is implied
The namespace id could be either the numeric id, or a selector like "talk" to select all talk namespaces, and "subject" to select all subject namespaces.

The namespace id can then take the forms:

  • an integer, the raw namespace number
  • a name, a namespace name (converted to a namespace id, should recognize the synonyms, notably localized names or English names, or site-specific names)
  • "odd": all odd namespace ids (i.e. "talk" namespaces associated to any subject namespace)
  • "even": all even namespace ids (i.e. "subject" namespaces)

For example:

  • {{PAGESINCATEGORY:categoryname|*}} : equivalent to {{PAGESINCATEGORY:categoryname|}} and to {{PAGESINCATEGORY:categoryname}} (existing syntax)
  • {{PAGESINCATEGORY:categoryname|:}} : count only pages of the main namespace, that are members of the specified category name
  • {{PAGESINCATEGORY:categoryname|0}} : count only pages of the main namespace, that are members of the specified category name ; equivalent to {{PAGESINCATEGORY:categoryname|+0}}
  • {{PAGESINCATEGORY:categoryname|+project:+talk}} : count only pages of the "project:" or of any talk namespaces, that are members of the specified category name ; equivalent to {{PAGESINCATEGORY:categoryname|+0}}
  • {{PAGESINCATEGORY:categoryname|-talk}} : count all pages of any namespace excluding the talk namespaces (odd ids) that are members of the specified category name; equivalent to {{PAGESINCATEGORY:categoryname|*-talk}}

The restriction can easily be implemented as WHERE clauses in the SQL select that will match the specified namespace ids, combined as a parenthetic list of 'OR id=value' (positive selections), followed by a list of exclusions with 'AND NOT id=value' (negative selections), and possibly with the "IN" operator if sets are available in the SQL syntax.

Some ideas about the SQL server-side cost of counting members in a specific category:

The SQL cost should with the restrictions above will be either the same (or better) as performing a select without the namespace restriction (because this is just a restriction of the existing syntax, and this should never reduce the selectivity of the SQL query, but may in fact help to improve it).

However, this means that the existing restriction (for costly parser functions) should remain (because counting pages that are members of a category, independantly of which namespace they belong may be costly in very populated categories, depending on how members of categories are indexed).

As this cost is effectively the cost of a:

SELECT COUNT(*) from categorymembers
WHERE category_pageid = $CATEGORYPAGEID
AND member_namespaceid = $CATEGORYNAMESPACEID

aggregate (note: I don't know the exact schema impelementation which varies across Mediawiki versions, so replace the table names and column names appropriately), one way to solve it would be to use:

SELECT 1 from categorymembers
WHERE category_pageid = $CATEGORYPAGEID
AND member_namespaceid = $CATEGORYNAMESPACEID
LIMIT 50

and then let the PHP code count the returned "1" rows: if there are 50 rows, then the category is too much populated, and COUNT(*) may take time, so the function can be considered costly. If the cost limit is reached, just return this limit value to the page calling the function, otherwise perform the same select, replacing "SELECT 1" by "SELECT COUNT(*)" (without the LIMIT clause) to return the exact value, or return the last known estimate from a separate caching aggregate table that will be updated separately (using a max timestamp of validity), to avoid reusing the same aggregate repetitively because of templated pages using this function and frequent accesses by many users viewing or editing various pages.

The value specified in the "LIMIT" clause above (here "50") may be tuned; and this first check (for performance) may be removed completely, or removed if the SQL schema includes an index that precompute aggregates for counting members in each specific category (in which case there will not be any need to perform a SELECT COUNT(*) aggregate, given that the count will be retrieved directly from a precomputed aggregate caching table, that should be updated asynchronously, either as a batch, or when the selective SELECT in the cache detects that the stored value is out of date, in which case it will perform the SELECT COUNT(*) from the non-cached table, just to update the caching table and its timestamp).

philippe.vigneau wrote:

I don't know if the index on the two columns (category_pageid, member_namespaceid) exists on the table categorymembers, but it seems to me that is the only thing that may be added in the database... performance can only be better...

so when this improvment will be done ?...

I would love to see this one implemented. I was just looking up how to count files in a directory (excluding sub-directories) when I learn that you can not. I was hoping to use that to allow commons template [[commons:Template:MetaCat]] to list metacategories (categories which should contain only other categories) with files.

test5555 wrote:

For files, see Bug 21822

a patch commited with gerrit 12790

successfully merged

You can use {{PAGESINCATEGORY:catname|subcats}} or {{PAGESINCATEGORY:catname|subcats|R}} or {{PAGESINCATEGORY:catname|R|subcats}}
to get the count of subcats in the category or with 'pages' to get the count of pages.

I don't know if PAGESINSCATEGORY must be accurate. As it is costly, we probably don't need the exact count if we can just have an exact estimate below a maximum.

If COUNT(*) is costly in SQL (because the column is not indexed) so that it has to fetch each row and count those that are found, then a "LIMIT n" clause can avoid this cost by returning only this maximum.

For Wikis, we generally don't need the exact count but only an estimate if there are more than about 400 pages in a category (for detecting those that are overpopulated), in which case the threshold LIMIT can be set to LIMIT 401 (only if there's no selective index to perform a fast COUNT(*).

Also I note that {{PAGESINCATEGORY: name |R}} does not seem to trim whitespaces in the given category names (this is a problem in templates where category names are generated by complex rules where there may be optional parts separated by spaces). So it does not work like [[CATEGORY: name ]] : in such cases, it will return 0 even if the category is not empty and has many members.

Workaround: trim the name parameter using {{PAGESINCATEGORY:{{#titleparts: name }}|R}}.

Please create new tasks for new issues or new feature request. The task as written is closed.

The page count is not accurate. It gives the same numbers as shown on each category page.
PAGESINCATEGORY works with spaces, there is possible another issue when using template arguments => new task

No, PAGESINCATEGORY does NOT (and has NEVER) worked with spaces.

We still need the workaround {{PAGESINCATEGORY:{{#titleparts: name }}|R}} because this trimming is NOT performed, and {{PAGESINCATEGORY: Name |R}} then always returns 0 when {{PAGESINCATEGORY:Name|R}} (without the leading/trailing spaces in Name) gets the correct result.

This is an old bug submitted multiple times and closed each time, even if it is very easy to reproduce on ALL wikis ! My comment about this was a reminder for something that is constantly ignored, reported many times and never fixed, causing problems that are hard to find in many templates using PAGESINCATEGORY (notably in various navboxes used in Commons that are quite hard to isolate: PAGESINCATEGORY: Name is frequently used with #ifexpr:, instead of #ifexist: Name).

And it never makes senses to use the parameter of PAGEINCATEGORY without trimming it, given that there can't exist any category name (or any other page names) which is not trimmed.

And this is absolutely not related to "template arguments": the bug exists even outside any template. It's just that this bugt is frequently occuring within templates that test category names with variable parameter values where some of them may be empty (like optional prefixes/suffixes)