Page MenuHomePhabricator

Run updateArticleCount.php on all Wikisources and Wiktionaries
Closed, ResolvedPublic

Description

Author: pierre.beaudouin

Description:

  1. Wikisource has a lot of articles without internal link. Article count should include page without internal link.
  1. Some namespaces are included into article count (102 = author,104 = page, 106 = index). But the numerotation is not identical between wikis

For example :
On pl: 100=page, 102=index, 104=author
On it: 102=author, 108=page, 110=index
On fr: 102=author, 104=page, 112=index


Version: unspecified
Severity: major

Details

Reference
bz33253
ReferenceSource BranchDest BranchAuthorTitle
repos/phabricator/phabricator!26T352530almanacClusterwmf/stableaklapperRevert custom changes to Almanac cluster services
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:00 AM
bzimport set Reference to bz33253.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

  1. Wikisource has a lot of articles without internal link. Article count should

include page without internal link.

I thought this had been fixed by bug 24754 / bug 11868.
I don't think Wikisource needs the new $wgArticleCountMethod created after bug 26033, does it? And in any case it should be set to "comma", not "any", except perhaps some languages which don't use commas.
Cf. bug 27256, while I don't find a bug for Wiktionary, perhaps the actual configuration has not been requested yet or wasn't actually needed?

  1. Some namespaces are included into article count (102 = author,104 = page,

106 = index). But the numerotation is not identical between wikis

For example :
On pl: 100=page, 102=index, 104=author
On it: 102=author, 108=page, 110=index
On fr: 102=author, 104=page, 112=index

This is bug 29172: reopen if some namespaces are missing on some wikis.

After the configuration has been set or confirmed to be set correctly, the really missing piece here is running updateArticleCount.php on all Wikisources, which already have utterly broken count because of the ContentNamespaces change, mass deletions and so on.
Also Wiktionary (after r88113) and en/pt.books need it I suppose.

pierre.beaudouin wrote:

(In reply to comment #1)
Thanks for your comment. I didn't understand everything and don't know where the problem come from.

But I still believe that the figures on http://stats.wikimedia.org and [[Special:Statitics]] for Wikisource are misleading.

E.g. the size of the french wikisoure database is decreasing !

There is also a problem with the number of active editors, article count...

The following script does may be not take into account the specificities of Wikisource (transclusion, namespaces)
http://svn.wikimedia.org/viewvc/mediawiki/trunk/wikistats/dumps/WikiCountsInput.pm?view=markup

(In reply to comment #2)

But I still believe that the figures on http://stats.wikimedia.org and
[[Special:Statitics]] for Wikisource are misleading.

What's misleading and why on Special:Statistics?

E.g. the size of the french wikisoure database is decreasing !

There is also a problem with the number of active editors, article count...

The following script does may be not take into account the specificities of
Wikisource (transclusion, namespaces)
http://svn.wikimedia.org/viewvc/mediawiki/trunk/wikistats/dumps/WikiCountsInput.pm?view=markup

Please open another bug for WikiStats. Erik Zachte has already worked on this and fixed some parts of it: I think Commons is ok, while he's not updated the script to use ContentNamespaces rather than namespace 0 in all cases, AFAIK. I think he has other priorities now, but a bug report might be useful.

In the meanwhile, because no request to change the article count method seems to be needed/requested, I'm changing the summary making this a shell request for a maintenance script run.
Severity set to "major" because article count is an important feature and Special:Statistics a very used page, currently completely incorrect on those wikis after the count method changes.

(In reply to comment #5)

Isn't this a WONTFIX?

Why should it?! The script updates the count to reflect the real value according to current rules.

This needs to wait until deployment of 1.19, which is where r88113 added $wgArticleCountMethod

(In reply to comment #7)

This needs to wait until deployment of 1.19, which is where r88113 added
$wgArticleCountMethod

1.19 has been deployed for a while now. What's the status of this bug? I'm assuming the script has not been run on everything, since Veps Wikipedia (for example) is still reporting the wrong article count.

And now we're on MW1.20wmf2. Is this still waiting for anything in particular (apart from someone to act on it)?

[P.S. - Why was the URL set to "analytics"?]

Be careful what you wish for! Thanks for running the script on these wikis, but it has resulted in some *huge* changes in article counts, and some are very questionable. For example, the Nepali Wiktionary has "lost" 98% of its entries, falling from 4,821 down to 73. How is this Wiktionary counting articles? Link or comma method? (It clearly isn't using "any".)

And while most of the Wikisources have grown, as one would expect since more namespaces are now considered "content", the Ukrainian Wikisource has lost 57% of its text units, dropping from 4,563 down to 1,947. How could it have lost so many content pages?

More generally: I know how to check what namespaces count as content (API namespaces query), but how does one find out what article-count method a wiki is using?

Reopening this bug until this issue is cleared up. (I've only checked a few of the updated article counts so far, but I will check more and see if there's a systematic problem with certain types of languages, or what...)

I'd suspect quite a few haven't been updated configuration wise like they should have:

/**

  • Method used to determine if a page in a content namespace should be counted
  • as a valid article. *
  • Redirect pages will never be counted as valid articles. *
  • This variable can have the following values:
  • - 'any': all pages as considered as valid articles
  • - 'comma': the page must contain a comma to be considered valid
  • - 'link': the page must contain a [[wiki link]] to be considered valid
  • - null: the value will be set at run time depending on $wgUseCommaCount:
  • if $wgUseCommaCount is false, it will be 'link', if it is true
  • it will be 'comma' *
  • See also See http://www.mediawiki.org/wiki/Manual:Article_count *
  • Retroactively changing this variable will not affect the existing count,
  • to update it, you will need to run the maintenance/updateArticleCount.php
  • script. */

$wgArticleCountMethod = null;

So what would you recommend as far as finding out which wikis still/now need fixing? Since I'm just a "regular user", the best I can do is compare the on-wiki (/API) article counts before and after the running of the updateArticleCount.php script (I collect these numbers daily) with the official article counts listed at Wikistats (stats.wikimedia.org), once those are posted (in a few weeks). Unfortunately, the Nepali Wiktionary is one of the wikis that are not tracked (for whatever reason) at Wikistats (meaning there are no "official" article counts for it, only what the wiki itself reports). In any case, because of the sheer number of projects involved, I haven't actually made any such comparisons yet (I have been collecting some relevant data over the past few weeks, though).

Would it be possible to post the values of $wgArticleCountMethod and $wgUseCommaCount for every wiki? I know it's a lot to ask, but I assume there's a "quick" way of doing this on the command line...?

All wikis are still using the default link method except pt, en.books which filed a request, see above.
Entries on ne.wiktionary seem to have no links, categories o templates at all, so they wouldn't be counted with the normal method either. The only difference may be that also interwiki links used to be counted, but I'd consider that a bug. In any case, please open a new bug to get the method fixed/change, this bug is indeed fixed.

(In reply to comment #13)

Would it be possible to post the values of $wgArticleCountMethod and
$wgUseCommaCount for every wiki? I know it's a lot to ask, but I assume there's
a "quick" way of doing this on the command line...?

'wgArticleCountMethod' => array(
'default' => 'link',
'enwikibooks' => 'comma',
'ptwikibooks' => 'comma',
),

http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php

That is a very helpful file! Thanks.

Actually, from what I've seen, the vast majority of ne.wiktionary entries do have categories. Same for uk.wikisource. Are you sure those would count as "links"? I don't remember...

In fact, based on a census of Special:AllPages (main namespace, hiding redirects) and a sample of 30 "Special:Random" pages (in main ns) checked for the presence of at least one link (assuming Category: links count) on each wiki:

  • ne.wiktionary = 28/30 * 4,937 = 4,608 estimated article count
  • uk.wikisource = 28/30 * 4,757 = 4,440 estimated article count

(Both wikis count only the main namespace as "content".)

Those estimates are really close to the respective counts before the update script was run: 4,821 and 4,563.

So... does this mean Category: links _used_ to count as links but don't anymore?

If so, this is going to affect a great many wikis. (And already has: 13 Wikisources and 24 Wiktionaries dropped below their latest significant article count milestone [in the sense of those tracked at m:Wikimedia_News] in the last 24 hours -- typically only a few wikis fall below milestones every _month_, across _all_ WMF projects.)

So, does anyone know if this has been discussed on-wiki anywhere, or on a mailing list?

Well, damn, there it is right there at [[mw:Manual:Article count]]: "...will be counted as an article in the statistics and the {{NUMBEROFARTICLES}} variable... if it contains at least one wiki link... or is categorized to at least one category."

So, this used to be the behavior, at least. Has it changed?

erikzachte wrote:

(In reply to comment #11)

More generally: I know how to check what namespaces count as content (API
namespaces query), but how does one find out what article-count method a wiki
is using?

Query does list namespaces which are in use, but not whether these count as content. Am I missing something?

http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces

If such a query exists, or such an attribute could be added, wikistats could use it to get that part of article counting up to date.

(In reply to comment #17)

Query does list namespaces which are in use, but not whether these count as
content. Am I missing something?

http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces

In the results of that query:

<ns id="0" case="first-letter" content="" xml:space="preserve" />

The string 'content=""' indicates that this namespace counts as content. Here's another example from http://de.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces:

...
<ns id="0" case="first-letter" subpages="" content="" xml:space="preserve" />
...
<ns id="102" case="first-letter" canonical="Seite" content="" xml:space="preserve">Seite</ns>
...
<ns id="104" case="first-letter" canonical="Index" content="" xml:space="preserve">Index</ns>
...

I should point out that I'm really only _assuming_ this is true (about the 'content=""' string), since it seems to match (more or less) what I've been told about what namespaces count as content on what projects.

Note, however, that there is quite a bit of variation in this. For example, when you, Erik, told me at [[m:Talk:Wikimedia News#Using Wikipedia Statistics to fill in gaps]] that "102 = Author, 104 = Page, 106 = Index" count as content on Wikisource, that's true about the English Wikisource, but not necessarily the others. Not all Wikisources even use the same namespace numbers for the same purposes: in the Estonian Wikisource, for example, 102 = Page, 104 = Index, and 106 = Author (and these are all marked as "content" in the API query results; and in the Turkish Wikisource, 100 = Author, and that's the only namespace other than main (ns0) marked as "content".

So does this mean not even Wikistats is counting the articles correctly?? [g]

As part of my investigation into the large shifts in "on-wiki" article counts alluded to above, I've started to fill in a large table at [[m:Talk:Wikimedia News#May 10 article count updates]] with some relevant info, including what namespaces are marked as "content" in the API results, how many non-redirect pages are (or appear to be, approximately) in each, and an estimate of what percentage of these should count as "articles" by the "at least one link" standard (plus a lot of other stuff -- note, BTW, that the table only contains wikis that passed or dropped below article-count "milestones").

I'm also in the process of downloading all the relevant database dumps that should allow me to calculate "exactly" many of the numbers in that table that are currently only estimates (in essense duplicating what I assume your script[s] do, Erik, but for dumps made just before and just after May 10th, not only at the end of the month).

lars wrote:

At http://meta.wikimedia.org/wiki/Wikisource
the Norwegian (no.) Wikisource is listed with 4,145 "good" pages,
which should be ten times larger if the "Side:" (Page) namespace was counted.
It should only be slightly smaller than the Swedish (sv.) Wikisource,
which has 46,815 "good" pages in the same table.

OTOH, notice that the "official" article count for s:no:, as of Mar 31, 2012, is only 2,392.http://stats.wikimedia.org/wikisource/EN/Sitemap.htm

The more I look into this, the more convinced I become that, unfortunately, *most* of the article counts, both on-wiki and based on dumps, are actually wrong by significant amounts.... but I can't be sure at this point exactly how widespread the problem is. When I get a clearer picture, I'll open a different bug about it.

erikzachte wrote:

Here is a list of 'content' namespaces collected via the API.
If this looks sensible I can use it from now on for wikistats.

http://stats.wikimedia.org/wikimedia/misc/StatisticsContentNamespaces.csv

BTW commons does not list 6 or 14.

erikzachte wrote:

Ah list of content namespaces is already available via

http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php

section 'wgContentNamespaces' => array

erikzachte wrote:

Ahem I overlooked comment 14, this php file was already mentioned.

http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php

So how about wikisource and wiktionary wikis, aren't those wgArticleCountMethod 'any' ?

reedy@fenari:~$ mwscript eval.php enwikisource

print $wgArticleCountMethod

link

reedy@fenari:~$ mwscript eval.php enwiktionary

print $wgArticleCountMethod

link

Oops. Forgot to mention here that I've opened bug 37291 about updateArticleCount.php (or whatever code actually counts the articles) not counting correctly.