Page MenuHomePhabricator

updateArticleCount.php script is broken
Closed, InvalidPublic

Description

In short: The updateArticleCount.php script is not counting articles correctly.

The evidence:

See the table I'm still filling out at [[m:User:Dcljr/Article counts]], which collects (way too many) statistics based on the official database dumps. (In particular, see the columns highlighted in pink, which show how far off the "on-wiki" article counts were from the actual dump-based article counts, both before and after the script was run.)

The longer version:

Ever since the resolution of bug 33253, which led to several wikis "losing" or "gaining" huge numbers of articles (according to their {NUMBEROFARTICLES} count), I've suspected very strongly that the updateArticleCount.php script is not counting articles correctly. Now I have firm evidence.

I wrote a Perl script to download and parse relevant dumps from <dumps.wikimedia.org> thereby counting articles "from scratch" based on the current "non-redirect with at least one wikilink" criteria (as well as some more and less generous criteria that I'm trying out for comparison). The results are being collected at the Meta page above.

I've started with the Wiktionaries whose article counts dropped the most (in terms of percentage), so the table is currently showing huge undercounts. I originally suspected that the wikis whose article counts gained the most would show significant overcounts, but the handful of checks I've made of such wikis (which haven't been added to the table yet) haven't shown this to be the case.

We Shall See...

Punchline: Someone needs to check the updateArticleCount.php script to see why it's undercounting articles.


Version: unspecified
Severity: normal
URL: http://meta.wikimedia.org/wiki/User:Dcljr/Article_counts

Details

Reference
bz37291

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 12:23 AM
bzimport set Reference to bz37291.
bzimport added a subscriber: Unknown Object (MLST).

BTW, I should point out that the undercounting cannot be because it's not considering all the "content" namespaces, because all Wiktionaries use only ns0 for content.

I see that the updateArticleCount.php script itself does very little. Instead, it relies on other code to actually count the articles. I followed the dependencies for a while, but eventually gave up before I found the actual code that does the counting. Someone more familiar with MW code will have to say where the problem lies...

Is your script available somewhere?

Maybe you could point out a small wiki with the count of your script, for comparing with the numbers provided by updateArticleCount for that wiki?

(In reply to comment #3)

Is your script available somewhere?

aka, what definition of "good" article are you using to say that the count is not "correct"?

See above: «based on the current "non-redirect with at least one wikilink" criteria»

If you haven't already, please see the Meta page I pointed to in my initial post: http://meta.wikimedia.org/wiki/User:Dcljr/Article_counts

That gives all the information I think someone would need to independently
check my counts.... (In fact, it might be a good idea for someone to try to
count the articles themselves without seeing my code first. My script is
currently not available anywhere, but I can put it up at Meta if it's really
necessary.)

I've posted stats for 12 Wiktionaries and 15 Wikisources so far (each for a
date before and a date after the running of the maintenance script). Take your
pick for which one(s) you want to check.

The exact definition I'm using for a "good" article is: "non-redirect in a content namespace with any kind of internal-style [[wikilink]]: page, category, image/file, interlanguage, or interwiki". AFAIK, that's the definition currently in use. Which pages contain each of these types of links are gleaned from the respective database dumps. For details, see the Meta page. For the Wiktionaries, I also show counts using three other sets of criteria (also explained at the Meta page).

Me: "AFAIK, that's the definition currently in use."

This only applies to the "link" article-count method, of course -- which all Wikisources and Wiktionaries are currently using, according to http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php (search for "wgArticleCountMethod").

(In reply to comment #6)

The exact definition I'm using for a "good" article is: "non-redirect in a
content namespace with any kind of internal-style [[wikilink]]: page, category,
image/file, interlanguage, or interwiki". AFAIK, that's the definition
currently in use.

I wouldn't be so sure. As I already told you, interwikis and/or category links may not being counted now (which makes sense, especially for interwikis).

OK... so I finally looked at r88113, which is apparently where all of this changed radically.

Let's be very precise here. There are many different kinds of wikilinks (a fact that has contributed greatly to confusion over this issue):

  1. page: e.g., [[link]] or [[Special:Statistics]]
  2. category: [[Category:English]]
  3. image/file: [[File:Yes.png]]
  4. interlanguage: [[🇩🇪]] or [[de:]]
  5. interwiki: [[species:]]
  6. hidden: <!-- [[don't look at me]] -->
  7. deactivated: <nowiki>[[look at me]]</nowiki>

8-14. template-provided versions of, respectively, 1-7

Before r88113, 1-7 (in fact, _any_ instance of "[[") were all counted, but not 8-14. Afterwards, 1 and 8 are counted and no others. (Even though I can't check 8-14 with my script, checking for only type 1 links gave counts that matched {{NUMBEROFARTICLES}} on four wikis I tried it on. So there ya go.)

Unfortunately, this means you can't tell anymore just from the raw page source whether a page will be an article or not (I mean, say, if it has a template on it but no page links); it must be parsed first.

Seems to me, this amounts to a fundamental change in the way articles are counted (the changes in article counts that have resulted is proof enough of this) that was only ever discussed beforehand by a handful of people in bug 11868 -- and nobody there seemed to actually be discussing _this_ particular counting method! (Brion, for example, stated that the new method would "overcount" articles, which is the opposite of what has happened!)

IOW, this "new" state of affairs (which, although over a year old at this point, has not yet propagated to projects beyond Wikisource and Wiktionary, because updateArticleCount.php hasn't been run on them) was not arrived at through any real consensus process. In fact, Nemo_bis, I see that's essentially what you said just 3 weeks before the changes were committed by IAlex https://bugzilla.wikimedia.org/show_bug.cgi?id=24754#c1.

So, anyway... I guess this bug is finished, and I need to start a (now more informed) discussion about this on Meta....

(In reply to comment #9)

OK... so I finally looked at r88113, which is apparently where all of this
changed radically.

Wow, I wasn't aware of that.

Unfortunately, this means you can't tell anymore just from the raw page source
whether a page will be an article or not (I mean, say, if it has a template on
it but no page links); it must be parsed first.

For the verification purposes discussed here, you can use pagelinks.sql.gz though.

Yeah, I just realized that! [g]

For some reason I was thinking that the way my script was doing it would miss links provided by templates, but of course that's not true: what my script does is _exactly_ what the MW code itself does when not triggered by a page edit: it checks page.sql for the existence of links originating from the page in question!

I don't know what I was thinking....