Page MenuHomePhabricator

site_stats.ss_good_articles and site_stats.ss_total_pages not synchronized with the real count
Closed, InvalidPublic

Description

site_stats.ss_good_articles and site_stats.ss_total_pages are not synchronized with the corresponding count done by query.

site_stats.ss_total_pages != count all from page;
site_stats.ss_good_articles != count ns 0 & nonredir & not dead end from page

Checked on random wikis. In all tested cases site_stats results shown more than corresponding query.


Version: unspecified
Severity: trivial

Details

Reference
bz10834

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 9:53 PM
bzimport set Reference to bz10834.
bzimport added a subscriber: Unknown Object (MLST).

Add on: Tested some other wikis and got the stats lower than query. Therefore there's no presumable behavior of it.

robchur wrote:

Checked on which random wikis in what manner? Exact SQL used would be helpful to check we've got inconsistent data, rather than invalid assumptions about what the statistics represent. You've taken issues like possible replication lag into consideration, where applicable?

Tested on toolserver couple minutes ago. Replags on s2 and s3 were within 0-4 sec during performing these queries.

query #1> SELECT ss_total_pages, ss_good_articles FROM site_stats;
query #2> SELECT COUNT(*) AS totalpages FROM page;
query #3> SELECT COUNT(DISTINCT page_id) AS goodarticles FROM page LEFT JOIN pagelinks ON page_id = pl_from WHERE pl_from IS NOT NULL AND page_namespace = 0 AND page_is_redirect = 0;

Query #3 based on the rule "good article = page in ns 0 AND not redirect AND not dead end"; query for dead end taken from http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/SpecialDeadendpages.php?view=markup and slightly modified, gives the same result as query #2 - exact query from svn)

cswiki
+----------------+------------------+

ss_total_pagesss_good_articles

+----------------+------------------+

18704473934

+----------------+------------------+
+------------+

totalpages

+------------+

186931

+------------+
+--------------+

goodarticles

+--------------+

74123

+--------------+

cswikisource
+----------------+------------------+

ss_total_pagesss_good_articles

+----------------+------------------+

40492523

+----------------+------------------+
+------------+

totalpages

+------------+

4049

+------------+
+--------------+

goodarticles

+--------------+

2464

+--------------+

skwiki
+----------------+------------------+

ss_total_pagesss_good_articles

+----------------+------------------+

14311173727

+----------------+------------------+
+------------+

totalpages

+------------+

143779

+------------+
+--------------+

goodarticles

+--------------+

73658

+--------------+

Couple suggestions I've got on MediaWiki-General during today:

Tim was suggesting invalid titles
Duesentrieb was suggesting last time he saw sources the stats were checking only for "[[" string presence

I can't remember on which wiki it was, but I also got the result site_stats.ss_good_articles > COUNT(*) WHERE page_namespace = 0 AND page_is_redirect = 0 couple times.

Live counts check for namespace, non-redirect, and '[['.

Re-initialized count checks only for namespace, non-redirect, and (I *think*) non-empty. It's not efficient to check for '[[' in text in a bulk query since text has to be loaded and decompressed separately.

Counts are re-initialized automatically more frequently now due to the checks for rolled-over or otherwise broken counts.

Counting pagelinks entries wouldn't necessarily give the same count as '[[' checks (interwikis, images, categories, or just plain invalid links).

mashiah.davidson wrote:

the same for ruwiki

darthnihilus wrote:

Achtung! Mashiah Davidson is homophob!

(In reply to comment #5)

Live counts check for namespace, non-redirect, and '[['.

Re-initialized count checks only for namespace, non-redirect, and (I *think*)
non-empty. It's not efficient to check for '[[' in text in a bulk query since
text has to be loaded and decompressed separately.

Counts are re-initialized automatically more frequently now due to the checks
for rolled-over or otherwise broken counts.

Counting pagelinks entries wouldn't necessarily give the same count as '[['
checks (interwikis, images, categories, or just plain invalid links).

So is this bug still an issue?

*** Bug 15746 has been marked as a duplicate of this bug. ***

conrad.irwin wrote:

Given that many wiktionaries are putting a <!-- [[ --> into their page source for pages that only contain templates (which thus have links but no [[) to make them count, yes it is most definitely an issue. (The english wiktionary doesn't do this, instead it insists on manual links as template parameters instead of letting the template do the linking, resulting in templates that must for all arguments whether they are valid pagenames and then link optionally)

If the live count were changed to be outgoing links, then matters would be much improved - though removing the link/[[ restriction completely would be another acceptable solution.

Other proposals I have seen are for (optionally) counting {{ instead of [[, but I think this is unnecessarily complicated.

The issue was the count falling out of date, not what it should include.

(In reply to comment #12)

The issue was the count falling out of date, not what it should include.

That was the issue of bug 15746 and similar bugs, but not of this one. The dropout in counting the other day is just a half of the problem. This bug is about general question - how to treat the counter.

(In reply to comment #13)

(In reply to comment #12)

The issue was the count falling out of date, not what it should include.

That was the issue of bug 15746 and similar bugs, but not of this one. The
dropout in counting the other day is just a half of the problem. This bug is
about general question - how to treat the counter.

Then the bug should actually say that in the summary or initial comment :)

(In reply to comment #11)

If the live count were changed to be outgoing links, then matters would be much
improved - though removing the link/[[ restriction completely would be another
acceptable solution.

If the count were to use the link tables rather than some criterion based on the article text, refreshing this count would be a lot easier.

(In reply to comment #2)

Checked on which random wikis in what manner? Exact SQL used would be helpful
to check we've got inconsistent data, rather than invalid assumptions about
what the statistics represent.

in fact I don't see any problem. Closing as INVALID.

(In reply to comment #13)

(In reply to comment #12)

The issue was the count falling out of date, not what it should include.

That was the issue of bug 15746 and similar bugs, but not of this one. The
dropout in counting the other day is just a half of the problem. This bug is
about general question - how to treat the counter.

Then this doesn't seem the best place. A Meta discussion would probably be better. See also bug 24754, bug 26033.