Page MenuHomePhabricator

page view statistics for Wikinews seem to be wrong
Closed, ResolvedPublic

Description

The community of Russian Wikinews identified that stats.grok.se (that works on page view statistics from Wikimedia dumps http://dumps.wikimedia.org/other/pagecounts-raw/) shows incorrect number of page views.

For example, we have page:
https://ru.wikinews.org/wiki/Категория:Чемпионат_мира_по_футболу_2014/Статистика

If we would open statistics (http://stats.grok.se/ru.n/latest/Категория:Чемпионат_мира_по_футболу_2014/Статистика) we would see 6 views in last 30 days, though the page had much more views actually - just look at the number of edits of the page:
https://ru.wikinews.org/w/index.php?title=Категория:Чемпионат_мира_по_футболу_2014/Статистика&action=history

I know that stats.grok.se is an external tool but it works on raw data of WMF and it seems that raw data are prepared incorrectly.


Version: unspecified
Severity: normal
Whiteboard: u=Community c=General/Unknown p=0 s=2014-06-26

Details

Reference
bz67411

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:26 AM
bzimport set Reference to bz67411.

So is there an indicator that the issue is with Wikimedia's data, and not with stats.grok.se processing it?

(In reply to Andre Klapper from comment #1)

So is there an indicator that the issue is with Wikimedia's data, and not
with stats.grok.se processing it?

Given the file format it would be very difficult for it to be incorrectly processed.

My only comments on this would be:

*The pageview stats is based on URL matching, not the actual page, so depending on how the page was /reached/ pageviews may not appear.
*Direct comparisons with edit events isn't possible because multiple edit events can be launched from a single pageview, and edit events themselves are excluded from the counter (wrong MIME type)

Aren't these stats based on sampling as well?

Actually I don't know; ErikZ can talk to that bit better than me. The large, aggregate breakdowns definitely are; I'm not sure if we do URL matching against those to produce the output here, or if the output here is based on the raw count.

(In reply to Oliver Keyes from comment #4)

Actually I don't know; ErikZ can talk to that bit better than me. The large,
aggregate breakdowns definitely are; I'm not sure if we do URL matching
against those to produce the output here, or if the output here is based on
the raw count.

Looking at some of the raw data - I see a lot of pages with 1 hit. If they were sampled I wouldn't expect that, so please ignore me :)

(In reply to Andre Klapper from comment #1)

So is there an indicator that the issue is with Wikimedia's data, and not
with stats.grok.se processing it?

stats.grok.se typically reads our files without problems.
webstatscollector (the one producing those files) is more hairy.
It broke before. Especially around non-latin characters, it caused
issues before.

I'll check the files.

However, the way things look ... I am not sure if something is
broken. Given that we're measuring against edits and how
webstatscollector is filtering, everything might just be fine^Wwithin
expectations.

Checking it nonetheless.

(In reply to Oliver Keyes from comment #2)

*Direct comparisons with edit events isn't possible because multiple edit
events can be launched from a single pageview, [...]

Right. And for example bots need not do a pageview (in
webstatscollector sense). They can edit right away.

and edit events themselves
are excluded from the counter (wrong MIME type)

webstatscollector does not care about MIME types, and counts requests
regardless of MIME types.

However, webstatscollector cares about "/wiki/" being in the URL. And
for edits, they are typically made through the API or directly through
/w/index.php. None of which have "/wiki/" in the URL and hence do not
get counted by webstatscollector.

(In reply to Bawolff (Brian Wolff) from comment #3)

Aren't these stats based on sampling as well?

It's one of the few parts that is unsampled :-)
stats.grok.se is driven by

http://dumps.wikimedia.org/other/pagecounts-raw/

which is the output of webstatscollector, which consumes the full
unsampled firehose (well ... there is some packet loss).

While the page has some properties that would allow to explain away some
effects we're seeing, it turned out that since mid-April, SSL logs were
no longer fed into webstatscollector (bug 67456), hence SSL traffic did
not get counted on stats.grok.se.

Across all projects, SSL traffic does not account for too much, but
for ru.wikinews.org SSL traffic seems more relevant.

Seeing if I can find more things.

I could not find further hiccups in the counting pipeline than bug 67456

(In reply to christian from comment #7)

While the page has some properties that would allow to explain away some
effects we're seeing, it turned out that since mid-April, SSL logs were
no longer fed into webstatscollector (bug 67456), hence SSL traffic did
not get counted on stats.grok.se.

Across all projects, SSL traffic does not account for too much, but
for ru.wikinews.org SSL traffic seems more relevant.


(In reply to christian from comment #7)

Seeing if I can find more things.

I could not find more things.

Checking with 7 consecutive days from end of June, and adding ssl page
counts by hand there, page views for this page went from ~1/day up to
~14/day, which more plausible.

Also bear in mind, that webstatscollector counts redirects only for
the source of redirects, not the targets. So hits for

https://ru.wikinews.org/wiki/Чемпионат_мира_по_футболу_2014/Статистика

(which redirects to

https://ru.wikinews.org/wiki/Категория:Чемпионат_мира_по_футболу_2014/Статистика

) are counted only at

http://stats.grok.se/ru.n/latest30/Чемпионат_мира_по_футболу_2014/Статистика

and show considerably more page views already. But those numbers are
going to increase further, once SSL requests get fed into
webstatscollector again.

Ssl requests get fed into webstatscollector again.