Page MenuHomePhabricator

By counting HTTP redirects, webstatscollector reporting too high numbers
Closed, ResolvedPublic

Description

One of the longstanding issues with Webstatscollector is that it
counts redirects at the HTTP level.

So, for example, requesting:

  • a page with a lower case first letter [1],
  • a page from the desktop site on a mobile device [2], or
  • a www.wikipedia.org/wiki/ path (first part is www, not a language) [3], or
  • Special:MyLanguage / Special:Random / Special:RandomRootPage / Special:RandomInCategory, or
  • namespace aliases, special page aliases and canonical special page names/namespace names

causes two requests to the caches, and webstatscollector counts both,
although actually only a single page is shown to the user.
Thereby too high numbers get reported.

Since we're about the deploy a new webstatscollector anyways, and this
double counting should not be too hard to fix, let's get it fixed too.

(Note that redirects above the HTTP level are not affected. So for example

http://en.wikipedia.org/wiki/Michael_J_Fox

(no dot after the J) is, was and will be one request, although it shows
the content of

http://en.wikipedia.org/wiki/Michael_J._Fox

(dot after the J). Such redirects at Wiki level are not affected.)

[1]


christian@spencer jobs: 0 time: 13:13:36 // exit code: 0
cwd: ~
wget -O /dev/null 'http://en.wikipedia.org/wiki/main_page'
--2014-10-08 13:13:39-- http://en.wikipedia.org/wiki/main_page
Resolving en.wikipedia.org... 91.198.174.192
Connecting to en.wikipedia.org|91.198.174.192|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://en.wikipedia.org/wiki/Main_page [following]
--2014-10-08 13:13:39-- http://en.wikipedia.org/wiki/Main_page
Reusing existing connection to en.wikipedia.org:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `/dev/null'

[ <=>                                                                                                                  ] 67,779      --.-K/s   in 0.1s

2014-10-08 13:13:39 (472 KB/s) - `/dev/null' saved [67779]

[2]


christian@spencer jobs: 0 time: 13:13:39 // exit code: 0
cwd: ~
wget -O /dev/null --user-agent 'iPhone' 'http://en.wikipedia.org/wiki/Main_Page'
--2014-10-08 13:13:44-- http://en.wikipedia.org/wiki/Main_Page
Resolving en.wikipedia.org... 91.198.174.192
Connecting to en.wikipedia.org|91.198.174.192|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://en.m.wikipedia.org/wiki/Main_Page [following]
--2014-10-08 13:13:44-- http://en.m.wikipedia.org/wiki/Main_Page
Resolving en.m.wikipedia.org... 91.198.174.204
Connecting to en.m.wikipedia.org|91.198.174.204|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `/dev/null'

[ <=>                                                                                                                  ] 22,002      --.-K/s   in 0.05s

2014-10-08 13:13:44 (416 KB/s) - `/dev/null' saved [22002]

[3]


christian@spencer jobs: 0 time: 13:13:44 // exit code: 0
cwd: ~
wget -O /dev/null 'http://www.wikipedia.org/wiki/Main_Page'
--2014-10-08 13:13:49-- http://www.wikipedia.org/wiki/Main_Page
Resolving www.wikipedia.org... 91.198.174.192
Connecting to www.wikipedia.org|91.198.174.192|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://en.wikipedia.org/wiki/Main_Page [following]
--2014-10-08 13:13:49-- http://en.wikipedia.org/wiki/Main_Page
Resolving en.wikipedia.org... 91.198.174.192
Reusing existing connection to www.wikipedia.org:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `/dev/null'

[ <=>                                                                                                                  ] 67,565      --.-K/s   in 0.1s

2014-10-08 13:13:49 (471 KB/s) - `/dev/null' saved [67565]


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=72102

Details

Reference
bz71790

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:54 AM
bzimport set Reference to bz71790.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to christian from comment #0)

Since we're about the deploy a new webstatscollector anyways, and this
double counting should not be too hard to fix, let's get it fixed too.

+1. https://meta.wikimedia.org/w/index.php?title=Research_talk:Page_view&oldid=10069001#Special_namespace_and_actual_problems (I'll miss stats for Special:MyLanguage, but that was a dirty trick).

Are we talking of 301 and 302 or something more?

(In reply to Nemo from comment #1)

I'll miss
stats for Special:MyLanguage, [...]

Yup. I'll miss stats for Special:Random :-(

Are we talking of 301 and 302 or something more?

301, 302, and 303.

303 basically only affects bots on wikidata. But there, some requests [1]
see two 303s, before content gets sent.

[1]


christian@spencer jobs: 0 time: 16:34:01 // exit code: 0
cwd: ~
wget -O /dev/null --header='Accept: text/html' 'https://www.wikidata.org/entity/Q507970'
--2014-10-08 16:34:02-- https://www.wikidata.org/entity/Q507970
Resolving www.wikidata.org... 91.198.174.192
Connecting to www.wikidata.org|91.198.174.192|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://www.wikidata.org/wiki/Special:EntityData/Q507970 [following]
--2014-10-08 16:34:03-- https://www.wikidata.org/wiki/Special:EntityData/Q507970
Reusing existing connection to www.wikidata.org:443.
HTTP request sent, awaiting response... 303 See Other
Location: https://www.wikidata.org/wiki/Q507970 [following]
--2014-10-08 16:34:03-- https://www.wikidata.org/wiki/Q507970
Reusing existing connection to www.wikidata.org:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `/dev/null'

[ <=>                                                                                                                  ] 81,443      --.-K/s   in 0.1s

2014-10-08 16:34:04 (593 KB/s) - `/dev/null' saved [81443]

I'm sure we can count special page requests separately if we want them...

Oh. Counting of Special pages won't change per se.

It only those Special pages that happen to come with 301, 302, or 303
HTTP status codes.

So for example Special:Search, or Special:Export come with HTTP status
code 200. They'll still be counted as usual.

Change 165351 had a related patch set uploaded by QChris:
Release fix that stops counting [uU]ndefined and redirects

https://gerrit.wikimedia.org/r/165351

Change 165631 had a related patch set uploaded by QChris:
Stop counting 301, 302, 303 HTTP status codes

https://gerrit.wikimedia.org/r/165631

Change 165725 had a related patch set uploaded by QChris:
Stop counting 301, 302, 303 HTTP status codes

https://gerrit.wikimedia.org/r/165725

Change 165748 had a related patch set uploaded by QChris:
[webstatscollector] Add condition to not count redirects

https://gerrit.wikimedia.org/r/165748

Change 165631 merged by jenkins-bot:
Stop counting 301, 302, 303 HTTP status codes

https://gerrit.wikimedia.org/r/165631

Change 165351 merged by Ottomata:
Release fix that stops counting [uU]ndefined and redirects

https://gerrit.wikimedia.org/r/165351

Change 165725 merged by QChris:
Stop counting 301, 302, 303 HTTP status codes

https://gerrit.wikimedia.org/r/165725

Fix has been deployed on 2014-10-15 ~19:01 and is effective.

The last affected files are

http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141015-200000.gz [1]
http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/projectcounts-20141015-200000 [1]
http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/pagecounts-20141015-190000.gz
http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/projectcounts-20141015-190000

The first files without the [uU]ndefined counts are

http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141015-210000.gz [1]
http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/projectcounts-20141015-210000 [1]
http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/pagecounts-20141015-200000.gz
http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/projectcounts-20141015-200000

[1] When restarting collector and filter for the C implementation of
webstatscollector, where was a period (<2 minutes) where the new
collector and the old filter have been running. Hence, during this
perioud a few redirects made it the 20:00:00 file.

Are retroactive adjustments of stats.wikimedia.org pageview stats expected?

Change 165748 merged by Ottomata:
[webstatscollector] Add condition to not count redirects

https://gerrit.wikimedia.org/r/165748

From http://thread.gmane.org/gmane.science.linguistics.wikipedia.research/4526 , it seems information about this change didn't trickle down to consumers of the data. Is it documented anywhere?

https://wikitech.wikimedia.org/wiki/Analytics/Webstatscollector is marked obsolete but seems to have no replacement; mediawiki.org pages recently got some warnings to visit wikitech, without specifying which page(s) now host the respective information; https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-raw has no such information; https://wikitech.wikimedia.org/wiki/Analytics/Pageviews and https://meta.wikimedia.org/wiki/Research:Page_view are silent on the matter.

Is it documented anywhere?

Yes. It got announced/documented at least 6 times :-)

On the corresponding data set pages on wikitech (search for "71790"):

Additionally, see the analytics list announcements:

And the mentioning in the "Adventures in Clusterland" reports (search for "71790"):

https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-raw has no such information

That page has the information. At the bottom of the page, there is a table of issues that has a row:

| * 	| 2014-10-15 ~19:02:30 | bug 71790 | Redirects have been counted

Thanks. I forgot to look in the table among outages etc.

I know it was announced, I just wonder where main features of the new data are expected to be described. Can/should I add information to https://wikitech.wikimedia.org/wiki/Analytics/Pageviews (which has an empty "Changes" section) or to https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-raw (which is linked from the dumps)?