Page MenuHomePhabricator

Space characters in [pagecounst-raw] titles
Closed, ResolvedPublic

Description

Author: westand

Description:
Beginning Feb. 1. space characters began appearing in *some* article titles in the pagecount data at http://dumps.wikimedia.org/other/pagecounts-raw/.

This file is space delimited, so this is breaking some parsing schemes. I understand that some of the internal logs were changing to a tab-delimited format, but this was not supposed to effect the pagecount stuff:

http://lists.wikimedia.org/pipermail/wikitech-l/2013-January/066007.html

http://en.wikipedia.org/wiki/Wikipedia:VPT#Format_Change_of_Page_View_Stats


Version: unspecified
Severity: normal

Details

Reference
bz45178

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:40 AM
bzimport set Reference to bz45178.

Hey Andrew,

Thanks for reaching out! Yes you are right, there are a couple of 1000's titles that have spaces in the titles and this indeed happened after the tab introduction but in an unexpected way.

Prior to the tab introduction, the title of the page would be truncated (because we used space as a delimiter) and so incorrect / incomplete titles would show up in the dumps data. Now, with the introduction of the space we really surfaced this bug.

The space is introduced because under very rare conditions, the Nginx server does not encode the space as %20; so far I have only see this happening if the request comes from Googlebot, and the server response is 301 (Moved Permanently).

We tried to replicate the conditions so we could fix our Nginx server configuration but we have not yet been able to do so. We could add a function in webstatscollector (the software that generates the data) to replace those spaces with %20 but I am worried that this will introduce performance regressions.

My plan is:

  1. We will test webstatscollector with a replace function, if this all works, great! problem solved.
  2. If the replace function introduces a performance regression then I will mark this bug as WONTFIX.

Rest assured, it affects only a really really small set of articles and those views are not real views in the first place as they come from Googlebot.