Page MenuHomePhabricator

Merging hourly pagecount files fails most days since a few weeks.
Closed, ResolvedPublic

Description

Christian:

I just noticed that the November directory of the pagecounts-ez/merged files at:

http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-11/

looks wrong. There are so many files ending in ".~" instead of ".bz2".
Also the timestamps differ from previous months. So for example each of the files in

http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-09/

have been created on the day following the date in the file name.

P.S.: I noticed that the problem seems to have started in October:

http://dumps.wikimedia.org/other/pagecounts-ez/merged/2013/2013-10/

There the 2013-10-24 file is not a ".bz2", but ".~".
That date struck me. Although it's probably completely unrelated, we had (for first time) a strange log line in the zero logs at that same day. There the timestamp of a log line has been mangled [1].
We're seeing such requests more and more these days.

[1]


qchris@stat1002 0 20:18:05
cwd: ~
zcat /a/squid/archive/zero/zero.tsv.log-20131024 | cut -f 3 | grep -C 5 201cp3011
2013-10-23T13:29:23
2013-10-23T13:29:23
2013-10-23T13:29:23
2013-10-23T13:29:23
2013-10-23T13:29:23
201cp3011.esams.wikimedia.org
2013-10-23T13:29:24
2013-10-23T13:29:24
2013-10-23T13:29:24
2013-10-23T13:29:24
2013-10-23T13:29:24


Version: unspecified
Severity: normal

Details

Reference
bz57851

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:37 AM
bzimport set Reference to bz57851.

Any idea of the impact of this issue? Is it a problem?

-Toby

Low impact. But I will fix in the new year. In the meantime monthly totals are extrapolated from remaining days. And once fixed all missing files can be recreated from permanently stored raw data.

It certainly is a problem for me; I used those files several times.
(E.g. to understand the data we're seeing, to understand webstatscollector,
to understand pageviews)

Of course, I can run the aggregations myself upon need, but that means a huge
delay and waste of time :-/

Besides it is a public set of daily data that has not been updated since
~1.5 months :-(

Not even 2 hours past comment #4 and I would already have needed
the data again :-)

I've just been pointed towards bug #58316. As we do not see the \x
hits in the sampled logs, I would naturally use Erik's merged files
to see if the problem is webstatscollector related.
Falling back to doing it by hand. Meh.

Sorry, I did not realize you use it that often. I will look at in the coming days.

(In reply to comment #6)

Sorry, I did not realize you use it that often. I will look at in the coming
days.

Sorry, my point was not to mess with your scheduling. Not at all!
I just wanted to show that the data indeed gets used.

It's perfectly fine by me if we fix it early 2014.

The daily files come in as expected again.