Page MenuHomePhabricator

Echo logs broken since Jan 16, 2014
Closed, ResolvedPublic

Description

It appears that we stopped accurately logging most Echo events on January 16, 2014

See: http://ee-dashboard.wmflabs.org/dashboards/enwiki-features

SELECT MAX(timestamp) FROM Echo_6081131 WHERE wiki = 'enwiki' AND event_notificationType = 'reverted';

20140116190505

SELECT MAX(timestamp) FROM Echo_6081131 WHERE wiki = 'enwiki' AND event_notificationType = 'edit-thank';

20140116190354


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=60550

Details

Reference
bz60555

Related Objects

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:52 AM
bzimport set Reference to bz60555.

bsitu wrote:

There was no PHP code merged since January 1st. I also did a test against revision 6081131 locally and it seemed to be fine:
{"event":{"version":"1.5","eventId":2418,"notificationType":"edit-thank","notificationGroup":"positive","sender":"1","recipientUserId":2,"recipientEditCount":911,"eventSource":"history","deliveryMethod":"web","revisionId":5348},"schema":"Echo","revision":6081131,"clientValidated":true,"wiki":"git_my_wiki","recvFrom":"ThinkPad-X220","timestamp":1390958511,"webHost":"git-core","userAgent":"Mozilla/5.0\u0020(X11;\u0020Ubuntu;\u0020Linux\u0020x86_64;\u0020rv:16.0)\u0020Gecko/20100101\u0020Firefox/16.0"}

We should look at production raw log to identify the error.

We are definitely still collecting raw events but they are not written into the log DB.

zgrep 'edit-thank' all-events.log-20140128.gz | wc -l
9812

zgrep 'edit-thank' all-events.log-20140128.gz | grep enwiki | wc -l
3166

Looks like whatever change caused this was part of 1.23wmf10.

Traceback (most recent call last):

File "/usr/local/lib/python2.7/dist-packages/eventlogging-0.6_20131120-py2.7.egg/EGG-INFO/scripts/eventlogging-processor", line 61, in <module>
  validate(event)
File "/usr/local/lib/python2.7/dist-packages/eventlogging-0.6_20131120-py2.7.egg/eventlogging/schema.py", line 76, in validate
  jsonschema.Draft3Validator(schema).validate(capsule)
File "/usr/lib/python2.7/dist-packages/jsonschema.py", line 227, in validate
  :exc:`ValidationError``\s to signal failed validation.

ValidationError: Additional properties are not allowed (u'userAgent' was unexpected)

We can either bump the schema module to reflect the added field (in which case UAs will go into the DB) or we can remove the userAgent field from server-side events. Analytics can choose. I can assist w/recovering events from the raw logs.

A third option is to make validation not fail when there are additional fields I think (which I thought it was the case).

Let's talk about this today, of the two options removing the log of UA server side
should be the easiest one.

Side question: Where are logs for events?

(In reply to comment #6)

A third option is to make validation not fail when there are additional
fields
I think (which I thought it was the case).

Let's talk about this today, of the two options removing the log of UA server
side
should be the easiest one.

Side question: Where are logs for events?

Found logs.

Copying my reply to Nuria here, so that the notes are public.

  • where are validation traces failures

Validation failures trigger exceptions, which are written to standard error. Stderr is captured by Upstart, so the failures are at /var/log/upstart/eventlogging-processor.log.

  • why some events are affected but not others (I would expect all server side events to be affected equally)

Me too. But I don't remember if we pushed the change to both production branches at the same time. So it may be that one branch was generating events that were passing validation, and the other, not.

  • whether changing validation not to fail upon adding additional fields seems a good idea.

It beats ALTER TABLE ... ADD COLUMN random_field ...

The events need to adhere to the table definition. But it goes beyond that. This isn't simply a trade-off that comes with using a relational database as (one of several) storage backends; it's one of the core design decisions of the entire platform. The setup EventLogging replaced allowed free-form output and as a result the datasets used every possible ad-hoc format you can think of, with the result that the data was usually logged incorrectly and no one knew how to analyze it a week after it was collected because the "wisdom" of how to interpret it was lost.

If you program something that handles EventLogging data, you will see the simplicity that comes with knowing definitively that your input matches your expectations. You don't need to handle edge case after edge case.

Events that fail validation are still recorded in the raw logs, so they are not lost; they merely don't get first-class treatment.

I can see why bug 60550 might lead you to question this design. But let me tell you: this is probably the worst screw-up in the >1 year that this has been operational; the previous norm was to have the system running for months at a time without any intervention from me or anyone else. The mistake in this case was mine (I hacked the code to pop the 'userAgent' field locally, but then clobbered it by deploying an unpatched version on top of it).

In hindsight, I should have insisted on data sterility for the entirety of the process. When the validation failure has to do with the event object and not the capsule, the developer has the responsibility (and the means) for finding out, but this isn't the case with EventCapsule schema changes. But I really didn't expect this to take so long -- I thought we'd be done with the user-agent change within a day of deploying the patch at the most. If it doesn't happen now, it may be best to revert the change, drop the userAgent field from the capsule schema, and remove the column from the database -- it's not responsible of us to keep the system in this halfway state, and not surprising that it leads to mistakes and confusion.

swalling wrote:

This also seems to be impacting other schemas. ServerSideAccountCreation_5487345 has no data since 20140204000427 in the log db.

I am closing this bug as the recovery of ServerSide events is underway (per https://bugzilla.wikimedia.org/show_bug.cgi?id=60550) while the separate issue with validating new event types has been moved to https://bugzilla.wikimedia.org/show_bug.cgi?id=61698

[moving from MediaWiki extensions to Analytics product - see bug 61946]