Page MenuHomePhabricator

Don't accept data from automated bots in Event Logging
Closed, ResolvedPublic8 Estimated Story Points

Description

For example, from google:

My guess is this might be somehow related to framing.

However, it uses:

webHost : window.location.hostname

which in theory should work even in a framing scenario, unless the context of the JavaScript was somehow lost. On http://jsfiddle.net/596mX/ the top web host is htp://jsfiddle.net, the frame host is http://fiddle.jshell.net, and it alerts the latter's hostname.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=55449

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:24 AM
bzimport set Reference to bz65508.
bzimport added a subscriber: Unknown Object (MLST).

Thanks for bringing this up and for linking the other ticket, this is an issue that we should take very seriously as people crunching the data will rarely remember to filter by webHost (assuming filtering by wiki is sufficient) which will cause the inclusion of a lot of bogus events caused by test instances on labs or legitimate but spurious events caused by users with proxies or other factors.

I think we should try and enforce stricter validation for the webHost field to only accept events from a list of known hostnames and set up monitoring of events that fail validation on the webHost field.

Steven also suggested it could be related to Chrome's automatic translation feature.

My tests indicate that still uses the original hostname in Chromium 34.0.1847.132 Debian 7.5 (265804) , but this behavior might vary by version or something.

kevinator set Security to None.
Milimetric claimed this task.
Milimetric subscribed.

We don't really know what's going on here, please update if this is still an issue

Milimetric renamed this task from translate.googleusercontent.com in webHost for some client-side events to Don't accept data from automated bots in Event Logging.Dec 14 2015, 5:46 PM
Milimetric reopened this task as Open.
Milimetric updated the task description. (Show Details)
Nuria lowered the priority of this task from High to Low.Mar 13 2017, 4:00 PM
Nuria moved this task from Backlog (Later) to Wikistats on the Analytics board.
Nuria raised the priority of this task from Low to Medium.Apr 3 2017, 4:27 PM
Nuria raised the priority of this task from Medium to High.Apr 17 2017, 3:58 PM

We can do this work now that we are parsing the user agent for incoming EventLogging data.

We need to: add a "self-identified" bot filter for all incoming data. We just need to use the same regex than the pageview code uses: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L59

It is unfortunate that this regex would need to be duplicated on both EL and pageview data.

This would take care of events that asent by , say, a google bot crawling the android application but it will not take care of other issues that have to do with javascript. Premise of ticket is not super clear on this regard.

How do we keep track of bot-identified traffic?
Sending data to graphite

We can publish bot-dentified events to a bot schema that is similar to EventError that is only pushed to hadoop (not to db): https://meta.wikimedia.org/wiki/Schema:EventError

Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria set the point value for this task to 8.

Change 350234 had a related patch set uploaded (by Fdans):
[eventlogging@master] Flag requests sent by spiders/bots using AutomatedRequest schema

https://gerrit.wikimedia.org/r/350234

Change 350235 had a related patch set uploaded (by Fdans):
[operations/puppet@production] Add AutomatedRequest to schema black list

https://gerrit.wikimedia.org/r/350235

Change 350234 merged by Ottomata:
[eventlogging@master] Mark events as bots if they self-identify

https://gerrit.wikimedia.org/r/350234

Change 352579 had a related patch set uploaded (by Fdans; owner: Fdans):
[eventlogging@master] Add handler for event filters

https://gerrit.wikimedia.org/r/352579

Change 350235 abandoned by Fdans:
Add AutomatedRequest to schema black list

Reason:
This is no longer needed since we've changed the approach for the task

https://gerrit.wikimedia.org/r/350235

Change 352582 had a related patch set uploaded (by Fdans; owner: Fdans):
[operations/puppet@production] Add bot filter to mysql consumer

https://gerrit.wikimedia.org/r/352582

Change 352579 merged by Ottomata:
[eventlogging@master] Add handler for event filters

https://gerrit.wikimedia.org/r/352579

Change 352582 merged by Ottomata:
[operations/puppet@production] Add bot filter to mysql consumer

https://gerrit.wikimedia.org/r/352582

Change 355238 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove use of is_not_bot filter in eventlogging mysql until code is fixed and change is cleared (announced)

https://gerrit.wikimedia.org/r/355238

Change 355238 merged by Ottomata:
[operations/puppet@production] Remove use of is_not_bot filter in eventlogging mysql until code is fixed and change is cleared (announced)

https://gerrit.wikimedia.org/r/355238

@Nuria @Tbayer is there anything we should announce before deploying this change?

@fdans: Yes, since this is going to affect the results of various queries (even though it's by improving their accuracy), people working with them should be notified. I think a quick note to Analytics-l would be justified.

Change 355482 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use is_not_bot filter function for eventlogging mysql consumer

https://gerrit.wikimedia.org/r/355482

Change 355482 merged by Ottomata:
[operations/puppet@production] Use is_not_bot filter function for eventlogging mysql consumer

https://gerrit.wikimedia.org/r/355482

How will this affect EventLogging calls made from PHP (which might need to be recorded whether the user used some bot framework or not)?

@Tgr: All calls go through varnish, there are no direct posts from php anymore (it is been a while), thus they are all process equally.

@Tgr: All calls go through varnish, there are no direct posts from php anymore (it is been a while), thus they are all process equally.

Except that requests initiated from the backed will be rejected because the useragent is MediaWiki (or something similar). I feel this hasn't really been thought through.

Note that even if EventLogging::logEvent would forward the user agent (which it currently doesn't), filtering on that would still make no sense for schemas such as Pingback or CommandInvocation.

mysql:research@analytics-store.eqiad.wmnet [log]> select timestamp, count(*) from MediaWikiPingback_15781718 group by substr(timestamp, 1, 8) order by timestamp desc limit 50;
+----------------+----------+
| timestamp      | count(*) |
+----------------+----------+
| 20170524001855 |      125 |
| 20170523003847 |      148 |
| 20170522000909 |      140 |
| 20170521002601 |      131 |
| 20170520000349 |      118 |
| 20170519001419 |      179 |

Change 356423 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Revert "Use is_not_bot filter function for eventlogging mysql consumer"

https://gerrit.wikimedia.org/r/356423

Note that even if EventLogging::logEvent would forward the user agent (which it currently doesn't)

To recap from IRC. Server side events DO forward UA: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/includes/EventLogging.php#L67

However there are 2 schemas that do not log using this method: MWPingback and CommandLine, of these we think Commandline is not used.

Change 356423 abandoned by Gergő Tisza:
Revert "Use is_not_bot filter function for eventlogging mysql consumer"

Reason:
I misdiagnosed the problem, it only affects a few schemas. There are more useful ways to handle it.

https://gerrit.wikimedia.org/r/356423

What about logging from the job queue (which could in theory happen for PageDeletion etc when some job creates/moves/deletes pages)? That will probably have a bot UA too. (It will probably be rare though and not sure whether the people owning the schemas would want to log it in the first place.)

Re IRC question: for the MediaWikiPingback schema, if the UA is not recorded, it won't be missed. All the information that could be possibly learned from it (MW version, or PHP version) is already included in the payload.

Change 356624 had a related patch set uploaded (by Fdans; owner: Fdans):
[eventlogging@master] Add is_mediawiki property to UA map

https://gerrit.wikimedia.org/r/356624

Change 356626 had a related patch set uploaded (by Fdans; owner: Fdans):
[operations/puppet@production] Add exception for events tagged as coming from MW

https://gerrit.wikimedia.org/r/356626

Change 356624 merged by Nuria:
[eventlogging@master] Add is_mediawiki property to UA map

https://gerrit.wikimedia.org/r/356624

Change 357243 had a related patch set uploaded (by Ottomata; owner: Nuria):
[eventlogging@master] Simpler parsing of user_agent to asses whether 'mediawiki' is present

https://gerrit.wikimedia.org/r/357243

Change 357243 merged by Ottomata:
[eventlogging@master] Simpler parsing of user_agent to asses whether 'mediawiki' is present

https://gerrit.wikimedia.org/r/357243

Change 356626 merged by Ottomata:
[operations/puppet@production] Add exception for events tagged as coming from MW

https://gerrit.wikimedia.org/r/356626

Closing, events are present on MediaWikiPingback from 20170606231658.