Page MenuHomePhabricator

"data:" URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se
Closed, DeclinedPublic

Description

In Analytics on 2014-06-03:

22:12:42 <Nemo_bis> O_o http://stats.grok.se/pt.q/top

.

On the above page (which currently shows “Most viewed articles in 201403”),
ranks 1, 2, 6, 7, 8, and 9 match

^[dD]ata:image/png;base64,iVBORw0K

. This looks wrong, as they look like data scheme URLs.


Version: unspecified
Severity: normal

Details

Reference
bz66112

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:11 AM
bzimport set Reference to bz66112.
bzimport added a subscriber: Unknown Object (MLST).

Looking through the log files, we indeed see requests for [1]

http://es.wikipedia.org/wiki/Data:image/png;base64,iVBORw0K[...]

so webstatscollector is doing the right thing :-/

Currently, this traffir amounts to ~500K requests per day.

We see such requests back until the first sampled log files we still
have. (But they were fewer in numbers back then)

Requested URLs are mostly to eswiki (~58%), and ptwiki (~38%).

Referrers are either empty (~97%) or coming mostly from ptwiki (to a
lesser extend eswiki, enwiki).

User Agents match '^Mozilla/5\.0 (Windows NT [56]\.' for >98% of requests.

Unwrapping the inline data from the URLs, and looking at them it seems
they are just images for UI chrome.

The images in the data uri scheme decode to images from VectorBeta like

VectorBeta/resources/typography/images/search-fade.png
VectorBeta/resources/typography/images/tab-break.png
VectorBeta/resources/typography/images/tab-current-fade.png
VectorBeta/resources/typography/images/portal-break.png

[1] Since they are just UI images, here are some concrete examples:

http://es.wikipedia.org/wiki/data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAAuCAIAAABmjeQ9AAAARElEQVR42mVO2wrAUAhy/f8fz+niVMTYQ3hLKkgGgN/IPvgIhUYYV/qogdP75J01V+JwrKZr/5YPcnzN3e6t7l+2K+EFX91B1daOi7sAAAAASUVORK5CYII=

http://pt.wikipedia.org/wiki/Data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAAuCAIAAABmjeQ9AAAARElEQVR42mVO2wrAUAhy/f8fz%2BniVMTYQ3hLKkgGgN/IPvgIhUYYV/qogdP75J01V%2BJwrKZr/5YPcnzN3e6t7l%2B2K%2BEFX91B1daOi7sAAAAASUVORK5CYII%3D

http://es.wikipedia.org/wiki/data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAAQCAIAAABY/YLgAAAAJUlEQVQIHQXBsQEAAAjDoND/73UWdnerhmHVsDQZJrNWVg3Dqge6bgMe6bejNAAAAABJRU5ErkJggg==

http://es.wikipedia.org/wiki/Data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAAQCAIAAABY/YLgAAAAJUlEQVQIHQXBsQEAAAjDoND/73UWdnerhmHVsDQZJrNWVg3Dqge6bgMe6bejNAAAAABJRU5ErkJggg%3D%3D

The bug looks like a browser/crawler bug where it's interpreting data URIs as relative URLs due to not understanding the protocol (and having a weird default for unknown protocols)

(In reply to christian from comment #1)

User Agents match '^Mozilla/5\.0 (Windows NT [56]\.' for >98% of requests.

Do you know which browsers these actually are? Does it have the MSIE or Trident token?

It is a known issue that IE <= 7 (http://caniuse.com/#feat=datauri) does not support data URIs. However, my understanding is that it's supposed to just drop it; I've never heard it would send a bogus request (I could be wrong, though).

(In reply to Matthew Flaschen from comment #2)

Do you know which browsers these actually are? Does it have the MSIE or
Trident token?

If you could share the full user agent, either publicly or privately, that might be helpful.

(In reply to Matthew Flaschen from comment #2)

It is a known issue that IE <= 7 (http://caniuse.com/#feat=datauri) does not
support data URIs. However, my understanding is that it's supposed to just
drop it; I've never heard it would send a bogus request (I could be wrong,
though).

This (old IE support) is also why we have a PNG fallback, which it's supposed to use.

Sadly enough. No IE<=7 issue. That was the first impression yesterday as well :-(

(In reply to Matthew Flaschen from comment #2)

Do you know which browsers these actually are?

Yes. User Agents are for example (figured they are generic enough to post):

Mozilla/5.0 (Windows NT 6.1; rv:29.0) Gecko/20100101 Firefox/29.0
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36

Does it have the MSIE or
Trident token?

Nope.
Affected browsers are mostly Firefox (~65%) and Chrome (~33%).
In old versions and (as exhibited above) also new versions.

It seems to be a "Windows with (Firefox or Chrome)" issue.

(In reply to christian from comment #5)

It seems to be a "Windows with (Firefox or Chrome)" issue.

Or a bot spoofing their user-agent to pretend to be such.

Which is fairly common. Even IE has started deliberately making ambiguous user agents because the devs have realised that people write special rules around IE UAs.

Is there anything interesting in the x_analytics field? I recall a problem with a similar range of browsers from Zero - attempts to DDoS the ISP-level packet inspection in Bangladesh.

(In reply to Matthew Flaschen from comment #6)

(In reply to christian from comment #5)

It seems to be a "Windows with (Firefox or Chrome)" issue.

Or a bot spoofing their user-agent to pretend to be such.

I checked that. And while of course, we cannot rule it out, it's
not too plausible to me.

The number of requests is following a strong weekly pattern.

For each day, the client IPs fall in between 200 to 500 different /24 IP groups.
(Basically all matching the country for the relevant wikis. So Brazil IPs
fetching ptwiki, Venezuelan IPs fetching eswiki.)

Sure. A /smart/ botnet still could implement a weekly pattern and grab many
relevant different IPs that are correctly geolocated.
But then ... a smart botnet would not misinterpret data uris. And even if they
did by accident, such a smart botnet would notice and fix it.

So I'd rule bots out.

(In reply to Oliver Keyes from comment #7)

Is there anything interesting in the x_analytics field?

No. X-Analytics is empty for all those requests.

For those who want to take a look themselves, there are prefiltered (from sampled-1000 stream) tsvs for May and June 2014 in

/home/qchris/data-uris

on stat1002 (the date in the file name corresponds to the date in the file name
of the sampled-1000 tsv files).

(In reply to christian from comment #1)

The images in the data uri scheme decode to images from VectorBeta like

VectorBeta/resources/typography/images/search-fade.png
VectorBeta/resources/typography/images/tab-break.png
VectorBeta/resources/typography/images/tab-current-fade.png
VectorBeta/resources/typography/images/portal-break.png

These images are also part of the core Vector skin, where
they sit at [mediawiki/core]/skins/vector/images.

Humn. Worth CCing the typography peeps and seeing if there's something weird in the implementation?

The images listed also do not have SVG versions, so I wouldn't blame the SVG->PNG fallback mechanism.

We were missing test cases that would prove that CSSMin is not borking data: URIs generated by LESS mixins like .background-image(), so I added some in https://gerrit.wikimedia.org/r/#/c/137698/ just in case.

(In reply to Bartosz Dziewoński from comment #11)

These images are also part of the core Vector skin, [...]

*Facepalm*
I had core at an old commit :-(

Yup ... they can come from core as well :-) Thanks.

Probably not relevant as the CSS should be interpreted as UTF-8
... but since I've been burnt by UTF-8 support on Windows a few times,
I checked the CSS of some prominent Wikipedias [1], and it seems of
them only

eswiki [2]
ptwiki [3]
plwiki [4]

had css classes using characters beyond 7-bit ASCII.

However, while eswiki, and ptwiki are the affected ones, plwiki does not
seem to be affected.

[1] arwiki cswiki dawiki dewiki elwiki enwiki eswiki fawiki fiwiki
frwiki hewiki idwiki itwiki jawiki kowiki nlwiki nowiki plwiki ptwiki
ruwiki svwiki trwiki ukwiki zhwiki

[2] eswiki:

.arquería
.astronomía
.béisbol
.canadá
.cómics
.comunicación
[...]

[3] ptwiki:

.page-Wikipédia_Esplanada_geral
.page-Wikipédia_Esplanada_propostas

[4] plwiki:

.page-Wikipedia_Strona_główna

Need collaboration with Platform to work on this further.

Milimetric subscribed.

untagging analytics, stats.grok is totally unmaintained and outdated

no longer relevant