Page MenuHomePhabricator

zero.log contains duplicate host in logs
Closed, InvalidPublic

Description

yurik@stat1002:/a/squid/archive/zero$ zgrep orghttp zero.tsv.log-20140810.gz

produces a significant number of broken entries, whose URL is duplicated:

http://en.m.wikipedia.orghttp://en.m.wikipedia.org/
http://en.m.wikipedia.orghttp://en.m.wikipedia.org/favicon.ico
etc

The first entry is just the host with protocol, the second is identical protocol+host with the rest of the URL.

Stats in the last few days (I also ran it for the august of last year - similar numbers, so this is not a recent bug)

yurik@stat1002:/a/squid/archive/zero$ zgrep -c orghttp zero.tsv.log-201408*
zero.tsv.log-20140801.gz:183
zero.tsv.log-20140802.gz:215
zero.tsv.log-20140803.gz:167
zero.tsv.log-20140804.gz:170
zero.tsv.log-20140805.gz:144
zero.tsv.log-20140806.gz:235
zero.tsv.log-20140807.gz:272
zero.tsv.log-20140808.gz:259
zero.tsv.log-20140809.gz:169
zero.tsv.log-20140810.gz:173

Host distribution:
$ sort tmp | uniq --count

 1 cp1046.eqiad.wmnet
 4 cp1047.eqiad.wmnet
 8 cp1059.eqiad.wmnet
 6 cp1060.eqiad.wmnet
51 cp3011.esams.wikimedia.org
58 cp3012.esams.wikimedia.org
62 cp3013.esams.wmnet
57 cp3014.esams.wmnet
 4 cp4011.ulsfo.wmnet
 1 cp4012.ulsfo.wmnet
 4 cp4019.ulsfo.wmnet
 3 cp4020.ulsfo.wmnet

Version: unspecified
Severity: normal

Details

Reference
bz69371

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:36 AM
bzimport set Reference to bz69371.
bzimport added a subscriber: Unknown Object (MLST).

Looking into these. Will look to see if there is anything common to all urls with this string 'orghttp'. We should probably entertain the possibility of thee being actual real requests.

This problem is also present on sampled logs, mobile and api since records are kept on stats1002. That means since about May 2013. I imagine there must have been another bug fixed on this regard a while back as it seems real visible. Will try to do some searches on bugzilla.

If you compare the number of occurrences to the number of lines in the file, this seems to happen less than 0.025% of the time. This seems insignificant. Could this just be corrupt data to ignore?

kleduc@stat1002:/a/squid/archive/zero$ find . -name "zero.tsv.log-201408*" |xargs wc -l

 750507 ./zero.tsv.log-20140801.gz
 858969 ./zero.tsv.log-20140802.gz
 911424 ./zero.tsv.log-20140803.gz
 924049 ./zero.tsv.log-20140804.gz
 990494 ./zero.tsv.log-20140805.gz
1099479 ./zero.tsv.log-20140806.gz
1080449 ./zero.tsv.log-20140807.gz
1103330 ./zero.tsv.log-20140808.gz
1214806 ./zero.tsv.log-20140809.gz
1247318 ./zero.tsv.log-20140810.gz
1266218 ./zero.tsv.log-20140811.gz

Sure, we could ignore it, but i am worried if this could point to some bigger issue. Parsing all zero.tsv* files i noticed a large number of other strange items - highly broken URLs that still return miss/200 result, some of which are images. Please take a look at my home dir (i think its public) at

yurik@stat1002:~/zero-sms/scripts$ grep '/200' *.txt

Various IP addresses and random hosts (e.g. 0.facebook.com) keep appearing, and get resolved just fine by the backend, even though they clearly shouldn't.

My code does regex substitute on all URLs now: ^(https?://.+)\1 -> \1
But not counting those, there are 35000 other host parsing errors in the logs, with some big spikes (will attach graph of them).

Created attachment 16166
Graph of the bad host counts per day as detected in zero logs

Attached:

bad-hosts-in-zero-logs.PNG (572×1 px, 78 KB)

Parsing all zero.tsv* files i noticed a large number
of other strange items - highly broken URLs that still return miss/200 result.

I think we are missing issues here. I will address this comment but let's try to focus on 'orghttp' problem first.

The fact that "highly broken" urls return miss/200 doesn't seem that strange if we are caching negative image lookups on the edge-caches and returning 200 for those (negative lookup meaning an image request we do not have in origin). So, without knowing more about our caching infrastructure, I would not flag this as a problem. Now, let's get back to this once we have taken a deeper look at "orghttp" issue.

Various IP addresses and random hosts (e.g. 0.facebook.com) keep appearing,
and get resolved just fine by the backend, even though they clearly shouldn't.

There are requests with "0.facebook.com" on the User_Agent (facebook-bot) and requests with "facebook" on the referrer. I see nothing obviously wrong with either.

After there are (very few) requests like : http://zero.wikipedia.org/uc6.ucweb.com:80/?0.facebook.com/?zout=1&ucid=6883926810-f3e4e4a2 that seem to come from the same kind of phone.

I think we are missing issues here -> "I think we are MIXing issues"

Please take a look at our logging format on varninsh:

https://git.wikimedia.org/blob/operations%2Fpuppet/production/modules%2Fvarnish%2Ffiles%2Fvarnishncsa.default#L9

The interesting part: http://%{Host}i%U%q

%{Host}i : logs host header
%U request The request URL without any query string. Defaults to '-' if not known.
%q The query string, if no query string exists, an empty string.

Looking at patterns on urls with the "orghttp" string in zero logs I see little in common, the big majority of requests with faulty logging are for "text/html" content type, under http and they are cache misses (200/miss).

Summing up, somehow we are generating requests in the client like: "http://es.m.wikipedia.org/http://es.m.wikipedia.org/wiki/Wikipedia:Portada"

which, according to http are valid requests, those requests depending on the target project are responded in different manners, 404, 304, miss/200

Overall, we do no think this is a problem when it comes to login requests. It might be a problem on the client as those requests should not be generated but with the format we are login if requests like "http://es.m.wikipedia.org/http://es.m.wikipedia.org/wiki/Wikipedia:Portada"
come along they will be logged like: http://es.m.wikipedia.orghttp://es.m.wikipedia.org/wiki/Wikipedia:Portada

Need to look at IP ranges as I looked at languages and wikipedia's for geographic commonality and that might not be the best.

I take my prior comment back, this looks like a proxy issue not client issue. Data
below for requests that match "orghttp" in the month of August thus far in zero, mobile and sampled.

  1. zero

Most requests come from 41.202.219.* (41.202.219.66 for example) in Cameroon
<snipet>

 16 101.222.232.111
 34 101.222.224.76
 43 41.202.219.77
 57 41.202.219.76
 59 41.202.219.65
 66 41.202.219.79
 80 41.202.219.68
 84 41.202.219.71
 88 41.202.219.78
 93 41.202.219.75
 96 41.202.219.69
100 41.202.219.72
101 41.202.219.74
102 41.202.219.64
132 41.202.219.67
132 41.202.219.73
134 41.202.219.70
170 41.202.219.66

</snipet>

2.mobile
<snipet>

  12 67.205.74.205
  14 122.96.59.102
  20 41.202.219.74
  83 202.67.40.50
 104 202.67.41.51
4368 203.223.95.248 -> Bangladesh

</snipet>

3.sampled
<snipet>

  75 208.43.239.42
  89 194.116.187.25
 176 195.74.38.153
 264 180.179.40.100
 282 172.15.215.113
 784 202.67.40.50
 804 203.223.95.248
1036 202.67.41.51
3674 91.121.6.4

</snipet>

IPs with most issues in zero are not the most used IPs so, again, this points to a proxy issue.

Nuria, are you saying one of our proxies is causing this? Or is it some common proxy software that many carriers are using that sets incorrect HOST value when forwarding request?

Well, neither. By looking at the data looks to be caused by a proxy but I do not think is "common" software as the percentage data affected seems pretty small.

You can reproduce one of this requests by using telnet as follows. Please note "host" and "url below

[prompt]$ telnet 91.198.174.204 80
Trying 91.198.174.204...
Connected to mobile-lb.esams.wikimedia.org.
Escape character is '^]'.
GET http://fr.m.wikipedia.org/favicon.ico HTTP/1.1
Host: fr.m.wikipedia.org
Connection: close

HTTP/1.1 200 OK
Server: Apache
X-Content-Type-Options: nosniff
Expires: Fri, 15 Aug 2014 11:12:17 +0000
Content-Type: image/x-icon
X-Varnish: 2057974935 2057973396, 2628462505, 3056161189 3056101957
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
Content-Length: 15086
Accept-Ranges: bytes
Date: Thu, 14 Aug 2014 11:13:43 GMT
Age: 86
Connection: close
X-Cache: cp1047 hit (1), cp3014 miss (0), cp3011 frontend hit (8)
Cache-Control: s-maxage=3600, max-age=3600
Set-Cookie: GeoIP=ES:Madrid:40.4086:-3.6922:v4; Path=/; Domain=.wikipedia.org