Page MenuHomePhabricator

Clarify zero markers in squid logs for IPs that should not get one
Closed, ResolvedPublic

Description

In the squid logs at /a/squid/archive/zero/zero.tsv.log-* on stat1002, we have zero markers for IPs that should not get one.

Examples

Run

zgrep 198278752 /a/squid/archive/zero/zero.tsv.log-20130731.gz
on stats1002.

The printed request holds a zero marker from Beeline Russia (zero=250-99), but the client IP is from Austria and not listed in Beeline Russia's IPs.

The request does not hold a Forwarded request.

The problem is not limited to Beeline Russia or Austrian IPs, as can be seen by running

zgrep 2215263992 /a/squid/archive/zero/zero.tsv.log-20130731.gz
on stat1002.


Version: unspecified
Severity: normal

Details

Reference
bz53806

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 1:53 AM
bzimport set Reference to bz53806.
bzimport added a subscriber: Unknown Object (MLST).

Prioritization and scheduling of this bug is tracked on Mingle card
https://mingle.corp.wikimedia.org/projects/analytics/cards/1137 (correct link)

From Mark Bergsma:

Hi Diederik,

On Sep 5, 2013, at 12:13 PM, Diederik van Liere <dvanliere@wikimedia.org> wrote:

We are seeing some odd tagging of Zero traffic where requests are being tagged as Zero traffic while it does not fall within the cidr range of a mobile carrier.

Example:

cp3011.esams.wikimedia.org 198278752 2013-07-30T06:47:59 0.000113249 178.115.XXX.XXX hit/200 345 GET http://ru.m.wikipedia.org/w/api.php?format=json&search=%D0%A1&action=opensearch&namespace=0&limit=15 - application/json; charset=utf-8 http://ru.m.wikipedia.org/wiki/%d0%a8%d0%b2%d0%b5%d1%86%d0%b8%d1%8f - Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; NOKIA; Lumia 610) de-AT zero=250-99

[snip]

P.S.: As the examples show, the issue is not simply the X-Forwarded-For header, as for example for [1], neither 66.54.XXX.XXX, nor 1.38.XXX.XXX is in the ranges of Aircell India, and the Beeline example above comes without a X-Forwarded-For setting.

[1] Here is a recent example for Aircell India:
cp1046.eqiad.wmnet 4331766631 2013-09-04T06:31:38 0.000135422 66.54.XXX.XXX hit/200 20 GET http://meta.m.wikimedia.org/wiki/Special:RecordImpression?result=hide&reason=empty&country=IN&uselang=en&project=wikipedia&db=enwiki&bucket=0&anonymous=true&device=unknown - image/png http://en.m.wikipedia.org/wiki/Sexy 1.38.XXX.XXX Mozilla/5.0 (Series40; Nokia206/03.58; Profile/MIDP-2.1 Configuration/CLDC-1.1) Gecko/20100401 S40OviBrowser/2.2.0.0.36 en zero=404-01

Thanks for flagging this. I'll investigate it when I get an opportunity.

I've already looked into the XFF header, which might be the culprit still. We (almost) always add an XFF header at the start of a request. However, unfortunately it's not being logged, as the logging program just checks what headers were received from the client, not what we add in VCL...

otto wrote:

Ok, so Adam and Christian and I just went through the VCL logic a bit. I don't think I have a full handle on all the steps varnish takes to tag things, but it doesn't seem like this particular issue is caused by a mistaken XFF problem. vcl_recv calls spoof_clientip (which conditionally replaces the client_ip with a value from XFF) BEFORE it calls tag_carrier. At this point, it doesn't actually matter if the client_ip was not set correctly by spoof_clientip. Whatever client_ip is set to, tag_carrier should use that to conditionally set the X-CS header.

We're checking examples where the client_ip in the varnishncsa logs does belong to the tagged X-CS acl. And we have no idea how that could happen.

Ja, there might be some weird logic happening in spoof_clientip that is not working with XFF properly, but that seems irrelevant to this problem.

How could X-CS not match the client_ip?

Looking further, we do not log the X-Analytics header of the request, but
the X-Analytics header of the response. Could it be that the caching
interferes here?

Doing some quick check for the 404-01 carrier (India) seems to
somewhat support this, as ~30% of hits come misattributed, while only
~1% of the misses do. It a least shows a really heavy bias that hits
are more affected than misses.

Could it be (wild guess) that upon a cache hit, varnish uses the
X-Analytics header of the cached document?

If the caching is not the issue... I am not sure if this is related,
but X-Analytics column in the logs is not always

zero=XXX-XX

We also have log lines where the X-Analytics column

zero=XXX-XX;mf-m=a
zero=XXX-XX;mf-m=b

and in the sampled-1000 (not limited to mobile or zero) stream we also
see some

mf-m=a
mf-m=b

. I could not find this 'mf-m' in the puppet repo. And each of those
'mf-m' log lines is for a 'pass/200' request.

mf-m=a and mf-m=b are set by the MobileFrontEnd and indicate whether the alpha (a) or beta (b) MobileFrontend are used. Remember, X-Analytics is a list of <key,value> pairs of settings.

Yuri has updated the mechanism by which traffic is assigned to Zero, so this may cause the symptom of this bug to go away.

Yuri -- can you let us know when we can check the data again to see if this is true?

Diederik believes Phaidon and Mark are still looking at this bug since it could be quite serious.

(In reply to comment #7)

Yuri -- can you let us know when we can check the data again to see if this
is true?

Yuri: Any estimated timeframe?

Current status (per Deiderik):

  1. Phaidon/Mark debugging varnish
  1. Yuri has the ball on the deploying the netmapper code.

What's the current status on this? I know that various folks have been working on it.

thanks,

-Toby

(In reply to comment #10)

What's the current status on this? I know that various folks have been
working
on it.

thanks,

-Toby

If the netmapper stuff doesn't get rolled into production and fix this issue before I get to it, I'm going to provide the suggested configuration for logging a few extra fields in Varnish logs on a particular node in the cluster instead of all nodes (it sounds like turning this on for all nodes could be a challenge for other log consumer jobs). I'll need help from Andrew on actually getting it up and running for one node.

Coordination of this bug is happening on the Ops list because neither Mark, Brandon nor Faidon are CC'ed on this bug. I sent this update to the Ops list yesterday. As everybody is CC'ed on the Ops thread lets have the discussion there.

I spoke with Brandon and Yuri and a quick update about the attempt to fix incorrect tagging of W0 traffic:

  • we have the new varnish packages (with netmapper rolled inside) on the mobile caches now, but not the other caches (where we don't need it at this time anyways)
  • Brandon is checking whether varnish is acting normal (so far so good it seems)

Next steps (assuming that Varnish will continue to act normal):

  • Brandon will make changes to the VCL config to replace the current hardcoded mobile ACLs with netmapper
  • finish the rollout of the upgraded varnish package, from 3.0.3-rc1-wm14 to -wm16, so that the netmapper software is included on all varnish boxes

Change 86079 had a related patch set uploaded by Faidon Liambotis:
varnish: remove X-CS/Analytics tag from vcl_fetch

https://gerrit.wikimedia.org/r/86079

Change 86079 merged by Faidon Liambotis:
varnish: remove X-CS/Analytics tag from vcl_fetch

https://gerrit.wikimedia.org/r/86079

Since the bug is closed, how did verify the problem actually has been solved?

Reopening, since (although the mistagged log lines went to almost zero) we do
not observe the expected drop in requests per carrier.

Additionally, we also see IPv6 addresses (without X-Forward-For) coming with
zero tags [1]. Such lines should not receive zero markers, as carriers do not
come with IPv6 addresses up to now. Those IPv6 zero lines are just a few per
day, but they might hint at a further issue in the logging infrastructure.

[1] Run for example on stat1002:

zgrep 3779252990 /a/squid/archive/zero/zero.tsv.log-20130928.gz

(In reply to comment #17)

Additionally, we also see IPv6 addresses (without X-Forward-For) coming with
zero tags [...]

Those were just IPv4 in IPv6 addresses.
Netmapper seems to be aware of them, while varnish acls were not.

Thanks Faidon for pointing that out!

Although there are no more mistagged lines since 2013-09-26 22:57,
there was some further discussion outside of bz, as we did not see the
expected change in number of total zero requests.

Neither a change in total zero requests per carrier.

Neither a change in the distribution of User-Agents, Accept-Language,
or X-Forwarded-For distribution per carrier.

This does not match the bug description of the zero tags being wrong.

Adam however came up with a different explanation: Maybe the zero tags
were not wrong, but the client ips were wrong. This would nicely explain
what we saw before and after the fixes. We're tagging the same traffic
as before, but this time with correct IP addresses. And code-wise, it
seems plausible as well, as the switch to netmapper in

1755cfe5d556f20acae6a9c3cb59e9b38e1327e3

removed the hacky spoof_clientip function.

Although the above commit has only been merged on 2013-09-26 23:22
(~25 minutes /after/ the last mistagged line), Adam's explanation
seems nevertheless plausible, as the fixes were tested manually on the
varnishes beforehand.

It seems the combination of Faidon's and Brandon's patches finally
fixed the bug for good, and allow to explain the data that we're seeing.

Thanks Faidon and Brandon!