Page MenuHomePhabricator

SSL endpoints log %-encoded URLs logged as \x-encoded URLs
Closed, ResolvedPublic

Description

SSL endpoints log %-encoded URLs logged as \x-encoded URLs

When requesting %-encoded URLs like

https://ru.wikipedia.org/wiki/1092_%D0%B3%D0%BE%D0%B4

(note: “https”) we get a log line for

http://ru.wikipedia.org/wiki/1092_%D0%B3%D0%BE%D0%B4

(%-encoded) from the cache, but the SSL endpoint additionally adds a
log entry using the URL

https://ru.wikipedia.org/wiki/1092_\xD0\xB3\xD0\xBE\xD0\xB4

(\x-encoded).
The latter, \x-encoded URL cannot be fetched, and distorts logs.

I'd prefer if we have no \x-encoded URLs in our logs.

Should we:

  • try to fix the SSL endpoints to not log distorted URLs, or
  • stop having ssl endpoints in the udp2log log stream altogether (Currently, https requests get two entries in the log stream. One from the SSL endpoint, and one from the responding cache)

?


Version: unspecified
Severity: normal

Details

Reference
bz58876

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:20 AM
bzimport set Reference to bz58876.
bzimport added a subscriber: Unknown Object (MLST).

Actual request and log entries:

  • request:

christian@spencer 0 21:32:57
cwd: ~/tmp/encoding-test
LC_ALL=C wget https://ru.wikipedia.org/wiki/1092_год
--2013-12-22 21:34:14-- https://ru.wikipedia.org/wiki/1092_%D0%B3%D0%BE%D0%B4
Resolving ru.wikipedia.org... 91.198.174.192
Connecting to ru.wikipedia.org|91.198.174.192|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `1092_\320\263\320\276\320\264.2'

[ <=>                                                            ] 80,537       510K/s   in 0.2s

2013-12-22 21:34:15 (510 KB/s) - `1092_\320\263\320\276\320\264' saved [80537]

  • Corresponding log entries from udp2log stream:

amssq57.esams.wikimedia.org 4663480343 2013-12-22T20:34:15 0.596358538 $WIKIMEDIA_IP miss/200 80537 GET http://ru.wikipedia.org/wiki/1092_%D0%B3%D0%BE%D0%B4 - text/html; charset=UTF-8 - $MY_IP Wget/ (linux-gnu) - -
ssl3002 454288692 2013-12-22T20:34:15.377 0.950 $MY_IP -/200 81682 GET https://ru.wikipedia.org/wiki/1092_\xD0\xB3\xD0\xBE\xD0\xB4 NONE/wikimedia - - - Wget/%20(linux-gnu) - -

The problem does not show on sampled-1000, mobile, zero stream but is
visible on unsampled streams that do not filter to hosts. So for example
the edit stream, and webstatscollector output (and hence stats.grok.se).

Especially the exposure of this problem through webstatscollector, seems
problematic, as people start to add redirects for the non-existing but
seemingly requested \x encoded URLs. :-/
(See bug 58316)

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/1351

Change 105449 had a related patch set uploaded by QChris:
Log correctly encoded url with parameters for nginx

https://gerrit.wikimedia.org/r/105449

Change 105449 merged by Ottomata:
Log correctly encoded url with parameters for nginx

https://gerrit.wikimedia.org/r/105449