Page MenuHomePhabricator

Requests with utf-8 in the URL return a outdated page revision
Closed, DuplicatePublic

Description

Author: wikimedia-bugreports

Description:
pcap file showing the problem

I have noticed that if i request the URL http://de.wikinews.org/wiki/Nobelpreis_für_Physik_für_„die_Meister_des_Lichts“ using the text mode browser links, i get a old (outdated) revision of the page.

I have tracked this issue and found that it is caused by links sending the special characters in the URL unencoded, directly as 8bit utf-8, and not in %xy encoding. If i change the URL to use %xy encoding (http://de.wikinews.org/wiki/Nobelpreis_f%C3%BCr_Physik_f%C3%BCr_%E2%80%9Edie_Meister_des_Lichts%E2%80%9C).

However as it seems, mediawiki actually can handle requests with utf-8 in the url, but for some strange reason it returns a old page revision when requesting that way.

I will attach a pcap-trace which shows first a request using links and the a request using lynx (lynx does the %xy encoding). You will notice the different page revisions returned.


Version: unspecified
Severity: normal

Attached:

Details

Reference
bz21027

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:49 PM
bzimport set Reference to bz21027.
bzimport added a subscriber: Unknown Object (MLST).

Looks like a problem with the squids cache not being purged for that encoding.

Mark, do we know whether Squid normalizes percent-encoded chars vs raw chars in URLs when determining canonical URLs for caching?

MediaWiki redirects you to the canonical URL for not-quite-canonical page view URLs in order to ensure consistent caching, but I have the vaguest recollection that our detection is post-percent decoding so we're not necessarily doing that right already.

If Squid would be caching them separately, then we might need to fix that up in MediaWiki to be more aggressive about the redirecting.

wikimedia-bugreports wrote:

Hmm, i've just noticed that bugzilla seems to have a bug as well as can be seen in my bugreport. The url does not get linked correctly, the trailing “ is missing...

I've just had a look at the code, and it seems that Squid does not do canonizing of URLs w.r.t. percent-decoding. There is a function url_decode_hex() in url.c which supports this, but it's only used for Gopher (yay ;). I strongly suspect that it's caching them seperately, so indeed MediaWiki may need to be adapted for that.

wikimedia-bugreports wrote:

I'm not sure how redirecting could fix this, unless you want to redirect all URLs without percent-encoding to URLs with percent-encoding, which seems ugly? Technically both URLs are exactly the same, so in my opinion this shall be fixed in squid.

Krinkle set Security to None.
Krinkle removed a subscriber: wikibugs-l-list.

The canonical link issue is a subtask of T93550. However I think we may be able to solve this without redirects or canonical urls. For structurally different urls that are not predictable, a redirect is probably the appropriate solution to avoid stale cache entries.

However different ways to encode the same url may be easier to deal with. Especially since there can be infinite variations on that. E.g.

https://en.wikipedia.org/wiki/%50%65%61%72
is the same as:
https://en.wikipedia.org/wiki/Pear

Modern browsers sometimes even take the liberty to decode and normalise some of these client-side before making the request. It may be possible to normalise this in Squid and response with the same cache object. Beware that some characters (like question mark and slash) have a different meaning depending on whether they are encoded or not. So those may be trickier.

Krinkle assigned this task to tstarling.

Since 2013, we have VCL code in place that normalises these characters in Wikimedia's caching infrastructure.

This means these urls considered identical and visits to either url serve the same exact response.

For example:

sh
curl -I 'https://de.wikinews.org/wiki/Nobelpreis_für_Physik_für_„die_Meister_des_Lichts“'
curl -I 'https://de.wikinews.org/wiki/Nobelpreis_f%C3%BCr_Physik_f%C3%BCr_%E2%80%9Edie_Meister_des_Lichts%E2%80%9C'
Run (1)
HTTP/2 200 
date: Sat, 04 Apr 2020 22:47:34 GMT
server: mw1264.eqiad.wmnet
last-modified: Sat, 21 Mar 2020 22:47:34 GMT
age: 2
x-cache: cp3052 miss, cp3060 hit/2
x-cache-status: hit-front
server-timing: cache;desc="hit-front"

HTTP/2 200 
date: Sat, 04 Apr 2020 22:47:34 GMT
server: mw1264.eqiad.wmnet
last-modified: Sat, 21 Mar 2020 22:47:34 GMT
age: 2
x-cache: cp3052 miss, cp3060 hit/3
x-cache-status: hit-front
server-timing: cache;desc="hit-front"
Run (2)
HTTP/2 200 
date: Sat, 04 Apr 2020 22:47:34 GMT
server: mw1264.eqiad.wmnet
last-modified: Sat, 21 Mar 2020 22:47:34 GMT
age: 5
x-cache: cp3052 miss, cp3060 hit/4
x-cache-status: hit-front
server-timing: cache;desc="hit-front"

HTTP/2 200 
date: Sat, 04 Apr 2020 22:47:34 GMT
server: mw1264.eqiad.wmnet
last-modified: Sat, 21 Mar 2020 22:47:34 GMT
age: 5
x-cache: cp3052 miss, cp3060 hit/5
x-cache-status: hit-front
server-timing: cache;desc="hit-front"