Page MenuHomePhabricator

Opening a Wikipedia article results in attempt to download gzipped version
Closed, ResolvedPublic

Description

Author: pramodkrai

Description:
(1) Go to http://www.wikipedia.org/
(2) Type 'vi' in the search filed
(3) Press 'ENTER' or click the 'Search' button

Current Result:

  • A file download is prompted.

Expected Result:

  • A page with information on 'vi' should be displayed.

Version: unspecified
Severity: major
OS: Windows XP
Platform: PC
URL: http://www.wikipedia.org/

Details

Reference
bz7098

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:27 PM
bzimport set Reference to bz7098.
bzimport added a subscriber: Unknown Object (MLST).

Please provide this file for analysis.

pramodkrai wrote:

File downloaded for 'vi' search in wikipedia

I have now attached the file that was prompted for downloading.

Attached:

This file looks like gzipped HTML.

It's possible we've got another problem with double-gzipping, or some inconsistency in the
squids where incorrect headers get cached.

Can you confirm if possible:

  • Version of IE and operating system?
  • Does your ISP use an HTTP proxy server?
  • Does this happen only when logged out, or only when logged in, or some combination?

(bah, first time i got an edit conflict on bugzilla; here's my two cent anyway)

The file you provided is a gziped version of the correct HTML page for the
article "vi". Gzip compression is used as a "transfer encoding" for all
communication, to safe bandwidth - normally, the browser should just uncompress
it and show it, without you noticing. For some reason, it apperently does not
receive or understand the Content-Encoding header that is used to indicate this.

This seems like a browser issue - what browser / operating system are you using?
Does it happen when you search for something else on http://www.wikipedia.org/?
Does it also happen when you search for vi directly in the Wikipedia? Does it
happen if you visit the vi page directly?

Side note: when trying with Firefox, i get the page fine, and it gets
transferred as gzip. With wget, I also get the page correctly, it appears to be
served uncompressed. Why?...

The gzip-or-plain selection is based on (I believe) the User-Agent and/or the Accept header.
PHP has some magic for this in ob_gzhandler, which may or may not be documented:

http://dk2.php.net/ob_gzhandler

It's possible that this isn't 100% jibing with the Vary header we have, such that some IE's get
served the wrong thing. Or, there might be some bad proxies which end up storing the wrong
thing. Or, squid might be breaking. Mark's running some 2.6 squids experimentally, so there
might be 'new' adventures.

pramodkrai wrote:

I am using the following:

Browser:
Microsoft Internet Explorer
Version: 6.0.2900.2180.xpsp_sp2_gdr.050301-1519
Cipher Strength: 128-bit

OS:
Microsoft XP

And, my ISP does use an HTTP proxy server.

And, when trying with Mozilla Firefox, no issues.

Please try with Internet Explorer, bypassing the proxy (use a different ISP if
necessary).

For reference: what ISP is it?

pramodkrai wrote:

Please view Issue ID 7099 as well, seems to be related to this issue?

  • Bug 7100 has been marked as a duplicate of this bug. ***
  • Bug 7105 has been marked as a duplicate of this bug. ***

ayg wrote:

*** Bug 7107 has been marked as a duplicate of this bug. ***

ayg wrote:

To everyone who's come here because their bug has been marked as a duplicate of
this, please provide:

  1. The browser you're using (exact version, please).
  2. Whether it works using another browser.
  3. What ISP you're using, and (if you know) whether you're behind a proxy server.
  4. Whether this happens only when you're logged in, only when you're logged out,

or some combination.

  1. Whether it happens every time you try loading the page, or only sometimes.

robchur wrote:

Need to check if anything's changed with respect to OutputPage, or if any
changes to it have only recently been synchronised; finding out if anything's
changed on the Squids configuration-wise would be another sensible move (we had
a recent upgrade - related, or not?)...more and more of these issues from
various people indicates something's buggered up big-time.

ayg wrote:

*** Bug 7111 has been marked as a duplicate of this bug. ***

pramodkrai wrote:

Hi all, in my case, the issue is NONexistent today. i.e. No more prompts for file
download. Did anybody fix it? Or did it get fixed due to some environment change in
my end? [However, Issue 7099 is still existent on my IE.]

And, by the way, the remaining answers to Simetrical's questions:

(4) This used to happen when I was NOT logged in. Didn't test while logged in. [I
created a wiki account just today :-)]
(5) It used to happen every time I tried loading the page (I guess I tried only about
10 times though)

hno wrote:

I havent looked into how the wikimedia gzip module works in detail, but Squid
now supports content negotiation using ETag and If-None-Match to find which
cached entity variant (identity vs gzip encoding, Swedish vs English etc) to
send to the client.

Which means that if the server is not sending correct ETag:s AND responds to
If-None-Match, or when the Vary header is not correcly filled out then clients
may be given an incorrect entity variant.

Apache mod_gzip is an example having this problem where both the identity
encoded and gzip encoded variants carry the same ETag and the server responds to
If-None-Match on this ETag. There is a Squid directive to try to work around
such broken servers (broken_vary_encoding or something like that). Quite likely
it needs additional work as the matrix of broken servers out there is figured out..

Regards
Henrik
Squid-cache.org

hno wrote:

Thinking. Most likely If-None-Match isn't needed to trigger this. Just sending
incorrect ETag:s on Vary responses is most likely sufficient to get the cache
confused.

pramodkrai wrote:

Latest update from my side:

(a) I learnt just now that my LAN/ISP does use Squid proxy server
(b) A friend of mine, who belongs to the same LAN, is currently getting the Issue but
I am NOT. (So what could have caused the issue if in case we rule out the possiblity
of bypassing the proxy server, for this particular discrepancy.)

ayg wrote:

*** Bug 7118 has been marked as a duplicate of this bug. ***

Indeed, a quick test with and without Accept-Encoding seems to indicate that
ETag is identical for both gzipped and cleartext responses.

ayg wrote:

*** Bug 7118 has been marked as a duplicate of this bug. ***

I have disabled sending of the ETag header, as the standard is vague about
whether MediaWiki or Squid is wrong w.r.t. W/ headers.

bryan wrote:

I was seeing this error yesterday on two different systems. It has cleared up today.

I did run some packet captures (which I didn't save, sorry). When I was seeing the issue, I would
receive the HTTP 200 OK packet, which included these two header tags:

Content-Encoding: gzip\r\n
Content-encoded entity body (gzip): 891 bytes -> 1775 bytes\r\n

... the gzip would follow, and my browser would ask if I wanted to download. Another machine
didn't have the problem (in the same time frame). But it would get an "HTTP 304 Not Modified"
header without the gzip tags.

As of today, all three machines, with no changes made to them, are receiving the HTTP 304 headers
and not wanting to download. The pages are displaying normally. So, disabling the ETag header
seems to have done the trick, and thanks!

hno wrote:

The standard is very clear if you ask me.

Content-Encoding is an entity header.

The gzipped body is an entity body.

Each unique entity (entity headers + entity body) must carry an unique ETag per
URL. (two entities on different URLs may have the same ETag, but no two
different entities of the same URL can, now or in future).

It would be good if we could detect whether PHP's ob_gzhandler is actually
in gzipping-mode so that we can send a different ETag; otherwise do we
really need them at all? Could we just take this out in the mainline code?

Since it's now off by default I'm going to go ahead and mark this FIXED,
though improved fixes are hypothetically possible.

robchur wrote:

*** Bug 7082 has been marked as a duplicate of this bug. ***

There is also a bug 16230 that might be related to this.

Reopening to dupe lots of stuff to it, as the issue has persisted even after the etag change.

  • Bug 16230 has been marked as a duplicate of this bug. ***
  • Bug 15457 has been marked as a duplicate of this bug. ***
  • Bug 15149 has been marked as a duplicate of this bug. ***

wikipara wrote:

There have been noticably many of these reports on OTRS lately, so I went through the tech queue from 2008 til now and the other English queues with some keywords. Here:

OTRS#2009050710039921, got a file download dialog on article view
OTRS#2009050710002156, on article view IE shows a File Download Security Warning for a file of Unknown File Type (with size matching the mentioned article gzipped)
OTRS#2009050610056565, similar report with IE, "seen it so many times", went to find the information outside Wikipedia
OTRS#2009050410016432, on article view gets a popup to handle a "ZIP file" (gzip?), suspects phishing
OTRS#2009042810003331, got a file download dialog box on article view, suspects virus from Wikipedia
OTRS#2009041610017974, got a file download dialog box on article view
OTRS#2009041410062099, IE7 doesn't recognise the file type on article view, and upon saving the content turns out to be binary
OTRS#2009041010028241, on article view IE8 shows a File Download Security Warning for a file of Unknown File Type (with size matching the mentioned article gzipped)
OTRS#2009021410009418, ran Firefox in a configuration with altered Accept-Encoding, but got gzipped content anyway. Also reproduced with default wget.
OTRS#2009020210021659, on article view IE shows a File Download Security Warning for a file of Unknown File Type (with size matching the mentioned article gzipped)
OTRS#2009012910023699, IE showed a "File Download Security Warning" and didn't load the page
OTRS#2008112110022973, on article view IE7 shows a File Download Security Warning for a file of Unknown File Type (with size matching the mentioned article gzipped)
OTRS#2008111710013115, on article view IE shows a File Download Security Warning for a file of Unknown File Type (with size matching the mentioned article gzipped)
OTRS#2008100110060088, got a file download dialog on article view, thinks it might be an executable and is worried
OTRS#2008091510022622, sometimes on article view gets an archive file download dialog on Mac IE5
OTRS#2008091210024457, got a file download dialog with IE6 on Windows XP, saved file was binary
OTRS#2008082610024727, wikisource article view pops up a file download dialog
OTRS#2008081110037232, on article view IE shows a File Download Security Warning for a file of Unknown File Type (with size matching the mentioned article gzipped)
OTRS#2008070910020541, got a file download box on article view with IE6
OTRS#2008070810003562, got a "zipped file" that on inspection turned out to be the article gzipped
OTRS#2008052810008247, on article view IE6 shows a File Download Security Warning for a file of Unknown File Type (with size matching the mentioned article gzipped), suspect something malicious
OTRS#2008052110013244, got a file download dialog for an unknown file type on article view, size matches mentioned article gzipped
OTRS#2008050610017706, got a file download dialog on article view
OTRS#2008042310020921, on article view IE shows a File Download Security Warning for a file of Unknown File Type (with size matching the mentioned article gzipped)
OTRS#2008021910010382, got a file download dialog on article view

Maybe related:
OTRS#2008122510000693, "pretty sure it's a virus download page masked as antivirus software", could be a mirror site issue
OTRS#2008120810021916, "article cannot be opened"
OTRS#2008052610004898, "please stop sending popups on my computer"

How can people help sort this out?

wikipara wrote:

On the toolserver there's a tool that's accessed about 70 times per minute, and every time it fetches a page from the Amsterdam Squids without an Accept-Encoding header. When the gzipped page is stuck in the cache, it gets reported quickly as "gibberish content" and someone purges the cache of the page. Here's some mentions I found:

20090503 http://commons.wikimedia.org/w/index.php?title=User_talk%3AMagnus_Manske&diff=20993744&oldid=20861308
20090416 http://lists.wikimedia.org/pipermail/toolserver-l/2009-April/002034.html
20090106 http://toolserver.org/~bryan/TsLogBot/wikimedia-toolserver_2009-01-06.txt
20090102 http://jira.toolserver.org/browse/MAGNUS-103
20081006 http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Geographical_coordinates/Archive_22#GeoHack_biffed.3F
20080606 http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_40#GeoHack

  • Bug 19356 has been marked as a duplicate of this bug. ***

davidbalbert wrote:

Just adding my voice to the chorus. Here's the HTTP headers from a request and response where I can duplicate this problem:

GET /wiki/Benzatropine HTTP/1.1
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Accept-Encoding: identity;q=1.0, gzip;q=0, *;q=0
Host: en.wikipedia.org

HTTP/1.0 200 OK
Date: Tue, 30 Jun 2009 14:04:55 GMT
Server: Apache
X-Powered-By: PHP/5.2.4-2ubuntu5wm1
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Content-Language: en
Vary: Accept-Encoding,Cookie
X-Vary-Options: Accept-Encoding;list-contains=gzip,Cookie;string-contains=enwikiToken;string-contains=enwikiLoggedOut;string-contains=enwiki_session;string-contains=centralauth_Token;string-contains=centralauth_Session;string-contains=centralauth_LoggedOut
Last-Modified: Tue, 30 Jun 2009 13:54:32 GMT
Content-Encoding: gzip
Content-Length: 18123
Content-Type: text/html; charset=utf-8
Age: 86264
X-Cache: HIT from sq25.wikimedia.org
X-Cache-Lookup: HIT from sq25.wikimedia.org:3128
X-Cache: MISS from sq36.wikimedia.org
X-Cache-Lookup: MISS from sq36.wikimedia.org:80
Via: 1.1 sq25.wikimedia.org:3128 (squid/2.7.STABLE6), 1.0 sq36.wikimedia.org:80 (squid/2.7.STABLE6)
Connection: close

A quick note, while the client's UA says Firefox 2, it's actually Apache's HttpClient java library (org.apache.commons.httpclient.HttpClient).

ezyang wrote:

David Albert, that doesn't tell us much because MediaWiki normally gzips content before sending it to the user, and the download results when "Content-Type: text/html; charset=utf-8
" isn't respected. Do you know what headers came before this request? (although, I don't see a Keep-Alive, so it can't be the garbage gzip data from previous response problem).

davidbalbert wrote:

There were no headers before this request. Perhaps I'm not looking at the same problem, although on an initial read through, it looked very similar. The problem here is that the client is specifically requesting non gzipped data:

Accept-Encoding: identity;q=1.0, gzip;q=0, *;q=0

and receiving it anyway.

ezyang wrote:

This sounds like it might be an unrelated Squid cache bug. Why don't you file a separate bug?

what kind of client is that?

of course, these headers are kind of standard, but usually all clients would just omit 'gzip' entirely, if they don't want gzip...

davidbalbert wrote:

Edward, I created Bug 19463 for this issue.

Domas, this is a custom client. Originally the Accept-Encoding string was simply identity. I added gzip;q=0 to see if being more specific helped solve the problem, but it did not.

rividh wrote:

Just got another one:

http://en.wikipedia.org/wiki/Taximeter

I wonder if it's significant that the link was just posted on Slashdot, so has probably just had a lot of traffic.

rividh wrote:

No idea if it's related, but I just got this error:

ERROR
The requested URL could not be retrieved

While trying to retrieve the URL: http://upload.wikimedia.org/wikipedia/en/2/2a/Tin_lamp_1930s.jpg_.jpg

The following error was encountered:

  • Unable to forward this request at this time.

This request could not be forwarded to the origin server or to any parent caches. The most likely cause for this error is that:

  • The cache administrator does not allow this cache to make direct connections to origin servers, and
  • All configured parent caches are currently unreachable.

Your cache administrator is nobody.

Generated Wed, 01 Jul 2009 17:51:12 GMT by sq13.wikimedia.org (squid/2.7.STABLE6)

Started on http://en.wikipedia.org/wiki/Tinsmith
clicked pic of lamp
clicked pic again to get highres image
got above error.

Rinse, repeat, same error.

  • Bug 19463 has been marked as a duplicate of this bug. ***

hno wrote:

This does not look like a normal Squid behaviour.

Is there any Vary related patches applied to the wikimedia Squids?

I have a faint memory of some patches related to optimizing Accept-Encoding to avoid having to go to the backend on each new Accept-Encoding variant...

the "Unable to forward" error is completely unrelated to the Accept-Encoding issue.

adrian wrote:

Talk to Tim Starling about this stuff. Make sure the 2.7 Squid is patched with the Wikimedia patchsets or things won't work as well as you'd expect.

Bryan.TongMinh wrote:

Another occurence: http://lists.wikimedia.org/pipermail/mediawiki-api/2010-November/002032.html

User does not specify Accept-Encoding: gzip, but nevertheless gets a gzipped response.

Cannot see this now, thinking ops is monitoring this stuff.