Page MenuHomePhabricator

[dbzip2] English Wikipedia dump has a wrong size
Closed, ResolvedPublic

Description

Author: yeyiwang

Description:
The size listed on the Web is 4.9Gb. The actual downloaded file size is 899 MB.

In WinXP/IE6, the file can be downloaded for 899MB. The window showing download progress states that the target size is 899MB.

In Vista/IE6, the file cannot be downloaded. It complains about HTTP header error.

In Vista/IE7, the file can be downloaded for 899MB. The window showing download progress states that the target size is 4.9GB.

wget downloads, again, a file of 899MB.

899MB is too small for an English Wikipedia article dump. Previous dump (July 2008) was around 3.8GB.


Version: unspecified
Severity: major
URL: http://download.wikimedia.org/enwiki/20090610/enwiki-20090610-pages-articles.xml.bz2

Details

Reference
bz19242

Event Timeline

matthew.britton wrote:

(In reply to comment #0)

Previous dump (July 2008) was around 3.8GB.

The previous dump was the week before, and there have been 5 since dumps were started again in May, e.g. http://download.wikimedia.org/enwiki/20090610/

Are all of these affected? I don't have the bandwidth to find out.

yeyiwang wrote:

I obtained a dump successfully about a year ago.

I tried all the dumps currently available at http://download.wikimedia.org/enwiki and they had the same problem.

Note that you may have an old version of wget which was known to have problems with files over 4GB.

Have never attempted large files with IE, but recent versions presumably should work if on an NTFS filesystem. (If you're downloading to a FAT32 filesystem like many default USB drives it will likely fail, but I think it should fail differently -- reporting an error at 2gb or 4gb rather than cropping off to the 0.9GB.)

Another possibility is that you're accessing the internet through a proxy which fails to understand large files properly. This might explain the Content-Length header being passed on (so you get correct report of the 4.9GB to come) but the intermediary crapping out at the 0.9GB 32-bit-wrapped limit.

Tomasz, putting this one on your bench; be good to double-check we haven't broken the server or something ;) but afaik it should be serving out fine.

Did a quick verify on OSX 10.5 using wget 1.11.4 and everything is showing up just like it should. 4.9GB

/opt/local/bin/wget -S http://download.wikimedia.org/enwiki/20090610/enwiki-20090610-pages-articles.xml.bz2
......
HTTP request sent, awaiting response...

HTTP/1.0 200 OK
Connection: keep-alive
Content-Type: application/octet-stream
Accept-Ranges: bytes
Content-Length: 5227630350
Date: Tue, 23 Jun 2009 02:24:22 GMT
Server: lighttpd/1.4.19

Length: 5227630350 (4.9G) [application/octet-stream]

Looking at http://tinyurl.com/ozafl2 shows the same correct content length header being returned if the user agent is IE.

It also correctly downloads past 899MB from my personal server that is not running in the wikimedia cluster.

I'll try this with IE after re-installing windows in the next day or so just to make sure its not an issue with the browser but otherwise I'm really suspecting a 32bit proxy here.

Gene, could you post in the content length you seeing on the wget downloads by adding a "-S" ?

It would also be nice to know if you going through a 32bit proxy as Brion says. That could easily do it.

yeyiwang wrote:

I guess the problem is related to the proxy server. I switched to a diffrent proxy server and a download attempt via IE stopped at 918MB.

I obtained the wget 1.11.4 and ran it with the -S option. There was a hiccup at around 900MB -- the connection got closed. Fortunately wget is robust enough this time to get reconnected. So far it is still running smoothly with ~3GB downloaded:

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = c:\Program Files\GnuWin32/etc/wgetrc
--2009-06-23 10:21:24-- http://download.wikimedia.org/enwiki/20090618/enwiki-20090618-pages-articles.xml.bz2
Resolving download.wikimedia.org... 208.80.152.183
Connecting to download.wikimedia.org|208.80.152.183|:80... connected.
HTTP request sent, awaiting response...

HTTP/1.1 200 OK
Connection: Keep-Alive
Proxy-Connection: Keep-Alive
Content-Length: 5258589574
Date: Tue, 23 Jun 2009 17:21:25 GMT
Content-Type: application/octet-stream
Server: lighttpd/1.4.19
Accept-Ranges: bytes

Length: 5258589574 (4.9G) [application/octet-stream]
Saving to: `enwiki-20090618-pages-articles.xml.bz2'

18% [=================> ] 963,624,164 763K/s in 21m 9s

2009-06-23 10:42:34 (741 KB/s) - Connection closed at byte 963624164. Retrying.

--2009-06-23 10:42:35-- (try: 2) http://download.wikimedia.org/enwiki/20090618/enwiki-20090618-pages-articles.xml.bz2
Connecting to download.wikimedia.org|208.80.152.183|:80... connected.
HTTP request sent, awaiting response...

HTTP/1.1 206 Partial Content
Connection: close
Proxy-Connection: close
Content-Length: 4294965410
Date: Tue, 23 Jun 2009 17:42:36 GMT
Content-Range: bytes 963624164-5258589573/5258589574
Content-Type: application/octet-stream
Server: lighttpd/1.4.19
Accept-Ranges: bytes

Length: 5258589574 (4.9G), 4294965410 (4.0G) remaining [application/octet-stream]
Saving to: `enwiki-20090618-pages-articles.xml.bz2'

58% [+++++++++++++++++======================================> ] 3,086,259,252 773K/s eta 46m 20s

Thanks for all of you who have helped!

No problem. Let us know if anything else pops up.

moving product dbzip2 to product Wikimedia tools