Page MenuHomePhabricator

itwiki-20110130-pages-articles.xml.bz2 is corrupted
Closed, ResolvedPublic

Description

Author: kali

Description:
$ md5sum itwiki-20110130-pages-articles.xml.bz2
7eac57c7c521bf6f36e9a5d7ec476562 itwiki-20110130-pages-articles.xml.bz2

which is fine, according to http://dumps.wikimedia.org/itwiki/20110130/itwiki-20110130-md5sums.txt

but...

$ bunzip2 itwiki-20110130-pages-articles.xml.bz2

bunzip2: Data integrity error when decompressing.
Input file = itwiki-20110130-pages-articles.xml.bz2, output file = itwiki-20110130-pages-articles.xml

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

bunzip2: Deleting output file itwiki-20110130-pages-articles.xml, if it exists.

$ bunzip2 -tvv itwiki-20110130-pages-articles.xml.bz2

itwiki-20110130-pages-articles.xml.bz2: 
  [1: huff+mtf rt+rld]
  [2: huff+mtf rt+rld]

[.... snip ....]

[2510: huff+mtf rt+rld]
[2511: huff+mtf rt+rld]
[2512: huff+mtf data integrity (CRC) error in data

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.


Version: unspecified
Severity: normal

Details

Reference
bz27064

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:20 PM
bzimport set Reference to bz27064.

got this, too. thanks for reporting this.

Rerunning this job from the command line. It should be done in a couple hours and I'll have a look. I've saved a copy of the old bad file elsewhere on the off chance that it's useful for comparison.

The new file looks normal afaict. Can you check it please?

kali wrote:

Indeed, my import script passed the download and unzip stage. Thanks a lot, and good luck with the broken file.

The bzip appears to die partway through once in a while. I'm going to have to add a check for that. I've so far failed to duplicate it on my laptop (probably because the files I generate aren't large enough). I'll rerun that step so we have a good file in the meantime.

kali wrote:

I confirm new eswiki file is ok.