Page MenuHomePhabricator

hewiktionary pages-articles.xml dump corrupted
Closed, ResolvedPublic

Description

Author: iorsh

Description:
The file hewiktionary-20090423-pages-articles.xml (available as compressed download from http://download.wikimedia.org/hewiktionary/20090423/hewiktionary-20090423-pages-articles.xml.bz2) has corrupted data inside.

For example, some entries lack text altogether, e.g. "MP3" has empty <text> tag. The article is available at http://he.wiktionary.org/wiki/MP3, its content is obviously non-empty.

<page>

<title>MP3</title>
<id>11639</id>

−<revision>

 <id>88117</id>
 <timestamp>2009-02-16T18:12:13Z</timestamp>
−<contributor>
  <username>Interwicket</username>
  <id>2170</id>
 </contributor>
 <minor/>
 <comment>iwiki +[[:ko:MP3]]</comment>
 <text xml:space="preserve"/>
</revision>

</page>

Other entries have text belonging to *other* entries, e.g. the entry "דת" (http://he.wiktionary.org/wiki/דת) has text from the entry "ויסקי" (http://he.wiktionary.org/wiki/ויסקי).

<page>

<title>דת</title>
<id>11808</id>

−<revision>

 <id>92454</id>
 <timestamp>2009-03-22T03:06:03Z</timestamp>
−<contributor>
  <username>Interwicket</username>
  <id>2170</id>
 </contributor>
 <minor/>
 <comment>iwiki +[[:th:דת]]</comment>
−<text xml:space="preserve">

וִיסְקִי

{{ניתוח דקדוקי|

כתיב מלא=ויסקי
הגייה='''vis'''ki
חלק דיבר=שם־עצם
מין=זכר
שורש=
דרך תצורה=
נטיות=

}}
[[תמונה:Scotch Whisky (aka).jpg|שמאל|ממוזער|184px|ויסקי]]

משקה [[אלכוהולי]], המופק על ידי זיקוק סוגים שונים של [[דגנים]] אשר עברו תהליך [[הלתתה]]. לאחר הזיקוק, מיושן הנוזל בחביות עץ אלון לפרק זמן משתנה. אחוז האלכוהול ברוב מותגי הוויסקי עומד על 40.

#:* ב[[יום הולדת|יום הולדתי]] שתיתי '''ויסקי''' לשוכרה.
#:* "איזה בן קיבוץ בא לבקר בעיר; נבוך עם התרמיל ובלוריתו המתנפנפת; הוא '''ויסקי''' טוב מוזג, תשתה בחור צעיר; ומחייך: "ספר, אז מה נשמע ברפת"" ("בלדה לעוזב קיבוץ", מילים: יענקל'ה רוטבליט)

מקור

מקור שמו של הוויסקי מגיע מהשפות הקלטיות. במקור נקרא המשקה "ויסקי באה" (uisge beatha באיות אירי או uisge baugh באיות סקוטי) שמשמעותו המילולית היא: "מי החיים".

תרגום

  • אנגלית: {{ת|אנגלית|whiskey}}

ראו גם

  • [[וודקה]]
  • [[טקילה]]

קישורים חיצוניים

{{מיזמים|ויקיפדיה=ויסקי|ויקישיתוף=Category:Whisky|שם ויקישיתוף=ויסקי}}

{{תבנית:משקאות חריפים}}

[[קטגוריה:משקאות]]

[[el:ויסקי]]

 </text>
</revision>

</page>


Version: unspecified
Severity: major

Details

Reference
bz18651

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:35 PM
bzimport set Reference to bz18651.

iorsh wrote:

It seems that these specific XML entries are ok in later dumps, but other entries are corrupt instead. I have some scripts which process pages-articles.xml and detect such corruptions, but they need some manual assistance. Please let me know if you want examples of corrupted entries for some newer dump.

If you could share the scripts that you use to detect this that would be great.

iorsh wrote:

Script which detects errors in database dumps

Attached:

iorsh wrote:

This script should be ran as

./HeWiktionary_2_CulmusDic.pl hewiktionary-pages-articles.xml > hewiktionary-culmus.xml

where hewiktionary-pages-articles.xml is the dump in question. It will produce a bunch of reports of form

...
Bad word in heading: כלכלן
Bad word in heading: זבד
Bad word in heading: חג
...

I made an effort to ensure that these reports refer to actual dump errors with high probability. Try a few if you don't encounter an error for the first time. The example is from 20090713 dump (http://download.wikimedia.org/hewiktionary/20090713/, file http://download.wikimedia.org/hewiktionary/20090713/hewiktionary-20090713-pages-articles.xml.bz2). Take any report and check the entry in the pages-articles.xml file which corresponds to a page with that name.

E.g. for "כלכלן" look for "<title>כלכלן</title>". You will find an XML entry for the page http://he.wiktionary.org/wiki/כלכלן, but the contents of the entry have nothing to do with the actual contents of the wiki page. I guess that the XML entry <text xml:space="preserve"> contents come from http://he.wiktionary.org/wiki/דינמיט.

The inner workings of the script are probably of no interest to you. It parses wiki pages and complains when the page seems too inconsistent with the usual Hebrew Wiktionary page template. It should mainly serve as an dump error detector.

abxabx wrote:

Posting again in correct bugreport.

Just notice same thing. Dump is broken at
http://download.wikimedia.org/plwiktionary/20100411/plwiktionary-20100411-pages-articles.xml.bz2.
Several entries are existing with empty text field. For example here is real
content of [[pl:wikt:til]] article:
http://pl.wiktionary.org/w/index.php?title=til&oldid=1279710 while in dump
there is:

<page>
  <title>til</title>
  <id>7330</id>
  <revision>
    <id>1279710</id>
    <timestamp>2010-04-01T23:12:41Z</timestamp>
    <contributor>
      <username>Interwicket</username>
      <id>5613</id>
    </contributor>
    <minor />
    <comment>iwiki +[[:sv:til]]</comment>
    <text xml:space="preserve" />
  </revision>
</page>

There are also entries misplaced. Content of [[pl:wikt:ez]] is under
<title>bela</title> in this archive.

We check length of revision content from the db against what we have in previous dumps (or what we think we are retrieving from the db), as of June 2010 (http://svn.wikimedia.org/viewvc/mediawiki?view=revision&revision=67324); are people still seeing this issue?

Closing, since no further reports were submitted after the text length check was put in place and the underlying bug causing text content mismatch was fixed in mid 2010.