Page MenuHomePhabricator

Spanish wikipedia XML dump problems
Closed, ResolvedPublic

Description

Author: elephantus_l

Description:
I downloaded two sequential Spanish wikipedia XML dump files
(eswiki-20090504-pages-articles.xml.bz2 and before that eswiki-20090421-pages-articles.xml.bz2). When I imported the file into wikitaxi it showed a strange error on a large number of pages: the titles and the content of the pages were mixed-up, that is, the title would be something and the text itself would obviously be from a different page (or it would be a combination of two pages). So I looked into the original XML file itself and this is what I found, for example:

<page>
  <title>Gómez Plata</title>
  <id>454035</id>
  <revision>
    <id>25156038</id>
    <timestamp>2009-03-28T06:38:04Z</timestamp>
    <contributor>
      <username>SajoR</username>
      <id>130444</id>
    </contributor>
    <minor />
    <comment>leve mejora</comment>
    <text xml:space="preserve">'''Montserrat Domínguez''' ([[Madrid]], [[1963]]) es una [[periodismo|periodista]] [[España|española]].

Considera que la primera obligación de un periodista es ser crítico con el poder y es optimista respecto a la situación actual del periodismo. Su trabajo le ofrece, en su opinión, &quot;un motor de vida&quot;.

Es aficionada a la [[lectura]] y a los viajes.

Biografía

Estudió [[Ciencias de la Información]] por la [[Universidad Complutense de Madrid]]. Posteriormente cursó un Master en Periodismo por la [[Universidad de Columbia]].

So the title of the page is Gómez Plata (a municipality in Colombia), but the page is about a Spanish journalist.

This didn't happen when I downloaded other wikipedia dumps (en, de, nl, sv). Could someone please look into this problem? Thank you.


Version: unspecified
Severity: normal

Details

Reference
bz18694

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:32 PM
bzimport set Reference to bz18694.

I can indeed find this in eswiki-20090504-pages-articles.xml.bz2 but not in eswiki-20090504-pages-meta-current.xml which is bizarre. A quick look at things didn't showcase any big errors in the code. This is going to to take a bit more time to find. Thank you for testing the other dumps to see if this was happening as well.

elephantus_l wrote:

Apparently the messed-up pages are those with a timestamp from approximately mid-January 2009 to mid- or late-April 2009. The pages older or younger than that aren't affected.

It doesn't seem to affect all articles within that time range though making it a bit hard to find other examples. Can you give me a list of 10
or so more that are still affected in the latest dump 20090519. Gómez Plata got updated and is now dumping correctly.

  • Bug 19420 has been marked as a duplicate of this bug. ***

eswiki-20090702-pages-articles is still affected.

For instance,
[[MediaWiki:anonnotice]] have the content of [[Carlos Iglesias]].

[[Wikipedia:Portada]] has

<page>
  <title>Wikipedia:Portada</title>
  <id>2271189</id>
  <revision>
    <id>25284089</id>
    <timestamp>2009-04-02T13:56:29Z</timestamp>
    <contributor>
      <username>Muro de Aguas</username>
      <id>214907</id>
    </contributor>
    <minor />
    <comment>wapedia no es propiedad de wikipedia</comment>
    <text xml:space="preserve">#REDIRECT [[Plantilla:Ficha de militar]]</text>
  </revision>
</page>

whereas that revision is http://es.wikipedia.org/w/index.php?title=Wikipedia:Portada&diff=25284089&oldid=24586619

Has a clean dump* been done since the problem was detected?

*A dump not based on the previous one.

(In reply to comment #2)

Apparently the messed-up pages are those with a timestamp from approximately
mid-January 2009 to mid- or late-April 2009. The pages older or younger than
that aren't affected.

Could it be a slave whose autoincrement column desynchonized?

enrique wrote:

eswiki-20090710-pages-articles.xml resolves this error???

No. But you can use the pages-meta-current to skip it.

enrique wrote:

them, this error will not have solution?
tomorrow i will try with pages-meta-current to skip it error, but I prefer wait for an solution.

capellan2000 wrote:

(In reply to comment #8)

No. But you can use the pages-meta-current to skip it.

Ok, is not too much problem to download 1GB instead of
758MB, but Could someone find any hint in the PHP code that creates this backup, that actually explains why this error is ocurring only in the Spanish Wikipedia and not in the English version?
I had verified that English version database backup does not show these errors.

enrique wrote:

Hi, i downloaded pages-meta-current to skip it error, but i see many red links, in the official wikipedia these links are blue
many of templates are empty or uncategorised.

(In reply to comment #10)

(In reply to comment #8)

No. But you can use the pages-meta-current to skip it.

Ok, is not too much problem to download 1GB instead of
758MB, but Could someone find any hint in the PHP code that creates this
backup, that actually explains why this error is ocurring only in the Spanish
Wikipedia and not in the English version?

Sadly we've suffered a loss of a good chunk of our previously run snapshots so comparing to
those will be a bit hard. If I can catch the problem happening actively then it will be much
easier.

  • Bug 20114 has been marked as a duplicate of this bug. ***

*** Bug 19598 has been marked as a duplicate of this bug. ***

Ascander wrote:

Hi, this problem is still present in the backups of the Spanish Wikipedia. It is hard to calculate the number of articles affected, but they are among those modified between January and April 2009. No affected articles have been observed so far out of this range of time.

Articles affected (i.e. articles showing a wrong content in the periodic backups including the last published one: eswiki-20100221-pages-articles.xml.bz2) seem to be the same in different backups.

Here are some examples of wrong content:

Article [[:es:Escudo de la Polinesia Francesa]] shows contents belonging to [[:es:Anexo:Gobernadores de Corrientes]]

Article [[:es:Pleurodema]] shows contents belonging to [[:es:Aviación virtual]]

Redirect [[:es:Candelilla]] points to [[:es:Eugène Scribe]] instead of [[:es:Euphorbia antisyphilitica]].

Redirect [[:es:Knut Schreiner]] points to [[:es:Euphorbia antisyphilitica]] instead of [[:es:Euroboy]]

Here is an upper bound to the number of articles affected (i.e. articles updated between January and April 2009): 44193 articles/annexes, 5634 files on other space names and 99067 redirects. Contrary to articles, redirects are easy to check and I can say that almost all of them show a wrong content in the last backup.
For instance, these are the redirects updated on the first hour of March first, 2009 and they are all wrong:

'Aeropuerto de Ontario' --> 'Agustín de Pedrayes'
'Corno' --> 'El aprendiz de brujo (Dukas)'
'Kenneth Burrell' --> 'Aeropuerto Internacional LA/Ontario'
'Oro amarillo' --> 'Emo'
'Claro de Luna (Beethoven)' --> 'oro'
'Claro de Luna (Maupassant)' --> 'Sonata para piano n.º 14 (Beethoven)'
'Claro de luna (Debussy)' --> 'Sonata para oboe y piano (Poulenc)'
'Idioma retorromance' --> 'Adrenalynn'
'Boubacar traoré' --> 'Suite bergamasque'
'Rodrigo Sepúlveda Lara' --> 'Claro de luna (astronomía)'
'Oxnard' --> 'Rodrigo Sepúlveda'

Ascander wrote:

This problem is vanishing.

As mentioned before, pages affected are among those edited for the last time between January 2009 and April 2009, so with the help of [[:es:Usuario:Boticario]] and his bot CEM-bot, the pages with these characteristics were reviewed first orthographically and then the remaining for cosmetic changes. There are still several pages and redirects that show a wrong contents in the las dump, with and upper bound of 38 redirects and 16260 pages.

All of the 38 remaining redirects contain character "_" in their title and thus, are not accessible through the site.

In order to finish with this problem and unless someone proposes a better idea, I'll suggest Boticario to edit them introducing a useless space at the end of the first line or something equally useless and invisible.

For the record, here is the list of redirects that I don't know how to access (notice the underscores in their title):

  • [[:es:A _c]]
  • [[:es:A _d]]
  • [[:es:A _e _c]]
  • [[:es:Siglo_II_d._C.]]
  • [[:es:La_440]]
  • [[:es:S._I.]]
  • [[:es:580_a._C.]]
  • [[:es:589_a._C.]]
  • [[:es:588_a._C.]]
  • [[:es:585_a._C.]]
  • [[:es:584_a._C.]]
  • [[:es:582_a._C.]]
  • [[:es:594_a._C.]]
  • [[:es:600_a._C.]]
  • [[:es:559_a._C.]]
  • [[:es:556_a._C.]]
  • [[:es:550_a._C.]]
  • [[:es:558_a._C.]]
  • [[:es:555_a._C.]]
  • [[:es:551_a._C.]]
  • [[:es:546_a._C.]]
  • [[:es:529_a._C.]]
  • [[:es:528_a._C.]]
  • [[:es:526_a._C.]]
  • [[:es:525_a._C.]]
  • [[:es:522_a._C.]]
  • [[:es:521_a._C.]]
  • [[:es:520_a._C.]]
  • [[:es:510_a._C.]]
  • [[:es:515_a._C.]]
  • [[:es:K._O.]]
  • [[:es:Brasilia,_D._F.]]
  • [[:es:Brasilia, D._F.]]
  • [[:es:1200_a._C.]]
  • [[:es:500_a._C.]]
  • [[:es:Marina de EE._UU.]]
  • [[:es:Francis S._Collins]]

Look at http://es.wikipedia.org/w/index.php?title=Francis_S.%C2%A0Collins&diff=prev&oldid=25984099 It is not an space, it is a non-breaking space. This redirect was created by a indef blocked vandal which used a no breaking space (0xc2 0xa0) instead of the normal one (0x20).

For some reason action=view is treating the 160 space as a 32 one and thus doesn't find it.
We have 73 articles like that. Most were made by Rosarino, but also by Gunderson, Muro Bot, Wiki Winner and Jtspotau.
We should probably delete them.

I opened bug 22939 to handle the nbsp titles.

steefy389 wrote:

It seems that not only eswiki is affected by this.

There is a report on dewiki Village Pump about this issue on dewikisource ([http://de.wikipedia.org/w/index.php?title=Wikipedia:Fragen_zur_Wikipedia&oldid=73896610#Dump]).

We check length of revision content from the db against what we have in previous dumps (or what we think we are retrieving from the db), as of June 2010 (http://svn.wikimedia.org/viewvc/mediawiki?view=revision&revision=67324); are people still seeing this issue?

Closing, since no further reports were submitted after the text length check was put in place and the underlying bug causing text content mismatch was fixed in mid 2010.