Page MenuHomePhabricator

XML import problem in Page namespace
Closed, DuplicatePublic

Description

example XML file which generates an error

I receive the following error in wikisource.org while using an XML file containing pages from Page namespace:

Import failed: Can't save non-default content model with $wgContentHandlerUseDB disabled: model is wikitext, default for Page:Ludwik Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/02 is proofread-page

However the imported file has

<model>proofread-page</model>

for each revision of imported file. Import of this page seems to be OK, but if a file contains more pages, only the first one (the one mentioned in the error message) is imported.
Example import file inclosed.


Version: 1.23.0
Severity: normal

Attached:

There is regression in Mediawiki 1.26wmf9 as described in comment below.

Details

Reference
bz62780

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:59 AM
bzimport set Reference to bz62780.
bzimport added a subscriber: Unknown Object (MLST).
Ankry raised the priority of this task from Low to Medium.Jun 16 2015, 1:00 AM
Ankry subscribed.

Regression is observed in MediaWiki 1.26wmf9 (rMWda20be2f916f) + ProofreadPage (96d77aa) comparing to the previous (initial) report:
While no longer the error message appears, the imported pages have different model (wikitext) set than original (proofread-page) (= default model for Page namespace). That is wrong and the imported pages are unusable (no scan view alailable, no header/footer separation, no proofread status can be set).
I found no way to set the content model to proofread-page again. Any hints?
(except page deletion and re-creation from scratch; but this is useless if you need to transfer ~500 pages with proofread status history)

observed in sourceswiki -> plwikisource transfer

Examples:
https://pl.wikisource.org/wiki/Strona:Korczak_Janusz_-_O_gazetce_szkolnej.djvu/15 (classic import)
https://pl.wikisource.org/wiki/Strona:Korczak_Janusz_-_O_gazetce_szkolnej.djvu/005 (XML import)

Ankry set Security to None.
Ankry added a subscriber: Wieralee.

It seems that on Wikimedia wikis (didn't observe this locally) the null edit with comment gets incorrect "wikitext" content_model:

Freshly imported https://test2.wikipedia.org/wiki/Page:Ludwik_Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/02 :

(Test2 is now at 1.26wmf12 rMWf0490d82fd6c core, ProofRead at 7cb7a87).

MariaDB [test2wiki_p]> select rev_id, rev_parent_id, rev_len, rev_timestamp, rev_content_model, rev_content_format, rev_comment, rev_user_text 
    -> from revision where rev_page=63160 ;
+--------+---------------+---------+----------------+-------------------+--------------------+---------------------------------------+---------------+
| rev_id | rev_parent_id | rev_len | rev_timestamp  | rev_content_model | rev_content_format | rev_comment                           | rev_user_text |
+--------+---------------+---------+----------------+-------------------+--------------------+---------------------------------------+---------------+
| 157371 |             0 |     347 | 20120827153332 | NULL              | NULL               | /* Nieskorygowana */                  | Adam2         |
| 157372 |        157371 |     349 | 20120827153528 | NULL              | NULL               | /* Skorygowana */                     | Remedios44    |
| 157373 |        157372 |     349 | 20120827155328 | NULL              | NULL               | /* Uwierzytelniona */                 | Paelius       |
| 157374 |        157373 |     382 | 20121020174540 | NULL              | NULL               |                                       | Adam2         |
| 157375 |        157374 |     382 | 20150630212839 | wikitext          | NULL               | 4 revisions imported: testing for bug | Saper         |
+--------+---------------+---------+----------------+-------------------+--------------------+---------------------------------------+---------------+

as a result we get

MariaDB [test2wiki_p]> select * from page where page_id=63160 \G
*************************** 1. row ***************************
           page_id: 63160
    page_namespace: 104
        page_title: Ludwik_Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/02
 page_restrictions: 
      page_counter: 0
  page_is_redirect: 0
       page_is_new: 0
       page_random: 0.477007818101
      page_touched: 20150630212839
page_links_updated: 20121020174540
       page_latest: 157375
          page_len: 499
page_content_model: wikitext
1 row in set (0.01 sec)

Same file on a non-WMF test wiki, running core 8338476b8e0c57132a5b150f42ed47da95f370b9 and ProofRead at 7cb7a8790f349f0f481e0ef8785ff63f182b7611

mysql> select rev_id, rev_parent_id, rev_len, rev_timestamp, rev_content_model, rev_content_format, rev_comment, rev_user_text
    -> from revision where rev_page=228;
+--------+---------------+---------+----------------+-------------------+--------------------+-----------------------+---------------+
| rev_id | rev_parent_id | rev_len | rev_timestamp  | rev_content_model | rev_content_format | rev_comment           | rev_user_text |
+--------+---------------+---------+----------------+-------------------+--------------------+-----------------------+---------------+
|    546 |             0 |     347 | 20120827153332 | NULL              | NULL               | /* Nieskorygowana */  | Adam2         |
|    547 |           546 |     349 | 20120827153528 | NULL              | NULL               | /* Skorygowana */     | Remedios44    |
|    548 |           547 |     349 | 20120827155328 | NULL              | NULL               | /* Uwierzytelniona */ | Paelius       |
|    549 |           548 |     382 | 20121020174540 | NULL              | NULL               |                       | Adam2         |
|    550 |           549 |     382 | 20150629210914 | NULL              | NULL               | 4 wersje: jeszcze raz | CheckUserTest |
+--------+---------------+---------+----------------+-------------------+--------------------+-----------------------+---------------+
5 rows in set (0.00 sec)

mysql> select * from page where page_id=228 \G
*************************** 1. row ***************************
           page_id: 228
    page_namespace: 250
        page_title: Ludwik_Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/02
 page_restrictions: 
  page_is_redirect: 0
       page_is_new: 0
       page_random: 0.100802924531
      page_touched: 20150629210914
page_links_updated: 20121020174540
       page_latest: 550
          page_len: 382
page_content_model: proofread-page
         page_lang: NULL
1 row in set (0.00 sec)

It seems that on Wikimedia wikis (didn't observe this locally) the null edit with comment gets incorrect "wikitext" content_model:

Freshly imported https://test2.wikipedia.org/wiki/Page:Ludwik_Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/02 :

I don't know about your naming scheme in the Page: namespace under MediaWiki proper... did the Page: namespace w/ <pagelist /> applied ever recognize page titles that have 'zero'-digit prefixes? In other words...

Page:Foo.djvu/005wrong -------> Page:Foo.djvu/5correct
Page:Foo.djvu/015wrong -------> Page:Foo.djvu/15correct

https://test2.wikipedia.org/wiki/Page:Ludwik_Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/02 wrong a 0 precedes the 2

https://test2.wikipedia.org/wiki/Page:Ludwik_Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/2 correct no 0 prefix; just 2

... I do not recall titles using leading zeros as ever being "valid" under the PRP extension (I believe) unless you setup the related Index: page numbering without using the <pagelist /> tag (but I've been wrong before :)

Maybe also worth mentioning....

One example uses the namespace numbering intended all along (ns-250) while the other is set for the one that happen to be available at the time (ns-104).

Could it be that ns-250 is somehow "set" to strip leading zeros while ns-104 is not? Maybe the "stripping" is something language dependent?

Regardless - all I've ever known or seen to date indicates any leading zeros using <pagelist /> do indeed get "stripped". In short...

<pagelist 001to024 ... /> comes out just like <pagelist 1to24 ... /> would.

It seems that on Wikimedia wikis (didn't observe this locally) the null edit with comment gets incorrect "wikitext" content_model:

Freshly imported https://test2.wikipedia.org/wiki/Page:Ludwik_Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/02 :

I don't know about your naming scheme in the Page: namespace under MediaWiki proper... did the Page: namespace w/ <pagelist /> applied ever recognize page titles that have 'zero'-digit prefixes? In other words...

Page:Foo.djvu/005wrong -------> Page:Foo.djvu/5correct
Page:Foo.djvu/015wrong -------> Page:Foo.djvu/15correct

https://test2.wikipedia.org/wiki/Page:Ludwik_Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/02 wrong a 0 precedes the 2

https://test2.wikipedia.org/wiki/Page:Ludwik_Młynek_-_Narzecze_wilamowickie_(Wilhelmsauer_Dialekt._Dy_wymmysuaschy_Gmoansproch).djvu/2 correct no 0 prefix; just 2

... I do not recall titles using leading zeros as ever being "valid" under the PRP extension (I believe) unless you setup the related Index: page numbering without using the <pagelist /> tag (but I've been wrong before :)

Pages with "zero-prefixes" are used only in indexes with manual page listings (without a <pagelist /> tag). We avoid this schema in newly created indexes. But this is no way related to the import problem: page import independent on index import. Note, that index pages cannot be imported as in most cases structures of index differ much between wikis (eg. because of translated field names in the index template).

BTW, we found a workaround: if importing a page into an existent page (eg. an empty one created just before the import) no imported revision has rev_content_model set to "wikitext" and after restoring the newest imported rev to the top everything works fine. But this method is ugly.

. . .
Pages with "zero-prefixes" are used only in indexes with manual page listings (without a <pagelist /> tag). We avoid this schema in newly created indexes. But this is no way related to the import problem: page import independent on index import. Note, that index pages cannot be imported as in most cases structures of index differ much between wikis (eg. because of translated field names in the index template).

Err... you should be copying/importing at least the base Index's numbering assignment first - before you import any of the individual pages - otherwise the assumption is the "default" empty <pagelist /> is understood to be the valid numbering progression. This is not the same as importing the entire Index: page itself - just the page numbering assignment.

In the given case, you probably edited the target Index: after importation was done maybe? So your source Index's numbering probably needed to be in place on the target prior to executing the individual pages the way I see it. Without that in place, the assumption for a default at the target would be a blank <pagelist /> no? Isn't that why no next & previous arrows appeared upon initial creation?

Or you can think of it this way - removing any page numbering assignment back to a blank <pagelist /> equivalent on the source prior to that XML generation should insure the same straight forward blank <pagelist /> progression will be created on the target.

If <pagelist /> is not in use at the source then I guess you're pretty much stuck creating pages first and importing back over them on the target alright.

. . .
Pages with "zero-prefixes" are used only in indexes with manual page listings (without a <pagelist /> tag). We avoid this schema in newly created indexes. But this is no way related to the import problem: page import independent on index import. Note, that index pages cannot be imported as in most cases structures of index differ much between wikis (eg. because of translated field names in the index template).

Err... you should be copying/importing at least the base Index's numbering assignment first - before you import any of the individual pages - otherwise the assumption is the "default" empty <pagelist /> is understood to be the valid numbering progression. This is not the same as importing the entire Index: page itself - just the page numbering assignment.

In the given case, you probably edited the target Index: after importation was done maybe? So your source Index's numbering probably needed to be in place on the target prior to executing the individual pages the way I see it. Without that in place, the assumption for a default at the target would be a blank <pagelist /> no? Isn't that why no next & previous arrows appeared upon initial creation?

Or you can think of it this way - removing any page numbering assignment back to a blank <pagelist /> equivalent on the source prior to that XML generation should insure the same straight forward blank <pagelist /> progression will be created on the target.

If <pagelist /> is not in use at the source then I guess you're pretty much stuck creating pages first and importing back over them on the target alright.

Just tested: problem is no way related to index page:

  1. Index on sourceswiki: https://wikisource.org/wiki/Index:Livische_Grammatik_%28Sprachproben%29.pdf
  2. Index on plwikisource: https://pl.wikisource.org/wiki/Indeks:Livische_Grammatik_%28Sprachproben%29.pdf (created before import)
  3. A page imported via import tool: https://pl.wikisource.org/wiki/Strona:Livische_Grammatik_%28Sprachproben%29.pdf/1
  4. A page imported via XML import: https://pl.wikisource.org/wiki/Strona:Livische_Grammatik_%28Sprachproben%29.pdf/2

All imported pages were related to existing indexes on sourceswiki and it does not seem to matter whether the target indexes use <pagelist/> or not (in current cases all indexes use <pagelist/>). Import is broken somwhere else.

As you can see here:
https://pl.wikisource.org/w/index.php?title=Strona:Livische_Grammatik_%28Sprachproben%29.pdf/1&action=info
and here:
https://pl.wikisource.org/w/index.php?title=Strona:Livische_Grammatik_%28Sprachproben%29.pdf/2&action=info
both pages has "wikitext" model.

In MediaWiki 1.27.0-wmf.3 (and probably in few earlier versions) this problem seems to no longer exist.

Close this ticket, please.

matmarex claimed this task.

Perhaps this was the same issue as T91170 after all.

Yes, it was the same issue. I traced it down to Revision::newNullRevision but couldn't figure out further.