Page MenuHomePhabricator

Greatly improved Export and Import
Closed, DeclinedPublic

Description

A patch for MediaWiki 1.14.1

Hi!
In our company (Customized InformSystems, http://www.custis.ru/), we use MediaWiki for maintaining corporate knowledgebases, and I'm responsible for its support and etc. Sorry in advance for my english :-)

I have greatly improved MediaWiki XML Export and Import mechanism and I want to submit a patch. The following features are added:

  • Support for exporting and importing UPLOADS - either attached inside the output file, which will be a multipart/related document with binary parts in this case, or as URLs + SHA1 hashes inside the XML document, so that destination Wiki could download them by itself.
  • Support for selecting pages from category with subcategories.
  • Support for selecting pages from specific namespace.
  • Support for selecting pages modified after a specific date.
  • Support for selecting pages linked to selected ("pagelinks closure").
  • The DVCS-like detection of "conflicts" during import and extended import report with 5 types of messages:
  • All revisions were previously imported. No local changes.
  • All revisions were previously imported. Page changed locally.
  • N revisions imported.
  • N revisions imported (new page).
  • N revisions imported (conflict: XX (import) and YY (local)).
  • With all these improvements, the import mechanism is backwards compatible and is able to import dumps from MediaWikis without these improvements.

Unfortunately (or fortunately? :-D), the patch requires a change to filerepo/LocalFile.php, because there is an irrationality: archived versions of uploads (oldimages) are stored with the timestamp in the file name which IS NOT EQUAL to the oldimage timestamp in the database. The timestamp in the file name is the timestamp of archiving, i.e. the timestamp of NEXT revision of upload. IMO, it is an inconsistent behaviour: oldimage file names depend on the order of import. So I'll submit a patch and a tool to convert timestamps in the oldimage archive.

I think all these features are very useful for MediaWiki users, so I ask you if it is possible to incorporate these changes into the MediaWiki code? Now, they are written against 1.14.1, but I can port them into trunk, if it is needed. The patch submitted also DOES NOT conform to your code-style, for example, I hate adding spaces after opening and before closing braces. But if it is needed, I can change the patch to conform.

But I think you can anyway start reviewing the patch :) IMO, it's very useful.


Version: unspecified
Severity: enhancement

attachment Y-001-import-export-images-and-conflicts.diff ignored as obsolete

Details

Reference
bz22881

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:08 PM
bzimport set Reference to bz22881.
bzimport added a subscriber: Unknown Object (MLST).

A maintenance tool to change oldimage filenames

attachment file-upload-renamer.php ignored as obsolete

Also this patch fixes Bug 9108.
I'll add this bug as a See Also to all related bugs.

P.S: If you want to see it in action, go to our external corporate Wiki: http://lib.custis.ru/Special:Export or my own wiki: http://yourcmc.ru/wiki/Special:Export
The patch is applied to both of them.

I'd certainly advise to port to trunk, considering 1.15.2 is considered stable, and 1.16 is not hugely far away :)

Then the chance of getting it reviewed quicker are increased

No new functionality will be added to already released versions. Please provide a patch for trunk and reopen on submit.

Outdated patches are no reason to resolve INVALID. Patch can be updated to run against trunk,

I will definitely update the patch to run against trunk when I'll have time to do it :) by now, you can only try live demonstration on URLs above...

Patch for Trunk MediaWiki (svn70079)

Good news everyone ! :-)
I ported my patch to MediaWiki 1.17 trunk (svn revision 70079).
Can you give it some review please ?

attachment CustIS-Import-Export-svn70079.diff ignored as obsolete

Patch for trunk MediaWiki (svn 77332)

Updated the patch for the trunk + added some checks and ability to export very large files using 1MB buffer.

attachment CustIS-Import-Export-svn77332.diff ignored as obsolete

A few comments:

<font color=red>

is a big no-no, use <span class="error">

$dbw->immediateCommit();

immediate*() are deprecated, use just begin() and commit()

Bryan.TongMinh wrote:

Comments only on the filerepo part:

+ global $wgUser;
File repo should never ever ever ever use globals like $wgUser. Pass the user object as a function argument, similar to LocalFile::upload

+ $dstPath = $this->repo->getZonePath('public') . '/archive/' . $this->getHashPath() . $dstName;
[...]
+ $dstName = gmdate( 'YmdHis', wfTimestamp( TS_UNIX, $timestamp ) ) . '!' . $this->getName();

You are essentially duplicating this from LocalFile::publish. The archive name generation should be moved to a separate function, which is called by both publish() and recordOldUpload().
E.g. $dstName = $this->generateArchiveName( $timestamp );

I find it strange that you are doing path generation in recordOldUpload(). Is there a specific reason you are not doing that in the caller and pass the entire archive name? That would seem a bit more logical to me.

+ /* Original gmdate( 'YmdHis' ) is not corrent AT ALL! */
+ /* It gives an inconsistency: file name has one timestamp and database row has another. */
Correct, we should fix that anyway.

+ $props['timestamp'] = wfTimestamp( TS_MW, $timestamp );
You should use $dbw->timestamp( $timestamp )

I don't know the import/export code at all, so I won't comment on that.

In general good work. You are on a good way to finally get upload importing/exporting available.

Bryan.TongMinh wrote:

(In reply to comment #12)

+ /* Original gmdate( 'YmdHis' ) is not corrent AT ALL! */
+ /* It gives an inconsistency: file name has one timestamp and database
row has another. */
Correct, we should fix that anyway.

I retract that. The timestamp part of the filename is the timestamp when the file was moved into the archive, not the timestamp when the file was originally uploaded.

Thanks everybody for remarks :)

(In reply to comment #13)

I retract that. The timestamp part of the filename is the timestamp when the
file was moved into the archive, not the timestamp when the file was originally
uploaded.

Yeah, originally it is. But I personally don't understand the practical sense of this. The time when the file was moved into the archive is usually equal to or 1-2 second relative to the time when a new version of the same file was uploaded. So each version contains timestamp of another, moreover, not always an accurate one.
I think versions should be independent, and should not form such "linked lists".

(In reply to comment #12)

I find it strange that you are doing path generation in recordOldUpload(). Is
there a specific reason you are not doing that in the caller and pass the
entire archive name? That would seem a bit more logical to me.

Do you mean concatenating $dstPath instead of passing it as a parameter to recordOldUpload()? If so, recordOldUpload() uses $dstName as the value of oi_archive_name DB column, that's the reason.

Rebase patch for new trunk MW (svn 82606)

Rebase patch to svn 82606 + add corrections from comment 12 by Bryan Tong Minh.
From now, all these patches will be available at:
http://code.google.com/p/mediawiki4intranet/source/browse/?r=files#hg%2Fimport-export-patch

attachment MW4Intranet-ImportExportPatch-svn82606.diff ignored as obsolete

Also forgot to mention that in new version of patch, the insertion of empty "marker" revisions to imported pages is now disabled, because it leads to very fancy bugs, in particular, infinitely duplicated revisions without changes in the case of 2-way ("vice-versa"?) replication between two wikis. One sees a change, imports it, adds a newer marker revision, then other sees this revision in export file, imports it, etc...

Rebase to 85617 + fix "no history" for uploads + use sha1base36

attachment MW4Intranet-ImportExportPatch-svn85617.diff ignored as obsolete

Any progress on review? I see there were some changes to import.php and specialexport.php in trunk...

Bryan.TongMinh wrote:

I've started with implementing the required FileRepo changes in r85635. I've heavily modified your patch to match our coding standards. I have moved most logic from LocalFile to OldLocalFile, because that means that we can use things like $this->getRel() etc.

I will next look at the multipart/related parts of the patch.

(In reply to comment #14)

Thanks everybody for remarks :)

(In reply to comment #13)

I retract that. The timestamp part of the filename is the timestamp when the
file was moved into the archive, not the timestamp when the file was originally
uploaded.

Yeah, originally it is. But I personally don't understand the practical sense
of this. The time when the file was moved into the archive is usually equal to
or 1-2 second relative to the time when a new version of the same file was
uploaded. So each version contains timestamp of another, moreover, not always
an accurate one.
I think versions should be independent, and should not form such "linked
lists".

Perhaps, but changing this may lead to unexpected breakage, which I don't feel inclined to search for.

Oh, thanks, sorry for the delay. I've seen your changes now.
One remark: I've also seen there is now base64 export added? I understand this is for the sake of backwards compatibility, but it's not very optimal. Maybe leave it as an option?

Perhaps, but changing this may lead to unexpected breakage,
which I don't feel inclined to search for.

What for example? Or is it completely :) unexpected so you have no idea about it?


Also, I recently made some changes in the page selection mechanism. The use cases were:

  • deny export of pages marked with some category
  • allow "incremental" export, i.e. correct export of pages which were changed since given time.

So, changes were:

  • moved used image, template and page links selection into page selection box (and rewritten the code optimizing it - each title is no more loaded ); so that export exports ONLY pages that are in textbox now, and link selection is done using "add pages";
  • added a "NOT-category" filter for page selection box ("Add pages:"), which is applied AFTER selection of image, template and page links, and which is applied to the full list in the textbox, not only to added pages;
  • changed the behaviour of modification date filter - it is now also applied after all, just like "NOT-category". So it allows to build page lists for incremental replication/export.

All this can be seen on, for example, http://wiki.4intra.net/ as an example.


I'll try to rebase the patch against the new trunk with your WIP.

Bryan.TongMinh wrote:

(In reply to comment #20)

Oh, thanks, sorry for the delay. I've seen your changes now.
One remark: I've also seen there is now base64 export added? I understand this
is for the sake of backwards compatibility, but it's not very optimal. Maybe
leave it as an option?

I don't understand. The base64 is backwards compatible if you parse with an XML parser.

Perhaps, but changing this may lead to unexpected breakage,
which I don't feel inclined to search for.

What for example? Or is it completely :) unexpected so you have no idea about
it?

I have absolutely no idea what would break, but I know MediaWiki well enough that something will break.

I'll try to rebase the patch against the new trunk with your WIP.

That's fine. However, I don't think that I will have time to review and commit the patch, so somebody else should review it then.

Compatible, but increases the size :)

I have absolutely no idea what would break

So you won't change it at all?

Conformance to the site style
http://www.mediawiki.org/wiki/Manual:Coding_conventions
is a requirement for core code contributions.

I'm not sure what the point is of this multipart/related document type: is there some client that can read it? The exports appear to be sent out with HTTP headers specifying Content-Type: application/xml, but with a body that is plainly not XML, rather it is this MIME message format with headers embedded in the body text.

It seems like zip would have been a better choice. The wording of the messages which enable this multipart format certainly don't warn the user that what they are going to get will be so exotic.

What does this:

/*op-patch|TS|2010-04-26|HaloACL|SafeTitle|start*/

signify? It is wrapped around some strange-looking code. Is it meant to be there?

The 5000-page limit in SpecialExport::getPagesFromCategory() appears to have been removed without explanation. Also, SpecialExport::rgetPagesFromCategory() appears to have no recursion depth limit.

It's probably feasible to use the upload timestamp of the old image version to generate the archive name, but it would need to be done with a bit more care and analysis than I see here. For example, LocalFileRestoreBatch::execute() appears to need patching. Deployment would be difficult since it would break if MediaWiki is downgraded or run with multiple versions on the same database.

It would be simpler, and head off potential future problems with analysis of image filesystems, if the archive names could be changed to a completely different regime, adding some marker to show whether they were generated the old way or the new way.

(In reply to comment #24)

Conformance to the site style
http://www.mediawiki.org/wiki/Manual:Coding_conventions
is a requirement for core code contributions.

I'll run my modifications through the tool.

I'm not sure what the point is of this multipart/related document type: is
there some client that can read it? The exports appear to be sent out with HTTP
headers specifying Content-Type: application/xml, but with a body that is
plainly not XML, rather it is this MIME message format with headers embedded in
the body text.

It seems like zip would have been a better choice. The wording of the messages
which enable this multipart format certainly don't warn the user that what they
are going to get will be so exotic.

Oh, thanks. I've not thinked about zip. I agree that multipart is very exotic. For some reason, it was the first "archive" type I thinked of.

What does this:

/*op-patch|TS|2010-04-26|HaloACL|SafeTitle|start*/

signify? It is wrapped around some strange-looking code. Is it meant to be
there?

No... It's from other patch (access control extension). Sorry.

The 5000-page limit in SpecialExport::getPagesFromCategory() appears to have
been removed without explanation. Also, SpecialExport::rgetPagesFromCategory()
appears to have no recursion depth limit.

I think any hard-coded limits are bad, and they were hard-coded... I've just removed them as it seemed the easiest way.

Recursion limit is harmful for small/intranet MW installations - users often create deep category hierarchies and want to export them :) and there is relatively little amount of pages, so the performance doesn't count. 5000 pages limit is harmful for example for exporting full wiki content.

These limits should probably be on parametrized on $wgSomething...

It's probably feasible to use the upload timestamp of the old image version to
generate the archive name, but it would need to be done with a bit more care
and analysis than I see here. For example, LocalFileRestoreBatch::execute()
appears to need patching. Deployment would be difficult since it would break if
MediaWiki is downgraded or run with multiple versions on the same database.

It would be simpler, and head off potential future problems with analysis of
image filesystems, if the archive names could be changed to a completely
different regime, adding some marker to show whether they were generated the
old way or the new way.

Very reasonable.

I'll change my modifications according to these remarks. Zip is really a very good idea :)

sumanah wrote:

Vitaliy, have you had time to revise your patch?

Sorry, didn't see the comments - dnsrbl.net has gone mad and was blocking all email on my server :)

Not fully, I've just implemented ZIP support by now, this required reworking writers/filters/readers, so now I have several questions to the devs:

  1. What was the original idea besides Dump*Filters ? My opinion is that it's absolutely irrational to filter the pages AFTER loading them from the DB, plus DumpFilters weren't actually used in export, so I've removed them - is that OK ?
  1. Also I don't truly understand the idea of having separate methods for write* (writeOpenPage, writeClosePage, etc) in DumpOutput. Was it supposed to allow easy switching of underlying stream format from XML to something other? I think it's useless as there is no other dump format by now :) so it's not designed for the real needs of other format. So it's also removed :)
  1. Also, in the new version of patch, I always use temporary files to read/write dump archives, so the support for streaming gzip/bzip2/etc filters is also removed. But, my opinion is that it's also not a big problem - there's support for ZIP anyway.
  1. Now, the import code doesn't restore recent changes for page edits/creations. I think it must restore them for consistency and I've implemented that - is it OK?
  1. Also I don't fully understand the purpose of setting revision callbacks to methods of self in WikiImporter :) what's that purpose?
  1. ImportStringSource - was this supposed to be really used somewhere?

Patch for archive filenames

I see the changes to the import xml reader happened in the meantime.
So I need to rebase my patch against trunk again.

Additionally, I still want to add archive support instead of or in addition to base64 encoding inside xml, because the latter increases file size by 30%, which could be sensitive even on medium installations :)

Now, I just want to submit patch for archive file names. I suggest a prepended 'T' letter for backwards compatibility noted by Tim.

Attached:

For "A maintenance tool to change oldimage filenames", that really needs rewriting to use our Maintenance classes

Legacy code not using it isn't so bad, but adding new files that don't isn't a good way forward

I agree, it anyway needs change for that 'T' letter :)

ZIP support + rebase the patch to r104165

The new version of patch:

  • Rebased to r104165... (again and again...)
  • ZIP support, used instead of multipart/related by default. Old multipart/related is also supported, partly as an example of multiple archive formats
  • Added reporting of imported upload count
  • Recent changes are now restored during import (so the post-import state looks almost exactly same as in the source wiki)
  • FileRepo part does not need changes anymore, thanks to Bryan Tong Minh! :)
  • There was a bug in trunk - sha1 was compared with sha1base36 in Import.php
  • A note about addPages() / addPagesExec() in SpecialExport: they're static because they are useful for MW extensions, for example for a BatchEditor extension we have.

I understand that the code maybe isn't perfect... But see, previous versions of MediaWiki also don't have perfect code in export/import part. :)

And one more thing about backwards compatibility, ZIP and etc etc etc: all efforts to maintain backwards compatibility with MW < 1.18 are useless anyway, because old XML parser didn't allow differences in the element hierarchy, even if all the difference is a single added element, and it bails out with strange errors in this case.

So I still suggest archives as the primary "dump format", not the XML with base64-encoded parts :)

attachment Custis-Import-Export-r104165.diff ignored as obsolete

Be nice to get it in for you before the patches go stale again!

Patch for svn 114100

Rebased the patch again...
Please review it someone!

attachment MW4Intranet-ImportExportPatch-svn114100.diff ignored as obsolete

Updated maintenance tool

Attached:

Updated patch for svn 114100

Updated the patch (one small bugfix, use $title[0] instead of $this->title in the error message).

Attached:

sumanah wrote:

Vitaliy, just wanted to check in and suggest that you get https://www.mediawiki.org/wiki/Git/Workflow a Git/Gerrit account! That'll make it easier to get your patches reviewed.

OK, I've pushed it to Gerrit.

  1. Change old image timestamps:

https://gerrit.wikimedia.org/r/#/c/34338/

  1. Quick fix for Bug 37209 - without it importing doesn't work:

https://gerrit.wikimedia.org/r/#/c/34339/

  1. Main part of import/export patch:

https://gerrit.wikimedia.org/r/#/c/34340/

dasch wrote:

when improving Import maybe somebody could consider looking on this bug 22077
just an idea :)

By the way, I think I'll try to move these import/export changes to an extension.

The only precondition is this patch:
https://gerrit.wikimedia.org/r/#/c/34338/
(change old image timestamps to include their own version, not the archiving timestamp)

Most of other changes can be easily extracted to an extension. I think it will be simpler to install and use, and it won't require long review process for my codebomb :)

The only precondition is this patch:
https://gerrit.wikimedia.org/r/#/c/34338/
(change old image timestamps to include their own version, not the archiving timestamp)

It seems there is disagreement about how to solve a particular problem. I recommend continuing the conversation on Phabricator and clarifying the specific problem this would address. Depending on the scope of the current task (T24881) it might help to file a new sub task for the specific problem.

From a quick glance, the image/oldimage is highly outdated and needs work, but I think it would help to treat that as an orthogonal issue. The name of the file on disk should not matter from the perspective of export/import, it may be numerical, or hashed, or otherwise decided by the individual FileBackend.

Would it be possible to export based on the timestamp in the database (not the file on disk), and when importing perform your duplication check based on that same timestamp (and content hash) as it exists in the the database? (Instead of comparing files on disk).

In my humble opinion this task needs a problem description plus needs to be broken into smaller, isolated problems and related patches (that could depend on each other): See https://www.mediawiki.org/wiki/Gerrit/Code_review/Getting_reviews#Write_small_commits

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/34340/ changes ~1500 lines which makes it basically unreviewable, so I propose to abandon that one as it has not seen updates for five years either. Patch author and task author was last active in September 2016.