Page MenuHomePhabricator

Merge items in Wikidata
Closed, ResolvedPublic

Description

There should probably be a special page for merging of items. It should take two identificators and return a new object, possibly after some manual operation on the items.


Version: unspecified
Severity: major
Whiteboard: u=dev p=0 c=backend
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=38790
https://bugzilla.wikimedia.org/show_bug.cgi?id=38795

Details

Reference
bz38664

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:00 AM
bzimport set Reference to bz38664.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 38682 has been marked as a duplicate of this bug. ***

I recall that we concluded that, while merging of items should be possible, splitting of items should not be.

However, I believe I have a use case: erroneously merged items. How could they be split back?

It could be nice to have some of the arguments against splitting an item, its not that uncommon to refine a page to be more specific. Merging is mostly for fixing an error condition (misnamed entities at the Wikipedias leads to duplicate new items) and secondly an editorial condition, while splitting seems to be mostly an editorial decision and secondly an error condition.

(In reply to comment #2)

I recall that we concluded that, while merging of items should be possible,
splitting of items should not be.

However, I believe I have a use case: erroneously merged items. How could they
be split back?

Merging items A and B means: transfer properties from A to B and turn A into a redirect. Simply reverting these two edits will un-merge the items.

Splitting items will always be possible manually, but I think we should support it. Basically, you would see a screen with all the properties, where you can opt whether that property should be kept only in A, only in B, or in both. The same goes for the sitelinks.

In most cases you use the merge feature, because you discover an article, that had no langlinks before, but belongs to an existing groups (item).

An simpler solution for adding a split feature would be, to add a possibility to duplicate items (without sitelinks). Then you only have to move some of the sitelinks from the old item to the new item on a second step (split item = duplicate item + reconnect some sitelinks at the new item).

The bug says merging of two items, but note that several items can be merged in a chain with simply two and two items until all are merged.

Merging of two items also (I believe) imply merging of the history, or logging that one of the items now are a redirect but contains part of the history.

Merging histories would be horrible, the result utterly misleading. You can't do this with linear histories, that's an absolute no-no. We *could* maintain the history as a directed graph with multiple parents and children per revision. That would be cool, but it really unrelated to wikidata. This would also be useful for text.

Present merging of histories are horrible, but people expect it to happen, I don't know why - its useless. All I want is a log entry that says "at this point continue there".

History and logging of such things could be done outside of Wikidata as you said, and probably should.. Its about the page history and should not be part of the item, which just happens to be stored on the page. And it would be really nice to be able to add such log entries after somebody have reused content by cut'n'paste.

Gah.. Probably need another bug for that... ;/

I'm poking at this now. Here is what I intend to do:

  • modifications must always be based on the revision that is indicated as the base revision. I.e. if ApiModifyEntity gets a baserevid parameter, it must loat that version of the entity, and then modify that, not the current revision.
  • If the current revision is the base revsision, or there is no base revision, then just save the new content as is (no conflict).
  • Otherwise, EditPage creates a patch by generating a diff from the base content to the new content (the content provided by the caller).
  • Then, a "clean" patch is generated by removing all changes from the patch that conflict with (are not applicable to) the current revision.
  • If the clean patch is different from the original patch, there is a conflict.
  • If there is a conflict but the current user was the only one to edit since the base revision, the conflict is ignored. Otherwise, saving is aborted.
  • If saving was not aborted, we now have either a clean patch, or a patch with conflicts against the user's own edits.
  • Apply that path to the current revision to get a fresh version of the new content which has all the intended changes performed on top of the current revision instead of the base revision. (in git terms, this is a rebase).

This should get us a clean result.

Oops... confused this with bug 39836. Ignore comment #10 and #11. Unassigning for now.

On the project chat I suggested a 'Move' item be added to the link edit menu and this was to be as a local widget.

This would get us this merge function on an item by item basis which is probably better than doing an entire page.

(In reply to comment #13)

On the project chat I suggested a 'Move' item be added to the link edit menu
and this was to be as a local widget.

"move" would mean changing the ID of an item... how would that help?

This would get us this merge function on an item by item basis which is
probably better than doing an entire page.

Hm? But each item is described by a page? I don't understand your suggestion.

The 'Move' action would be on each sitelink 'edit' menu. Not on the whole page. I should have said "on a sitelink by sitelink basis"; not "on an item by item basis". My mistake.

To merge two items you would 'move' each sitelink on one page to the other page, one at a time.

Where there are only a few sitelinks which need to be moved then you would move only those.

How it would work:

Sitelink: select 'Edit' then 'Move'

Submenu appears asking for the ID of the destination page.
Submenu also shows the Label and Description in the Sitelink language with option to select either or both to move also.

Select Save on submenu to delete the sitelink from this page and add it to the other page.

That would not be a merge operation but a move on individual entries in the item. A merge is rather high level and each sitelink, and later on statements, would be transfered. That would probably also imply some way to mark some entries to not be transferred, and a way to set up the redirect.

It could be that a cut/copy/paste -solution would be simpler, but it would also be very slow for large items. And then we need a way to change an item into an redirect.

Yes. It would be a move rather than a true merge.

Yes cut/copy/paste is very slow for items with a lot of links. This move function will help speed it up while we are waiting for a true merge function to be developed.

Do we need a redirect? Wikidata is CC0 so we don't have to keep a reference to the original contributor. As I have been merging via cut/copy/paste I have just asked for the item I merged from to be deleted with the justification "merged to Q????".

Mostly I just do this for pages with small numbers of links because it is such a pain to do for pages with lots of links. When I get this move tool I will do the pages with lots of links.

The redirect is to maintain link structure.

(In reply to comment #17)

Do we need a redirect? Wikidata is CC0 so we don't have to keep a reference
to
the original contributor.

Yes, we do need the redirect. We don't need the redirect for *legal* reason. But Wikidata is an entity base, that is, it provides stable identifiers for entities (concepts, data items). These IDs should stay valid indefinitely, otherwise Wikidata becomes a lot less useful to 3rd party applications.

soulkeeper.wikipedia wrote:

How many duplicates/merge candidates exist on Wikidata currently? I don't want to guesstimate, but I'm worried that the number is overwhelming.

In any case, the longer each merge candidate lingers, the more work it becomes. If we are not to be swamped down by this, I believe we need to get us some efficient merge tools, the sooner the better. IMHO.

Setting to major. We really need to move forward with this.

To sum up an internal discussion we had on this a couple of weeks back:

This is stalled because of all the edge cases it introduces. It seems impossible to foresee and cover them all in advance. So it's probably best to provide a very simple base line implementation, and observe and improve from there. The baseline should include:

  • A special page for turning one item into a redirect to another. Data in the item is, for now, lost (though of cause still available via the page history). Actual merging of item data can be added later.
  • There should be no way to create or change a redirect via the API.
  • Any attempt to edit a redirect should cause an error. The only exceptions would be rollback/restore (but probably not undo) actions that remove the redirect-ness from the item.
  • Special:EntityData should automatically resolve redirects.
  • Item redirects should be presented to mediawiki as redirects through the ContentHandler facility.
  • MediaWiki should be able to automatically resolve double redirects in the usual manner.

Quite a bit of desirable functionality and unclear edge cases remain, but the above should be easy enough to implement, and should provide a baseline for testing and tentative live use.

nzmoihue wrote:

I made a test version of merge.js using that API http://test.wikidata.org/wiki/MediaWiki:Gadget-Merge.js but I hope a gadget be developed inside wikibase as the merging is very common action on wikibase.

nzmoihue wrote:

Also as I am on wikibreak, feel free to change/override/develop merge.js gadget.

nzmoihue wrote:

Also interesting, what happened here? Seems merge API didn't add anything to Q194 but removing anything from Q7: http://test.wikidata.org/w/index.php?title=Q7&action=history http://test.wikidata.org/w/index.php?title=Q194&action=history