Page MenuHomePhabricator

Merge the Wikidata ContentHandler branch
Closed, ResolvedPublic

Description

A new ContentHandler class as well as other modifications currently reside in the Wikidata branch here:

https://gerrit.wikimedia.org/r/#/q/status:merged+project:mediawiki/core+branch:Wikidata,n,z

This work needs to go through one or more review/iteration cycles, and then merged into master for deployment. This bug will be used to track the conversation and ownership. (Step 1: Tim will outline the steps that he believes need to happen prior to merge, which previously may have only been captured in IRC.)


Version: 1.20.x
Severity: normal

Details

Reference
bz38622

Related Objects

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:57 AM
bzimport set Reference to bz38622.

From Daniel Kinzler:

Things got a bit stuck for a bit over the last 4 weeks, because I was busy
traveling to Wikimania and having a look at New York. I have picked up work on
the branch last week, merged in the latest master and did some refactoring that
Tim requested (moved several functions from ContentHandlers to Content).

It would be really helpful to get this review done soon, so we can code the
extension against the actual current master. Also, several other projects, like
Geodata, could benefit a lot from the ContentHandler stuff.

Full message:
http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/62552

In June, I suggested a couple of changes, which don't seem to have been implemented yet:

  • Make $wgContentHandlers purely configuration, store cache separately
  • Use strings for the model and format instead of integers. Store as strings in the database except where they are equal to the page default, in which case null should be stored.

(In reply to comment #2)

In June, I suggested a couple of changes, which don't seem to have been
implemented yet:

Both have been implemented in June:

commit 906a1ba51f149d659e270597e4b964cc8c550357
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Mon Jun 25 23:30:51 2012 +0200

[bug 37746] string ids for content model and format.

The content model is stored as a varbinary(32), the format
as varbinary(64).

If the standard model resp. format is used, null is written
to the database instead of the actual id, saving space.

Change-Id: I32659b49a9ad3cb8ecae9019562cff7de42b65f9

commit 38828a93a17e63abb85442c90f04f6cac20034ec
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Thu Jun 14 12:33:16 2012 +0200

Use class-static var for storing ContentHandler singletons.

$wgContentHandlers will be used for configuration only, and
is no longer changed during operation.

It's no longer possible to supply instances directly in $wgContentHandlers,
it only takes class names now.

Change-Id: Ifafb08d8638eff0d6965cd92fadbd9071c74de37

I also moved some functionality from ContentHandler to COntent, as you requested:

commit 8d280dde9b405880d9d4a3a724c6b0872ef03dc1
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Mon Jul 23 23:52:34 2012 +0200

Moved getParserOutput to Content interface.

On Tim's request, this change moved getParserOutput() and getSecondaryDataUpdates()
from the ContentHandler to the Content interface.

Change-Id: Ia654aa8710a242ba5fe7a4eb528e6a6449035f59

commit c8e633f1e38d8fee671fc5453536adccaee616be
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Mon Jul 23 22:54:25 2012 +0200

moved getDeletionUpdates to Content interface

Change-Id: I1b46b1d663f8efd609aeb1b63cb07ee1a0a00c33

I can't find any of those revisions in the Gerrit web UI. I can see 38828a9 in the history of the Wikidata branch, but the other two aren't there. What command did you use to push them? Maybe you pushed them directly into refs/heads/Wikidata and then dereferenced them with a rebase.

Bah, I hate gerrit. I did not disconnect or rebase anything. When I do a fresh clone of core and checkout the Wikidata branch, these commit are there. I have no idea why Gerrit does not find them. I can't see them on gitweb either, which really confuses me.

I did push to the branch directly, yes. I was working on it by myself, and when I started, it was an SVN branch. So I just kept working like this. Perhaps it would be better to do all changes via gerrit from now on.

If you think that would be the way to go, just disable direct push. However, I may end up self-approving edits, because there's really nobody to review this stuff (well, except you I guess).

so bugzilla craft an URL to search the commit in Gerrit. If the commit was pushed directly to the repo (bypassing Gerrit), it does not know anything about the sha1.

(In reply to comment #6)

so bugzilla craft an URL to search the commit in Gerrit. If the commit was
pushed directly to the repo (bypassing Gerrit), it does not know anything about
the sha1.

Yea, but why doesn't gitweb find it? search doesn't turn it up, and I can't find it in the branche's log either. Confusing.

(In reply to comment #3)

(In reply to comment #2)

In June, I suggested a couple of changes, which don't seem to have been
implemented yet:

Both have been implemented in June:

commit 906a1ba51f149d659e270597e4b964cc8c550357
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Mon Jun 25 23:30:51 2012 +0200

[bug 37746] string ids for content model and format.

The content model is stored as a varbinary(32), the format
as varbinary(64).

If the standard model resp. format is used, null is written
to the database instead of the actual id, saving space.

Change-Id: I32659b49a9ad3cb8ecae9019562cff7de42b65f9

https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=commitdiff;h=906a1ba51f149d659e270597e4b964cc8c550357

commit 38828a93a17e63abb85442c90f04f6cac20034ec
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Thu Jun 14 12:33:16 2012 +0200

Use class-static var for storing ContentHandler singletons.

$wgContentHandlers will be used for configuration only, and
is no longer changed during operation.

It's no longer possible to supply instances directly in $wgContentHandlers,
it only takes class names now.

Change-Id: Ifafb08d8638eff0d6965cd92fadbd9071c74de37

https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=commitdiff;h=38828a93a17e63abb85442c90f04f6cac20034ec

I also moved some functionality from ContentHandler to COntent, as you
requested:

commit 8d280dde9b405880d9d4a3a724c6b0872ef03dc1
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Mon Jul 23 23:52:34 2012 +0200

Moved getParserOutput to Content interface.

On Tim's request, this change moved getParserOutput() and

getSecondaryDataUpdates()

from the ContentHandler to the Content interface.

Change-Id: Ia654aa8710a242ba5fe7a4eb528e6a6449035f59

https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=commitdiff;h=8d280dde9b405880d9d4a3a724c6b0872ef03dc1

commit c8e633f1e38d8fee671fc5453536adccaee616be
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Mon Jul 23 22:54:25 2012 +0200

moved getDeletionUpdates to Content interface

Change-Id: I1b46b1d663f8efd609aeb1b63cb07ee1a0a00c33

https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=commitdiff;h=c8e633f1e38d8fee671fc5453536adccaee616be

I'm having the same problem as Tim. I can see:

commit 906a1ba51f149d659e270597e4b964cc8c550357
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Mon Jun 25 23:30:51 2012 +0200

...and

commit 38828a93a17e63abb85442c90f04f6cac20034ec
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Thu Jun 14 12:33:16 2012 +0200

...but I can't see:

commit 8d280dde9b405880d9d4a3a724c6b0872ef03dc1
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Mon Jul 23 23:52:34 2012 +0200

...or:

commit c8e633f1e38d8fee671fc5453536adccaee616be
Author: daniel <daniel.kinzler@wikimedia.de>
Date: Mon Jul 23 22:54:25 2012 +0200

(In reply to comment #10)

I'm having the same problem as Tim. I can see:

Did you try any of my links? (from the BZ email notif not the web UI)

I just tried them all and they all worked.

(In reply to comment #10)

I'm having the same problem as Tim. I can see:

see *where*? Bugzilla generates links to Gerrit. Gerrit doesn't know these commits because they did not go in via Gerrit. They are in Git though, and in GitWeb (if you can find them - use jeremy's links).

To reiterate: to review this branch, forget about gerrit. just clone core and change check out the Wikidata branch. Then check the log for the changes i mentioned. If they are not there, let me know.

Alright, my apologies, now I see them. For whatever reason, my local copy didn't have the latter two revs, even though I had fetched (or so I thought). I nuked that repo, checked out a fresh copy, and now I see everything. I'll pester Tim to look at this.

I could see those revisions once I deleted the local branch and recreated it. Here's a few review comments for you.

There are still a lot of long lines. Pretty much every time there is a comment on the end of a line, it makes the line too long, so you should probably not do that.

What are the changes to DairikiDiff.php for? The code changes don't match the commit message.

In DifferenceEngine.php:

} elseif( Hooks::isRegistered( 'ArticleViewCustom' )

		&& !wfRunHooks( 'ArticleViewCustom', array( ContentHandler::getContentText( $this->mNewContent ), $this->mNewPage, $out ) ) ) { #NOTE: deperecated hook, B/C only

This is a common pattern. I suggest you provide a wrapper function for running legacy hooks so that code like this will be shorter and more legible. Something like:

ContentHandler::runLegacyHook( 'ArticleViewCustom', array( $this->mNewContent,
$this->mNewPage, $out ) )

with runLegacyHook() something along the lines of

function runLegacyHook( $hookName, $args ) {

foreach ( $args as $key => $value ) {
    if ( $value instanceof Content ) {
        $args[$key] = ContentHandler::getContentText( $value );
    }
}
return wfRunHooks( $hookName, $args );

}

#XXX: generate a warning if $old or $new are not instances of TextContent?
#XXX: fail if $old and $new don't have the same content model? or what?

These are good questions, and there are many more fixme comments like them throughout the Wikidata code. I can't approve it with these kinds of flaws remaining. If nothing will ever call this function with anything other than TextContent, then throwing an exception is the right thing to do. Diffing the serialized representation does not seem right to me, for non-text content.

#XXX: text should be "already segmented". what does that mean?

It means that $wgContLang->segmentForDiff() should have been run on it. The comment is an error introduced by me in r13029, you certainly should *not* segment the input of that function.

In Article.php:

$contentHandler = ContentHandler::getForTitle( $this->getTitle() );
$de = $contentHandler->createDifferenceEngine( $this->getContext(), $oldid, $diff, $rcid, $purge, $unhide );

Surely the content type used to get the diff engine should depend on the content of the revisions involved, not just the title. Same issue throughout commit 88f8ab8e90a77f1d51785ba63f2eac10599c3063.

Thanks for your comments Tim.

I have addressed some of the issues you mentioned in my recent commits to gerrit. I have self-approved some cosmetic changes, others are pending review - could you have a look?

https://gerrit.wikimedia.org/r/#/q/project:mediawiki/core+branch:Wikidata+is:open,n,z

In particluar, I have done the following:

  • cleaned up long lines
  • addressed, removed or reworded many FIXME and XXX comments
  • implemented ContentHandler::runLegacyHooks, and made use of it in several places, cleaning up crufty code.

The changes in DairikiDiff where made in order to make this class more flexible and reusable by by other kinds of content. I still think that would be nice, but it's not needed for Wikibase/Wikidata, since we ended up implementing our own Diff infrastructure.

So, this is what I'll do:

  • revert the changes to DairikiDiff
  • make sure the DifferenceEngine is created via the revision's ContentHandler, not the Title's.

The two "ALTER TABLE revision" patches should be merged into a single query, to improve the time taken to update. Same with the archive table. All of the schema changes can be in one patch file, there's no requirement for each field to be in its own file.

Following Rob's suggestion:

Someone could create a faux commit (with "DO NOT MERGE" on it) which
is not a merge commit, but a fully squashed, rebased single commit.
That would give a target for inline review comments. I'm not
volunteering to do that myself, but anyone could do that, and drop a
comment on bug 38622.

That faux commit is here:

https://gerrit.wikimedia.org/r/#/c/25736/

Hi Denny, thanks for the faux commit! Is there a merge commit ready to go for this yet, or is there more blocker fixing happening first?

(In reply to comment #19)

Hi Denny, thanks for the faux commit! Is there a merge commit ready to go for
this yet, or is there more blocker fixing happening first?

I think that's a question for Tim, really. I think we are basically good to go - I have not yet resolved everything he mentioned, but I would not consider any of these issues blockers in the hard sense. Anyway, I expect to be able to address most of the stuff by Monday.

Hi Daniel, let me ask the question another way: when are you planning on submitting a merge request? If *you* believe you've addressed all of the blockers, you should submit the merge request. If Tim disagrees, he'll put a -2, but don't wait for his (or anyone's) preapproval before submitting the merge request.

Hi Rob. Ok, we'll do that. I was sticking to the original plan of Tim merging the branch bypassing gerrit, but a merge request is probably helpful to get input from others as well.

Anyway, I expect that we'll do that early next week.

Closed older resolved bugs as verified.