Page MenuHomePhabricator

Review Wikibase Repo extension for deployment
Closed, ResolvedPublic

Description

Review the Wikibase repo extension for deployment.

It can be found on git in the project mediawiki/extensions/Wikibase, in the repo directory.


Version: master
Severity: normal

Details

Reference
bz38822

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:10 AM
bzimport set Reference to bz38822.

Note: this code will only be used on the Wikidata site itself, not on client wikis like Wikipedias.

Assigning to Tim for now. We have some ideas of how to split the review work up a little differently, so we may change these around before reassigning.

  • {

/* defining Arial as default font working around problematic font metrics of Helvetica applied
in Firefox and Opera on Mac cutting off high characters like "Å" in some cases */
font-family: Arial, sans-serif;
}

You should find some other way to fix this, global font choice is a matter for the skin, not the Wikibase extension.

57KB of site and language data is too much to be efficiently loaded with every page view request by embedding in a <script> tag. It should be split off to a separate request, with an Expires header, using either API or RL.

It would be nice if it worked for sites other than Wikipedia. It seems a bit funny to implement so many layers of abstraction and then to hard-code the site name.

Global variables names should be prefixed with "wg", including configuration globals. For example, $wbStores should be $wgWBStores.

The performance issue I identified during core review, i.e. O(N^2) write queries for N link updates, seems to still be present. WikibaseCache.sql says "this cache is a shared table, so exists only once per master", but I couldn't find any actual implementation of that mechanism. EntityCacheTable does not override the getReadDb() method or use a database selector in getName().

ChangesTable also appears to lack any remote DB support, so it's hard to see how a change could propagate from one wiki to another. When I run pollForChanges.php on a client wiki, it just gives me an error. No doubt you would have tested that, so I'm probably just doing it wrong. But grepping for wfGetLB() doesn't give any hits, and that is the obvious way to connect to a remote DB.

Thanks for the feedback, Tim. You are mentioning 5 issues:

  1. 57KB of site and language data: yes, this has been on our todo list forever. I hope we get this moved to a separate resource next week.
  1. "Wikipedia" should not be hardcoded anywhere - where did you find this? Maybe as a default setting, or some such?
  1. global variables: will do. The convention changed a couple of times it seems, causing confusion. Is our main settings array acceptable as $egWBSettings, or would it become $wgWBSettings?

The last two points are really only relevant to the client, even though the code is in Wikibase/lib. These issues shouldn't block the deployment of the repo, some of the code, like the pollForChanges script, should probably be moved. Anyway:

  1. That we have to (potentially) update all Wikipedias after a single edit on Wikidata lies in the nature of the project, I would think. We are thinking about how to make this more efficient by batching updates. I'll try to prepare a writeup explaining how we currently envision the percolation of the changes.
  1. I have worked on remote DB support for ORMTable (and by extension ChangesTable) yesterday, see I261a2a31. I have not yet figured out though how to correctly set up a LBFactory_multi to test this. Can you help me with that? What would a simple setup for two masters (and no slaves) look like?

(In reply to comment #6)

Thanks for the feedback, Tim. You are mentioning 5 issues:

  1. 57KB of site and language data: yes, this has been on our todo list forever.

I hope we get this moved to a separate resource next week.

  1. "Wikipedia" should not be hardcoded anywhere - where did you find this?

Maybe as a default setting, or some such?

In ItemView.php:

/**

  • Returns a list of all the sites that can be used as a target for a site link. *
  • @static
  • @return array */

public static function getSiteDetails() {
...
if ( $site->getType() === Site::TYPE_MEDIAWIKI && $site->getGroup() === 'wikipedia' ) {

The autocomplete feature that this function services also references Wikipedia in a message (wikibase-error-autocomplete-connection).

There doesn't seem to be any way to populate the sites table with data other than the data that comes from meta.wikimedia.org, I had to patch Utils::insertDefaultSites() to set up my test instance.

populateInterwiki.php also unconditionally references Wikipedia.

  1. global variables: will do. The convention changed a couple of times it

seems, causing confusion. Is our main settings array acceptable as
$egWBSettings, or would it become $wgWBSettings?

I think $wg is the best convention, since if everything uses it, a configuration UI can drop the prefix. It's almost universal in extensions deployed to WMF, the only exception is the Contest extension, which is another one of Jeroen's projects.

The last two points are really only relevant to the client, even though the
code is in Wikibase/lib. These issues shouldn't block the deployment of the
repo, some of the code, like the pollForChanges script, should probably be
moved. Anyway:

  1. That we have to (potentially) update all Wikipedias after a single edit on

Wikidata lies in the nature of the project, I would think. We are thinking
about how to make this more efficient by batching updates. I'll try to prepare
a writeup explaining how we currently envision the percolation of the changes.

Aren't we talking about a deployment in October? It seems like a pretty basic feature to be starting so late.

  1. I have worked on remote DB support for ORMTable (and by extension

ChangesTable) yesterday, see I261a2a31. I have not yet figured out though how
to correctly set up a LBFactory_multi to test this. Can you help me with that?
What would a simple setup for two masters (and no slaves) look like?

Here is my LocalSettings.php, if it helps:

http://paste.tstarling.com/p/drrHMe.html

Apologies for the accumulated cruft. It has configuration for various multi-wiki features. For multiple masters, it would be basically the same, except with $wgLBFactoryConf having:

'sectionsByDB' => array(

'enwiki' => 's1',

),
'sectionLoads' => array(

's1' => array( 'local1' => 1 ),
'DEFAULT' => array( 'local2' => 1 ),

),

It's possible to run multiple MySQL servers on the same host. There's a helper script for it called mysqld_multi:

http://dev.mysql.com/doc/refman/5.1/en/mysqld-multi.html

For MediaWiki, it's necessary to use different IP addresses rather than different ports to separate the instances.

I think $wg is the best convention, since if everything uses it,
a configuration UI can drop the prefix.

Not that it's a big thing, but I just want to mention that alone Jeroen and I are maintaining at least about 25 different extensions not using the 'wg' prefix and I am sure there are a few more out there.
So I don't think a configuration UI could or should ever easily work based on the 'wg' prefix.

(In reply to comment #7)

  1. "Wikipedia" should not be hardcoded anywhere - where did you find this?

Maybe as a default setting, or some such?

Ok, I filed what you mentioned as Bug 40594.

  1. That we have to (potentially) update all Wikipedias after a single edit on

Wikidata lies in the nature of the project, I would think. We are thinking
about how to make this more efficient by batching updates. I'll try to prepare
a writeup explaining how we currently envision the percolation of the changes.

Aren't we talking about a deployment in October? It seems like a pretty basic
feature to be starting so late.

Our current implementation works fine if you have one poll script per wiki. It uses $wgSharedTables for accessing the repo's wb_changes table, which only works if that's on the same server. So I'm now changing this to use the foreign wiki stuff.

This should be sufficient for a deployment with a handful of client wikis. A better solution is needed if we want to deploy the client stuff to all Wikipedias. That's what the writeup is about.

Here is my LocalSettings.php, if it helps:

http://paste.tstarling.com/p/drrHMe.html

Cool, thanks for the link!

Reedy also pointed me to https://noc.wikimedia.org/conf/db.php.txt, which gave me some idea of how this works.

One question about terminology though: Can you explain to me what are "sections" and "groups", and how they related to "clusters"?

(In reply to comment #9)

Our current implementation works fine if you have one poll script per wiki. It
uses $wgSharedTables for accessing the repo's wb_changes table, which only
works if that's on the same server. So I'm now changing this to use the foreign
wiki stuff.

$wgSharedTables is outdated. I will add a deprecation warning.

One question about terminology though: Can you explain to me what are
"sections" and "groups", and how they related to "clusters"?

A section is collection of wiki databases. A shared database like centralauth is treated like a wiki in that it can be in a section.

A query group (sometimes abbreviated to "group") is the set of queries which come from a particular caller or a related set of callers, for example user contributions queries. Query group configuration allows such queries to be directed to particular slaves, to make efficient use of the RAM cache or to avoid having one feature overload the server used by another feature.

A cluster is a master DB server and its associated slaves, which are used by ExternalStoreDB for reading and writing article text data.

I believe the security+architecture review that Chris did, plus all of the architecture discussions we've had, are sufficient for a deployment. Please reopen if you feel we need additional review on any of these.