Page MenuHomePhabricator

Document how to use federated commonswiki and wikidatawiki databases
Closed, ResolvedPublic

Description

It's currently impossible to do joins between Wikipedia and Commons/Wikidata. The labs database servers should have copies of Commons and Wikidata just like at the Toolserver.


Version: unspecified
Severity: normal

Details

Reference
bz58802

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:16 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz58802.

Actually, what's missing here is documentation and not functionality:

every shard that does not have the original database has a federated database link to commonswiki and wikidatawiki; but they are named 'commonswiki_f_p' and 'wikidatawiki_f_p' respectively.

Using those is functionally identical to using an actual local view, except that performance can be severely impacted if you do joins on non-indexed columns (which you should never do anyways).

Keeping the bug open but renaming it to track the need for documentation instead.

You need to do more than document.

MariaDB [nlwiki_p]> connect nlwiki_p nlwiki.labsdb;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Connection id: 1008694
Current database: nlwiki_p

MariaDB [nlwiki_p]> use commonswiki_f_p;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [commonswiki_f_p]> show tables;
+---------------------------+

Tables_in_commonswiki_f_p

+---------------------------+

image
logging
logging_userindex
page
revision
revision_userindex
user

+---------------------------+
7 rows in set (0.04 sec)

Should be 64 tables.

Those are the only tables for which queries are being used in joins in practice. Adding more is not an overly complicated operation, but will only be done as needed.

If it's not overly complicated, why not just do it? Having the tables in-place makes sure anyone who *does* want to use them *can*, without having to jump through hoops.

Are you serious? W(In reply to comment #3)

Those are the only tables for which queries are being used in joins in
practice. Adding more is not an overly complicated operation, but will only
be
done as needed.

Are you serious? Based on what practice? Just add the missing tables.

There is a maintenance cost associated with maintaining federated tables (inter alia, it requires review at any schema change). The fewer such tables in place, the less likely it is that a problem is introduced.

Also, joins between databases living in different slices should be a rare operation and is almost always better done differently; requiring a request to add new tables ensures that there is an opportunity to review the proposed use case before it's implemented (to avoid having to force people to change their tools later).

The main reason why I have a Toolserver account is to do complicated cross database joins.

For example: Give all items on Wikidata that don't have a claim which link to an article on the Dutch Wikipedia using https://nl.wikipedia.org/wiki/Sjabloon:Taxobox .

Of course these are not run very often, let alone part of a tool. These results serve as input for a bot to do the actual work.

If not all tables are available I can't do this anymore. Than I've lost the most important functionality of Toolserver/Toollabs.

The fact that it's hard to maintain doesn't impress me. When the WMF started the whole database endeavor and made a choice to not have copies but federation this implication should have been taken into account.

Copies would have been even worse. The vast majority of downtimes of the replication in toolserver were caused /by/ the multiple replication.

That said, could you give me the queries which you /do/ run? Like I said, if there are use cases that need support I will support them; but I'm not going to create maintenance overhead for hypotectical joins with commonswiki.module_deps just for the fun of completing a list of checkmarks.

(in particular, I am certain that many tables in wikidatawiki would be useful to add to federation; but I cannot guess which nor would it be reasonable to preemptively include all of them).

(In reply to comment #6)

There is a maintenance cost associated with maintaining federated tables
(inter
alia, it requires review at any schema change). The fewer such tables in
place, the less likely it is that a problem is introduced.
[...]

But the review is already necessary for the change to the source, so there is no extra cost.

As this bug is a subset of bug #57876, closing this one as a duplicate.

  • This bug has been marked as a duplicate of bug 57876 ***