Page MenuHomePhabricator

use composite indexes for efficient term search
Closed, ResolvedPublic

Description

Currently, the wb_terms table has indexes on individual fields:

CREATE INDEX /*i*/wb_terms_entity_id ON /*_*/wb_terms (term_entity_id);
CREATE INDEX /*i*/wb_terms_entity_type ON /*_*/wb_terms (term_entity_type);
CREATE INDEX /*i*/wb_terms_language ON /*_*/wb_terms (term_language);
CREATE INDEX /*i*/wb_terms_type ON /*_*/wb_terms (term_type);
CREATE INDEX /*i*/wb_terms_text ON /*_*/wb_terms (term_text);
CREATE INDEX /*i*/wb_terms_search_key ON /*_*/wb_terms (term_search_key);

Since only a single index will be used for any given query, this is not helpful, and results in inefficient searches. We really need a composite key for each kind of query. As far as I remember, the kinds of queries we do are:

  • by term_language, term_term, and term_entity_type
  • by term_language, term_term, and term_type

"term" here may be an exact match, or a soft match - so we need indexes covering either. The term type has very low cardinality (3 or 4, iirc), so it can probably be admitted.

From this, I gather we need the following two composite indexes:

  • for exact matches: ( term_language, term_term )
  • for soft matches: ( term_language, term_search_key )

We also need to access terms by entity ID (especially for deleting and updating, so we also want an index on

  • ( term_entity_type, term_entity_id )

And of course, row_id will remain a primary key.


Version: unspecified
Severity: major
Whiteboard: backlog, termsearch

Details

Reference
bz45529

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:33 AM
bzimport set Reference to bz45529.
bzimport added a subscriber: Unknown Object (MLST).

What is the wb_terms table? I need a human-readable one line description (for non wikibase users) which will accompany it on the dumps page. It currently has the following placeholder: "This table is currently missing a description." See the description of the wb_items_per_site table at http://dumps.wikimedia.org/wikidatawiki/20130527/ for an example.

What is the wb_terms table? I need a human-readable one line description
(for non wikibase users) which will accompany it on the dumps page.

wb_terms associates labels, aliases and description in a given language with data items. It can be used for finding all labels and aliases for a given concept, or finding all concepts that use a given name (label or alias) in a given language. (In reply to comment #1)

afeldman wrote:

(In reply to comment #0)

Since only a single index will be used for any given query, this is not
helpful, and results in inefficient searches.

This isn't strictly true. Since the select queries in production against wb_terms aren't join queries, index merge optimizations are possible and are being utilized. I.e. a query containing WHERE (term_language='oc' AND term_type='label' AND term_entity_type='property') is executed via an intersection of two indexes - "Using intersect(wb_terms_entity_type,wb_terms_language)".

That said, index merging isn't as efficient as composite indexes, and there's one query case that isn't performing well at all.

It looks like there are four common select query types in production:

  1. /* Wikibase\TermSqlIndex::getTermsOfEntity */ select term_language, term_type, term_text from wb_terms where term_entity_id = ? and term_entity_type = ?
  1. /* Wikibase\TermSqlIndex::getMatchingIDs */ select distinct term_entity_id from wb_terms where (term_language=? and term_search_key like ? and term_type=? and term_entity_type=?) or (term_language=? and term_search_key like ? and term_type=? and term_entity_type=?) limit ?
  1. /* Wikibase\TermSqlIndex::getMatchingTerms */ SELECT term_entity_type, term_type, term_language, term_text, term_entity_id FROM wb_terms WHERE (term_language=? AND term_type=? AND term_entity_type=?)
  1. /* Wikibase\TermSqlIndex::getMatchingTermCombination */ select terms?term_entity_type as terms?term_entity_type, terms?term_type as terms?term_type, terms?term_language as terms?term_language, terms?term_text as terms?term_text, terms?term_entity_id as terms?term_entity_id, terms?term_entity_type as terms?term_entity_type, terms?term_type as terms?term_type, terms?term_language as terms?term_language, terms?term_text as terms?term_text, terms?term_entity_id as terms?term_entity_id from wb_terms terms? inner join wb_terms terms? on ((terms?term_entity_id=terms?term_entity_id) and (terms?term_entity_type=terms?term_entity_type)) where (terms?term_language=? and terms?term_text=? and terms?term_type=? and terms?term_entity_type=?) and (terms?term_language=? and terms?term_text=? and terms?term_type=? and terms?term_entity_type=?) and (terms?term_entity_id <> ? or terms?term_entity_type <> ?) limit ?

Query type 3 above can have a large number of results returned and should probably include a limit.

I think your proposed indexes should be good, but please limit the length on term_search_key. Most queries use a few letters only ('abc%'), indexing term_search_key(6) should be good.

If the combination ( term_entity_type, term_entity_id ) is guaranteed to be unique, and it will be used to delete rows, define this index as unique to minimize locking.

Submit the changes (with drop index statements) as a regular schema change and let me know when it's available in gerrit for review.

Thanks for the analysis Asher!

If the combination ( term_entity_type, term_entity_id ) is guaranteed to be
unique, and it will be used to delete rows, define this index as unique to
minimize locking.

It's not unique. In fact, ( term_entity_type, term_entity_id, term_language ) isn't unique either (there can be multiple aliases in a language language on an entity).

Just noticed Wikibase\TermSqlIndex::getMatchingIDs queries taking over a minute in production - this needs to be fixed.

(In reply to comment #3)

I think your proposed indexes should be good, but please limit the length on
term_search_key. Most queries use a few letters only ('abc%'), indexing
term_search_key(6) should be good.

UTF-8 can take up to 4 bytes for some characters, so this limit would cover 1.5 chars for some languages.

we should really look into this.

While this is still in limbo, to keep production alive, last week I added this temporary index:

KEY tmp1 (term_language,term_type,term_entity_type,term_search_key)

This query form (2 in comment #3) was backing up to the point of pain on slaves:

/* Wikibase\TermSqlIndex::getMatchingIDs */ select distinct term_entity_id
from wb_terms where (term_language=? and term_search_key like ? and
term_type=? and term_entity_type=?) or (term_language=? and term_search_key
like ? and term_type=? and term_entity_type=?) limit ?

Most examples were identical. Perhaps something could be cached.

(In reply to comment #3)

It looks like there are four common select query types in production:

  1. /* Wikibase\TermSqlIndex::getTermsOfEntity */ select term_language,

term_type, term_text from wb_terms where term_entity_id = ? and
term_entity_type = ?

This is getting all the terms defined for an entity. This is mostly used when determining which term entries need to be updated when an entity is saved.

  1. /* Wikibase\TermSqlIndex::getMatchingIDs */ select distinct term_entity_id

from wb_terms where (term_language=? and term_search_key like ? and
term_type=? and term_entity_type=?) or (term_language=? and term_search_key
like ? and term_type=? and term_entity_type=?) limit ?

This is used to find entities based on a user provided prefix. We should see LOTS of those, since they are used for find-as-you-type suggestions in the search box and in the entity selector.

  1. /* Wikibase\TermSqlIndex::getMatchingTerms */ SELECT term_entity_type,

term_type, term_language, term_text, term_entity_id FROM wb_terms WHERE
(term_language=? AND term_type=? AND term_entity_type=?)

This should be relatively rare: it's used to load all labels for all properties. The number of rows in the result should be the number of properties defined, which at the moment is about 1000. TermPropertyLabelResolver already caches the result in memcached.

There should be another variant of the Wikibase\TermSqlIndex::getMatchingTerms queries, which searches for a specific term_text:

3b) /* Wikibase\TermSqlIndex::getMatchingTerms */ SELECT term_entity_type,
term_type, term_language, term_text, term_entity_id FROM wb_terms WHERE
(term_language=? AND term_type=? AND term_entity_type=? AND term_text=? )

This is quite similar to case (2), but using term_text, not term_search_key.

  1. /* Wikibase\TermSqlIndex::getMatchingTermCombination */ select

terms?term_entity_type as terms?term_entity_type, terms?term_type as
terms?term_type, terms?term_language as terms?term_language, terms?term_text [...]

Used to find entities with a given combination of label and description, to avoid conflicts. This query is performed once per edit.

I suppose the query as such could be optimized - maybe using a subquery is more efficient here than a full join. Should be investigated.

To cover all these use cases, I'm considering the following indexes:

  • for TermSqlIndex::getMatchingIDs

CREATE INDEX /*i*/term_search ON /*_*/wb_terms (term_language, term_search_key(12), term_entity_type, term_type);

  • for TermSqlIndex::getTermsOfEntity and for the join in TermSqlIndex::getMatchingTermCombination

CREATE INDEX /*i*/term_entity ON /*_*/wb_terms (term_entity_type, term_entity_id, term_type);

  • TermSqlIndex::getMatchingTerms with or without given term_text, as well as for TermSqlIndex::getMatchingTermCombination

CREATE UNIQUE INDEX /*i*/term_identity ON /*_*/wb_terms (term_language, term_type, term_entity_type, term_text, term_entity_id);

Change 99660 had a related patch set uploaded by Daniel Kinzler:
(bug 45529) use composite indexes on wb_terms.

https://gerrit.wikimedia.org/r/99660

Pulled from db1021 with the old wb_% indexes after a few hours of activity:

+--------------+------------+---------------------+------------+

TABLE_SCHEMATABLE_NAMEINDEX_NAMEROWS_READ

+--------------+------------+---------------------+------------+

wikidatawikiwb_termstmp1905372478
wikidatawikiwb_termswb_terms_entity_id1284769088
wikidatawikiwb_termswb_terms_text365238
wikidatawikiwb_termswb_terms_language85815191
wikidatawikiwb_termswb_terms_search_key80851

+--------------+------------+---------------------+------------+

(tmp1 is roughly eqivalent to the proposed terms_search composite index)

Turns out we need to keep some indexes with term_entity_id, term_text, and term_search_key in left-most positions, plus avoid having massive indexes (22G already, more for the new composites) except where "Index Condition Pushdown" (ICP) is useful. Therefore I propose these indexes instead:

  • Some wb_terms queries not listed in the bug report use term_entity_id=N
  • which is good selectivity.

CREATE INDEX /*i*/terms_entity ON /*_*/wb_terms (term_entity_id);

  • When any wb_terms query includes a search on term_text greater than
  • four or five leading characters a simple index on term_text and
  • language is often better than the proposed composite indexes. Note
  • that it's fine to have term_language second as MariaDB still uses
  • the entire key length even with LIKE '...%' on term_text.

CREATE INDEX /*i*/terms_text ON /*_*/wb_terms (term_text, term_language);

  • Same idea as above for terms_search_key. Not so much traffic as
  • term_text but enough to be useful.

CREATE INDEX /*i*/terms_search_key ON /*_*/wb_terms (term_search_key, term_language);

  • The getMatchingIDs query is horrible when it has just one or two
  • leading characters on term_search_key. This index is slightly
  • different to the proposed term_search for better selectivity
  • while still allowing ICP for short string values.

CREATE INDEX /*i*/terms_search ON /*_*/wb_terms (term_language, term_entity_id, term_type, term_search_key(16));

The above indexes give reasonable performance on all wb_terms queries I can find and seems suitably generic enough for tables.sql.

However, for wikidatawiki in production it's still not fast enough for all fringe cases due to the sheer dataset size. So, on db1026 I've partitioned wb_terms based on hash(term_language). This:

  1. Lets the optimizer use partition pruning like a free first-stage index on term_language (because we use equality for term_language clauses). Reduces the number of rows read for many of our queries except those already using ICP.
  1. Allows terms_text and terms_search_key to drop the second term_language field for slightly smaller footprints with comparable performance due to (1). More stuff fits in the buffer pool for longer, basically.
  1. Speeds up writes to wb_terms on the slave because insert/update/delete only touches the smaller indexes in the relevant partition rather than the entire btrees (this is only a slight benefit, but I figure every bit to help reduce replag...)

CREATE TABLE wb_terms (
...

KEY terms_entity (term_entity_id),
KEY terms_text (term_text),
KEY terms_search_key (term_search_key),
KEY terms_search (term_language,term_entity_type,term_type,term_search_key(16))

) ENGINE=InnoDB PARTITION BY KEY (term_language) PARTITIONS 16

Note that I left term_language in terms_search despite the partitions because getMatchingIDs query still benefits from ICP when term_search_key is used with a poorly performing short prefix: LIKE 'a%'.

Partitions aren't suitable for tables.sql as we still say we support MySQL 5.0+, but I think it's worthwhile for this particular table on WMF production slaves.

(In reply to comment #10)

Pulled from db1021 with the old wb_% indexes after a few hours of activity:

Thank you for the thorough analysis and the helpful suggestions! I don't understand mysql's internals nearly as well as you, so I think I'll just go with your suggestions :)

Partitions aren't suitable for tables.sql as we still say we support MySQL
5.0+, but I think it's worthwhile for this particular table on WMF production
slaves.

Note that this is for the Wikibase extension, not core. We can and do have stronger requirements for the extension anyway. I'll think about whether to use partitioning always, or to make a special wmf version.

FTR, correction for my comment #10: I said:

"Note that I left term_language in terms_search despite the partitions because
getMatchingIDs query still benefits from ICP when term_search_key is used with
a poorly performing short prefix: LIKE 'a%'."

This is actually wrong for MariaDB 5.5. ICP is not being used inside the partition (though it apparently will be possible in the future). Looks like the speed-up for getMatchingIDs here is only due to the pruning and the smaller partitioned indexes generating fewer Handler_read% hits.

Change 99660 merged by jenkins-bot:
(bug 45529) use composite indexes on wb_terms.

https://gerrit.wikimedia.org/r/99660