Page MenuHomePhabricator

Incomplete population of wb_terms table
Closed, ResolvedPublic

Description

So, due to many hours of replag, which are only going to get worse for the next 7-8 hours (meaning at least 12 hours of replag), I've cancelled the current run of rebuildTermsSearchKey.php

Whilst trying to work out where to start again from:

mysql:wikiadmin@db35 [wikidatawiki]> select min(term_row_id) from wb_terms where term_search_key = '';
+------------------+

min(term_row_id)

+------------------+

247135

+------------------+
1 row in set (1 min 9.97 sec)

mysql:wikiadmin@db35 [wikidatawiki]> select * from wb_terms where term_row_id > 247130 limit 10;
+-------------+----------------+------------------+---------------+-----------+--------------------+--------------------+

term_row_idterm_entity_idterm_entity_typeterm_languageterm_typeterm_textterm_search_key

+-------------+----------------+------------------+---------------+-----------+--------------------+--------------------+

24713141253itembnaliasMovie theatersmovie theaters
24713241253itembnaliasMovie housemovie house
24713341253itembnaliasExhibitionexhibition
24713441253itembnaliasFilm theatrefilm theatre
24713541253itembnalias
24713641253itembnaliasসিনেমাসিনেমা
24713741253itembnaliasFilm exhibitorfilm exhibitor
24713841253itembnaliasMatineematinee
24713941253itembnaliasPicture housepicture house
24714041253itembnaliasMoviegoermoviegoer

+-------------+----------------+------------------+---------------+-----------+--------------------+--------------------+
10 rows in set (0.05 sec)

mysql:wikiadmin@db35 [wikidatawiki]> select min(term_row_id) from wb_terms where term_row_id > 247140 AND term_search_key = '';
+------------------+

min(term_row_id)

+------------------+

254476

+------------------+
1 row in set (15.35 sec)

mysql:wikiadmin@db35 [wikidatawiki]> select * from wb_terms where term_row_id = 254476;
+-------------+----------------+------------------+---------------+-----------+-----------+-----------------+

term_row_idterm_entity_idterm_entity_typeterm_languageterm_typeterm_textterm_search_key

+-------------+----------------+------------------+---------------+-----------+-----------+-----------------+

25447641607itembnalias

+-------------+----------------+------------------+---------------+-----------+-----------+-----------------+
1 row in set (0.00 sec)

These show as a square box on my shell, but are having a resultant term_search_key that is ''.

This makes manually finding a starting point difficult, as above. --only-missing would help, but it's still going to go through the process of finding all these rows that are apparently still '', attempting to repopulate them, and then find the next one. This might take a while.

So my first point is, why is the term_search_key coming out as ''? Is this correct? If necessary, we can try and get the results dumped somewhere so we can work out what said character is.. Or with the IDs above, you might be able to find out through the end user interface.

I can/will start the script again when the replag is fixed. In the meantime, finding out if the above is right/wrong/we don't care would be useful


Version: master
Severity: normal

Details

Reference
bz46867

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:40 AM
bzimport set Reference to bz46867.
bzimport added a subscriber: Unknown Object (MLST).

My guess is that the input is a unicode control character that gets stripped in the normalization process. We shouldn't really accept that kind of thing as input, but apparently we do.

If this is true, the '' key is technically correct. But we could hack it to just use the original string in that case. Not sure what's the Right Thing here.

reedy@tin:/a/common$ mwscript extensions/Wikibase/repo/maintenance/rebuildTermsSearchKey.php wikidatawiki --force --only-missing
Updated 100 search keys, up to row 85621099.
Updated 100 search keys, up to row 115374456.
Updated 100 search keys, up to row 142209402.
Updated 51 search keys, up to row 151258620.
Done. Updated 351 search keys.
reedy@tin:/a/common$ mwscript extensions/Wikibase/repo/maintenance/rebuildTermsSearchKey.php wikidatawiki --force --only-missing
Updated 100 search keys, up to row 85621099.
Updated 100 search keys, up to row 115374456.
Updated 100 search keys, up to row 142209402.
Updated 51 search keys, up to row 151258620.
Done. Updated 351 search keys.
reedy@tin:/a/common$ mwscript extensions/Wikibase/repo/maintenance/rebuildTermsSearchKey.php wikidatawiki --force --only-missing
Updated 100 search keys, up to row 85621099.
Updated 100 search keys, up to row 115374456.
Updated 100 search keys, up to row 142209402.
Updated 51 search keys, up to row 151258620.
Done. Updated 351 search keys.

Part of the issue is that preg_replace apparently returns an empty string if it encounters a bad unicode sequence anywhere in the input.

Related URL: https://gerrit.wikimedia.org/r/70139 (Gerrit Change I702e01b3f021bb2e86fb309e0d51db2a10475ac2)

Related URL: https://gerrit.wikimedia.org/r/70140 (Gerrit Change Iedd9cc3b56c0db2e5ed6c02a398d7c35b1c96a1b)

Change 70139 merged by Jeroen De Dauw:
(bug 46867) trim bad utf-8 sequences before normalizing.

https://gerrit.wikimedia.org/r/70139

Change 70140 merged by Denny Vrandecic:
(bug 46867) skip bad search keys and report them.

https://gerrit.wikimedia.org/r/70140

Sam: please confirm that the issue is now solved, so we can set this to "verified".

Verified in Wikidata demo time July 17th