Page MenuHomePhabricator

[Task] Update the wb_terms table so it does not have a numeric entity id
Closed, DeclinedPublic

Description

  • Update term class to not have a numeric entity id
  • Provide a migration script for wb_terms

Details

Reference
bz56711

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
OpenFeatureNone
DuplicateNone
ResolvedNone
ResolvedNone
ResolvedNone
DuplicateNone
InvalidLydia_Pintscher
OpenNone
OpenNone
StalledNone
OpenNone
ResolvedAddshore
Resolvedthiemowmde
ResolvedAddshore
DeclinedNone
InvalidNone
DeclinedNone
ResolvedLydia_Pintscher
ResolvedNone
ResolvedWMDE-leszek
DeclinedNone
DeclinedNone
DeclinedNone
ResolvedLadsgroup
Resolvedaude
ResolvedMarostegui
ResolvedLadsgroup
ResolvedAndrew
ResolvedLadsgroup
Resolvedaude
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:15 AM
bzimport set Reference to bz56711.
bzimport added a subscriber: Unknown Object (MLST).

Rationale: we are dropping the assumption that ids will always be prefix+number. For the current code and use case, wikidata.org, this works fine, but we need to migrate away from this in order to support things like meta-data storage on commons.

I fixed compat with sqlite and several other issues. The tests now pass: https://gerrit.wikimedia.org/r/#/c/114490/

The commits are doing some stuff I don't like, though that can be fixed after we got rid of the main issue, the bad assumption in the table, which this commit fixes.

Springle wrote at https://gerrit.wikimedia.org/r/#/c/101197/

This one still seems dangerous to me :-) I understand the reason for the change, however please do also consider:

  1. Have we done any real profiling of the new query forms against the production dataset? I'd really like to see how much of an impact this has on data and index disk usage, and more importantly on runtime memory usage. Happy to do this if a Dev can generate a few thousand samples of each query type...
  1. Would it be wise to keep a numeric entity id field as an interim step on the wikidata production dataset, so we can fail back if needs be? Ie, treat this as a denormalization step (which is /all it is/ for now) until #1 is assured? That might even make the migration less painful.
  1. VARBINARY(255) smells like an arbitrary size choice :-) Variable field widths really start to matter for large datasets as the server must convert it to fixed-width BINARY while working. If the choice /was/ arbitrary, can we arbitrarily choose to make this smaller from the get go?
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:29 PM

Fundamental questions in today's sprint start came up: Do we still want this at all? We should try get rid of the database table anyway and the change is very expensive on such a large database table so we do want to avoid it unless absolutely necessary.

thiemowmde renamed this task from Update the wb_terms table so it does not have a numeric entity id to [Task] Update the wb_terms table so it does not have a numeric entity id.Aug 13 2015, 4:25 PM
thiemowmde updated the task description. (Show Details)
thiemowmde removed a project: Patch-For-Review.
thiemowmde set Security to None.
thiemowmde removed a subscriber: Wikidata-bugs.
Addshore subscribed.

the wb_terms table is being removed in T208425.

In preparation for that I am simply going to close this task.