Page MenuHomePhabricator

[Task] EntityIdValues should be serialized as strings, not type/number structures.
Closed, ResolvedPublic

Description

Currently, EntityIdValues are represented in JSON as pairs of entitiy types and numeric ids. There is no easy for a client to make this to a prefixed id or URL. Instead, we should always use the prefixed ID form (e.g. "Q1234") to represent an entity ID externally (and probably also in the internal serialization).

Of course, we must keep accepting the old form as input. Otherwise, we would be unable to process serializations from existing revisions.


From T78294:
"unserializing the EntityIdValue (e.g. from memcached) from the entity type + numeric id format to an EntityId object has significant impact on performance and memory usage.

The unserialization uses the PropertyId::newFromNumber and ItemId::newFromNumber methods are memory intensive, with use of strtr (#5 in https://github.com/filbertkm/wb-profiling/blob/master/memory_own-itempurge-1.25wmf1.txt).

If we can somehow move away from entity type + numeric id format in EntityIdValue, I think that would be better."

https://github.com/wmde/WikibaseDataModel/issues/248

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
OpenFeatureNone
DuplicateNone
ResolvedNone
ResolvedNone
ResolvedNone
DuplicateNone
InvalidLydia_Pintscher
OpenNone
OpenNone
StalledNone
OpenNone
OpenNone
OpenNone
ResolvedAddshore
Resolvedthiemowmde
ResolvedAddshore
InvalidNone
Resolvedthiemowmde
ResolvedLydia_Pintscher
Resolvedadrianheine
Resolvedthiemowmde

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:13 AM
bzimport set Reference to bz54085.
bzimport added a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

Does the serialization change need to be in EntitySerializer.php?
Also what could possibly be a new serialization process?
I'd like to work on fixing this, I'm familiar with mediawiki code, but not much with wikidata in particular, a little hint would get me started.

Sorry for taking so long to reply to you Sumit. I don't know why this was tagged easy. This is a rather invasive change needing a lot of coordination in other areas. So probably not a good one to start with. Sorry! Have you looked at some other tickets already?

JanZerebecki lowered the priority of this task from High to Medium.May 16 2015, 5:32 PM
Jonas renamed this task from EntityIdValues should be serialized as strings, not type/number structures. to [Task] EntityIdValues should be serialized as strings, not type/number structures..Aug 15 2015, 12:44 PM
Jonas updated the task description. (Show Details)

We discussed this in todays story time. We decided this should be a blocker for …

…, erm, why are there so many tickets with the same intent? Anyway, we decided this must be considered to not break stuff twice. Why? Because instead of introducing a new property type (or even a new value type, which is what T101752 is about) it was suggested to make the existing EntityIdValue type more generic.

What is this about?

Look at https://www.wikidata.org/wiki/Special:EntityData/Q21632765.json and search for "numeric-id". That's what we want to get rid of. Here is the relevant JSON snippet, representing an EntityIdValue:

"value":{"entity-type":"item","numeric-id":184}

Original proposal: Turn into string

"value":"Q184"

Issues:

  • The moment we want it to support items from an external repository (e.g. Commons should use the wikidata.org items), this must change again into something like "value":"wikidata:Q184" or even "value":"http://www.wikidata.org/entity/Q184".
  • All ids must have a prefix then, even all internal ones, wasting massive amounts of bytes. Having ids with and without prefixes would be a bad hack, since this means : is not allowed in internal ids any more.
  • A string like wikidata:Q184 combines two facts in one field, violating a basic database design principle.

Todays proposal: Keep object

Note: I'm suggesting the new keys "repo" and "id" here. This can also be something else, e.g. "serialization".

"value":{"repo":"wikidata","id":"Q184"}

A format like this allows for a much more convenient migration path including a deprecation phase, which could look like this:

"value":{"repo":"wikidata","id":"Q184","entity-type":"item","numeric-id":184}

Advantages:

  • Makes it possible to add keys later, including full URIs and URLs (derived values, see T118860).
  • Migration is possible.
  • The repo key can be omitted for internal links, saving bytes.

Disadvantages:

  • Wastes bytes, compared to "value":"wikidata:Q184".
  • Users can mistakenly ignore the repo part (no matter if it's always there or omitted for internal links) and think this is a link to https://www.wikidata.org/wiki/Q184. This can't happen with URIs like wikidata:Q184.

Both proposals combined

"value":{"id":"wikidata:Q184"}

Issues:

  • Wastes bytes, compared to "value":"wikidata:Q184".
  • Again, it's not possible to omit the prefix.
  • We can not start with "id":"Q184" and add the prefix later, since this would be an other breaking change.

So, if I'm right, we need a task breakdown for this next, correct @thiemowmde?

We should have a decision on the format first. Personally I think a short discussion with @daniel, @adrianheine and me could work out.

+1 to proposal

Apparently there is an advantage to not fixing "something obviously bad" for over 3 years :)

@JeroenDeDauw, do you have a favorite of the three different proposals I collected?

Apparently there is an advantage to not fixing "something obviously bad" for over 3 years :)

Yea, that way, we have had more time to think about how to make this change backwards compatible -- which we wouldn't need if we had gotten it right in the first place.

Hah, I did not realize there are 3 serious proposals here, and thought only "Todays proposal: Keep object" was being proposed. That's the one that got the +1 from me. The combined one also seems reasonable.

I also support the "keep object" proposal, for compatibility reasons. It's ugly though, and wouldn't be needed if we hadn't started with a serialization that exposes internals. We are stuck with using an object here I guess, but we can move away from exposing the internals. I guess it's the best option we have right now.

@daniel, even with that we have many options. Use URIs like "wikidata:Q1", or full URLs, or split this into two elements? Which array keys to use for that? Personally I strongly suggest to keep the serialized JSON as short as possible because we have millions of these links in the database. My ideal solution (after a migration phase): "value":{"id":"Q184"}.

Quick summary of a discussion between Adrian, Thiemo, and me today:

Agreement:

  • Keep the object structure for entityid values
  • Add a new "id" key for the serialized entity id
  • { "id":"Q184", "entity-type":"item", "numeric-id":184 }
  • For internal storage, we can drop the old fields right away, and go to { "id":"Q184" }
  • { "id":"Q184" } should also become the default for API output at some point. We may continue to support "entity-type" and "numeric-id" as an option.

Note: we also want to support optional "url" and "uri" keys, but they should probably be treated as derived values, and not be part of the entity id value itself.


There are some questions that remain open regarding the support of external entity ids, for a federated setup. The changes outlined above can however be made without deciding the questions below.

The main question regarding external identifiers is: should a prefix that encodes the home repo of an entity be included in the id field? There are basically two options:

  1. { "id":"foo:Q184" } optionally expanded to { "id":"foo:Q184", "repo":"foo" }. Local IDs would have the form "Q184" without a prefix.
  2. { "id":"Q184" "repo":"foo" } optionally augmented with { "id":"Q184" "repo":"foo", "qname":"foo:Q184" }. The qname for a local ID would have the form mywiki:Q184, using a configurable prefix. An empty prefix could be allowed here, whih would lead to the form :Q184 for local entities.

Arguments for option(1):

  • The "id" field in API output has the same form as the expected input for API parameters an URLs.
  • We want to use prefixed IDs internally (nearly) everywhere where we currently have non-prefixed IDs. The notion of entity ID would be extended to include entities in other repos.
  • Clients that do not expect to see references to external entities would fail early, since they would not be able to parse prefixed IDs.
  • On a repo that is not federated with other repos, nothing changes, except that IDs become available as strings.

Arguments against (1):

  • The parser needs to detect whether the ID has a prefix or not.
  • IDs may not contain a colon (or a colon would need encoding or escaping).

Arguments for (2):

  • the "id" field never has a prefix, the "qname" field always has a prefix.
  • Clear distinction between "IDs" and "references".

Arguments against (2):

  • Clients that do not know about external entities may read the "id" field as being local, even if they are not.
  • IDs with the local prefix need to be accepted as input everywhere, including as parts of the URL.
  • What's the canonical form of the ID - with out without prefix? Should the canonical URI also contain the prefix?

@daniel's summary misses an option:

    1. { "id": "foo:Q184" }, but a local ID would still be { "id": "Q1" }. Main advantage: Repos like wikidata.org would not change at all when the software starts supporting such prefixed external IDs everywhere (not only in the data value we discuss here).
    2. Same, but local IDs would also be prefixed, e.g. { "id": "d:Q1" }. Advantage: The fact that IDs can be prefixed is explicit and obvious and can not be ignored. That's also a disadvantage: User code will break even when the repo the user cares about does not have external ids.
  1. { "id": "Q184", "repo": "foo" }. Main disadvantage: This changes the meaning of the existing "id" field.

I tend to agree that option 1.A is the best.

We continued discussing edge cases that involve colons in entity IDs. Solution: Prefix local IDs with a colon when they contain colons, e.g. { "id": ":Q1:2" }. We can and I think we should have support for this right away when we start working on this.

Change 299963 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Minimize EntityIdValue footprint when storing to the database

https://gerrit.wikimedia.org/r/299963

thiemowmde claimed this task.
thiemowmde removed a project: Patch-For-Review.

The issue this ticket describes was actually solved with https://github.com/wmde/WikibaseDataModel/pull/671. There are a few other patches like the linked https://gerrit.wikimedia.org/r/299963, but these are optional optimizations and cleanups.

Change 299963 abandoned by Thiemo Kreuz (WMDE):
Minimize EntityIdValue footprint when storing to the database

https://gerrit.wikimedia.org/r/299963