Page MenuHomePhabricator

EntityObject::equals should make more consistent, strict comparison of entities ID
Closed, ResolvedPublic

Description

Right now EntityObject::equals( entity ) will return true if one entity has an ID set and the other entity's ID is null. If both entities have an ID set but it is different then false will be returned. This seems kind of strange.

A better solution could be to add a function comparing the content of two entities only, ignoring the ID entirely, e.g. EntityObject::same( entity ).


Version: unspecified
Severity: normal

Details

Reference
bz40295

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:56 AM
bzimport set Reference to bz40295.
bzimport added a subscriber: Unknown Object (MLST).

The intention was to allow an item that was already stored in the database (and thus has an ID) to be equal to a newly created "volatile" item that doesn't have an ID.

Using a different function for this kind of comparison is cleaner - I was trying to make this opaque to the caller, but that was probably a silly idea. So, I agree that we should go this route, but I would really like to have descriptive names. Between "equals" and "same", it's kind of non-obvious which does which. So, how about equals() for "total" equality and hasSameContent() for the version that ignores the ID?

We could perhaps use "similarity"? "Equals" is somehow triple equality and "same" is somehow double equality. Similarity is used for correlation with higher order statistics. We could say that we use a correlation function (or functions) to measure the similarity, and that the function(s) run over a number of properties (or all). With a little thought we could make something that is fairly efficient, that is basically running over a limited set of properties and testing if they are the same. If many enough of them are the same the two entities will be "similar". It is also possible to extend such similarity measures in several ways, for example by using Levenshtein distance between strings used as properties instead of a double equality.

Its not so difficult as it seems.

(In reply to comment #2)

We could perhaps use "similarity"?

The notion of "similarity" and/or semantic proximity of of items (topics) is very interesting and useful for information retrieval, natural language processing, etc; but I don't think that it is what we need here. Similarity is a complex beast and how it should be defined highly depends on what it is intended to be used for, so I think the notion should be defined on the application level.

The equality function under consideration here is primarily used to decide whether a new version of an item is the same as the previous one, i.e. it's used to check whether an edit is a "null edit" and should thus be omitted from the page history. That's a fairly low-level and should be pretty strict.

Hm... thinking about equals() vs. hasSameContent(): Note that we are free to define the equals function in Entity as we like, but the equals function in EntityContent is defined by the Content interface and used by WikiPage::doEditContent (which is called by EntityContent::save). It's used to determine whether the new content is the same as the previous content, in which case a "null edit" is triggered and no new revision is created in the database.

The reason I made equals lenient about missing IDs is this: if I want to store a new revision of an item, and to do so I construct a fresh Entity and EntityContent and then tell WikiPage to save it, the new content object may not yet have an ID attached. So equals() would fail without need when comparing the new version to the previous one.

Hm... this could also be solved by forcing the new content to have an ID. This should probably be done anyway, for consistency. And if it already has an ID that is different from the previous item's ID, the save should probably fail - I can't think of a valid reason to allow this.

What you talking about here is not equality and it seems to me that it is not even sameness, it is similarity. If we call such tests equality we introduce confusion to the soup and will get into problems later.

(In reply to comment #5)

What you talking about here is not equality and it seems to me that it is not
even sameness, it is similarity.

How is it not equality? The function is intended to determine whether the data structures are semantically equivalent with respect to our data mode. To me, this is the definition of equality.

Think about the naming of the function.