Page MenuHomePhabricator

Incremental indexing (OAI) randomly skips events
Closed, ResolvedPublic

Description

Analysis of the lucene-search-2 code indicates a possible explanation for reports of page updates occasionally being missed, causing the previous version of a page to persist in the index indefinitely.

The main loop of IncrementalUpdater fetches OAI records from MediaWiki, 50 pages at a time. It uses the "from" timestamp parameter to advance through the update list. After each batch of pages, it uses the date from the <responseDate> element as the next value to send to the "from" parameter. This has the following flaws:

  • responseDate is the time at which the response is generated. If there is replication lag, the most recent timestamp available in the chosen slave might be some seconds in the past. Thus, a batch of events equivalent to the replication lag will be skipped.
  • responseDate and the from parameter have a one-second resolution. The English Wikipedia sees about 5 edits per second at peak. So some events may appear in the database with the same timestamp, after that timestamp has been processed by IncrementalUpdater, because they were committed later in the same second.
  • Using the revision timestamp instead of responseDate would be an improvement. However, rev_timestamp and up_timestamp are generated before the transaction is committed, and it is unknown how long it will take for the transaction to be completed, so the order of rev_timestamp or up_timestamp in the replication log will typically not be monotonic. Additionally, the approach would be highly sensitive to apache clock skew.

The obvious solution is to use the sequence number (resumptionToken) to advance through the update list, instead of the timestamp.


Version: unspecified
Severity: major

Details

Reference
bz45266

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:16 AM
bzimport set Reference to bz45266.
bzimport added a subscriber: Unknown Object (MLST).

Setting the assignee to Ram, plus changing status to ASSIGNED, as he is working on this.

I07d5c2cedcd7550505b53d380d13111bd83e3216 I31bde27a7a64e0d9fff340843f56fe6c6d8a322a

Ram: Both mentioned patches have been merged. Is something more needed here (if yes: what?), or can this bug report be closed as RESOLVED FIXED?

ram wrote:

Andre: The merged Lucene code has not yet been deployed; we should wait until that happens and no further issues are reported before closing this.

(In reply to comment #5)

Andre: The merged Lucene code has not yet been deployed; we should wait until
that happens and no further issues are reported before closing this.

Anybody knows where I can track this deployment, or for when this is scheduled?

Tim?

(In reply to comment #5)

Andre: The merged Lucene code has not yet been deployed; we should wait until
that happens and no further issues are reported before closing this.

Anybody knows where I can track this deployment, or for when this is scheduled?

Tim?

Adding Chad and Nik to the cc for their insight.

Adjusting severity: this is not "normal", at least major (or even "critical" as in practice we don't rebuild the index so that search data may be "lost" for years), however if more work is needed I doubt it's going to happen now that focus is on CirrusSearch.

Do we plan to deploy this? I only did a quick review of it but at that level it looks sane.

The OAI code long since went out. I honestly can't remember if we ever pushed out a fixed lsearchd or not.

debian/changelog makes it look like no, we didn't ever push these out to lsearchd.

demon claimed this task.
demon set Security to None.

I'm marking this resolved. As I said earlier we've long since made the changes to OAI and they went out. lsearchd got changes in master but we never deployed them. We don't care about lsearchd anymore and we're going to decom it so I consider it "resolved by getting rid of the broken thing."