Page MenuHomePhabricator

CirrusSearch: We think CirrusSearch is doing two page parses on a page update and don't like that
Closed, ResolvedPublic

Description

We think CirrusSearch is doing two page parses on a page update and don't like that. We should do something about that but we're not sure what yet.


Version: unspecified
Severity: normal

Details

Reference
bz56968

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:33 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz56968.

So I think I more or less tracked it down. It can happen in two places:

  1. When we're called in SearchUpdate during a page edit, we're not making use of the already parsed version of the page. This makes page edits slower. It should be basically a non-issue though if we fix bug 57316
  1. Page redirects, when calling updateFromTitleAndText() and then updateFromTitle(), we we're likely throwing away any parser output we have on hand.

To fix this once and for all, I think I've got a solution in mind: I'm going to write a class that handles grabbing parser output and the parser cache so Cirrus won't have to care.

At the very least, we'd only end up with one parse per page in a given process.

I like the idea. I think it'll be less important when we get Bug 572316 because there will be fewer paths to take into the update process.

Oh, I still advocate doing it every though we're doing bug 57316.

So, the situation is far better than it was now that bug 57316 is fixed. There's tons of refactoring for that and other things have improved this.

We pass ParserOutput around everywhere now, and don't stampede the cache like we used to. We still toss the EditPage's output like I said in comment 1 (1), but we don't care anymore since we're using the job queue (and passing the output through the job queue would be bad).

Comment 1 (2) is fixed now

Other than bulk indexing jobs (where we bump the batch size to 10-ish), we shouldn't be doing any more than 1 parse per job now, which was the goal.

Marking this FIXED since we don't have a MOSTLY-FIXED-BUT-DONT-CARE :)

(In reply to comment #4)

Comment 1 (2) is fixed now

Pressed enter too soon. Meant to say:

Comment 1 (2) is fixed now with the job queue as well. We still index recursively, but since it's in the job queue we're not doing multiple in-process parses.