CirrusSearch shouldn't use SQL to figure out page weight
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Manybubbles
	Nov 8 2013, 7:41 PM

Description

CirrusSearch shouldn't use SQL to figure out page weight. The queries are too slow.

Version: unspecified
Severity: normal
Whiteboard: cirrus_reenable

Details

Reference: bz56798

Event Timeline

• bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 2:21 AM

• bzimport added projects: CirrusSearch, Performance Issue.

• bzimport set Reference to bz56798.

• Manybubbles created this task.Nov 8 2013, 7:41 PM

While I'm thinking about this - we should probably go about this by pushing the outbound links into Elasticsearch and then counting them on page update. That should be reasonably simple but will require a code push and a rebuild to get the links in then another code push.

My only question is do we continue to use the database method during initial recovery and population or do we switch to a two pass approach for both. I'm inclined to stick with the database based method because it simplifies initial index loading a ton. Because we control it we can make sure it doesn't run rampant. Also, it is less work so we can get this out the door sooner. We can always go with the two phrased approach later and we really won't have cost ourselves too much in the mean time.

It might also have to use SELECT 1 ... LIMIT X where X is some reasonable number to short-circuit the query.

Alternatively, one could page through them for the count (if this is for a population script). BacklinkCache::partition() already does that. Since maintenance scripts turn of DBO_TRX, each paging query would be a separate implicit server-side TRX, so row purging could happen in between each SELECT() unlike the huge COUNT(*) SELECT now.

Something like this:

while count > 0:

SELECT MAX(pl_page_id), COUNT(*) FROM (SELECT pl_page_id FROM page_link WHERE pl_page_id > $last_max$ LIMIT 10000)

We're sure we'll get rid of the sql based counting in the normal update case but in the population/outage recovery case (both in process and job queue based) I was thinking of keeping it (or modifying it like you suggest.) The idea being that SQL based counting will be right even if Elasticsearch is super out of date. And it'll certainly be out of date in the population case. Without it we'd need a second pass at populating Elasticsearch to count the links which just seems complicated/burdensome/nasty.

I had a look at BacklinkCache a while ago but it looked like it was pulling all the backlinks into memory to count them. That didn't seem pretty.

BacklinkCache::getLinks() buffers the whole query result in RAM but iterates for the Titles. BacklinkCache::partition() batches larges queries and does not store the full query result nor titles anywhere. It just stores small arrays of ranges (and row count integers).

Now that we've disabled CirrusSearch in production we can actually do this without a full index rebuild.

So we still have to decide if we want to use BacklinkCache for maintenance scripts or use a two pass build. I kind of prefer a two pass build because that has the minimum effect on the database.

(In reply to comment #5)

So we still have to decide if we want to use BacklinkCache for maintenance
scripts or use a two pass build. I kind of prefer a two pass build because
that has the minimum effect on the database.

Yes, let's. I think operations will appreciate it ;-)

Patch uploaded but gerrit didn't link it: https://gerrit.wikimedia.org/r/#/c/97002/

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Apr 20 2015, 4:09 AM

CirrusSearch shouldn't use SQL to figure out page weightClosed, ResolvedPublicActions

Description

Details

Event Timeline

CirrusSearch shouldn't use SQL to figure out page weight
Closed, ResolvedPublic
Actions