Page MenuHomePhabricator

uncategorized pages report speedup hack
Closed, InvalidPublic

Description

Author: bugzilla_wikipedia_org.to.jamesd

Description:
One approach to speed up uncategorized pages. Note
that it is NOT optimal - ideally there would be an
index for cur on cur_is_redirect. The lack of that
forces loading of the full record of cur - if there
was an index, we could take the value directly from
the index without reading the main record at all.
Better still, there would be an index on
cur_namespace, cur_is_redirect and cur_title. That
index could be used for many other queries, using
cur_is_redirect in (0,1) when it doesn't matter
whether it's a redirect or not. Field order makes a
speed difference in indexes and I haven't evaluated
whether cur_namespace or cur_is_redirect first is
faster.

See bug 250 for a periodically generated intermediate
table approach which would make this cheaper without
changing the cur table. This query could run against
that intermediate table. We'd want one index on cur
which contained every field in the intermediate table
or we'd want cur updates to update the intermediate
table directly (using insert delayed so they wouldn't
block due to MyISAM poor locking if a query was
running). With the text field removed from cur we
wouldn't need this index on cur - it's just to dodge
the time it takes to read all of the current full cur
record.

However, we don't have this yet, so:

  1. create a table with just the cur_id - will load id

from that index, so it's very fast. Will include
records with the wrong namespace. Get them in batches
of 50,000 and perform steps 2+3 for each batch.

  1. join that with categorylinks to find the ones

which have no category. Also very fast.

  1. check the resulting list against cur to eliminate

the redirects - has to read quite a large chunk of
cur. This one can be done in small chunks - the
sample here doesn't bother.
3b repeat 1/2/3 until all batches are in the
temporary table.

  1. finally the query we want

This requires setup and cleanup steps which aren't
present in the current framework which calls the
uncategorized pages code in

I don't recommend implementing this - use the
intermediate table approach instead. It's a lot less
messy than this hack.

/* 1. First build a list of cur records in namespace
0 using the namespace index */
/* can use limit 0,50000 then 51000,100000 etc. to
meet the batching */
/* requirement discussed later */
/* 607593 for en */
CREATE temporary TABLE cur_namespace0 (

cur_namespace0_id int(8) unsigned NOT NULL,
UNIQUE KEY id (cur_namespace0_id)

) TYPE=MyISAM PACK_KEYS=0;
insert into cur_namespace0 (cur_namespace0_id)
select cur_id from cur where cur_namespace=0;

/* 2. now join this with category links to get list
of cur records which */
/* need to be checked against cur records for not
is_redirect and to get title */
/* order by cur_id so the subsequent check of cur
will read cur in */
/* order by the primary/clustered key, the fastest
way */
CREATE temporary TABLE uncat_nocat (

uncat_nocat_id int(8) unsigned NOT NULL,
UNIQUE KEY id (uncat_nocat_id),

) TYPE=MyISAM PACK_KEYS=0;
insert into uncat_nocat (uncat_nocat_id)
SELECT cur_namespace0_id
FROM cur_namespace0 LEFT JOIN categorylinks ON
cur_namespace0_id=cl_from
WHERE cl_from IS NULL
order by cur_namespace0_id;
drop table cur_namespace0;

/* 3. now we must read cur itself to eliminate the
redirects */
/* this hurts - 438803 records to check for en. */
/* for en gets 176703 results in 4 min 17 sec on suda
*/
/* should really do this in batches of 50,000 or so
uncat_nocat records */
/* uncat_nocat rather than cur because it's OK to
scan uncat_nocat repeatedly */
/* but we gain nothing by using chunks if we use cur
chunks - want one */
/* and only one scan of cur, even if we do it in
pieces */
/* cur_id is used to avoid more costly check of
cur_title for */
/* duplicates, which we would have to do if it was
primary key. */
CREATE temporary TABLE nocat (

nocat_id int(8) unsigned NOT NULL,
nocat_namespace tinyint(2) unsigned NOT NULL

default '0',

nocat_title varchar(255) binary NOT NULL default '',
UNIQUE KEY id (nocat_id),
KEY title (nocat_title)

) TYPE=MyISAM PACK_KEYS=0;
insert into nocat (nocat_id, nocat_namespace,
nocat_title)
SELECT uncat_nocat_id, cur_namespace, cur_title from
cur, uncat_nocat
WHERE cur_id=uncat_nocat_id and cur_is_redirect=0;
drop table uncat_nocat;

/* 4. finally the query we are really after */
SELECT 'Uncategorizedpages' as type, nocat_namespace
AS namespace, nocat_title AS title, nocat_title AS
value
FROM nocat
ORDER BY nocat_title /*limit 50*/;

/* cleanup */
drop table nocat;


Version: unspecified
Severity: normal

Details

Reference
bz449

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 6:49 PM
bzimport set Reference to bz449.
bzimport added a subscriber: Unknown Object (MLST).

robchur wrote:

Bumped: Is this relevant with our new schema?

While UncategorizedPages still is a crappy query, absolutely none of the above applies (cur and friends don't even exist anymore). Closing as INVALID.