Page MenuHomePhabricator

Create XML category dumps
Closed, ResolvedPublic

Description

Author: Keisial

Description:
Dump category xml

dumpBackup can currently export the wiki pages, full or current. r40968 added support for the logs and bug 16082 allows to include the authorship list on current dumps.

But there's still one information needed for normal display not available yet on any xml: Categories.
Unless you are to render every article, obtaining full information about a Category involves merging data from four different table dumps.

The above patch addresses this by creating xml category dumps.


Version: 1.14.x
Severity: enhancement

attachment dump_category_xml.patch ignored as obsolete

Details

Reference
bz16176

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:17 PM
bzimport set Reference to bz16176.

Keisial wrote:

Structure of XML category dumps

attachment Formato técnico categorías.xml ignored as obsolete

Keisial wrote:

Structure of XML category dumps

Attached:

Keisial wrote:

Dump category xml

New version, since the older didn't always achieve what it should.
Still, it'd be nice to be able to get the hidden attribute in a more efficient way, specially if it worked with mysql 4.0. As it is now, <hidden> would be skipped on installs like wikipedia.

Attached:

Picking this bug up as I'm the most likely to push it along.

I'm pretty uneasy about running this complicated query with 4 tables and a subquery to boot, at least for full dumps. For manual exports of a small set of pages via Special:Export it might be just fine. Anyone with more query-fu than me want to weigh in?

sumanah wrote:

(In reply to comment #4)

Created attachment 5553 [details]
Dump category xml

Thanks for the patch. Would it be reasonable to put it into Gerrit?

Attached:

sumanah wrote:

Adding Asher for query-fu. :-)

ArielGlenn added a subscriber: Springle.

adding springle. what do you think about this?

Subqueries on modern MySQL or MariaDB are not so painful. The 4-way join with filesort is the biggest worry.

EXPLAIN select cl_from, cl_to, page_title, page_namespace, cl_sortkey, cl_timestamp, cat_id, cat_subcats, cat_pages, cat_files, pp_value
from page
inner join categorylinks on cl_from = page_id
left join category on cat_title = cl_to
left join page_props on pp_page = (
    select page_id from page where page_title = cl_to and page_namespace = 10
) and pp_propname = 'hiddencat'
order by cl_to;

*************************** 1. row ***************************                                                                                             
           id: 1                                                                                                                                           
  select_type: PRIMARY                                                                                                                                     
        table: page                                                                                                                                        
         type: index                                                                                                                                       
possible_keys: PRIMARY                                                                                                                                     
          key: name_title                                                                                                                                  
      key_len: 261                                                                                                                                         
          ref: NULL                                                                                                                                        
         rows: 35013959                                                                                                                                    
        Extra: Using index; Using temporary; Using filesort                                                                                                
*************************** 2. row ***************************                                                                                             
           id: 1                                                                                                                                           
  select_type: PRIMARY                                                                                                                                     
        table: categorylinks                                                                                                                               
         type: ref                                                                                                                                         
possible_keys: cl_from                                                                                                                                     
          key: cl_from                                                                                                                                     
      key_len: 4                                                                                                                                           
          ref: enwiki.page.page_id                                                                                                                         
         rows: 2                                                                                                                                           
        Extra:                                                                                                                                             
*************************** 3. row ***************************                                                                                             
           id: 1                                                                                                                                           
  select_type: PRIMARY                                                                                                                                     
        table: category                                                                                                                                    
         type: eq_ref                                                                                                                                      
possible_keys: cat_title                                                                                                                                   
          key: cat_title                                                                                                                                   
      key_len: 257                                                                                                                                         
          ref: enwiki.categorylinks.cl_to                                                                                                                  
         rows: 1                                                                                                                                           
        Extra:                                                                                                                                             
*************************** 4. row ***************************                                                                                             
           id: 1                                                                                                                                           
  select_type: PRIMARY                                                                                                                                     
        table: page_props                                                                                                                                  
         type: eq_ref                                                                                                                                      
possible_keys: PRIMARY,pp_propname_page,pp_propname_sortkey_page                                                                                           
          key: PRIMARY                                                                                                                                     
      key_len: 66                                                                                                                                          
          ref: func,const                                                                                                                                  
         rows: 1                                                                                                                                           
        Extra: Using where                                                                                                                                 
*************************** 5. row ***************************                                                                                             
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: page
         type: eq_ref
possible_keys: name_title
          key: name_title
      key_len: 261
          ref: const,enwiki.categorylinks.cl_to
         rows: 1
        Extra: Using index
5 rows in set (0.28 sec)

The patch includes a USE INDEX cl_sortkey but it isn't clear that helps much as it still does a huge filesort. Simpler and faster would be to ORDER BY page_id which Ariel said might be acceptable:

EXPLAIN select cl_from, cl_to, page_title, page_namespace, cl_sortkey, cl_timestamp, cat_id, cat_subcats, cat_pages, cat_files, pp_value
from page
inner join categorylinks on cl_from = page_id
left join category on cat_title = cl_to
left join page_props on pp_page = (
    select page_id from page where page_title = cl_to and page_namespace = 10)
and pp_propname = 'hiddencat'
order by page_id\G

*************************** 1. row ***************************                                                                                             
           id: 1                                                                                                                                           
  select_type: PRIMARY                                                                                                                                     
        table: page                                                                                                                                        
         type: index                                                                                                                                       
possible_keys: PRIMARY                                                                                                                                     
          key: PRIMARY                                                                                                                                     
      key_len: 4                                                                                                                                           
          ref: NULL                                                                                                                                        
         rows: 35013987                                                                                                                                    
        Extra:                                                                                                                                             
*************************** 2. row ***************************                                                                                             
           id: 1                                                                                                                                           
  select_type: PRIMARY                                                                                                                                     
        table: categorylinks                                                                                                                               
         type: ref                                                                                                                                         
possible_keys: cl_from                                                                                                                                     
          key: cl_from                                                                                                                                     
      key_len: 4                                                                                                                                           
          ref: enwiki.page.page_id                                                                                                                         
         rows: 2                                                                                                                                           
        Extra:                                                                                                                                             
*************************** 3. row ***************************                                                                                             
           id: 1                                                                                                                                           
  select_type: PRIMARY                                                                                                                                     
        table: category                                                                                                                                    
         type: eq_ref                                                                                                                                      
possible_keys: cat_title                                                                                                                                   
          key: cat_title                                                                                                                                   
      key_len: 257                                                                                                                                         
          ref: enwiki.categorylinks.cl_to                                                                                                                  
         rows: 1                                                                                                                                           
        Extra:                                                                                                                                             
*************************** 4. row ***************************                                                                                             
           id: 1                                                                                                                                           
  select_type: PRIMARY                                                                                                                                     
        table: page_props                                                                                                                                  
         type: eq_ref                                                                                                                                      
possible_keys: PRIMARY,pp_propname_page,pp_propname_sortkey_page                                                                                           
          key: PRIMARY                                                                                                                                     
      key_len: 66                                                                                                                                          
          ref: func,const                                                                                                                                  
         rows: 1                                                                                                                                           
        Extra: Using where                                                                                                                                 
*************************** 5. row ***************************                                                                                             
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: page
         type: eq_ref
possible_keys: name_title
          key: name_title
      key_len: 261
          ref: const,enwiki.categorylinks.cl_to
         rows: 1
        Extra: Using index
5 rows in set (0.28 sec)

Also:

  • Full example SQL incase I've made bad assumptions translating the PHP would help :-)
  • Worth investigating if the subquery can somehow be moved from the join order into the field list (since only one pp_value field is needed, it could simulate a left join) -- that tends to make fewer chances for optimizer screw-ups.
  • The page_namespace = N moving into the outer query somehow would help, then ORDER BY page_title.

Category dumps are available in rdf format now; https://dumps.wikimedia.org/other/categoriesrdf/ Can you get the required data from these?

ArielGlenn claimed this task.

I haven't heard back from the reporter so I'm closing this ticket. If it turns out that more is needed, please feel free to re-open.