Page MenuHomePhabricator

DBQ-122 Create interlanguage links for entire Wikipedia
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-122.
Summary: Create interlanguage links for entire Wikipedia
Issue type: Task - A task that needs to be done.
Priority: Major
Status: Done
Assignee: merl <mewikipedia@to.mabomuja.de>


From: Thomasmp <tm.petzold@gmail.com>

Date: Tue, 01 Feb 2011 11:36:07

Hi,

Please construct a two-way linking table for interlanguage links amongst all Wikipedia languages.

SQL query: select ll_lang, count as c from langlinks group by ll_lang order by c desc where ll_language in ('all languages');

For each of the languages, repeat executing the same query.

Please generate the output in simple plain csv format.

If possible, please add two further columns besides each language pair which lists a) the sum total of number of articles at time of query, and b) the sum total of number of speakers.

Many thanks!
Thomas


Version: unspecified
Severity: major

Details

Reference
bz59375

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:26 AM
bzimport set Reference to bz59375.

From: merl <mewikipedia@to.mabomuja.de>

Date: Tue, 01 Feb 2011 19:19:56

Please note what you have requested:

  • column3=interwiki_count contains the number of all interwiki from one project to one languange from ALL namespaces
  • column4=article_count contains the number of pages in content namespacey ONLY. This is ns 0 in most wiki except arwiki

I don't know what you mean by "the sum total of number of speakers", so i left out this column. The column ll_lang is cut after 10 letters, so zh-classical=zh-classic.

Here is the code i used for running at SGE

#! /bin/bash
#$ -N DBQ-122
#$ -hard
#$ -l sqlprocs-s1=1
#$ -l sqlprocs-s2=1
#$ -l sqlprocs-s3=1
wikilist=`mysql -wBN -hsql-toolserver toolserver -e "SELECT GROUP_CONCAT(REPLACE(dbname,'wiki_p','') SEPARATOR ' ') FROM wiki WHERE family='wikipedia' AND is_closed=0 AND is_meta=0 ORDER BY dbname"`
SQL="SELECT ss_good_articles FROM site_stats INTO @articles;
     SELECT CONCAT('%WIKI%,', ll_lang, ',', count(*),',',@articles) AS csv FROM langlinks GROUP BY ll_lang ORDER BY ll_lang";
echo "wiki,interwiki,interwiki_count,article_count" > DBQ-122.txt
for runwiki in $wikilist
do
  echo "querying $runwiki"
  runwiki2=`echo "$runwiki" | sed "s/_/-/g"`
  SQL2=`echo "$SQL" | sed "s/%WIKI%/$runwiki2/g"`
  mysql -wBN -h${runwiki2}wiki-p.rrdb ${runwiki}wiki_p -e "$SQL2" >> DBQ-122.txt
done

The output can be found at http://toolserver.org/~merl/sql/DBQ-122.txt


From: Thomasmp <tm.petzold@gmail.com>

Date: Wed, 09 Feb 2011 23:40:32

Thanks, that looks great! What I meant by "the sum total of number of speakers" was the sum of primary and secondary speakers of each wiki.

Also, is it possible to repeat running the same query for any specific time in the past? The above query was run on Feb 1, 2011, so could you run the query on the data from Feb 1, 2010, Feb 1, 2009, and Feb 1, 2008, for example?


From: Hoo man <hoo@online.de>

Date: Thu, 10 Feb 2011 15:53:50

The count of speakers of a language is hard to get, cause there is no central API or database for that, it might be the best to just fetch it at your own and add it per Hand.

And no, the query can't be run in the past, cause we don't have the tables from back then. (Ok, after an import of all old DB dumps it might be possible, but that isn't really feasible [not only because of the need of disk space for that])


From: Thomasmp <tm.petzold@gmail.com>

Date: Fri, 11 Feb 2011 00:05:46

Thanks for considering importing old dumps. I understand your concerns about disk space for that, but is it possible to just run the query at two specific times in the past: 1/2/03 and 1/2/07 (and thereby reduce required data size considerably)? This would still be very valuable, thanks!


From: Hoo man <hoo@online.de>

Date: Fri, 11 Feb 2011 00:40:57

You may haven't fully understood that! That would force us to load dumps off all databsese which just can't be done in few hours and propably with the given resources.

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: bugreporter@to.mabomuja.de
CC list: bugreporter@to.mabomuja.de, hoo@online.de