Page MenuHomePhabricator

DBQ-137 statistics of different languages
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-137.
Summary: statistics of different languages
Issue type: Task - A task that needs to be done.
Priority: Major
Status: Done
Assignee: Hoo man <hoo@online.de>


From: Minn Seok Choi <MinnSeok.Choi@gmail.com>

Date: Wed, 20 Apr 2011 19:21:49

I am not sure it is possible to retrieve some data from the Wikipedia databases. If it is possible, I would like to get the following variables from the different Wikipedias shown in the list:

A. total pages of each namespace pages (excluding redirects)
(1) the number of article pages (i.e. main namespace pages)
(2) the number of talk pages
(3) the number of user pages
(4) the number of user talk pages
(5) the number of Wikipedia pages
(6) the number of Wikipedia talk pages
(7) the number of file pages
(8) the number of file talk pages
(9) the number of template pages
(10) the number of template talk pages
(11) the number of portal pages
(12) the number of portal talk pages
(13) the number of help pages
(14) the number of help talk pages

B. total edits to each namespace (excluding redirects)
(15) the number of article pages (i.e. main namespace pages)
(16) the number of talk pages
(17) the number of user pages
(18) the number of user talk pages
(19) the number of Wikipedia pages
(20) the number of Wikipedia talk pages
(21) the number of file pages
(22) the number of file talk pages
(23) the number of template pages
(24) the number of template talk pages
(25) the number of portal pages
(26) the number of portal talk pages
(27) the number of help pages
(28) the number of help talk pages

C. size of each namespace (byte)(excluding redirects)
(29) the size of article pages (i.e. main namespace pages)
(30) the number of talk pages
(31) the number of user pages
(32) the number of user talk pages
(33) the number of Wikipedia pages
(34) the number of Wikipedia talk pages
(35) the number of file pages
(36) the number of file talk pages
(37) the number of template pages
(38) the number of template talk pages
(39) the number of portal pages
(40) the number of portal talk pages
(41) the number of help pages
(42) the number of help talk pages

D. URL for certain pages
(43) the URL of community portal pages (if available)
(44) the URL of village pump, it available)
(45) the URL of help desk
(46) the URL of Featured article portal

the Wikipedia list (68 languages)

en English
de German
fr French
pl Polish
it Italian
ja Japanese
es Spanish
ru Russian
pt Portuguese
nl Dutch
sv Swedish
zh Chinese
ca Catalan
no Norwegian (Bokmål)
uk Ukrainian
fi Finnish
vi Vietnamese
cs Czech
hu Hungarian
ko Korean
ro Romanian
id Indonesian
tr Turkish
da Danish
ar Arabic
eo Esperanto
sr Serbian
lt Lithuanian
sk Slovak
he Hebrew
ms Malay
bg Bulgarian
sl Slovenian
hr Croatian
et Estonian
simple Simple English
th Thai
eu Basque
nn Norwegian (Nynorsk)
el Greek
az Azerbaijan
la Latin
tl Tagalog
te Telugu
ka Georgian
sh Serbo-Croatian
be-x-old Belarusian (Taraškievica)
lv Latvian
jv Javanese
sq Albanian
bs Bosnian
is Icelandic
ta Tamil
an Aragonese
oc Occitan
bn Bengali
ml Malayalam
af Afrikaans
ur Urdu
zh-yue Cantonese
ast Asturian
yo Yuruba
wa Walloon
yi Yiddish
uz Uzbek
li Limburgian
ia Interlingua
szl Silesian


Version: unspecified
Severity: major

Details

Reference
bz59394

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:28 AM
bzimport set Reference to bz59394.

From: Hoo man <hoo@online.de>

Date: Fri, 22 Apr 2011 18:39:33

The following is feasible: 1-14 and (may) 29 - 42.
Please confirm that the above data alone is useful for you and please give me the lang code (like en for English, sq for Albanian) for the above languages (I'm to lazy to get them myself ![][1] ).

[1]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif

From: Minn Seok Choi <MinnSeok.Choi@gmail.com>

Date: Sat, 23 Apr 2011 08:54:04

Thanks, Hoo man. 1-14 and 29-42 are useful for me. I updated my query request by adding the language codes, following your comment.


From: Hoo man <hoo@online.de>

Date: Sun, 24 Apr 2011 18:08:20

Ok, fine, thanks for the language codes ![][1]
Code (did id in PHP because I once again was to lazy for bash ![][2]):

#!/bin/php
<?php
$langcodes = array('en', 'de', 'fr', 'pl', 'it', 'ja', 'es', 'ru', 'pt', 'nl', 'sv', 'zh', 'ca', 'no', 'uk', 'fi', 'vi', 'cs', 'hu', 'ko', 'ro', 'id', 'tr', 'da', 'ar', 'eo', 'sr', 'lt', 'sk', 'he', 'ms', 'bg', 'sl', 'hr', 'et', 'simple', 'th', 'eu', 'nn', 'el', 'az', 'la', 'tl', 'te', 'ka', 'sh', 'be_x_old', 'lv', 'jv', 'sq', 'bs', 'is', 'ta', 'an', 'oc', 'bn', 'ml', 'af', 'ur', 'zh_yue', 'ast', 'yo', 'wa', 'yi', 'uz', 'li', 'ia', 'szl');
$file = '../public_html/dbq/dbq-137.txt';
foreach($langcodes as $lang) {
	$query = 'SELECT /* SLOW_OK */ \'' . $lang . '\' as lang, page_namespace, COUNT(*) as page_count, SUM(page_len) as namespace_size FROM page WHERE page_namespace IN(0,1,2,3,4,5,6,7,10,11,100,101,12,13) AND page_is_redirect = 0 GROUP BY page_namespace;';
	echo 'Executing "' . $query .'" on ' . $lang . "wiki_p\n";
	exec('mysql --host=' . $lang . 'wiki-p.rrdb.toolserver.org --database=' . $lang . 'wiki_p -e"' . $query . '" | cat >> ' . $file);
}
?>

Result:
http://toolserver.org/~hoo/dbq/dbq-137.txt (plain text)
http://toolserver.org/~hoo/dbq/dbq-137.csv (Excel readable csv)

[1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif
[2]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif

From: Minn Seok Choi <MinnSeok.Choi@gmail.com>

Date: Mon, 25 Apr 2011 19:53:25

Thank you so much, Hoo man.

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: hoo@online.de
CC list: hoo@online.de