split out from 36133
Add 300 000 Wikias from https://github.com/WikiTeam/wikiteam/raw/master/listsofwikis/mediawiki/wikia.com (taken from the API)
Version: unspecified
Severity: enhancement
split out from 36133
Add 300 000 Wikias from https://github.com/WikiTeam/wikiteam/raw/master/listsofwikis/mediawiki/wikia.com (taken from the API)
Version: unspecified
Severity: enhancement
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Dzahn | T38291 Add 300 000 wikia wikis to stats table | |||
Resolved | Dzahn | T61943 Fix all the Wikia stats |
Daniel, given that Wikia doesn't provide the data (and even if they didn't, we'd never be sure to have it regularly), can we please do this and update the stats again? They're horribly old.
Isn't there a way to throttle API requests?
Can I help in some way to get it done (besides talking to Wikia which is a waste of everybody's time)?
The count of currently active wikis can be retrieved from:
http://community.wikia.com/api.php?action=query&list=wkdomains&wkto=1000000&wkcountonly=1&wkactive=1
You can download the list with queries like the following, in at least two batches:
http://community.wikia.com/api.php?action=query&list=wkdomains&wkto=500000&format=json
(this one is 14 MB).
@Nemo_bis it might take a couple weeks to run that update. we might first have to change the update script to do multiple requests at once. maybe using curl_multi ? http://php.net/manual/en/function.curl-multi-getcontent.php
IMHO it's not an issue if it takes a couple weeks to update the full table. After all, stats.wikimedia.org is updated monthly. :)
i extracted from http://community.wikia.com/api.php?action=query&list=wkdomains&wkto=500000&format=json and using jq and grep and got 119675 wikis
That's not right, you have to use at least three batches up to 1.5M and filter by wkactive=1.
How? well, for now i deleted the horribly old list and imported those ~ 120k. We seem to run into a limit here as well with over 99 pages, page 100 is the first one again:
http://wikistats.wmflabs.org/display.php?t=wi&s=good_desc&p=99
http://wikistats.wmflabs.org/display.php?t=wi&s=good_desc&p=100
I don't see any duplicates in the WikiTeam repo list (updated in 72d43634d566d4a2cc1601324b24b5eeac79eb13 ). I can update that list again if you want to use it. I don't see duplicates.
$ wc -l wikia.com 410373 wikia.com $ sort -u < wikia.com | wc -l 410373
IMHO pagination is not a huge issue as long as sorting works; I think it does, though the list is currently not including all the biggest wikis due to the partial import of the list. http://wikistats.wmflabs.org/display.php?t=wi&s=ausers_desc
@Nemo_bis feel like making a new list with all 300k in one file and nothing else except the names/subdomains? that would be helpful.
https://github.com/WikiTeam/wikiteam/blob/master/listsofwikis/mediawiki/wikia.py should do something right?
python wikia.py
Traceback (most recent call last):
File "wikia.py", line 24, in <module> from wikitools import wiki, api
ImportError: No module named wikitools
Nevermind, there is already a generated list in the repo.
git clone https://github.com/WikiTeam/wikiteam.git
~/wikiteam/listsofwikis/mediawiki$ wc -l wikia.com
410373 wikia.com
we can use that.
imported! number of wikis:
before: 115212
now: 410703
now gotta run a loooon update i guess