Page MenuHomePhabricator

Add 300 000 wikia wikis to stats table
Closed, ResolvedPublic

Description

split out from 36133

Add 300 000 Wikias from https://github.com/WikiTeam/wikiteam/raw/master/listsofwikis/mediawiki/wikia.com (taken from the API)


Version: unspecified
Severity: enhancement

Details

Reference
bz36291

Related Objects

StatusSubtypeAssignedTask
ResolvedDzahn
ResolvedDzahn

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:27 AM
bzimport set Reference to bz36291.

Daniel, given that Wikia doesn't provide the data (and even if they didn't, we'd never be sure to have it regularly), can we please do this and update the stats again? They're horribly old.
Isn't there a way to throttle API requests?
Can I help in some way to get it done (besides talking to Wikia which is a waste of everybody's time)?

The count of currently active wikis can be retrieved from:
http://community.wikia.com/api.php?action=query&list=wkdomains&wkto=1000000&wkcountonly=1&wkactive=1
You can download the list with queries like the following, in at least two batches:
http://community.wikia.com/api.php?action=query&list=wkdomains&wkto=500000&format=json
(this one is 14 MB).

@Nemo_bis it might take a couple weeks to run that update. we might first have to change the update script to do multiple requests at once. maybe using curl_multi ? http://php.net/manual/en/function.curl-multi-getcontent.php

@Nemo_bis it might take a couple weeks to run that update. we might first have to change the update script to do multiple requests at once. maybe using curl_multi ? http://php.net/manual/en/function.curl-multi-getcontent.php

IMHO it's not an issue if it takes a couple weeks to update the full table. After all, stats.wikimedia.org is updated monthly. :)

Nemo_bis set Security to None.

i extracted from http://community.wikia.com/api.php?action=query&list=wkdomains&wkto=500000&format=json and using jq and grep and got 119675 wikis

That's not right, you have to use at least three batches up to 1.5M and filter by wkactive=1.

it also has duplicates. example:

domainofheroes.wikia.com
domainofheroes.wikia.com

That's not right, you have to use at least three batches up to 1.5M and filter by wkactive=1.

How? well, for now i deleted the horribly old list and imported those ~ 120k. We seem to run into a limit here as well with over 99 pages, page 100 is the first one again:

http://wikistats.wmflabs.org/display.php?t=wi&s=good_desc&p=99
http://wikistats.wmflabs.org/display.php?t=wi&s=good_desc&p=100

I don't see any duplicates in the WikiTeam repo list (updated in 72d43634d566d4a2cc1601324b24b5eeac79eb13 ). I can update that list again if you want to use it. I don't see duplicates.

$ wc -l wikia.com
410373 wikia.com
$ sort -u < wikia.com | wc -l
410373

IMHO pagination is not a huge issue as long as sorting works; I think it does, though the list is currently not including all the biggest wikis due to the partial import of the list. http://wikistats.wmflabs.org/display.php?t=wi&s=ausers_desc

ah, yeah, there was just that one single duplicate. not sure how it got in there

@Nemo_bis feel like making a new list with all 300k in one file and nothing else except the names/subdomains? that would be helpful.

python wikia.py

Traceback (most recent call last):

File "wikia.py", line 24, in <module>
  from wikitools import wiki, api

ImportError: No module named wikitools

Nevermind, there is already a generated list in the repo.

git clone https://github.com/WikiTeam/wikiteam.git

~/wikiteam/listsofwikis/mediawiki$ wc -l wikia.com
410373 wikia.com

we can use that.

imported! number of wikis:

before: 115212

now: 410703

now gotta run a loooon update i guess