Can you please make access statistics for the exports at https://www.wikidata.org/entity/Q1.json and similar available? The available formats are json, rdf, n3, xml and ttl.
Statistics split by format would be most useful.
Description
Details
- Reference
- bz62874
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
EntityData usage tracking | analytics/limn-wikidata-data | master | +83 -2 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T108931 [Epic] Improve metrics and statistics for wikidata | |||
Duplicate | None | T117203 [WD] External usage KPI | |||
Resolved | Addshore | T64874 [Story] Statistics for Special:EntityData usage |
Event Timeline
I reckon prio 'high' is for assessment of requirements and doability. So without further ado some questions here:
Lydia, can you please explain in more detail what this is about? The link above just points to some structured data file, without any further explanation. Is this a data dump for one article? Just guessing.
Statistics split by format. Do I understand correctly you want as many monthly totals as there are formats, no further granularity (I hope so).
Where to find those numbers? Is there a table or api log which stores api requests, that you know of? Or should we be look at general traffic logs? We have 1:1000 sampled squid log reports (that would only work if api requests come by 100,000's per month, also those are more or less frozen in a partially functional state, as new infrastructure for traffic analysis is still expected to happen soonish).
Thanks for follow-up.
bingle-admin wrote:
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1491
(In reply to Erik Zachte from comment #2)
Lydia, can you please explain in more detail what this is about? The link
above just points to some structured data file, without any further
explanation. Is this a data dump for one article? Just guessing.
Yes. https://www.wikidata.org/entity/Q1.json is part of wikidata's linked data interface. Basically, https://www.wikidata.org/entity/<id>[.<format>] URLs allow access to the machine readable description of an entity in the given format. If the format is not given, content negotiation is applied.
In the end, these URLs are resolved to a redirect (303 and/or 302) to wiki/Special:EntityData with the appropriate parameters. E.g. the example above results in a redirect to https://www.wikidata.org/wiki/Special:EntityData/Q1.json
I suppose that is already counted, but only as a single count, not for each entity/format.
Statistics split by format. Do I understand correctly you want as many
monthly totals as there are formats, no further granularity (I hope so).
From Lydia's original description, I gather that we are not interested in per-entity counters, but only per format. Considering that the format is not always given explicitly in the original URL, it woud probably be easiest to look at requests for wiki/Special:EntityData/*.<format> and base the statistics on that.
Where to find those numbers? Is there a table or api log which stores api
requests, that you know of? Or should we be look at general traffic logs?
That's our question to you (and Dario, I guess). But this has nothing to do with the API. This is a special purpose URL path that gets resolved to a special page. So I guess looking at the general purpose web logs should work.
I've just had another person ask for this. (3rd party researcher who wants to use it for a workshop) It'd be really great to get this published.
Adam: Can you get us an internal overview from hadoop?
Just as a sample here is for a single hour
6 text/html; charset=UTF-8
2460 application/rdf+xml; charset=UTF-8
2473 application/vnd.php.serialized; charset=UTF-8
2479 application/n-triples; charset=UTF-8
2487 text/n3; charset=UTF-8
8820 application/json; charset=UTF-8
36716 text/turtle; charset=UTF-8
2015-11-10 1:00 UTC
Hi, quick questions on that:
Is the need regular, or would one shots make it ?
Also, what level of aggregation ? Daily is good ?
Below is a hive request that makes daily aggregation over (so thought) interesting dimension.
DISCLAIMER: These request need to scan a BIG volume of data (500Gb per day), so let's discuss how to handle the thing if you need regular updates.
SELECT CONCAT(LPAD(year, 4 ,0), '-', LPAD(month, 2, 0), '-', LPAD(day, 2, 0)) as day, regexp_extract(uri_path, '^/entity/.+(\\..+)$', 1) AS entity_format, regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1) AS special_entity_format, access_method, agent_type, http_status, COUNT(1) as count FROM wmf.webrequest WHERE webrequest_source = 'text' AND year = 2015 AND month = 11 AND day = 16 AND normalized_host.project_class = 'wikidata' AND uri_path rlike '^(/entity/|/wiki/Special:EntityData/).*$' GROUP BY year, month, day, access_method, agent_type, http_status, regexp_extract(uri_path, '^/entity/.+(\\..+)$', 1), regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1) ORDER BY day, entity_format, special_entity_format, access_method, agent_type, http_status LIMIT 100000;
day | entity_format | special_entity_format | access_method | agent_type | http_status | count |
2015-11-16 | desktop | spider | 200 | 2 | ||
2015-11-16 | desktop | spider | 301 | 345473 | ||
2015-11-16 | desktop | spider | 302 | 75 | ||
2015-11-16 | desktop | spider | 303 | 312186 | ||
2015-11-16 | desktop | spider | 400 | 21 | ||
2015-11-16 | desktop | spider | 503 | 2 | ||
2015-11-16 | desktop | user | 200 | 18 | ||
2015-11-16 | desktop | user | 301 | 1398 | ||
2015-11-16 | desktop | user | 302 | 38 | ||
2015-11-16 | desktop | user | 303 | 2714 | ||
2015-11-16 | desktop | user | 400 | 25 | ||
2015-11-16 | desktop | user | 429 | 2 | ||
2015-11-16 | .json | desktop | spider | 200 | 719297 | |
2015-11-16 | .json | desktop | spider | 301 | 501004 | |
2015-11-16 | .json | desktop | spider | 304 | 10315 | |
2015-11-16 | .json | desktop | spider | 400 | 7 | |
2015-11-16 | .json | desktop | spider | 404 | 1777 | |
2015-11-16 | .json | desktop | spider | 503 | 4 | |
2015-11-16 | .json | desktop | user | 200 | 7675 | |
2015-11-16 | .json | desktop | user | 301 | 97 | |
2015-11-16 | .json | desktop | user | 302 | 10 | |
2015-11-16 | .json | desktop | user | 304 | 1017 | |
2015-11-16 | .json | desktop | user | 400 | 1 | |
2015-11-16 | .json | desktop | user | 404 | 2 | |
2015-11-16 | .json | desktop | user | 429 | 42 | |
2015-11-16 | .n3 | desktop | spider | 200 | 65982 | |
2015-11-16 | .n3 | desktop | spider | 301 | 1952 | |
2015-11-16 | .n3 | desktop | spider | 304 | 17417 | |
2015-11-16 | .n3 | desktop | spider | 404 | 13 | |
2015-11-16 | .n3 | desktop | spider | 503 | 1 | |
2015-11-16 | .n3 | desktop | user | 200 | 169 | |
2015-11-16 | .n3 | desktop | user | 302 | 12 | |
2015-11-16 | .nt | desktop | spider | 200 | 65717 | |
2015-11-16 | .nt | desktop | spider | 301 | 2045 | |
2015-11-16 | .nt | desktop | spider | 304 | 7927 | |
2015-11-16 | .nt | desktop | spider | 404 | 13 | |
2015-11-16 | .nt | desktop | spider | 503 | 1 | |
2015-11-16 | .nt | desktop | user | 200 | 203 | |
2015-11-16 | .nt | desktop | user | 302 | 15 | |
2015-11-16 | .org/resource/ | desktop | user | 400 | 4 | |
2015-11-16 | .php | desktop | spider | 200 | 65394 | |
2015-11-16 | .php | desktop | spider | 301 | 2048 | |
2015-11-16 | .php | desktop | spider | 304 | 7745 | |
2015-11-16 | .php | desktop | spider | 404 | 22 | |
2015-11-16 | .php | desktop | user | 200 | 168 | |
2015-11-16 | .php | desktop | user | 302 | 14 | |
2015-11-16 | .rdf | desktop | spider | 200 | 66840 | |
2015-11-16 | .rdf | desktop | spider | 301 | 2088 | |
2015-11-16 | .rdf | desktop | spider | 304 | 13343 | |
2015-11-16 | .rdf | desktop | spider | 404 | 17 | |
2015-11-16 | .rdf | desktop | spider | 503 | 4 | |
2015-11-16 | .rdf | desktop | user | 200 | 182 | |
2015-11-16 | .rdf | desktop | user | 302 | 10 | |
2015-11-16 | .ttl | desktop | spider | 200 | 867900 | |
2015-11-16 | .ttl | desktop | spider | 301 | 2069 | |
2015-11-16 | .ttl | desktop | spider | 304 | 17560 | |
2015-11-16 | .ttl | desktop | spider | 404 | 35 | |
2015-11-16 | .ttl | desktop | spider | 503 | 24 | |
2015-11-16 | .ttl | desktop | user | 200 | 181 | |
2015-11-16 | .ttl | desktop | user | 302 | 13 | |
2015-11-16 | .ttl | desktop | user | 304 | 1 | |
2015-11-16 | .json | desktop | spider | 301 | 9089 | |
2015-11-16 | .json | desktop | spider | 303 | 4551 | |
2015-11-16 | .json | desktop | user | 303 | 20 | |
2015-11-16 | .org/resource/ | desktop | user | 301 | 4 | |
2015-11-16 | .org/resource/ | desktop | user | 303 | 4 | |
2015-11-16 | .ttl | desktop | user | 303 | 2 |
Daily would be good.
Grouped by format.
We can probably ignore /entity/ as they should all redirect to Special:EntityData.
It would also be great to have this running (perhaps with all possible historical data (I think that is about a months worth) but the end of this year!
@Addshore: Do you have access to cluster 1002 to run querys yourself? Timeline wise if you need this before end of year it might be faster if you start working on it while we help you get changes going.
@Nuria yes I do!
I should be able to do this but of course if there is any chance of your team doing it that would be great!
Otherwise this will probably be one of the later things on my list!
@Addshore: It is on our backlog but we have several things before it so we cannot give an ETA. Now, I suggest that 1) you do some ad-hoc querying and get the data you need to met your end of December deadline. And 2) we can work together on oozification of this job later.
This is so our team doesn't block you and you can have your data for dev summit, that is a different deliverable than having an oozie job that calculates this data on a fixed interval.
Let me know if this sounds Ok.
I have reduced the above query to only show what we want to look at:
SELECT CONCAT(LPAD(year, 4 ,0), '-', LPAD(month, 2, 0), '-', LPAD(day, 2, 0)) as day, regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1) AS format, agent_type, COUNT(1) as count FROM wmf.webrequest WHERE webrequest_source = 'text' AND year = 2015 AND month = 12 AND day = 22 AND http_status = 200 AND normalized_host.project_class = 'wikidata' AND uri_path rlike '^/wiki/Special:EntityData/.*$' GROUP BY year, month, day, agent_type, regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1) ORDER BY day, format, agent_type LIMIT 100000;
Change 260576 had a related patch set uploaded (by Addshore):
EntityData usage tracking