[Story] Statistics for Special:EntityData usage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Lydia_Pintscher
	Mar 20 2014, 4:55 PM

Description

Can you please make access statistics for the exports at https://www.wikidata.org/entity/Q1.json and similar available? The available formats are json, rdf, n3, xml and ttl.
Statistics split by format would be most useful.

Details

Reference: bz62874

	Subject	Repo	Branch	Lines +/-
	EntityData usage tracking	analytics/limn-wikidata-data	master	+83 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T108931 [Epic] Improve metrics and statistics for wikidata
Duplicate	None	T117203 [WD] External usage KPI
Resolved	Addshore	T64874 [Story] Statistics for Special:EntityData usage

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 3:06 AM

• bzimport added a project: Data-Engineering-Wikistats.

• bzimport set Reference to bz62874.

• bzimport added a subscriber: Unknown Object (MLST).

Lydia_Pintscher created this task.Mar 20 2014, 4:55 PM

Erik Z and I will discuss and prioritize.

-Toby

I reckon prio 'high' is for assessment of requirements and doability. So without further ado some questions here:

Lydia, can you please explain in more detail what this is about? The link above just points to some structured data file, without any further explanation. Is this a data dump for one article? Just guessing.

Statistics split by format. Do I understand correctly you want as many monthly totals as there are formats, no further granularity (I hope so).

Where to find those numbers? Is there a table or api log which stores api requests, that you know of? Or should we be look at general traffic logs? We have 1:1000 sampled squid log reports (that would only work if api requests come by 100,000's per month, also those are more or less frozen in a partially functional state, as new infrastructure for traffic analysis is still expected to happen soonish).

Thanks for follow-up.

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1491

(In reply to Erik Zachte from comment #2)

Lydia, can you please explain in more detail what this is about? The link
above just points to some structured data file, without any further
explanation. Is this a data dump for one article? Just guessing.

Yes. https://www.wikidata.org/entity/Q1.json is part of wikidata's linked data interface. Basically, https://www.wikidata.org/entity/<id>[.<format>] URLs allow access to the machine readable description of an entity in the given format. If the format is not given, content negotiation is applied.

In the end, these URLs are resolved to a redirect (303 and/or 302) to wiki/Special:EntityData with the appropriate parameters. E.g. the example above results in a redirect to https://www.wikidata.org/wiki/Special:EntityData/Q1.json

I suppose that is already counted, but only as a single count, not for each entity/format.

Statistics split by format. Do I understand correctly you want as many
monthly totals as there are formats, no further granularity (I hope so).

From Lydia's original description, I gather that we are not interested in per-entity counters, but only per format. Considering that the format is not always given explicitly in the original URL, it woud probably be easiest to look at requests for wiki/Special:EntityData/*.<format> and base the statistics on that.

Where to find those numbers? Is there a table or api log which stores api
requests, that you know of? Or should we be look at general traffic logs?

That's our question to you (and Dario, I guess). But this has nothing to do with the API. This is a special purpose URL path that gets resolved to a special page. So I guess looking at the general purpose web logs should work.

This task has not seen updates for 16 months. Is this still high priority?

Nemo_bis renamed this task from stats for Wikidata exports to Stats for Wikidata exports.Jul 31 2015, 7:24 AM

Nemo_bis changed the task status from Open to Stalled.

Nemo_bis edited projects, added Analytics-General-or-Unknown; removed Data-Engineering-Wikistats.

Nemo_bis set Security to None.

It is still important for us, yes.

Lydia_Pintscher changed the task status from Stalled to Open.Jul 31 2015, 3:52 PM

Lydia_Pintscher added a project: Wikidata.

thiemowmde renamed this task from Stats for Wikidata exports to [Story] Statistics for Wikidata exports.Aug 13 2015, 5:11 PM

thiemowmde added a parent task: T108931: [Epic] Improve metrics and statistics for wikidata .

thiemowmde updated the task description. (Show Details)

thiemowmde removed a subscriber: • wikibugs-l-list.

Ricordisamoa subscribed.Aug 13 2015, 5:12 PM

JanZerebecki moved this task from incoming to needs discussion or investigation on the Wikidata board.Sep 10 2015, 10:46 PM

Lydia_Pintscher moved this task from needs discussion or investigation to ready to go on the Wikidata board.Sep 11 2015, 2:28 PM

I've just had another person ask for this. (3rd party researcher who wants to use it for a workshop) It'd be really great to get this published.

Adam: Can you get us an internal overview from hadoop?

JanZerebecki added a project: Story.Sep 25 2015, 8:28 PM

Adam: Poke? This is getting rather important and urgent.

Can you get us an internal overview from hadoop?

Yes

Just as a sample here is for a single hour

6 text/html; charset=UTF-8
2460 application/rdf+xml; charset=UTF-8
2473 application/vnd.php.serialized; charset=UTF-8
2479 application/n-triples; charset=UTF-8
2487 text/n3; charset=UTF-8
8820 application/json; charset=UTF-8
36716 text/turtle; charset=UTF-8

2015-11-10 1:00 UTC

Hi, quick questions on that:
Is the need regular, or would one shots make it ?
Also, what level of aggregation ? Daily is good ?
Below is a hive request that makes daily aggregation over (so thought) interesting dimension.
DISCLAIMER: These request need to scan a BIG volume of data (500Gb per day), so let's discuss how to handle the thing if you need regular updates.

SELECT
    CONCAT(LPAD(year, 4 ,0), '-', LPAD(month, 2, 0), '-', LPAD(day, 2, 0)) as day,
    regexp_extract(uri_path, '^/entity/.+(\\..+)$', 1) AS entity_format,
    regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1) AS special_entity_format,
    access_method,
    agent_type,
    http_status,
    COUNT(1) as count
FROM wmf.webrequest
WHERE webrequest_source = 'text'
    AND year = 2015
    AND month = 11
    AND day = 16
    AND normalized_host.project_class = 'wikidata'
    AND uri_path rlike '^(/entity/|/wiki/Special:EntityData/).*$'
GROUP BY
    year, month, day,
    access_method,
    agent_type,
    http_status,
    regexp_extract(uri_path, '^/entity/.+(\\..+)$', 1),
    regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1)
ORDER BY
    day, entity_format, special_entity_format, access_method, agent_type, http_status
LIMIT 100000;

day	entity_format	special_entity_format	access_method	agent_type	http_status	count
2015-11-16			desktop	spider	200	2
2015-11-16			desktop	spider	301	345473
2015-11-16			desktop	spider	302	75
2015-11-16			desktop	spider	303	312186
2015-11-16			desktop	spider	400	21
2015-11-16			desktop	spider	503	2
2015-11-16			desktop	user	200	18
2015-11-16			desktop	user	301	1398
2015-11-16			desktop	user	302	38
2015-11-16			desktop	user	303	2714
2015-11-16			desktop	user	400	25
2015-11-16			desktop	user	429	2
2015-11-16		.json	desktop	spider	200	719297
2015-11-16		.json	desktop	spider	301	501004
2015-11-16		.json	desktop	spider	304	10315
2015-11-16		.json	desktop	spider	400	7
2015-11-16		.json	desktop	spider	404	1777
2015-11-16		.json	desktop	spider	503	4
2015-11-16		.json	desktop	user	200	7675
2015-11-16		.json	desktop	user	301	97
2015-11-16		.json	desktop	user	302	10
2015-11-16		.json	desktop	user	304	1017
2015-11-16		.json	desktop	user	400	1
2015-11-16		.json	desktop	user	404	2
2015-11-16		.json	desktop	user	429	42
2015-11-16		.n3	desktop	spider	200	65982
2015-11-16		.n3	desktop	spider	301	1952
2015-11-16		.n3	desktop	spider	304	17417
2015-11-16		.n3	desktop	spider	404	13
2015-11-16		.n3	desktop	spider	503	1
2015-11-16		.n3	desktop	user	200	169
2015-11-16		.n3	desktop	user	302	12
2015-11-16		.nt	desktop	spider	200	65717
2015-11-16		.nt	desktop	spider	301	2045
2015-11-16		.nt	desktop	spider	304	7927
2015-11-16		.nt	desktop	spider	404	13
2015-11-16		.nt	desktop	spider	503	1
2015-11-16		.nt	desktop	user	200	203
2015-11-16		.nt	desktop	user	302	15
2015-11-16		.org/resource/	desktop	user	400	4
2015-11-16		.php	desktop	spider	200	65394
2015-11-16		.php	desktop	spider	301	2048
2015-11-16		.php	desktop	spider	304	7745
2015-11-16		.php	desktop	spider	404	22
2015-11-16		.php	desktop	user	200	168
2015-11-16		.php	desktop	user	302	14
2015-11-16		.rdf	desktop	spider	200	66840
2015-11-16		.rdf	desktop	spider	301	2088
2015-11-16		.rdf	desktop	spider	304	13343
2015-11-16		.rdf	desktop	spider	404	17
2015-11-16		.rdf	desktop	spider	503	4
2015-11-16		.rdf	desktop	user	200	182
2015-11-16		.rdf	desktop	user	302	10
2015-11-16		.ttl	desktop	spider	200	867900
2015-11-16		.ttl	desktop	spider	301	2069
2015-11-16		.ttl	desktop	spider	304	17560
2015-11-16		.ttl	desktop	spider	404	35
2015-11-16		.ttl	desktop	spider	503	24
2015-11-16		.ttl	desktop	user	200	181
2015-11-16		.ttl	desktop	user	302	13
2015-11-16		.ttl	desktop	user	304	1
2015-11-16	.json		desktop	spider	301	9089
2015-11-16	.json		desktop	spider	303	4551
2015-11-16	.json		desktop	user	303	20
2015-11-16	.org/resource/		desktop	user	301	4
2015-11-16	.org/resource/		desktop	user	303	4
2015-11-16	.ttl		desktop	user	303	2

Addshore renamed this task from [Story] Statistics for Wikidata exports to [Story] Statistics for Special:EntityData usage.Nov 18 2015, 12:38 PM

Daily would be good.
Grouped by format.
We can probably ignore /entity/ as they should all redirect to Special:EntityData.

JAllemandou added a project: Analytics-Backlog.Nov 19 2015, 10:57 AM

It would also be great to have this running (perhaps with all possible historical data (I think that is about a months worth) but the end of this year!

Addshore moved this task from Incoming to Watching / External on the WMDE-Analytics-Engineering board.Nov 19 2015, 1:06 PM

Addshore added a parent task: T117203: [WD] External usage KPI.Nov 19 2015, 2:17 PM

Milimetric moved this task from Incoming to Prioritized on the Analytics-Backlog board.Nov 19 2015, 6:01 PM

@Addshore: Do you have access to cluster 1002 to run querys yourself? Timeline wise if you need this before end of year it might be faster if you start working on it while we help you get changes going.

@Nuria yes I do!
I should be able to do this but of course if there is any chance of your team doing it that would be great!
Otherwise this will probably be one of the later things on my list!

@Addshore: It is on our backlog but we have several things before it so we cannot give an ETA. Now, I suggest that 1) you do some ad-hoc querying and get the data you need to met your end of December deadline. And 2) we can work together on oozification of this job later.

This is so our team doesn't block you and you can have your data for dev summit, that is a different deliverable than having an oozie job that calculates this data on a fixed interval.

Let me know if this sounds Ok.

Addshore moved this task from Watching / External to ToDo on the WMDE-Analytics-Engineering board.Dec 22 2015, 11:27 AM

I have reduced the above query to only show what we want to look at:

SELECT
    CONCAT(LPAD(year, 4 ,0), '-', LPAD(month, 2, 0), '-', LPAD(day, 2, 0)) as day,
    regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1) AS format,
    agent_type,
    COUNT(1) as count
FROM wmf.webrequest
WHERE webrequest_source = 'text'
    AND year = 2015
    AND month = 12
    AND day = 22
    AND http_status = 200
    AND normalized_host.project_class = 'wikidata'
    AND uri_path rlike '^/wiki/Special:EntityData/.*$'
GROUP BY
    year, month, day,
    agent_type,
    regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1)
ORDER BY
    day, format, agent_type
LIMIT 100000;

Change 260576 had a related patch set uploaded (by Addshore):
EntityData usage tracking

https://gerrit.wikimedia.org/r/260576

Change 260576 merged by Addshore:
EntityData usage tracking

https://gerrit.wikimedia.org/r/260576

Addshore moved this task from ToDo to Done on the WMDE-Analytics-Engineering board.Dec 23 2015, 10:59 AM