Page MenuHomePhabricator

Make domas' pageviews data available in semi-publicly queryable database format
Closed, ResolvedPublic

Description

This doesn't seem to be tracked yet.
It's been discussed countless times in the past few years: for all sorts of GLAM initiatives and any other initiative to improve content on the projects, we currently rely on Henrik's stats.grok.se data in JSON format, e.g. https://toolserver.org/~emw/index.php?c=wikistats , http://toolserver.org/~magnus/glamorous.php etc.
The data on domas' logs should be available for easy querying on the Toolserver databases and elsewhere, but previous attempts to create such a DB lead nowhere as far as I know.

I suppose this is already one of the highest priorities in the analytics team plans for the new infrastructure, but I wasn't able to confirm it by reading the public documents and it needs to be done anyway sooner or later.

(Not in "Usage statistics" aka "Statistics" component because that's only about raw pageviews data.)


Involved sub-tasks:


Version: wmf-deployment
Severity: enhancement
Discussions (partial list):

Details

Reference
bz42259

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks milimetric.

Not sure I understand the question. Isn't the request path / query part of an URL just a sequence of bytes without any associated encoding? Are you going to do something other than just return that byte sequence? (Which is in practice almost always UTF-8 but I don't think you have any control over it.)

I'll be more explicit: in the past people have mentioned issues aggregating stats for UTF-8 titles vs. percent-encoded titles, if I remember correctly. I don't know if there are further issues, but it would be nice for the field to be normalised.

It's a good point. For the moment, we assume page titles are UTF-8 encoded, and we percent decode page titles coming from url path and query strings. An additional trick is applied for the query strings: we change spaces to underscore to match with titles from url path.

Hi all: I just wanted to add a +1 interest in this thread from the Wikipedia Library. In part, we would like to be able to pair pageviews with external links (our best proxy for relative impact of citations from partner resourcces). In general, we are trying to figure out what the best strategies are for tracking a) links to partner resources, b) which editors created those links and c) relative visibility of the partner resources via their presence on pages. Having a cube that makes it easy to connect external links or some other variable (like DOIs) to page views would be ideal. I have created a bug in our larger link metrics collection task at : https://phabricator.wikimedia.org/T102855?workflow=102064

Thanks @Sadads, I think I remember other people here doing analytics work on citations and references, @DarTar, am I imagining things? Is there another cube that would be useful for what @Sadads is talking about?

Sorry for the long pause between updates. Here's where we are with our quarterly goal to put up a pageview API.

  • We have programmed the RESTBase endpoints and are getting ready to submit a pull request today or Monday, that's ahead of schedule and we're happy about it
  • We believe the hardware we're decomissioning from Kafka work can be repurposed for the RESTBase / Cassandra cluster to host the Pageview API. This hasn't been approved by ops yet but we're optimistic which is good because these things can take the most time
  • We have started the puppet work to configure this new RESTBase cluster, and that's ahead of schedule too.
  • We ran into some bugs and incomplete work in normalizing the page titles. Some of these fixes are done and others are scheduled, and we think we can go back and fix all the data we'd be putting into the API.

In short, I think we're on or ahead of schedule overall with no known blockers.

My top request would be an inclusion of mobile views on something similar to stats.grok.se, currently it doesn't include mobile views which according to metrics Wikiproject medicine have created is missing over 50% of the actual page views.

https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_medical_pages

You can drill down into mobile views a little bit on: https://vital-signs.wmflabs.org/ (just click the data breakdowns button on the left).

As for the ongoing pageview API work, the current end points don't have a mobile breakdown, because with such low traffic on mobile for some articles, you could identify that an editor is using a mobile device. We are still talking about whether to include mobile views at the project level. In most cases it's ok, but there are some projects that also have very few pageviews, and there we would have the same problem of identifying that active editors are probably using mobile devices to edit.

We're a bit torn on what to do with the vital signs breakdowns. They're also suffering from the same problem, and we should remove them. Especially when you consider the zero site.

Releasing data is hard :)

...

As for the ongoing pageview API work, the current end points don't have a mobile breakdown, because with such low traffic on mobile for some articles, you could identify that an editor is using a mobile device. We are still talking about whether to include mobile views at the project level. In most cases it's ok, but there are some projects that also have very few pageviews, and there we would have the same problem of identifying that active editors are probably using mobile devices to edit.

We're a bit torn on what to do with the vital signs breakdowns. They're also suffering from the same problem, and we should remove them. Especially when you consider the zero site.

Never mind this, edits that happen on the mobile site are tagged as "mobile" and that data is available publicly anyway. So I'll file a task to change our endpoints to expose mobile / not mobile.

Quick update: we're talking with ops and making all the necessary preparations. The code is mostly done from our point of view, but these next few weeks will be a reality check in terms of hardware, network setup, etc. I've added the puppetization of the pageview API deployment as a blocking tasks to this. A lot of work ahead, but exciting times! :)

Quick question that we'd love some opinions on. We have two choices for how to go forward with the "top" endpoint and we're not sure what would be most useful to folks consuming this. I asked this on analytics-l so feel free to ignore this if you're participating there. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

  • {project} means en.wikipedia, commons.wikimedia, etc.
  • {access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Clearly the second. It gives more information, allows you to leverage Varnish and does not change. (Linking to some statistics on a wiki page and that link showing something different every way would be rather confusing.) Convenience can be left to the frontend.

It's fantastic that this is almost done. Will this project also provide downloadable dump files, roughly equivalent to those at http://dumps.wikimedia.org/other/pagecounts-ez/merged/ or will that url continue to be the main source of aggregated page view dumps?

Will this project also provide downloadable dump files, roughly equivalent to those at http://dumps.wikimedia.org/other/pagecounts-ez/merged/ or will that url continue to be the main source of aggregated page view dumps?

This API won't provide aggregated dumps, we'll still publish those at dumps.wikimedia.org. But we're currently trying to simplify that site, the differences between the different data sets are confusing.

I know I owe everyone an update here. We've run into a bunch of little tiny annoying problems, and we're working through all of them, nothing too interesting to this list I think.

The API will be deployed today or tomorrow, depending on the services team's availability. But at that point we'll be still backfilling some data, especially the per-article view counts. So the endpoints will be available to hit, with some data in there, but we won't announce it publicly. I'll post here as soon as we have some good news. If anyone's interested in the problems and details, ping me privately or on IRC or something.

Good News :)

API has been launched. We're not announcing it widely yet, because we haven't finished loading in all the data. The per-article data will take some time, the others should be ready relatively soon. However, I wanted to give folks on this list a heads-up so they can start writing code, the spec is final.

Find the docs here: https://wikimedia.org/api/rest_v1/?doc#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end

And here are some example queries:

https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2015/10/01

https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/2015100100/2015100200

https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/2015100100/2015100200

Huge huge thanks to everyone involved:

  • The community for kicking our butt until we did this (@Magnus and @MZMcBride especially)
  • @Henrik for keeping stats.grok.se up as long as he has, hopefully this will be a welcome addition and maybe spark interest in a new version of stats.grok.se
  • Ops and Services teams for holding our hand throughout
  • Everyone on the analytics team, past and present, we all worked on this at some point

We'll have an official public announcement on the analytics list when all the data is loaded. And most likely a blog post soon after. Until then let's keep this among people who need to know to update code and deserve to know in general :)

Fantastic news indeed!

Can't wait for the per-article data. At least, I now have a URL schema to
code against :-)

Detail question: Will per-article work with "out-of-bounds" dates? So, if
my date range is 2015090100-2015093200 (or 3124, or3123 for a
30-day-month), will that work?

Congrats Dan and team -- nice to see this so close.

We should talk about moving the page view statistics from the wiki to this
service when it's had a chance to bake some.

-Toby

First of all, I'll join the celebrations, this if absolutely fantastic, huge thanks to everyone involved!

I'm looking forward to testing it with SuggestBot, since the bot delivers view data to en-wiki users every day. Reading through the documentation I had a question about the date format when requesting views per day for articles, does it simply strip off the hour from the time then? Meaning a start/end time spec of '2015102300' and '2015102301' are equivalent? Not sure where to ask, maybe I should open a separate ticket for it? And as you probably can tell, a bit eager to play around with this.

Detail question: Will per-article work with "out-of-bounds" dates? So, if
my date range is 2015090100-2015093200 (or 3124, or3123 for a
30-day-month), will that work?

The timestamps are validated to be valid dates, so 2015093200 or 2015093100 will be invalid and will return a proper message explaining what's wrong.

Hours are from 00 to 23, so 2015100100 will include the first hour of 2015-10-01. If you want all of september at an hourly level, this is the correct range: 2015090100-2015093023

@Nettrom:

question about the date format when requesting views per day for articles, does it simply strip off the hour from the time then? Meaning a start/end time spec of '2015102300' and '2015102301' are equivalent?

Actually, you have to pass 2015102300 if you want data for the 23rd. 2015102301 will give you a 404, since it specifies an hour for the daily level. This is ... confusing. I'm open to suggestions, but we may not want to mess with the URL structure too much once this is publicly launched.

Not sure where to ask, maybe I should open a separate ticket for it?

Anyone can ask details like this in #wikimedia-analytics on freenode, someone should be able to answer there. Or on the analytics-l list.

Actually, you have to pass 2015102300 if you want data for the 23rd. 2015102301 will give you a 404, since it specifies an hour for the daily level. This is ... confusing. I'm open to suggestions, but we may not want to mess with the URL structure too much once this is publicly launched.

The intuitive format would IMO be https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/20151001/20151002 but if you don't want to change the format at least it should be a 301 instead of a 404.

Quick update: October has finished loading. We tried to optimize but we couldn't get hourly resolution data per-article to fit in Cassandra. Because of that, we're looking at Druid and Elastic Search as replacements [1].

So at this point, people can query this data freely, and expect it to be reliable. Let us know if you have problems. We will continue to fill in all the rest of the data we have, back to May 2015, and we'll keep it up to date with new data.

[1] https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI/DataStore

It is really wonderful these metrics are now available :)

Has anyone started work on a user interface or can anyone suggest an easy way to visualise the results?

We haven't made an interface sort of on purpose to see what the level of interest is, etc. We're working pretty hard on the back-end to add more types of data and possible queries.

But the data that comes back is JSON and should be very easy to visualize with anything like d3, dygraphs, etc. I'm happy to help as a volunteer to write that kind of code, and I humbly suggest dashiki as a platform to build it with. Anyone who wants to work on this should open another task and cc me.

Update:

  • I want to talk about the Pageview API and future Analytics Data APIs at this year's Mediawiki Developer Summit. I will cc some of you on this proposal: https://phabricator.wikimedia.org/T112956. Let's discuss there where we want to go next
  • Marcel wrote a simple demo of what's possible to do with the API, we'll be showing that off soon
  • We are getting ready to make a blog post about the API
In T44259#1748904, @Tgr wrote:

The intuitive format would IMO be https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/20151001/20151002 but if you don't want to change the format at least it should be a 301 instead of a 404.

I agree. If I'm trying to get daily numbers, then it makes sense to have the dates in YYYYMMDD format. I tried both that and with HH = 01 before figuring out that the only way to get it to work was to use HH = 00.

In T44259#1748904, @Tgr wrote:

The intuitive format would IMO be https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/20151001/20151002 but if you don't want to change the format at least it should be a 301 instead of a 404.

I agree. If I'm trying to get daily numbers, then it makes sense to have the dates in YYYYMMDD format. I tried both that and with HH = 01 before figuring out that the only way to get it to work was to use HH = 00.

I'm happy to do this, just wondering if you guys thought it wasn't too confusing to have different date formats based on different values for the other parameters. It seems easier for humans and harder for machines, and this API leans slightly towards machines.

I'm happy to do this, just wondering if you guys thought it wasn't too confusing to have different date formats based on different values for the other parameters. It seems easier for humans and harder for machines, and this API leans slightly towards machines.

Why not support both? Just interpret YYYYMMDD as meaning YYYYMMDD00.

Why not support both? Just interpret YYYYMMDD as meaning YYYYMMDD00.

Makes sense, filed: https://phabricator.wikimedia.org/T118543

Thanks, I'll manage on my own, once daily (or monthly) views are available on the new API. Or did I miss a mail and they already are?

@Magnus, the API is up and being used already, we just haven't announced it on a list yet. I have a draft email explaining some details that I'll send probably today or Monday to analytics-l, engineering, and wikitech.

Monthly pageviews aren't ready quite yet. But daily pageviews are stable, and filled back to October (with more data being added as we go): https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Selfie/daily/2015010100/2015120100

I'm super duper excited to announce that the API has been announced publicly on wikitech, analytics, and engineering lists. Therefore I'm resolving this. Feel free to stick around, share stories, etc. But if you want to talk about what's next for this API, head on over to T112956 where I've most likely already subscribed you :)

Thank you again to everyone on this thread. It means a lot to me to be able to move this project forward, and I'm excited to see where we want to go next.

What about unique viewers?

That requires gathering data in a different way. We don't really like the whole idea of fingerprinting at WMF, so we don't do that.

What about unique viewers?

That requires gathering data in a different way. We don't really like the whole idea of fingerprinting at WMF, so we don't do that.

Sounds sensible, thanks.

We haven't made an interface sort of on purpose to see what the level of interest is, etc. We're working pretty hard on the back-end to add more types of data and possible queries.

But the data that comes back is JSON and should be very easy to visualize with anything like d3, dygraphs, etc. I'm happy to help as a volunteer to write that kind of code, and I humbly suggest dashiki as a platform to build it with. Anyone who wants to work on this should open another task and cc me.

Can you recommend a guide that gives baby steps to reuse the data in one of the tools you suggest? I'd be very happy to work with you on a tool as a beta tester etc (not a programmer)

In T44259#1873040, @Mrjohncummings wrote:

Can you recommend a guide that gives baby steps to reuse the data in one of the tools you suggest? I'd be very happy to work with you on a tool as a beta tester etc (not a programmer)

We're about to put out a blog post. At the bottom of that I'm trying to have such a guide. If that's not rich enough I'll keep trying :)

@Mrjohncummings: please open another Phabricator task and assign it to me so I can reference it from the work I do.

@Milimetric great, thanks, I've written something here, I think I've made a bit of a pigs ear with the wording, please change to make sense.

https://phabricator.wikimedia.org/T121314

To keep the archives happy, this is the blogpost announcing the release of the API: http://blog.wikimedia.org/2015/12/14/pageview-data-easily-accessible/

@Milimetric, when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

@Milimetric, when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

Probably same issue as T116286.

when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

@Slaporte: We have experienced a cluster issue 2016 Jan 4th which slowed down our computation for the next two days. Everything is now back in order.
Sorry for the inconvenience.

when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

@Slaporte: We have experienced a cluster issue 2016 Jan 4th which slowed down our computation for the next two days. Everything is now back in order.
Sorry for the inconvenience.

Glad that's resolved. Thanks for the update!

@Milimetric, when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

@Slaporte, data shows up "as soon as possible". In theory, the earliest it could show up is a couple of hours after the respective time period is finished (so at 02:00 UTC on day X+1, we should have day X ready). But that sometimes may be much slower if the cluster is overloaded, data gets lost and we have to restart jobs, etc. In general I haven't seen it take more than 24 hours, so if you see really long wait times beyond that, it might be worth reporting.

Heh, funny, that's just a copy of my code from: https://github.com/mediawiki-utilities/python-mwviews/blob/master/mwviews/api/pageviews.py

I've seen a couple of better python implementations and there are also clients in R, JS, and more. This thing's heating up :)

@Nemo_bis: Are there any actionables for analytics here? seems that we can close this ticket right?

@Nemo_bis: Are there any actionables for analytics here? seems that we can close this ticket right?

This was already closed over 2 months ago.