Page MenuHomePhabricator

Collect enwiki clickstream data (we could use it to automatically fix links to disambiguation pages and more)
Closed, ResolvedPublicFeature

Description

Author: jasonspiro4

Description:
It looks like bug 4118 ("Semi-automatic disambiguation") won't be implemented. But you developers have access to server logs. Could you buy a tool to derive [[clickstream]] data from the enwiki logs, strip out the IP addresses from the reports, and then either post the data online or share it with people who request it? We then could use the data

  • to write a bot that will use it to automatically fix links to disambig pages (this is a separate idea that I can file a bug for later)
  • or for all sorts of other uses. (I don't know what clickstream data can be used for so I don't know what these possible uses are.)

Version: unspecified
Severity: enhancement

Details

Reference
bz12742

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:01 PM
bzimport set Reference to bz12742.
bzimport added a subscriber: Unknown Object (MLST).

As a privacy nutjob.... (I think you can guess where my comment is going)

jasonspiro4 wrote:

Sorry Bawolff. Based on some Google research I did in response to your comment, I found that the Wikimedia Foundation already decided last year to get some better analytics tools.[1] :) But remember that the Foundation has a privacy policy already. Also, they can do a few things if they so choose: they can limit who can see the data, and they can limit from whom they collect the data.

^ [1]. http://www.mediawiki.org/wiki/Analytics_upgrade

Also, if they decide to make clickstream data available to certain people (say, bot developers), they can further sanitize it by removing all records of clicks on user pages and user talk pages.

I just CC'ed all five members of the analytics upgrade team to this bug, and assigned this bug to Howie Fung. I hope both of those actions were OK.

Maybe we should assign this to Rob Lanphier or Nimish.

Based on some Google research I did in response to your
comment, I found that the Wikimedia Foundation already decided last year to get
some better analytics tools.

My main concern was giving out such data to everyone who could potentially want it. Bot developers are a wide group of people, of varying levels of competency. I wouldn't really want such a group to have access to such data unless it was very well anonimized. Such information could be sensitive. Say someone browsed through various articles on Wikipedia about sexual topics, followed by a browse through the commons categories for sexual images (Assuming such categories still exist after the recent controversies that i havn't really been following) followed by the user visiting his own userpage (so one can identify who it is. If user pages aren't listed, perhaps followed by him accidently making a typo and going to uer:<user name>/ whatever). That might be something that the user would not want to be published.

Anyways, I'm all for better analyitic tools in general (I love the page stats), but we also have to be careful. Even anonoymized data can be harmful to release (for example [[AOL search data scandal]]) if not done carefully.

howiewiki wrote:

I'm not sure the benefits of fixing the disambiguation issue outweigh the potential privacy concerns. Yes, we do want better analytics, but we should think about what clickdata we want to track and/or publish very carefully. E.g., we may consider applying click-tracking to some types of pages, but not others if that's possible.

Robla is managing the priority list of analytics related features, so I'm going to assign this to him.

Any other use cases for this data?

[mass-moving wikistats reports from Wikimedia→Statistics to Analytics→Wikistats to have stats issues under one Bugzilla product (see bug 42088) - sorry for the bugspam!]

So the only potential usecase I've seen mentioned so far in this report is
"to write a bot that will use such data to automatically fix links to disambig pages".
Is that all?

However, better clickstream data is mentioned at https://www.mediawiki.org/wiki/Wikimedia_Engineering/2012-13_Goals#Analytics

FYI -- came across this and I assume it can be closed. The clickstream dataset has existed for English Wikipedia for several years now and there's now a prototype interface and API for easier access to the data:

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM
Isaac claimed this task.