Page MenuHomePhabricator

Cleaning up of some (?) EventLogging schemata for Growth
Closed, ResolvedPublic

Description

Around

http://lists.wikimedia.org/pipermail/analytics/2014-July/002351.html

it seems some EventLogging schemas need to get purged.


The names of the schemas are not yet fully clear, but the OP in one part
said:

we can probably just wholesale
remove the associated schemas listed at
https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_register#Schemas

Removal should happen before 2014-08-04, but (as discussed in private communication)
only after 2014-08-01.

I made it clear in private communication that we probably cannot meet that
deadline.

If I understood OP correctly, Sean will handle the database cleanup.

I pushed back on cleanup of raw logs.


Version: unspecified
Severity: normal
Whiteboard: u=Growth c=EventLogging p=0 s=2014-08-07

Details

Reference
bz68931

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 3:40 AM
bzimport set Reference to bz68931.
  • Bug 68978 has been marked as a duplicate of this bug. ***

I pushed back on cleanup of raw logs.

Steven clarified on-list that they have an agreement with legal to remove
the data. So we should do it.

On-list [1] Kevin said

Christian: before I prioritize it, can you scope out how much work
would be required?

The items that immediatedly come mind are:

  • Clarify which schemas are meant to get purged.
  • Clarify how to handle future data (We're still seeing those events getting logged). We have no machinery in place to guard against data entering raw-logs.
  • Clarify whether or not purging EventLogging's “raw-logs” is sufficient (Since the relevant part of the data flow starts at the caches, it goes through both the udp2log and kafka pipeline)
  • Clarify if the event data got sent to universities (through udp2log forwards).
  • If the event data got sent to universities (see above item), clarify how to proceed there.
  • Get data removed from database (Either we get access, or we need to discuss with Sean or Ops)
  • Get data removed from all relevant files in vanadium:/var/log/eventlogging/...
  • Make sure the cleansed files from vanadium get rsynced over to stats1002, and stats1003.
  • If necessary (see 3rd item), remove the data from kafka cosumers (Might be easier to just nuke current data, as we repaved Hadoop some days ago anyways)
  • If necessary (see 3rd item), remove the data from udp2log consumers (Not sure. Might turn out that effectively no udp2log filter is actually selecting this data)

Taking a quick look, it seems data-collection might have started in
April 2014.

The 2nd and 3rd item probably need more discussion with Steven
(probably also legal, as some items are costly).

As our team lacks the required access for most of those parts, we
either need to get access [2], or consume more Ops time (which
requires more preparations on our end).

As the above list of items have some “Clarify” and “If” items, it's
hard to give an estimate. If those items do not resolve to much extra
work: Maybe 1-2 weeks total wall-clock time. But most of this time
will be waiting time. So maybe one or two man-days.

[1] http://lists.wikimedia.org/pipermail/analytics/2014-August/002367.html

[2] I already applied when receiving Steven's first email, and Toby
approved. But those items just require three days waiting.

ahalfak said in private communication that he has finished the things he needed
to do, so we're good to get things moving from their end.

As discussed in private emails between Steven, Aaron and me, the request is
only for the following schemas:

SignupExpAccountCreationComplete
SignupExpAccountCreationImpression
SignupExpCTAButtonClick
SignupExpCTAImpression
SignupExpPageLinkClick
TrackedPageContentSaveComplete

Removal of future data is beyond the scope of this request.

The tables to be purged from the log database are

SignupExpAccountCreationComplete_8539421
SignupExpAccountCreationImpression_8539445
SignupExpCTAButtonClick_8102619
SignupExpCTAButtonClick_8965028
SignupExpCTAImpression_8101716
SignupExpCTAImpression_8965023
SignupExpPageLinkClick_8101692
SignupExpPageLinkClick_8965014
TrackedPageContentSaveComplete_7872558
TrackedPageContentSaveComplete_8535426

On-list announcement about the upcoming purge is at

http://lists.wikimedia.org/pipermail/analytics/2014-August/002382.html