Page MenuHomePhabricator

Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish)
Closed, ResolvedPublic

Description

I was browsing this page (logged-in):
https://ba.wikipedia.org/wiki/Arthropoda

And noticed the following request is made on every single page:
https://meta.wikimedia.org/wiki/Special:RecordImpression?result=hide&reason=empty&country=NL&userlang=en&project=wikipedia&db=bawiki&sitename=%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F&bucket=1

What's up with that?

Similar issues are exhibited by Special:BannerRandom, Special:BannerLoader.


Version: unspecified
Severity: normal

Related Objects

StatusSubtypeAssignedTask
DuplicateNone
Resolvedawight
DeclinedNone
Resolved atgo
ResolvedAndyRussG
ResolvedAndyRussG
ResolvedAndyRussG
ResolvedAndyRussG
ResolvedAndyRussG
ResolvedAndyRussG
ResolvedAndyRussG
ResolvedNone
Resolvedawight
ResolvedAndyRussG
ResolvedAndyRussG
Resolvedawight
ResolvedNone
ResolvedAndyRussG
ResolvedEjegg
ResolvedNone
Resolved ellery
Resolvedgreg
Resolved Nuria
Resolvedawight
Resolvedcoren
Resolvedawight
Resolvedawight

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 193777 merged by jenkins-bot:
Special:RecordImpression: include sample rate

https://gerrit.wikimedia.org/r/193777

After tonight's deployment, RecordImpression numbers are down by at least 50% even before the new code has worked through everyone's caches.
(sampled 1:100)
15 min ending 20150303-030001: 8402
15 min ending 20150224-030001: 23036

This is mostly due to 100-fold reduction of reason=empty calls, but probably reflects variation in deployed campaigns too.
reason=empty calls:
15 min ending 20150303-030001: 359 (125 POSTs from new code)
15 min ending 20150224-030001: 15125

Hi... here are the patches deployed for this:

Again, it's not done, but this should be better. There will still be times when you don't see a banner but S:RI is called, but it'll be less, especially just now. Note also that a bug in Firefox prevents the call from appearing in the Network tab. It's visible in Chrome, though.

Thanks, all!! :)

faidon raised the priority of this task from Medium to High.Jul 4 2015, 5:39 PM
faidon added subscribers: BBlack, Joe.

The wm2015register campaign caused this on appservers req/s (just a sample from one appserver):

appserver.png (373×747 px, 32 KB)

Varnish top RxURLs are now:

1root@cp1065:~# varnishtop -1bi TxURL|head -20 |cut -d\& -f1
2 837.00 TxURL /w/index.php?title=Special:RecordImpression
3 227.00 TxURL /w/api.php
4 133.00 TxURL /w/api.php?maxlag=5
5 121.00 TxURL /w/index.php?title=Special:RecordImpression
6 100.00 TxURL /w/api.php?continue=%7C%7C
7 82.00 TxURL /w/index.php?title=Special:RecordImpression
8 74.00 TxURL /w/index.php?title=MediaWiki:AjaxTranslation.js
9 71.00 TxURL /w/index.php?title=Special:RecordImpression
10 54.00 TxURL /w/index.php?title=Special:RecordImpression
11 50.00 TxURL /w/api.php?continue=%7C%7C
12 45.00 TxURL /w/index.php?title=Special:RecordImpression
13 32.00 TxURL /w/index.php?title=Special:RecordImpression
14 30.00 TxURL /w/index.php?title=Special:RecordImpression
15 26.00 TxURL /w/index.php?title=Special:Export
16 25.00 TxURL /w/index.php?title=Special:RecordImpression
17 23.00 TxURL /w/index.php?title=Special:RecordImpression
18 22.00 TxURL /w/index.php?title=Special:RecordImpression
19 21.00 TxURL /w/index.php?title=Special:RecordImpression
20 21.00 TxURL /w/api.php?inprop=protection
21 19.00 TxURL /w/index.php?title=Special:RecordImpression

I think graphs/numbers speak for themselves.

@Joe, @ori, @BBlack & myself all wasted our time with effects correlating to this and other issues today. Both @ori and myself have personally spent a considerable amount of our time over the years for S:RI. What needs to be done for it to die in a fire as the task's name suggests?

Ouch...! @faidon @Joe, @BBlack, @ori, really, really sorry about the wasted time! Fortunately we are very close to finishing the changes required to bury this. Here is the most important WIP patch:

https://gerrit.wikimedia.org/r/#/c/221759/

That patch also refactors basically all the banner controller code to make it modular and readable, in addition to supporting the new processes we need. Since it's not complete, I haven't added too many reviewers yet, but it will need review from several teams. Note that I was expecting a gradual shutting down of S:RI--that's why that patch does still support it. I imagine we can make the cutoff more drastic and sudden, though... Mainly that'd be a question of coordinating things with any community banners that need display numbers.

Many thanks in advance for ur thoughtful, incisive and scathing comments!!! ;P

Prioritizing dev effort

Please understand that CentralNotice is at once legacy software and a single point of possible failure.

Can you please explain what "legacy software" means specifically? Are there newer tools that are now being used to replace CentralNotice? I'm very confused about this.

Please also understand that FR's banner tests are important, since the more effective banners are, the less they have to be shown. We have to balance time spent fixing technical debt (like this), doing new things right, improving tests and CI, general maintenance, etc. All things considered, I think the balance we've struck so far is quite reasonable.

The comments below saying (yet again) to resolve this task, as titled, as soon as possible indicate that the balance your team is making is unreasonable.

As pointed out above in T45250#997918, the only reason this code is still live is that it's part of fundraising, but that's a pretty thin and ugly veil to wear.

Can you (@AndyRussG) or @atgo provide an exact date by which Special:RecordImpression will no longer be live on Wikimedia wikis?

We're hoping to shut off S:RI by the end of the calendar year. The system which will replace it is called banner history, and it's one of Fundraising Tech's top priorities, in active development. That said, expect not to see any sparks flying from this "die in a fire" task until we have the parallel infrastructure built which will replace it.

I really appreciate the tempered criticism we're receiving here--when I originally came up with the Special:RecordImpression malfeature three years ago, I duly walked to the other side of the office and painted my name in feeble capitals at the top of the "fire me" wall. At the time, we were replacing an even more egregious bit of oxygen-poor design work, and S:RI actually saved us a round trip for every page view... Glad this discussion is finally heating up, though. I'd love to see CentralNotice get the attention it deserves, since it's the sole reason we aren't yet another Google rebranded targeted video ad vomitorium.

My invitation to anyone subscribing here who feels feisty or proactive, is to help review or implement the banner history feature at T78089, rather than jabbing at our gangrenous cecum. The faster we can get that done, the more likely S:RI dies within the year. If you go over there, you'll notice all kinds of shiny new things. In the future, thanks to the banner history work, there will be a checkbox for limiting the number of times a banner is displayed to each individual. Currently, the poor admin would have to hunt down and transclude a bunch of obscure onwiki javascript, which will surely change in some incompatible way at an indeterminate time.

As an example of how life sucks, but could be better: the wm2015register campaign failed to follow any of the (underdocumented) best current practices in CentralNotice, such as throttling the amount of traffic to the campaign, limiting the targeted audience, or most importantly, if you're going to spam everyone looking at a wiki, limiting the number of impressions any one victim is subjected to. Under the upcoming banner history / campaign mixin regime, the admin would have checked the box that says "I'm not a sociopath or anything, please show my banner no more than 5 times".

In a world with more FR-tech developers and more months before December, I'd also like to add some tools to help admins be proportionate about how many impressions they're delivering relative to one another. In the wm2015register case, the problem is actually that we're spamming the bloody crap out of the entire world, that should have set off alarms before S:RI did. I'm not surprised that some app servers are heating up, and it's not even particularly informative to point to S:RI when we have a huge systemic issue with allowing people to spam each other in the first place.

/me crunches loudly on peanuts

This comment was removed by awight.

In the wm2015register case, the problem is actually that we're spamming the bloody crap out of the entire world, that should have set off alarms before S:RI did.
... when we have a huge systemic issue with allowing people to spam each other in the first place.

do we have a bug filed for making the right interface for that? or for
notifications for people that opt-in to know when banners hit certain
criteria? there was even a time not so long ago when a related special
page used to regularly timeout (I think it was something like
globalallocation; ah, I guess I'm right: {T55443}).

the rest of the abuse of CN problem (the users), I think belongs in a
new section at [[m:Talk:CentralNotice/Calendar]][0] rather than
phabricator and also should be mentioned on the centralnotice-admins
mailing list[1].

[0] https://meta.wikimedia.org/wiki/Talk:CentralNotice/Calendar
[1] https://lists.wikimedia.org/mailman/listinfo/centralnotice-admins

As an example of how life sucks, but could be better: the wm2015register campaign failed to follow any of the (underdocumented) best current practices in CentralNotice, such as throttling the amount of traffic to the campaign, limiting the targeted audience, or most importantly, if you're going to spam everyone looking at a wiki, limiting the number of impressions any one victim is subjected to. Under the upcoming banner history / campaign mixin regime, the admin would have checked the box that says "I'm not a sociopath or anything, please show my banner no more than 5 times".

In the wm2015register case, the problem is actually that we're spamming the bloody crap out of the entire world, that should have set off alarms before S:RI did. I'm not surprised that some app servers are heating up, and it's not even particularly informative to point to S:RI when we have a huge systemic issue with allowing people to spam each other in the first place.

If this is an unintentionally-bad campaign, can we get it fixed? Can someone shut off the campaign or fix its parameters? AFAICS, the load situation has not changed, and we're still experiencing a related deluge of S:RI hits, and we've had a huge spike of failing appservers that correlate with the request-load peak coming from this a few hours ago. Where can we go or who do we ask to get it corrected?

Change 222879 had a related patch set uploaded (by BBlack):
filter S:RI from wm2015register T45250

https://gerrit.wikimedia.org/r/222879

Change 222879 merged by BBlack:
filter S:RI from wm2015register T45250

https://gerrit.wikimedia.org/r/222879

Yes, the issues with that banner should be reported and discussed at https://meta.wikimedia.org/wiki/Talk:CentralNotice/Calendar

Mainly that'd be a question of coordinating things with any community banners that need display numbers.

I doubt any such banner exists. Either way, the community can serenely do without such numbers for now.

I've read T78089 and associated page and there is no indication on what makes it a blocker for immediate death of Special:RecordImpression.

Please understand that CentralNotice is at once legacy software and a single point of possible failure.

Can you please explain what "legacy software" means specifically? Are there newer tools that are now being used to replace CentralNotice? I'm very confused about this.

By legacy, I mean it's crufty and should be substantially updated or replaced (or a bit of both). Dive into the code and you'll see what I mean. :) Yes, its contributors include many folks I hold in high esteem!! But it has also accumulated unfinished improvements, support for features no longer in use, and extra complexity due to the need to support a lot more (scale- and featurewise) than originally planned. (BTW, the WIP patch mentioned above does, I think, fix a lot of client-side issues.)

Please also understand that FR's banner tests are important, since the more effective banners are, the less they have to be shown. We have to balance time spent fixing technical debt (like this), doing new things right, improving tests and CI, general maintenance, etc. All things considered, I think the balance we've struck so far is quite reasonable.

The comments below saying (yet again) to resolve this task, as titled, as soon as possible indicate that the balance your team is making is unreasonable.

Hmmm... Well, let's consider both the specific and the general. Certainly a different balance could be sought specifically for killing S:RI. Though I want to say that overall we're giving CN performance very high priority!! (For example, said patch also reduces the amount of code loaded for users in many cases--that's another issue other teams have brought up--and with T103695 we're exploring ways to get beyond the performance limits of the current system.) Upcoming changes should address S:RI, other performance issues, and a large chunk of technical debt, and provide the banner history feature requested by FR, all in one swoop!

I should also call out that having new folks on the FR team does not mean we can go a lot faster right away--pretty soon, though, I think! :)

It seems the more general issue related to this is: how tech priorities and goals are set overall. (Not something to look into much here... maybe worth taking elsewhere, though!)

@BBlack, that patch looks great, thanks for doing that! Also, with the currently deployed CN code, adding the following snippet to JS in any banner should reduce S:RI calls for that banner to 1% of what they'd otherwise be:

mw.centralNotice.onlySampleRI = true;

In the latest version of said WIP patch, I made S:RI sampling the default. With that, campaigns that need full impression numbers would have to explicitly turn sampling off to get them. It also simply turns off S:RI in some cases where it's currently called at the low, sampled rate. (Again, the thought behind this was a gradual shutdown of S:RI. And, again, it's a WIP--I still need to check with all parties that this is OK.)

Mainly that'd be a question of coordinating things with any community banners that need display numbers.

I doubt any such banner exists.

There have been community banners that used in-banner JS to hide banners in certain circumstances, and have asked for impression numbers.

I've read T78089 and associated page and there is no indication on what makes it a blocker for immediate death of Special:RecordImpression.

Just edited the task description of T90917 and commented there. Sorry if it wasn't clear!!

Thanks!!! :)

I doubt any such banner exists. Either way, the community can serenely do without such numbers for now.

Agreed. I'd like us to have the numbers anyway for postmortem analysis, but if it helps with the ops burden by all means do whatever is needed with the non-fundraising banners. AFAIK the only non-FR campaign admins who have asked for their impression numbers have been WMDE and @Jalexander.

If this is an unintentionally-bad campaign, can we get it fixed?

Great question. This is definitely not an abusive admin, it's a structural shortcoming in the software that the defaults are to spam the whole world. Unfortuntely, adding the code to limit banner impressions is currently so arcane that we've been too embarrassed to document the process. Or something like that. I tried to make the connection on the talk page, we'll see how that goes.

I've read T78089 and associated page and there is no indication on what makes it a blocker for immediate death of Special:RecordImpression.

It's a blocker in the sense that we need a functional replacement before we retire the old impression counting system. Impression numbers are a must-have for all fundraising banners, it's the only way we can normalize results and compare across campaigns. I'll find an appropriate milestone card and mark the dependency from this task. Seems like the dependencies are pretty clearly marked.

I doubt any such banner exists. Either way, the community can serenely do without such numbers for now.

Agreed. I'd like us to have the numbers anyway for postmortem analysis, but if it helps with the ops burden by all means do whatever is needed with the non-fundraising banners. AFAIK the only non-FR campaign admins who have asked for their impression numbers have been WMDE and @Jalexander.

Thanks!! :) Yeah, agreed about the second point... FWIW, I was thinking of the steward elections banner, which only showed to logged-in users with a minimum number of edits. Don't know how far their request for impression numbers ever got...

If this is an unintentionally-bad campaign, can we get it fixed?

Great question. This is definitely not an abusive admin, it's a structural shortcoming in the software that the defaults are to spam the whole world. Unfortuntely, adding the code to limit banner impressions is currently so arcane that we've been too embarrassed to document the process. Or something like that. I tried to make the connection on the talk page, we'll see how that goes.

Just added this to the doc on Meta. Arg! Really silly of me not to have done that before, really sorry about that! :( Also, I remember at one point it was decided that it wasn't necessary to make that the default, though I don't remember why now. I guess that was also a mistake!!!

/me sprints and skips back to the spaghetti refactory...

Have set onlySampleRI, and basic javascript/cookie to display up to 2 times per project but that only affect what someone see and not whether the HTML/CSS/JS get loaded in the first place. @Glaisher had set traffic to 50% earlier.

In T45250#1428586, @KTC wrote:

Have set onlySampleRI, and basic javascript/cookie to display up to 2 times per project

Looks great Thanks much

that only affect what someone see and not whether the HTML/CSS/JS get loaded in the first place.

Indeed... BTW that'll also be fixed in upcoming changes :)

There have been community banners that used in-banner JS to hide banners in certain circumstances, and have asked for impression numbers.

"Asked" doesn't mean "need". There are alternative ways to calculate impressions and I'm pretty sure whoever asked would gladly do without if only they knew it impacts performance for users.

What's the status of this? Are we any closer to the complete shutdown of S:RI?

In T45250#1473036, @ori wrote:

What's the status of this? Are we any closer to the complete shutdown of S:RI?

This refactor patch (mentioned above) is ready and waiting for review and smoke testing! (It's still marked "WIP" just so we don't merge until enuf eyes have seen it.)

Please see also T106577 for some considerations...

DStrine renamed this task from Special:RecordImpression should die in a fire to [EPIC] Special:RecordImpression should be used at a very low sample rate.Jan 26 2016, 8:12 PM
DStrine subscribed.

fr-tech has discussed this a few times. We have a few steps left. I edited the title per the team's instructions.

If fr-tech plans to dedicate a sprint to reduce the rate at which S:RI is called, that is well and good, and worthy of appreciation. But this task is not about that; it is about S:RI being a badly-designed solution that needs to be replaced or reworked so that it is a good design. The fact that S:RI abuses the concept of special pages to add an API endpoint with semantics that run counter to the design of the system in which it is embedded -- this is not improved by sampling.

If "die in a fire" is unhelpful or in poor taste, it can be rephrased.

ori renamed this task from [EPIC] Special:RecordImpression should be used at a very low sample rate to Replace or redo Special:RecordImpression.Jan 27 2016, 8:33 AM

@awight @AndyRussG pinging you directly for comments.

Oh hi! Sure, the current title is a good summary of what we'd like to do. A low sample rate is just one example of how to "redo" S:RI, and I agree that the scope of this task should remain broad. Perhaps "Replace or improve".

I already see two incompatible interpretations of "redo" in the comments above, so it appears that "die in a fire" was more descriptive.

Nemo_bis renamed this task from Replace or redo Special:RecordImpression to Replace Special:RecordImpression with something that is not a special page.Mar 1 2016, 9:30 AM
AndyRussG renamed this task from Replace Special:RecordImpression with something that is not a special page to Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish).Mar 18 2016, 7:54 PM
fgiunchedi subscribed.

I see "herald" has added "operations" and its transcript has been garbage collected (!) removing operations since it isn't clear what needs to be done here, please add back if relevant!

Herald keeps re-adding ops because traffic is still in the tag list. We have an interest in this because S:RI is very high on the list of high-rate uncacheable client fetches.

Change 193779 abandoned by AndyRussG:
WIP Special:RecordImpression: sample by default, custom rates

https://gerrit.wikimedia.org/r/193779

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

awight claimed this task.

All subtasks are closed—I think that means this epic can be closed as well? Congratulations on all the successful work!