Page MenuHomePhabricator

Provide a better means of status update delivery in WMF error message
Open, LowPublicFeature

Description

Author: martinp23

Description:
See also: T18043: Link to #wikimedia-tech from the WMF error message (this blocks it if anything)

Great change has taken place in #wikipedia with regards to opping practices.
It remains difficult to manage the channel during times of downtime, especially
with little or no support from sysadmins (if the channel gets particularly
hectic, while +m might not be warranted, it is impossible to read both the
channel and #wikimedia-tech).

A far better solution than Mike suggests is that the Wikimedia sysadmins go to
the effort of creating some easy, quick to update and accessible method of
telling users what is going on. Not many people use and are familiar with IRC

  • and I'd expect that for 90% of people who see the "site are down" message,

their usual next step would be to (ironically!) visit wikipedia to see what IRC
means! It therefore serves very few users as a means of providing status
updates.

It would be relatively trivial for someone to create (yet another) IRC bot for
#wikimedia-tech which could write comments given to it to either a blog or
something like twitter (and thus an RSS feed). This would be accessible to
many many more users affected by Wikimedia downtime.

An IRC channel is no longer fit for purpose.


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=49476

Details

Reference
bz20079

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:51 PM
bzimport set Reference to bz20079.
bzimport added a subscriber: Unknown Object (MLST).

Where would you place that status page so it doesn't get "slashdotted" on a wikipedia outage?

A long time ago, there was an external page serving for that, which was taken down on wikipedia
failures. Now wikipedia traffic is orders of magnitude greater.

An appropiate place to set the messages could be the toolserver (if WM-DE is ok with it),
independent but nearby. However, it only makes sense if the source of the problem isn't in esams
itself!
I can't think a scenary where the squids present the error message, the toolserver is not
accesible and which isn't trivially solved by rerouting to tampa. Nonetheless I feel there might
be an unsuspected problem there.

martinp23 wrote:

(In reply to comment #1)

Where would you place that status page so it doesn't get "slashdotted" on a
wikipedia outage?

A long time ago, there was an external page serving for that, which was taken
down on wikipedia
failures. Now wikipedia traffic is orders of magnitude greater.

An appropiate place to set the messages could be the toolserver (if WM-DE is ok
with it),
independent but nearby. However, it only makes sense if the source of the
problem isn't in esams
itself!
I can't think a scenary where the squids present the error message, the
toolserver is not
accesible and which isn't trivially solved by rerouting to tampa. Nonetheless I
feel there might
be an unsuspected problem there.

toolserver did cross my mind. Alternatively, use a completely seperate service such as a hosted blog or, as an increasing number of services do, use twitter.

Alternatively, could the "Site down" notice be modified such that it draws a short status string from somewhere and presents it to users?

It would be more consistent with our values linking to http://identi.ca/wikimediatech instead of twitter.
Still, I don't think the Server admin log is appropiate as a general status information.
Make a feed with #wikipedia-tech topic? :)

FT2.wiki wrote:

IRC is useful for many people. So are identi.ca and twitter, client (or app) and browser based. We should provide a few routes, not just one. There's no need not to tell people about IRC as one of those. If anything's up IRC will surely get to know of it. I was in #wikipedia on 24 May and demands weren't unreasonable, posted a message there now and then, people got the idea. Easy.

Suggestion:

<Standard and user-friendly generic error message>

If this persists more than a few minutes, the current status and updates can be viewed at:

  * IRC: <channel details>
         <http://web link> (web based)
  * identi.ca: <details>
  * Twitter: wikimedia-network-status
             <http://search.twitter.com/search?q=wikimedia-network-status>

(web based)

  • Our external status pages: <list>

    Almost-current versions of articles can be read from the following cache

websites:

  • <list>

(In reply to comment #5)

IRC is useful for many people. So are identi.ca and twitter, client (or app)
and browser based. We should provide a few routes, not just one.

That's bug 16043. This bug asks for

some easy, quick to update and accessible method of
telling users what is going on.

http://status.wikimedia.org/ seems the way, but sysadmins need to decide how to update it with notices.

[Link to largely unhelpful discussion, just for historical purposes: http://thread.gmane.org/gmane.org.wikimedia.foundation/52853 .]

Guillaume, Sumana, Tilman, Matthew or whoever is responsible for this: do we have *any* location right now where users can expect to find information about (if not report) current outages and technical problems, which could be linked from the error page?

As of bug 16043 comment 24, the new (varnish) error page won't have even a link to IRC, while it would be nice if it gave some directions.
status.wikimedia.org doesn't give any updates; I think even Twitter would be better than nothing, but I don't remember https://twitter.com/wikimedia consistently consistently reporting such information with less than few hours' delay.
Perhaps https://wikitech.wikimedia.org/view/Server_admin_log would be a suitable target? It's both more open for posting (hence more complete) and "moderated" (by editing the wiki). It mostly contains obscure information, but during outages the top lines will probably be about what people are looking for; informative messages can easily be made in bold.

This issue saw no progress in years... Can we find a simple solution, and who's in the position of taking a decision on the topic?

Just post stuff /both/ to Identi.ca and Twitter (with the most popular accounts or hashtags) with a simple IRC-to-Identi.ca/Twitter bot. Won't hurt.

(In reply to comment #7 by Nemo)

Perhaps https://wikitech.wikimedia.org/view/Server_admin_log would be a
suitable target?

See comment 4 - It's likely too techy.

This request has some bikeshed potential - is there a scope which kind of issues should be informed about? (I probably shouldn't ask this, to keep this focused.)

ksnider wrote:

Hello everyone,

Apologies for being late to this discussion.

Is the sort of information we are currently exposing at http://status.wikimedia.org the kind of information you are looking for? Or something else?

Thanks.

Hello Ken.

(In reply to comment #10)

Is the sort of information we are currently exposing at
http://status.wikimedia.org the kind of information you are looking for? Or
something else?

Something else. status.wikimedia.org reports only the worst cases of downtime (when sites are not even accessible), for some of the services. What's needed is information on whether the sites are functioning (e.g. up, down, read only, r/w but there's a fatal if you try to save, Europe cut off) and what's being done about it.
A recent example could be https://status.github.com/messages

Change 97190 had a related patch set uploaded by Nemo bis:
Add Twitter account to Varnish's error page

https://gerrit.wikimedia.org/r/97190

(In reply to comment #12)

Change 97190 had a related patch set uploaded by Nemo bis:
Add Twitter account to Varnish's error page

https://gerrit.wikimedia.org/r/97190

I think this proposed change might mistakenly give the impression that the "wikimedia" Twitter account is used to provide site status information and it's definitely not, even during actual outages and issues.

(In reply to comment #13)

Comment 4 notes that Twitter is not really aligned with Wikimedia's open source values, though in the time since comment 4 was made, identi.ca no longer exists, I believe. :-/

Copy from gerrit comments:

Dzahn: didn't you mean https://twitter.com/wikimediatech instead of https://twitter.com/wikimedia ? ...snip...

I disagree with using the wikitech logs because most end users will not understand what they mean

eg: <p858snake|l> most end users will not know what "cp1002 hdd is full" or "fenari is in swap" or perhaps "exim is being stupid" means
<p858snake|l> or how that relates to why its boke

(In reply to comment #13)

(In reply to comment #12)

Change 97190 had a related patch set uploaded by Nemo bis:
Add Twitter account to Varnish's error page

https://gerrit.wikimedia.org/r/97190

I think this proposed change might mistakenly give the impression that the
"wikimedia" Twitter account is used to provide site status information and
it's
definitely not, even during actual outages and issues.

It's definitely been used for major outages, see e.g.:

https://twitter.com/Wikimedia/status/232469652691894272
https://twitter.com/Wikimedia/status/232519974663643136
https://twitter.com/Wikimedia/status/350485792956755968
https://twitter.com/Wikipedia/status/398888528039276544 (was retweeted by @wikimedia, too)

Since mid-2011, Twitter has been listed as a communications tool for such cases at
https://wikitech.wikimedia.org/wiki/Incident_response#Communicating_with_the_public .

Of course it's a matter of judgment how severe an incident needs to be to be reported on @wikimedia. Issues that don't affect a lot of users, or short outages, may indeed not be covered there. The wording in the patch ("You may be able to get further information in Wikimedia's <a href="https://twitter.com/wikimedia"

Twitter feed</a>") should be sufficiently non-committal.

It would be interesting to have a guesstimate of how many views of that error page happen to coincide with something that would trigger an update of that Twitter handle, i.e. if it's mostly seen during major outages (a lot of views in rare events) or minor ones (less views but many more events).
Since IRC was removed, the error message no longer provides any way (however hard) to get really up to date information. I don't know however if that's a goal, maybe not.

I don't think it's an issue if folks check out @wikimedia from an error message and find no updates there, as long as the message is worded accordingly. The proposed message in https://gerrit.wikimedia.org/r/#/c/97190/ already says "may be", which I think is sufficient, but we could also add "in case of ongoing outages" since we'll likely never tweet something for an intermittent site issue.

(In reply to comment #16)

I think this proposed change might mistakenly give the impression that the
"wikimedia" Twitter account is used to provide site status information and
it's definitely not, even during actual outages and issues.

It's definitely been used for major outages[...]

Yes, it has been used previously. But site outages and issues happen 24/7 and I can assure you we've had many outages and large site issues of varying strengths that have gone unreported to Twitter. There's also the issue of tweets coming post-incident (see below).

[...] see e.g.:

https://twitter.com/Wikimedia/status/232469652691894272
https://twitter.com/Wikimedia/status/232519974663643136
https://twitter.com/Wikimedia/status/350485792956755968
https://twitter.com/Wikipedia/status/398888528039276544 (was retweeted by
@wikimedia, too)

A user visits Wikipedia and sees an error page. They refresh or come back a few minutes later and the site is back. In only one of the four cases mentioned here would there have been any useful information from Twitter. In three of the four cases, the message was put out after the site issue was resolved (e.g., "Site back to normal after problems affecting logged-in users."). Any user who saw the Wikimedia error message and clicked over to Twitter would not have been provided any useful information.

If we insist on including a link to Twitter, I think it might be better to include a link such as https://twitter.com/search?q=wikipedia+down. That's how a user can actually determine whether the site is having issues during an actual outage.

Otherwise we will simply be directing users to a feed (@wikimedia) of "check out this project on Wikisource" or "see the Commons image of the day" when the sites are inaccessible. That doesn't seem ideal to me.

Change 97190 abandoned by Faidon Liambotis:
Add Twitter account to Varnish's error page

Reason:
See T76560 for a broader effort.

https://gerrit.wikimedia.org/r/97190

Despite the generic title, T76560 doesn't seem to be concerned with anything which would help with this request (i.e. informing users better).

jeremyb edited subscribers, added: jeremyb; removed: jeremyb-phone.
Dereckson closed this task as Declined.EditedFeb 13 2017, 11:02 PM
Dereckson subscribed.

Despite a claim of a specific burden noted in 2009, the channel hasn't requested such a change these last years.

The issue is a symptom and will be solved with T129433 or a combo of https://status.wikimedia.org/ and an update stream.

The only solution offered was (twice) a Twitter change, a solution twice rejected.

But as there isn't specific for the #wikipedia channel to do, we can mark this one as resolved and focus on the root problem to refresh the error pages, this one being T129433.

T129433 isn't in any way a "root problem" and has nothing about giving accurate information to users.

Aklapper lowered the priority of this task from Medium to Low.Apr 30 2017, 4:16 PM
Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM