Page MenuHomePhabricator

VisualEditor: Editor does not load at all (ve.ui.TargetToolbar is undefined)
Closed, ResolvedPublic

Description

load.php request for ve.base (when it crashes)

At this time, on most Wikipedias that have just been switched to 1.22wmf19, VisualEditor is fatally broken.

It cannot be consistently reproduced, but trying 3 or 4 times will hit the exception.


Version: unspecified
Severity: critical
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=51766

Attached:

req-vebase_crash.png (1×2 px, 546 KB)

Details

Reference
bz54935

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 2:38 AM
bzimport set Reference to bz54935.

Created attachment 13434
load.php request for ve.base (when it does not crash)

Attached:

req-vebase_nocrash.png (1×2 px, 569 KB)

Created attachment 13435
load.php request for ve.core (when it crashes)

Attached:

req-vecore_crash.png (1×2 px, 551 KB)

Created attachment 13436
load.php request for ve.core (when it does not crash)

Attached:

req-vecore_nocrash.png (1×2 px, 565 KB)

The two ve.base request are (naturally) from different cache servers, but the Content-Length is identical, and the Age is close enough (sent out at different times, and my request is at a few seconds later as well).

However the ve.core request is significantly different (eventhough we're giving the exact same url with the same timestamp from the startup module).

I think this is a case where it once again shows how conceptually flawed our deployment system is (swapping out files and directory inside /a/common while apache is still fully serving and pooled) – or, how we did not take this into account in the design of ResourceLoader (take your pick).

As a result I think one of the following 2 scenarios happened:

Scenario A:

  • sync deployment starts
  • sync to srv100 complete
  • user visits Wikipedia
  • user requests load.php?module=startup
    • load balancer picks srv100
  • user requests load.php?module=foo&version=123
    • (timestamp the user got from srv100)
    • load balancer picks srv200, which still runs the old code
    • this url is now cached on some servers
  • user requests load.php?module=bar&version=212
  • sync to srv200 complete
  • sync deployment ends

Imagine module bar depends on the (new version of) module foo. From this point on, users hitting srv100 for the module=foo request will keep getting an old version and results in broken scripts in unpredictable ways.

Scenario B:

  • sync deployment starts
  • sync to srv100 complete
  • user visits Wikipedia
  • user requests load.php?module=startup
    • load balancer picks srv100
  • sync to srv200 is busy (started, but not complete)
  • user requests load.php?module=foo&version=123
    • (timestamp the user got from srv100)
    • load balancer picks srv200, which has mixed code so some of the files concatenated/minified are old some are new
    • this url is now cached on some servers
  • sync to srv200 finishes
  • user requests load.php?module=bar&version=212
  • sync deployment ends

In this scenario a dependency between different requests/modules doesn't even matter because in our deployment system a server can even be in mixed state within itself (as opposed to a mixed state across the datacenter, as scenario A).

Again, it is cached under the new timestamp.

Contrary to what one might think, both scenarios do not automatically resolve themselves the next 5-minute window of the startup module.

The next time the startup module is generated, 5 minutes later, the max() timestamp of a module will still be the same.

Touching startup.js won't help either.

You'd have to artificially touch the individual module that got corrupted in cache and sync it and hope the same race condition won't happen again.

Is this a dupe of bug 51766 or bug 51766?

Fixed by touching the JS file in production.

And yeah, this is another artefact of 43805, I think. :-(