Page MenuHomePhabricator

Large numbers are rendered differently depending on which server is rendering them
Closed, ResolvedPublic

Description

Author: berendjanwever

Description:
Large numbers such as 82,000,000 get rendered as 8.2E+7 by one server and 82000000 by another. The former is wrong and makes parsing numbers in templates (such as {{val}}) impossible. A little investigation indicates that servers named "srv*" render it as 8.2E+7 and servers named "mw*" render it as 82000000. I've been told at the village pump that filing a bug here may help find the root cause and resolve it.

See also:
https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Village_pump_(technical)#large_numbers_are_rendered_differently_by_various_servers.2C_leading_to_number_formatting_errors.


Version: unspecified
Severity: major

Details

Reference
bz31259

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:49 PM
bzimport set Reference to bz31259.

+assignee per roan's comments on en.wp

(In reply to comment #0)

Large numbers such as 82,000,000 get rendered as 8.2E+7 by one server and
82000000 by another. The former is wrong and makes parsing numbers in templates
(such as {{val}}) impossible. A little investigation indicates that servers
named "srv*" render it as 8.2E+7 and servers named "mw*" render it as 82000000.
I've been told at the village pump that filing a bug here may help find the
root cause and resolve it.

That's interesting.. It's suggesting the new builds are doing it right, but potentially old builds/reinstalls aren't.

py wrote:

Hey,

may I please have the number of the srv* server? *some* of those boxes have been upgraded to Lucid, so this is a very important data point.

Because upgrading all of the apaches will not be done at one time (for example, now), is there any way to make parsing either format possible?

(In reply to comment #2)

(In reply to comment #0)

Large numbers such as 82,000,000 get rendered as 8.2E+7 by one server and
82000000 by another. The former is wrong and makes parsing numbers in templates
(such as {{val}}) impossible. A little investigation indicates that servers
named "srv*" render it as 8.2E+7 and servers named "mw*" render it as 82000000.
I've been told at the village pump that filing a bug here may help find the
root cause and resolve it.

That's interesting.. It's suggesting the new builds are doing it right, but
potentially old builds/reinstalls aren't.

Hah, really? That would be very weird, you'd think it would've always been broken then.

(In reply to comment #3)

Hey,

may I please have the number of the srv* server? *some* of those boxes have
been upgraded to Lucid, so this is a very important data point.

Because upgrading all of the apaches will not be done at one time (for example,
now), is there any way to make parsing either format possible?

No, we should just fix whatever's outputting scientific notation to no longer output scientific notation.

Despite pcache potentially messing things up, I have confirmed that srv186 (which runs hardy) returns 8.2E+7 while srv286 (which runs lucid) returns 82000000. I confirmed this using echo $wgOut->parse( '{{#expr:82000000}}' ); in eval.php . I tried both 1.17 and 1.18 but there's no difference.

Also, scientific notation is only used in very few cases. The rules seem to be like this:

  • If the number is below a million (10^6), never use scientific notation. This happens on both hardy and lucid
  • If the number is above 10^14, always use scientific notation. This also happens on both hardy and lucid (!!! so this was broken all along for numbers greater than 100 billion)
  • If the number is between 10^6 and 10^14, only use scientific notation if there are /exactly/ two significant digits. This happens ONLY on hardy but NOT on lucid.

Some examples:

echo $wgOut->parse( '{{#expr:870000}}' ); // less than a million

<p>870000
</p>

echo $wgOut->parse( '{{#expr:8700000}}' ); // hardy: <1E14, two significant digits

<p>8.7E+6
</p>

echo $wgOut->parse( '{{#expr:8700000}}' ); // lucid: <1E14, two significant digits

<p>8700000
</p>

echo $wgOut->parse( '{{#expr:8720000}}' ); // three significant digits

<p>8720000
</p>

echo $wgOut->parse( '{{#expr:9000000}}' ); // one significant digit

<p>9000000
</p>
// Big numbers >1E14:

echo $wgOut->parse( '{{#expr:900000000000000}}' );

<p>9.0E+14
</p>

echo $wgOut->parse( '{{#expr:940000000000000}}' );

<p>9.4E+14
</p>

echo $wgOut->parse( '{{#expr:943000000000000}}' );

<p>9.43E+14
</p>

echo $wgOut->parse( '{{#expr:943250000000000}}' );

<p>9.4325E+14
</p>

I'm pretty sure what is being described is the same bug as:

https://bugs.php.net/bug.php?id=43053

Which affected PHP versions 5.2.1 to 5.2.6

It appears that the old servers are running 5.2.4, while lucid moves to 5.3, at which point the bug has been fixed in PHP.

Is there anything holding up software updates on these machines? Considering we've got all this infrastructure for maintaining consistent software configurations, I'm a bit unclear on why we would still have a mix of different versions in production.

berendjanwever wrote:

As a side note: I assume that since PHP does not have infinite precision, templates that need to be able to parse really large numbers must be able to deal with scientific representation. In other words 1E100 is never going to be rendered as "100000000000000....". Am I assuming correctly? Does anybody know what the exact limit is going to be at which point scientific notation is always going to be used by Wikipedia after all servers have been updated? I'd like to update the documentation on my {{val}} template so people who run into this limit understand that what is going on.

Thanks!
BJ

(In reply to comment #6)

I'm pretty sure what is being described is the same bug as:

https://bugs.php.net/bug.php?id=43053

Which affected PHP versions 5.2.1 to 5.2.6

Good catch!

It appears that the old servers are running 5.2.4, while lucid moves to 5.3, at
which point the bug has been fixed in PHP.

That is correct. The lucid servers run PHP 5.3.2.

(In reply to comment #7)

Is there anything holding up software updates on these machines? Considering
we've got all this infrastructure for maintaining consistent software
configurations, I'm a bit unclear on why we would still have a mix of different
versions in production.

Nothing is holding them up really, other than "it takes time". At first, we upgraded one, then two boxes, just to see if there were any issues (and of course there were). AFAIK Peter has already upgraded the entire image scaler cluster, and is currently chipping away at the general Apache cluster. But you can't just upgrade 100+ servers overnight. We also have to deal with the fact that a lot of Apaches run memcached and/or ES, so we can't have too many of those be down at the same time. This is all expected to be over soon, say in a week or two. But in the meantime we'll inevitably have a mix of hardy and lucid in production, and that mix will gradually shift to lucid until everything is upgraded. For more details, see notpeter's entries in the server admin log, or talk to him on IRC.

(In reply to comment #8)

As a side note: I assume that since PHP does not have infinite precision,
templates that need to be able to parse really large numbers must be able to
deal with scientific representation. In other words 1E100 is never going to be
rendered as "100000000000000....". Am I assuming correctly? Does anybody know
what the exact limit is going to be at which point scientific notation is
always going to be used by Wikipedia after all servers have been updated? I'd
like to update the documentation on my {{val}} template so people who run into
this limit understand that what is going on.

As per comment 5, I have experimentally determined this limit to be 10^14 (100 trillion). I just double-checked this on a lucid server: 99999999999999 (that's 99,999,999,999,999) is untouched but 100000000000000 (that's 100,000,000,000,000) becomes 1E+14.

(In reply to comment #9)

(In reply to comment #7)

Is there anything holding up software updates on these machines? Considering
we've got all this infrastructure for maintaining consistent software
configurations, I'm a bit unclear on why we would still have a mix of different
versions in production.

Nothing is holding them up really, other than "it takes time".

Maybe I should be a bit clearer here: we do have infrastructure for maintaining the same versions of packages everywhere etc., but it's not like we're upgrading a single package here. This is a full OS upgrade, and Peter is reimaging each server for that.

ES has been migrating out (per http://wikitech.wikimedia.org/view/External_storage) but as I understand this is not-quite-100%-done at this moment, so they still have to be watched out for?

Memcache still being in Apache space definitely indicates doing rolling upgrades to keep most of the cluster online at any given time, but should refill ok as long as each one that goes down has a replacement.

[In principle, apaches should hold no permanent data and should be able to act like they're netbooting or something; re-imaging *shouldn't* be any more unpleasant than just rebooting other than it being slower.]

To confirm: we expect this process to be done by mid-October 2011 under current work plans?

(In reply to comment #11)

ES has been migrating out (per
http://wikitech.wikimedia.org/view/External_storage) but as I understand this
is not-quite-100%-done at this moment, so they still have to be watched out
for?

It looks like Ben's started to rewrite the documentation to represent the future state, or something. I know that he's working on moving all the ES data off the Apaches onto a bunch of dedicated ES boxes, but right now the Apaches are still listed in db.php and the ES boxes aren't.

Memcache still being in Apache space definitely indicates doing rolling
upgrades to keep most of the cluster online at any given time, but should
refill ok as long as each one that goes down has a replacement.

That's about right, yeah. But it's a bit slow because Peter's being careful and rotation out memcached boxes 3 at a time. Also, each replacement memcached server starts with a blank slate, so that part of the cache needs to be rebuilt, which means replacing 20 of them at once is probably not a good idea. And the number of spares we have lying around isn't infinite either, of course.

[In principle, apaches should hold no permanent data and should be able to act
like they're netbooting or something; re-imaging *shouldn't* be any more
unpleasant than just rebooting other than it being slower.]

To confirm: we expect this process to be done by mid-October 2011 under current
work plans?

I *believe* so, but I'll ask Peter to comment here with an authoritative answer.

I hacked up a quick one-liner to check the number of upgraded servers, and it seems we have 97 upgraded machines and 108 non-upgraded ones.

py wrote:

Mid-october is a reasonable estimate. I'm doing these about 15-20 at a time, but it's the slowly pushing out new mc.php versions that's slow, so as to not bring down the site. But I'm in a rhythm and the apaches that serve up pages should be done mid-october, which will leave the smaller pools of bits and api apaches to do.

Lowering priority since this is an Ops issue, not something most developers can do.

Oh, and it is past mid-October, so maybe it is fixed?

py wrote:

Yes! This is fixed. Sorry for not noting sooner. Many bugs were closed by upgrading to lucid, and a couple slipped through the cracks.

All active apaches are upgraded. This includes all of the mw[0-9][0-9] boxxies and all srv* boxxies, with the exception of those with external stores which have been removed from the load-balancing pools, and thus are not serving anything up via apache.

(In reply to comment #15)

Yes! This is fixed.

Closing.