Page MenuHomePhabricator

getExpandedArguments should not count toward the 10-second Lua execution limit
Closed, ResolvedPublic

Description

Author: mmantel

Description:
Per discussion at http://en.wikipedia.org/wiki/Wikipedia:VPT#Lua_10-second_timeout_unusable_for_articles

If you transclude a Lua-powered template such as:

{{bananas| <... complicated parserfunctions ...> }}

the input parameter is apparently not computed until the Module behind the template accesses frame.args, which invokes Scribunto_LuaSandboxCallback::getExpandedArgument. The time spent preprocessing the parameters is thus counted against the 10-second time limit on Lua execution.

This violates my expectations as an editor. In my simplistic mental model of the parser, the template parameters are processed before any Lua code is invoked.

It has caused a problem on enwp, where {{hexadecimal}}, a simple decimal-to-hexadecimal conversion template, has been converted to Lua. In {{weather box}}, some complex parser functions are used, and the results are passed into {{hexadecimal}}, dozens of times. On pages with several weather boxes, Lua timeouts result. On [[Climate of Russia]], for instance, 8s of Lua time is spent in getExpandedArgument.

This could probably be fixed by converting {{weather box}} to Lua as well, but there are many templates that need conversion, and it seems unwise to distort editor priorities this way.

Per BJorsch in the linked VPT thread, "The counting of calls into PHP against the time limit in general is intentional, but there is a good case for making an exception for getExpandedArguments()."


Version: unspecified
Severity: normal
URL: https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=542013515#Lua_10-second_timeout_unusable_for_articles
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=64163

Details

Reference
bz45684

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:16 AM
bzimport added a project: Scribunto.
bzimport set Reference to bz45684.

I'm going to second this request. The current Lua timing system, which counts the time that the parser takes to expand parameters sent to Lua against the Lua run time will make it difficult to know whether Lua conversions will be effective for widely deployed templates.

Consider a case of two complex templates:

{{ templateA | ... }}

and

{{ templateB | ... }}

Suppose that each takes 0.25 seconds under the current parser system. Then a construction like:

{{ templateA | {{ templateB | ... }} }}

will take about 0.5 seconds per iteration. If these are frequently used formatting templates, this combination might appear 60 times on a single page, leading to very long render times (i.e. > 30 seconds).

Now we come along and find an easy Lua replacement for templateA:

{{ #invoke: fastLua | A | ... }}

That only take 0.01 seconds to execute in the typical case.

Our combination case

{{ #invoke: fastLua | A | {{ templateB | ... }} }}

now ought to render the page nearly twice as fast. However, in practice we find that fastLua | A can't be deployed to any page that heavily uses this construction, because the Lua time limit is getting charged for the time it takes to expand it's parameter {{ templateB | ... }}, and after ~40 of those the entire Lua clock limit is gone. This happens even though the actual time spent on Lua operations was only 0.4 seconds.

Hence, even though a fast solution can be found, the way the limits are constructed may make it impossible to deploy in some cases. Lua developers shouldn't have to worry about perverse cases like this. Slow downstream templates shouldn't be counted against Lua run times.

There has already been at least one case of having to develop a workaround to accommodate existing pages for a scenario just like the one described above, and there are other examples where it would occur if the Lua code presently being developed were moved into the live templates. I've seen in use examples where getExpandedArgument consumes over 96% of Lua run time.

To avoid such negative interactions, the natural solution is to stop charging Lua for the time the parser takes to generate the parameters being sent to Lua. The other alternative would be to remove the separate Lua clock and only rely on a full page rendering limit. Either way, I think we need something to be done to avoid this issue being a recurring problem during Lua conversions.

It's not especially simple to remove the time spent in getExpandedArgument() from the Lua usage timer, because stopping and restarting the timer causes it to be truncated to a fairly coarse resolution. If you stopped the timer during getExpandedArgument() calls, then expanding an argument once every millisecond or so would allow Lua to run forever. It's the same reason why the profiling timer is not stopped during PHP execution.

Maybe it would be possible to record the CPU usage delta by calling timer_gettime() at the start and end of the getExpandedArgument() call. But translating that discount into delayed signal delivery would need to be done with care, to avoid the same resolution issue when calling timer_settime().

Just increasing the CPU limit to 30 seconds or so would be easy and would mostly fix the problem, but that is not the request made in this bug report.

(In reply to comment #1)

The other alternative would be to remove the separate Lua clock and only
rely on a full page rendering limit.

Tying a CPU limit reduction with the Lua deployment allows us to take advantage of the attention Lua deployers are paying to performance and error generation. If we just introduced a 30s CPU limit for all parsing, that would break many existing articles.

Part of the reason we are introducing Lua is to eliminate long parse times which adversely affect user experience. The 10s limit reflects this goal.

Experience suggests that the applications will expand to fill the limit, whatever the limit is. For metatemplates, a typical optimisation practice is to start with something ridiculously slow and to optimise it until some large test case stops generating Squid errors. Thus we are left with parse times of around 45s. If we introduced a 10s preprocessor CPU limit when we first introduced templates, then parse times would now be around 7.5s instead (albeit with a reduced set of applications).

Requiring that everything be converted to Lua simultaneously in order to achieve <10s parse times is not, in my opinion, the worst possible solution. Weather boxes commonly cause complaints and Squid timeouts, despite their simple function. If I was setting Lua migration priorities, they would probably be first or second on the list.

One of the problems that Lua scripting was intended to solve was that the weather box rendering template should not take 50 seconds to render and there is no reason on Earth why it should render as nearly as slow as this. I mean, that's approximately 3 seconds per template. That's like processing one weather box is just as complex as transcoding ~10 seconds of Full HD video using x264, which is completely untrue and just blows my mind in a very negative sense of this word.

I care about the physical integrity of my mind and hence suggest English Wikipedia to use the tools given and redeem our Movement for the sins against the computing we have been habitually committing ever since template complexity started growing insanely.

I.e., I suggest WONTFIX.

mmantel wrote:

I respect the goal of the 10-second limit, but it seems to me that the time limit in combination with the getExpandedArgument issue is overconstraining the problem.

"Big templates" like {{weather box}} tend to use a lot of "small templates" like {{hexadecimal}} and {{rnd}}. Where do we begin in converting these to Lua? If we start with the big template, we'll have to make lots of expandTemplate calls to the small templates, which may easily take us over the 10-second limit. Instead we started with the small templates, but the inputs involve a lot of complex parser functions, so getExpandedArgument takes us over the limit.

The only option left is, as Tim says, that "everything be converted to Lua simultaneously". But how is that workable under the wiki model? Who is going to coordinate this mass simultaneous conversion? Will we work on hundreds of template sandboxes for months, and just pray that everything works perfectly together when they're all switched over?

(In reply to comment #2)

Maybe it would be possible to record the CPU usage delta by calling
timer_gettime() at the start and end of the getExpandedArgument() call. But
translating that discount into delayed signal delivery would need to be done
with care, to avoid the same resolution issue when calling timer_settime().

Brad, can you look at this? I'm not sure exactly what the interface would look like -- maybe a simple LuaSandbox::pauseUsageTimer() and LuaSandbox::resumeUsageTimer(), with most of the logic written in C.

luasandbox_update_usage() could adjust the returned usage based on the accumulated "pause" time. If the timeout timer expires with the pause counter non-zero, it would be restarted with an interval equal to the pause counter.

If the timeout timer expires during "pause", we will have to wait for the hook function to run before we can re-arm it, since timer_settime() is not signal-safe. The amount to add to the pause counter would be the time between the pause and the signal, minus the time between the unpause and the hook execution. See attached diagram.

It is possible to call timer_gettime() from the signal handler, so it's possible to do this calculation.

Created attachment 11886
Pause timing diagram

Apologies for the bad drawing. The two intervals labelled "discount" are meant to be the same length. I said above that the timer can be restarted in the hook, but actually, I guess it can be more easily restarted on unpause. The pause counter needs to be reset when the timer is restarted, whether it is restarted from the hook or the unpause function.

Attached:

pause-timing-diagram-v2.png (334×800 px, 15 KB)

Created attachment 11887
Overcomplicated pause timing diagram

What it would look like if the timer was only restarted in the hook, not in unpause. The important thing is to not accidentally give Lua an extra CPU time allowance equal to the time from the signal to the unpause, because that could be arbitrarily long.

Attached:

pause-timing-diagram-v1.png (334×800 px, 18 KB)

(In reply to comment #3)

One of the problems that Lua scripting was intended to solve was that the
weather box rendering template should not take 50 seconds to render and there
is no reason on Earth why it should render as nearly as slow as this. I mean,
that's approximately 3 seconds per template. That's like processing one
weather
box is just as complex as transcoding ~10 seconds of Full HD video using
x264,
which is completely untrue and just blows my mind in a very negative sense of
this word.

I think the concerns about migration complexity are valid. If you're replacing {{strlen}} with Lua, it's not really possible to know how long the arguments will take to expand, in all the thousands of existing callers. So there's no way the bug can reasonably be addressed by the developer of the relevant Lua module.

I don't want to increase the timeout to something approaching the length of time it takes to parse an existing long page, because I think that would lead to permanent negative consequences for user experience.

I don't want to exclude all PHP function calls from the usage timer, since that would open up DoS vulnerabilities, where the an expensive PHP function is called repeatedly in a Lua tight loop. I think excluding only getExpandedArgument() is a reasonable compromise. The results are cached by the PPFrame object, so a DoS attack against it would basically equivalent to one done with pure templates.

(In reply to comment #2)

It's not especially simple to remove the time spent in getExpandedArgument()
from the Lua usage timer, because stopping and restarting the timer causes it
to be truncated to a fairly coarse resolution.

Thanks for reminding me about that.

(In reply to comment #5)

Brad, can you look at this?

Yes, I will. Assigning the bug to me.

(In reply to comment #5)

If the timeout timer expires during "pause", we will have to wait for the
hook function to run before we can re-arm it, since timer_settime() is not
signal-safe.

According to the list from POSIX.1-2004[1] and POSIX.1-2008,[2] timer_settime() is signal-safe. Am I missing something?

[1]: http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03
[2]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04_03_03

Gerrit change 52568 adds it to luasandbox.
Gerrit change 52572 uses it in Scribunto.

(In reply to comment #10)

According to the list from POSIX.1-2004[1] and POSIX.1-2008,[2]
timer_settime()
is signal-safe. Am I missing something?

You are correct.