Page MenuHomePhabricator

[SMW] Patch: SMW_refreshData.php delay time * 100 causes high server load
Closed, ResolvedPublic

Description

Patch resolving the "bug".

SMW_refreshData.php takes a delay parameter that is supposed to slow the progress of the refresh for each SMW ID. However, for some reason the delay time is multiplied by 100 and then applied only for blocks of 100 ID's. That causes 100% server load while the 100 ID's are being processed, even while idle. In most cases we tested, it caused otherwise idle servers to be sluggish to respond to even a single HTTP request.

If other refresh operations are in progress, or the MediaWiki job queue is running, the problem can compound until the server grinds to a halt, and the refresh on the block of 100 ID's never completes. Then, the server won't even respond to SSH connection attempts for a reboot command, and it must be manually power-cycled. This was tested on a variety of servers. We tested a 2.5 GHz dual-core server with 8 GB RAM before we realized it was a flaw in SMW, not the server.

I don't know why the code prevents the delay from applying to each SMW ID, and the comments do not explain. The fix was simple, and I have attached a patch. This is my first patch, so go easy on me if something isn't correct. It appears my text editor removed some superfluous white space too, but that shouldn't matter.

I tested my changes on SMW 1.7.0.2, but supplied a patch for the version of SMW_refreshData.php that is included in SMW 1.7.1. We haven't upgraded to test 1.7.1 yet, but these changes are so trivial, I doubt they would make any difference. It appears SMW_refreshData.php has no changes in SMW 1.7.1 that would produce different results from what we tested.


Version: unspecified
Severity: critical

Attached:

Details

Reference
bz38136

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 12:53 AM
bzimport set Reference to bz38136.
bzimport added a subscriber: Unknown Object (MLST).

Thanks for the patch. We should integrate this (I did not do this yet, but I wanted to give you a quick reply at least).

The original reason for using batches of 100 was that individual pages are usually so quick to process that a delay after each seemed unnecessary. We can change this. However, if you have problems with 100 pages, you might already have problems with 1 page (often, there are many more short/simple pages than "slow" pages). Maybe try running your update script with a larger nice value to avoid it from blocking more important processes. This only has effect if the problem is not in the database system (which applies the same priority to all queries). If your problems persist, esp. if it is due to the MySQL part of the processing, then it would be nice to know why exactly your pages need so much CPU for refreshing. We are currently looking into storage optimizations and are interested in testing their efficacy on sites that have one or the other performance issue.

The refresh script testing we did was all done at a nice value of 19. It only takes a few seconds at most to process each page, with the typical time being about 0.5 to 1.5 seconds. You may be correct that the problem is at the database level, but I'm not sure about that.

I don't think any one feature of SMW or SMW page is responsible. The MediaWiki job queue will have similar effects, but it provides a --maxtime option that ends the run if it takes too long. I typically use --maxtime to give 1/3 to 2/3 of each minute to the job queue (if there are jobs to run), depending on the visitor traffic load. For a casual SMW refresh, I am using a 10 second to 20 second delay after processing each ID, so the load is negligible over the 1 or 2 weeks that it runs.

Let me know if there's anything I could do better with my patch, for the next time I do one. I like commenting code in detail so it is self-documenting. Would it be alright if some of my patches are just code comments?

Comment on attachment 10816
Patch resolving the "bug".

[Correcting MIME Type and setting patch flag]