Page MenuHomePhabricator

deadlocks mass-deleting media files in categories
Closed, ResolvedPublic

Description

When mass-deleting media files that are in a category, quite often deadlocks are seen originating from "Article::updateCategoryCounts".

Mass deletion is a relative term here: having opened 7 delete tabs in the browser for Wikimedia Commons and deleting them as quickly as possible (say within 2-3 seconds), I may get 3 deadlock out of 7. Would be a great help for maintaince if this was somehow resolved.


Version: 1.22.0
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=46086

Details

Reference
bz13921

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 10:09 PM
bzimport set Reference to bz13921.
bzimport added a subscriber: Unknown Object (MLST).

Issue still there. I also ran into another deadlock sotuation: removing 4 older versions of a file quickly after eachother: “LocalFileDeleteBatch::doDBDeletes” gives: “1213: Deadlock found when trying to get lock; Try restarting transaction (10.0.0.231)”.

Perhaps this could use the jobqueue to keep a nice linear execution order.

(In reply to comment #2)

Perhaps this could use the jobqueue to keep a nice linear execution order.

Looking at what it does, that shouldn't be n..

ayg wrote:

Ugh, I hate databases. :(

(In reply to comment #5)

Is still still reproducible?

I don't do any mass deletions anymore at Commons, but let me quickly check if there are enough speedy deletion tags at the moment to give it a try.

(In reply to comment #6)

(In reply to comment #5)

Is still still reproducible?

I don't do any mass deletions anymore at Commons, but let me quickly check if
there are enough speedy deletion tags at the moment to give it a try.

Cannot reproduce.

(In reply to comment #5)

Is still still reproducible?

As said, I haven't tried with images, but I just reproduced it mass deleting pages in the MediaWiki namespace on nl.wikimedia.org (1.17wmf1 (r88299)). The error thrown came from RecentChange::save and was:

1213: Deadlock found when trying to get lock; try restarting transaction (10.0.6.49)”.

When updating twinkle to api now, I'm getting database query errors a lot (10 in a minute depending on the amount being processed) which seems to be related to this:

<?xml version="1.0"?><api servedby="srv255"><error code="internal_api_error_DBQueryError" info="Database query error" xml:space="preserve">

#0 /usr/local/apache/common-local/php-1.17/includes/db/Database.php(734): DatabaseBase->reportQueryError('Deadlock found ...', 1213, 'UPDATE page ...', 'Title::invalida...', false)
#1 /usr/local/apache/common-local/php-1.17/includes/db/Database.php(1349): DatabaseBase->query('UPDATE page ...', 'Title::invalida...')
#2 /usr/local/apache/common-local/php-1.17/includes/Title.php(2485): DatabaseBase->update('page', Array, Array, 'Title::invalida...')
#3 /usr/local/apache/common-local/php-1.17/includes/Article.php(4093): Title->invalidateCache()
#4 /usr/local/apache/common-local/php-1.17/includes/Article.php(3271): Article::onArticleDelete(Object(Title))
#5 /usr/local/apache/common-local/php-1.17/includes/api/ApiDelete.php(154): Article->doDeleteArticle('Deleted talk pa...')
#6 /usr/local/apache/common-local/php-1.17/includes/api/ApiDelete.php(79): ApiDelete::delete(Object(Article), 'a0b48ef8649b3b5...', 'Deleted talk pa...')
#7 /usr/local/apache/common-local/php-1.17/includes/api/ApiMain.php(657): ApiDelete->execute()
#8 /usr/local/apache/common-local/php-1.17/includes/api/ApiMain.php(339): ApiMain->executeAction()
#9 /usr/local/apache/common-local/php-1.17/includes/api/ApiMain.php(323): ApiMain->executeActionWithErrorHandling()
#10 /usr/local/apache/common-local/php-1.17/api.php(115): ApiMain->execute()
#11 /usr/local/apache/common-local/live-1.5/api.php(3): require('/usr/local/apac...')
#12 {main}

</error></api>

Just now I came across this again. Slightly different scenario -- mass deleting empty categories on Wikimedia Commons:

“RecentChange::save”: “1213: Deadlock found when trying to get lock; try restarting transaction (10.0.6.32)”.

Adding to comment 12: Two more methods the deadlock was reported in:

  • WikiPage::updateCategoryCounts
  • HTMLCacheUpdate::invalidateTitles

(In reply to comment #13)

Adding to comment 12: Two more methods the deadlock was reported in:

  • WikiPage::updateCategoryCounts
  • HTMLCacheUpdate::invalidateTitles

These are the only two I'm still seeing in the logs a lot.

  • Bug 46086 has been marked as a duplicate of this bug. ***

Log analysis and frequent reports on bug 46086 indicate that this has been happening more often since 1.21wmf11 was deployed. So, increasing priority.

Created attachment 12105
Exception rate time series

Rate of exceptions with WikiPage->doDeleteUpdates in the backtrace. Clearly something new is going on, possibly correlated with deployment of MW 1.21wmf11 to commonswiki.

Attached:

bug-13921-exception-rate.png (812×915 px, 21 KB)

The transaction length is certainly epic. Here's an indicative debug log trace from my local test wiki, just showing the transaction in question, annotated with hook calls:

http://paste.tstarling.com/p/spMggU.html

35 DB queries, 59 hook calls, 4 squid purges and some FileBackend operations. It's not hard to imagine something in there being slow.

What makes this tricky is the use of the $commit=false parameter to doDeleteArticleReal() by FileDeleteForm to attempt to roll back everything when a FileBackend operation fails. Of course, you can't unpurge Squid or unsend an IRC log line, but at least it does make an effort.

Moving things like squid purges and links updates to a DeferredUpdates job would be easy if it wasn't for this. You can't unqueue a DeferredUpdates job at present. Maybe you could clear the queue or somehow identify invalid jobs, but it would be pretty messy and heuristic. The use of onTransactionIdle() would suffer similar problems.

Ideally, WikiPage::doDeleteArticleReal() would be split up into a pre-commit function and a post-commit function, and FileDeleteForm would call the post-commit function after the file operations are successful.

WikiPage::doDeleteArticleReal() itself would continue to do both halves, for backwards compatibility, which would be fairly convenient in the short term since ApiDelete and maintenance/deleteBatch.php could continue to call it. But in the long term, those two callers would be best served by a new entry point that treated file and non-file deletions equivalently.

Assigning to Rob Lanphier for delegation.

Giving this to AaronSchulz to look at.

Related: https://gerrit.wikimedia.org/r/58820. Tim thinks this should help with about 80% of the problem.

Ariel: You silently reopened this - could you add a comment, e.g. how often / where this still can be seen? Thanks.

oops, I didn't mean to re-open. just to add myself. sorry! do what you need to do...