Page MenuHomePhabricator

Timeout when sending translation notification (yet again)
Closed, DeclinedPublicPRODUCTION ERROR

Description

While sending out a translation notification on Meta, after hitting the "send notification to translators" button and waiting for a bit, I received this error page:

504 Gateway Time-out
nginx/1.1.19

The notification itself seems to have gone out fine, here's the log entry:

05:16, 10 February 2014 Tbayer (WMF) (talk | contribs) sent a notification about translating page Data retention guidelines; languages: all languages; deadline: none; priority: high; sent to 1469 recipients, failed for 0 recipients, skipped for 6 recipients

This bug continues the hallowed tradition of T43131: Timeout when sending translation notification and T57397: Timeout when sending translation notification (again). Both of which have been fixed and had different kinds of error messages (not 504), which is why I'm filing a new bug rather than reopening one of them.

Details

Reference
bz61122

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:55 AM
bzimport set Reference to bz61122.
bzimport added a subscriber: Unknown Object (MLST).

Merged all tasks that are related to the same root issue - it is not scalable to do all the updates in a single user web request.
These are time-outing because the extension attempts to insert each email and talk page edit job during the submission web request. It also tries to do the updates in the table for each user separately. On wikis with many translators, this means it could be hundreds (possibly thousands) of queries rows. MediaWiki also now automatically aborts transactions that take too long (and the duration will probably be decreased in the future).

The best thing to do here would be to defer to the job queue and do all the updates in one job which submits all other jobs. We also probably shouldn't be (mis)using the user_properties table and instead should have the extension's own table so that we can easily do updates in batches.

Glaisher raised the priority of this task from Medium to High.May 22 2016, 10:13 AM
MarcoAurelio subscribed.

[WY2yRApAEKcAABSE9XgAAAAC] 2017-08-11 13:34:37: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError"

That said, it seems the job is sent to the queue and performed. It'll send double/triple notifications if you reload the page. If you just ignore the exception the translation notifications will get delivered. Question is, all of them?

[WY2yRApAEKcAABSE9XgAAAAC] 2017-08-11 13:34:37: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError"

[WY2yRApAEKcAABSE9XgAAAAC] /wiki/Special:NotifyTranslators   Wikimedia\Rdbms\DBTransactionSizeError from line 1177 of /srv/mediawiki/php-1.30.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Transaction spent 5.2927534580231 second(s) in writes, exceeding the 3 limit.
referrer:	 https://meta.wikimedia.org/wiki/Special:NotifyTranslators

#0 [internal function]: Closure$Wikimedia\Rdbms\LoadBalancer::approveMasterChanges(Wikimedia\Rdbms\DatabaseMysqli)
#1 /srv/mediawiki/php-1.30.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1502): call_user_func_array(Closure$Wikimedia\Rdbms\LoadBalancer::approveMasterChanges;300, array)
#2 /srv/mediawiki/php-1.30.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1187): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection(Closure$Wikimedia\Rdbms\LoadBalancer::approveMasterChanges;300)
#3 [internal function]: Wikimedia\Rdbms\LoadBalancer->approveMasterChanges(array)
#4 /srv/mediawiki/php-1.30.0-wmf.13/includes/libs/rdbms/lbfactory/LBFactory.php(183): call_user_func_array(array, array)
#5 [internal function]: Closure$Wikimedia\Rdbms\LBFactory::forEachLBCallMethod(Wikimedia\Rdbms\LoadBalancer, string, array)
#6 /srv/mediawiki/php-1.30.0-wmf.13/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(417): call_user_func_array(Closure$Wikimedia\Rdbms\LBFactory::forEachLBCallMethod;230, array)
#7 /srv/mediawiki/php-1.30.0-wmf.13/includes/libs/rdbms/lbfactory/LBFactory.php(186): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure$Wikimedia\Rdbms\LBFactory::forEachLBCallMethod;230, array)
#8 /srv/mediawiki/php-1.30.0-wmf.13/includes/libs/rdbms/lbfactory/LBFactory.php(223): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod(string, array)
#9 /srv/mediawiki/php-1.30.0-wmf.13/includes/MediaWiki.php(598): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string, array)
#10 /srv/mediawiki/php-1.30.0-wmf.13/includes/MediaWiki.php(571): MediaWiki::preOutputCommit(RequestContext, Closure$MediaWiki::main;324)
#11 /srv/mediawiki/php-1.30.0-wmf.13/includes/MediaWiki.php(884): MediaWiki->doPreOutputCommit(Closure$MediaWiki::main;324)
#12 /srv/mediawiki/php-1.30.0-wmf.13/includes/MediaWiki.php(523): MediaWiki->main()
#13 /srv/mediawiki/php-1.30.0-wmf.13/index.php(43): MediaWiki->run()
#14 /srv/mediawiki/w/index.php(3): include(string)
#15 {main}

I wonder how MassMessage behaves when sending the job. There are lists with hundreds of subscribers yet never timeouts. Maybe the same behaviour should be used here for the feature to get usable again? Ping @Legoktm for analysis.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM
Krinkle subscribed.

Closing as too old for a prod error to be usefully investigable, and also does not seem to be generic enough for it to be an obvious general root cause that needs to be structurelly changed somehow.