Page MenuHomePhabricator

Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16"
Closed, ResolvedPublic

Description

I've discovered a single revision_id that, when requested from en.wikipedia.org, causes an error that causes an "Internal Error" and crashes an API query. When making an API query, the problem only happens when I try to request the content. The error is out of the ordinary and that is why I am reporting.

Revision id: 186704908
Problem observed for: 3 days
Error Message:
LBFactory_Multi::newExternalLB: Unknown cluster "cluster16"
Backtrace:
#0 /usr/local/apache/common-local/wmf-deployment/includes/db/LBFactory_Multi.php(139): LBFactory_Multi->newExternalLB('cluster16', false)
#1 /usr/local/apache/common-local/wmf-deployment/includes/ExternalStoreDB.php(42): LBFactory_Multi->getExternalLB('cluster16', false)
#2 /usr/local/apache/common-local/wmf-deployment/includes/ExternalStoreDB.php(53): ExternalStoreDB->getLoadBalancer('cluster16')
#3 /usr/local/apache/common-local/wmf-deployment/includes/ExternalStoreDB.php(125): ExternalStoreDB->getSlave('cluster16')
#4 /usr/local/apache/common-local/wmf-deployment/includes/ExternalStoreDB.php(97): ExternalStoreDB->fetchBlob('cluster16', '3970273', false)
#5 /usr/local/apache/common-local/wmf-deployment/includes/ExternalStore.php(43): ExternalStoreDB->fetchFromURL('DB://cluster16/...')
#6 /usr/local/apache/common-local/wmf-deployment/includes/Revision.php(732): ExternalStore::fetchFromURL('DB://cluster16/...')
#7 /usr/local/apache/common-local/wmf-deployment/includes/Revision.php(920): Revision::getRevisionText(Object(stdClass))
#8 /usr/local/apache/common-local/wmf-deployment/includes/Revision.php(621): Revision->loadText()
#9 /usr/local/apache/common-local/wmf-deployment/includes/Revision.php(600): Revision->getRawText()
#10 /usr/local/apache/common-local/wmf-deployment/includes/Article.php(481): Revision->getText(2)
#11 /usr/local/apache/common-local/wmf-deployment/includes/Article.php(343): Article->fetchContent(186704908)
#12 /usr/local/apache/common-local/wmf-deployment/includes/Article.php(230): Article->loadContent()
#13 /usr/local/apache/common-local/wmf-deployment/includes/Article.php(832): Article->getContent()
#14 /usr/local/apache/common-local/wmf-deployment/includes/Wiki.php(493): Article->view()
#15 /usr/local/apache/common-local/wmf-deployment/includes/Wiki.php(70): MediaWiki->performAction(Object(OutputPage), Object(Article), Object(Title), Object(User), Object(WebRequest))
#16 /usr/local/apache/common-local/wmf-deployment/index.php(117): MediaWiki->performRequestForTitle(Object(Title), Object(Article), Object(OutputPage), Object(User), Object(WebRequest))
#17 /usr/local/apache/common-local/live-1.5/index.php(3): require('/usr/local/apac...')
#18 {main}

Example #1: http://en.wikipedia.org/w/index.php?title=Wikipedia:WikiProject_Fungi/fungus_articles_by_size&oldid=186704908

Example #2: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=186704908&rvprop=content|ids

Example #3 (no content, should work): http://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=186704908&rvprop=ids|timestamp


Version: unspecified
Severity: normal

Details

Reference
bz24675

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:01 PM
bzimport set Reference to bz24675.
bzimport added a subscriber: Unknown Object (MLST).

Cluster 16 is marked as "Obsolete ex-fedora clusters"

I just tested Example #1 and Example #2. They both still fail as described above.

I hope I'm not acting out of step by re-opening this ticket, but this is still an issue.

And it's still like that except that we now are provided only the adorable clueless "Fatal exception of type MWException".

Roan, who knows enough about external storage and could investigate this?

Thanks for your detailed report, Aaron.

There's no cluster16 in the config. The revision timestamp is 2008-01-25T00:11:44Z. You can't simply disappear an ES cluster leaving references to it. What happened with its contents?
Given these are 2008 events, I'd pass this to Tim.

By the way, same error if you try to export it: you get an invalid XML file which ends with

<!doctype html>
<html><head><title>Internal error</title></head><body>
<div class="errorbox">[7aae7e6a] 2012-08-24 19:09:20: Fatal exception of type MWException</div>
<!-- Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. --></body></html>

(In reply to comment #4)

Thanks for your detailed report, Aaron.

There's no cluster16 in the config. The revision timestamp is
2008-01-25T00:11:44Z. You can't simply disappear an ES cluster leaving
references to it. What happened with its contents?
Given these are 2008 events, I'd pass this to Tim.

CC Tim.

Example number 3 works

http://en.wikipedia.org/w/index.php?title=Wikipedia:WikiProject_Fungi/fungus_articles_by_size&oldid=186704908

012-10-21 15:44:32 mw39 enwiki: [6828b293] /w/index.php?title=Wikipedia:WikiProject_Fungi/fungus_articles_by_size&oldid=186704908   Exception from line 162 of /usr/local/apache/common-local/php-1.21wmf1/includes/db/LBFactory_Multi.php: LBFactory_Multi::newExternalLB: Unknown cluster "cluster16"
#0 /usr/local/apache/common-local/php-1.21wmf1/includes/db/LBFactory_Multi.php(181): LBFactory_Multi->newExternalLB('cluster16', false)
#1 /usr/local/apache/common-local/php-1.21wmf1/includes/ExternalStoreDB.php(42): LBFactory_Multi->getExternalLB('cluster16', false)
#2 /usr/local/apache/common-local/php-1.21wmf1/includes/ExternalStoreDB.php(55): ExternalStoreDB->getLoadBalancer('cluster16')
#3 /usr/local/apache/common-local/php-1.21wmf1/includes/ExternalStoreDB.php(143): ExternalStoreDB->getSlave('cluster16')
#4 /usr/local/apache/common-local/php-1.21wmf1/includes/ExternalStoreDB.php(108): ExternalStoreDB->fetchBlob('cluster16', '3970273', false)
#5 /usr/local/apache/common-local/php-1.21wmf1/includes/ExternalStore.php(74): ExternalStoreDB->fetchFromURL('DB://cluster16/...')
#6 /usr/local/apache/common-local/php-1.21wmf1/includes/Revision.php(934): ExternalStore::fetchFromURL('DB://cluster16/...')
#7 /usr/local/apache/common-local/php-1.21wmf1/includes/Revision.php(1135): Revision::getRevisionText(Object(stdClass))
#8 /usr/local/apache/common-local/php-1.21wmf1/includes/Revision.php(823): Revision->loadText()
#9 /usr/local/apache/common-local/php-1.21wmf1/includes/Revision.php(800): Revision->getRawText()
#10 /usr/local/apache/common-local/php-1.21wmf1/includes/Article.php(383): Revision->getText(2)
#11 /usr/local/apache/common-local/php-1.21wmf1/includes/Article.php(583): Article->fetchContent()
#12 /usr/local/apache/common-local/php-1.21wmf1/includes/actions/ViewAction.php(37): Article->view()
#13 /usr/local/apache/common-local/php-1.21wmf1/includes/Wiki.php(427): ViewAction->show()
#14 /usr/local/apache/common-local/php-1.21wmf1/includes/Wiki.php(304): MediaWiki->performAction(Object(Article))
#15 /usr/local/apache/common-local/php-1.21wmf1/includes/Wiki.php(553): MediaWiki->performRequest()
#16 /usr/local/apache/common-local/php-1.21wmf1/includes/Wiki.php(446): MediaWiki->main()
#17 /usr/local/apache/common-local/php-1.21wmf1/index.php(59): MediaWiki->run()
#18 /usr/local/apache/common-local/live-1.5/index.php(3): require('/usr/local/apac...')
#19 {main}

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=186704908&rvprop=content|ids

2012-10-21 15:45:33 mw70 enwiki: [7b1060ef] /w/api.php?action=query&prop=revisions&revids=186704908&rvprop=content|ids Exception from line 162 of /usr/local/apache/common-local/php-1.21wmf1/includes/db/LBFactory_Multi.php: LBFactory_Multi::newExternalLB: Unknown cluster "cluster16"

I have no idea how you managed to find this among the several hundred million revisions we have, but:

mysql:wikiadmin@db1055 [enwiki]> select rev_text_id from revision where rev_id = 186704908;
+-------------+
| rev_text_id |
+-------------+
|   185600705 |
+-------------+
1 row in set (0.08 sec)

mysql:wikiadmin@db1055 [enwiki]> select min(old_id), max(old_id) from text where old_text like 'DB://cluster16/%';
+-------------+-------------+
| min(old_id) | max(old_id) |
+-------------+-------------+
|   185600705 |   185600705 |
+-------------+-------------+
1 row in set (16 min 0.65 sec)
Krinkle renamed this task from Revision error on en.wikipedia.org, Internal error: Unknown cluster "cluster16" to Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16".Mar 26 2015, 3:24 AM
Krinkle added projects: DBA, acl*sre-team.
Krinkle set Security to None.
Krinkle removed a subscriber: Unknown Object (MLST).
Krinkle subscribed.

So.. where did that revision go? Can we scan the other clusters perhaps?

I suspect that if we had this blob laying around before, it may have been lost during https://wikitech.wikimedia.org/wiki/External_storage/Srv-data-migration - if not earlier

I don't have any particularly sane ways of finding out whether it exists but is orphaned, or just completely lost. Would the IDs have been created systematically? Where would information about the decommissioning of cluster16 be? In some private SVN commit that no one looks at anymore?
I guess you could theoretically look for all texts that look like the page at the revision before (not that much would've changed), and work out which ones are actually referenced in the text table or not. Probably more work than it's worth.

I suspect that if we had this blob laying around before, it may have been lost during https://wikitech.wikimedia.org/wiki/External_storage/Srv-data-migration - if not earlier

I think it was earlier than that.

Does anyone know when cluster16 was decommissioned? Maybe we could restore the revision from a dump. Might it be in enwiki-20100312-pages-meta-history.xml ? Based on what I could find on the dumps site that's the oldest existing page history dump made after the revision which was successful, but it's not on labs and I can't write to /public/dumps so don't want to download it in labs. @ArielGlenn?

Maybe we could restore the revision from a dump. Might it be in enwiki-20100312-pages-meta-history.xml ? Based on what I could find on the dumps site that's the oldest existing page history dump made after the revision which was successful

I pulled the file to tin like this:
SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp mwdeploy@snapshot1001:/mnt/data/xmldatadumps/public/archive/enwiki/20100312/enwiki-20100312-pages-meta-history.xml.7z T26675.xml.7z
(snapshot1001:/mnt/data is from dataset1001.wikimedia.org:/data)

However:

krenair@tin:~$ 7z l T26675.xml.7z

[uninteresting stuff]

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2010-04-08 17:18:36 ..... 3918111615000  17008818440  T26675.xml

Looks like that would uncompress as 3.56TB... I only want one revision. Is there a host this can be extracted on and searched?

@Halfak has agreed to help find this in the dump. We don't think zgrep will work with 7z.

You can do 7z e -so T26675.xml.7z | grep -C 100 "<id>186704908</id>".

Indeed, but the text may be more than 100 lines, so I was parsing to get the whole revision block. However, my parser ran into a chunk of invalid XML (so it thinks), so I'm looking into that. :S

I'll run your simple 7z grep in parallel to see if we can get this done. :)

I couldn't find the revision with my parsing utility, so I also tried the grep strategy.

(3.4)[halfak@stat1003: ~]
$ 7z e -so /mnt/data/xmldatadumps/public/archive/enwiki/20100312/enwiki-20100312-pages-meta-history.xml.7z | grep -C 100 "<id>186704908</id>"
  
7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,16 CPUs)
  
Processing archive: /mnt/data/xmldatadumps/public/archive/enwiki/20100312/enwiki-20100312-pages-meta-history.xml.7z
  
Extracting  enwiki-20100312-pages-meta-history.xml
  
Everything is Ok
  
Size:       3918111615000
Compressed: 17008818540

That revision ID isn't there.

https://wikitech.wikimedia.org/wiki/Dumps/History suggests that it was missing a third of all revisions. I took a look at stat1003:/mnt/data/xmldatadumps/public/archive/2010/2010-11/enwiki/20101011/enwiki-20101011-pages-meta-history.xml.bz2 a couple of days ago but that seemed to be corrupt.

Does anyone know (even roughly) when cluster16 was decommissioned? Perhaps someone has some IRC logs somewhere?

Although actually, 2010-10 dumps would not be helpful considering this task was created in 2010-08.

Can we just fill it in with a copy of the previous revision, or a text saying "revision lost", this is an old edit of a bot, is not like we would lose that much?

We could, probably, it's just nice to not have broken stuff... Or we just hide the revision for technical reasons. It's not going to any real data loss. The diff of the commit before/after it is as below:

https://en.wikipedia.org/w/index.php?title=Wikipedia%3AWikiProject_Fungi%2Ffungus_articles_by_size&type=revision&diff=186924826&oldid=186464935

+1 to just marking the text as deleted and putting a placeholder in the text store. This is more of a technical problem than a practical one.

Halfak removed Halfak as the assignee of this task.Oct 1 2015, 5:27 PM

I might have a few tricks for recovering this revision, give me a day or two Ill see what I can do.

I'll let @Betacommand try to find the revision, but if that doesn't work out I'll insert a new text entry like SYSADMIN NOTE: Text of this revision has been lost, for details see https://phabricator.wikimedia.org/T26675, point the revision at that and set rev_deleted = 1 (DELETED_TEXT).

@Krenair, it may also make sense to add a relevant row to the logging table.

krenair@tin:~$ mwscript eval.php enwiki
> echo ExternalStore::insertToDefault( gzdeflate( "SYSADMIN NOTE: Text of this revision has been lost, for details see https://phabricator.wikimedia.org/T26675" ) );
DB://cluster24/95690756
> exit
krenair@tin:~$ sql --write enwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3361705417
Server version: 5.5.34-MariaDB-1~precise-log mariadb.org binary distribution

Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:wikiadmin@db1052 [enwiki]> select rev_text_id from revision where rev_id = 186704908;
+-------------+
| rev_text_id |
+-------------+
|   185600705 |
+-------------+
1 row in set (0.09 sec)

mysql:wikiadmin@db1052 [enwiki]> select * from text where old_id = 185600705;
+-----------+------------------------+---------------------+
| old_id    | old_text               | old_flags           |
+-----------+------------------------+---------------------+
| 185600705 | DB://cluster16/3970273 | utf-8,gzip,external |
+-----------+------------------------+---------------------+
1 row in set (0.03 sec)

mysql:wikiadmin@db1052 [enwiki]> update text set old_text = "DB://cluster24/95690756" where old_id = 185600705;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

I didn't mark it as deleted. I'm not sure if we should or not.

I guess we should update rev_len (currently 78946) and rev_sha1 (currently blank) as well?

Yeah, sorry for the delay in getting back to this, I have a dump from a few months after this, but it doesnt look like the revision is in it.

removing this from my assigned list until someone answers my question

jcrespo lowered the priority of this task from Medium to Low.Mar 9 2016, 2:27 PM

@Krenair: yes, then we can close the ticket.

Nemo_bis raised the priority of this task from Low to Medium.
krenair@terbium:~$ mwscript eval.php enwiki
> $rev = Revision::newFromId( 186704908 );

> $content = $rev->getContent();

> var_dump( $content->getSize() );
int(108)

> var_dump( Revision::base36Sha1( $content->getNativeData() ) );
string(31) "tgs3iu7ve32iaourcav4u522yr3nwt3"

> krenair@terbium:~$ sql --write enwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3570161496
Server version: 5.5.34-MariaDB-1~precise-log mariadb.org binary distribution

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:wikiadmin@db1052 [enwiki]> select rev_len, rev_sha1 from revision where rev_id = 186704908;
+---------+----------+
| rev_len | rev_sha1 |
+---------+----------+
|   78946 |          |
+---------+----------+
1 row in set (0.02 sec)

mysql:wikiadmin@db1052 [enwiki]> update revision set rev_len = 108, rev_sha1 = 'tgs3iu7ve32iaourcav4u522yr3nwt3' where rev_id = 186704908;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql:wikiadmin@db1052 [enwiki]> select rev_len, rev_sha1 from revision where rev_id = 186704908;
+---------+---------------------------------+
| rev_len | rev_sha1                        |
+---------+---------------------------------+
|     108 | tgs3iu7ve32iaourcav4u522yr3nwt3 |
+---------+---------------------------------+
1 row in set (0.00 sec)

mysql:wikiadmin@db1052 [enwiki]> exit
Bye