Page MenuHomePhabricator

OAIRepository fatal error stops search indexing
Closed, ResolvedPublic

Description

Author: rainman

Description:
This bug is related to bug 26203. OAIRepository returns 500 on certain queries on dewikisource. This makes the search indexer stuck.

Here are the exact queries and errors:

First query:
http://de.wikisource.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2010-03-17T21:09:10Z

Dec 2 14:48:29 10.0.2.227 apache2[32008]: PHP Fatal error: Allowed memory size of 83886080 bytes exhausted (tried to allocate 76 bytes) in /usr/local/apache/
common-local/wmf-deployment/includes/normal/UtfNormal.php on line 272

Second query:
http://de.wikisource.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumptionToken=mediawiki:1009861

Dec 2 15:03:39 10.0.2.182 apache2[9827]: PHP Fatal error: Allowed memory size of 83886080 bytes exhausted (tried to allocate 26519414 bytes) in /usr/local/apache/common-local/wmf-deployment/includes/OutputHandler.php on line 8

The second query/error was produced when I tried to skip some articles by bumping the from= parameter forward in time.


Version: unspecified
Severity: critical

Details

Reference
bz26304

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 11:20 PM
bzimport set Reference to bz26304.

rainman wrote:

Bump. Happened again, but now on en.wp:

http://en.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumptionToken=mediawiki:414565305

I am going to try skipping some articles, but this really need to be fixed.

rainman wrote:

And again on commons:

http://commons.wikimedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumption
Token=mediawiki:53981669

Bumping priority on this as it is breaking search index updates at random times on multiple wikis.

  • Bug 28852 has been marked as a duplicate of this bug. ***

nagutokay wrote:

Search won't work at all this morning. Yesterday evening, results that had been missing for days started appearing, but this morning, searches do not display any results.

rainman wrote:

This is a known problem unrelated to OAIRepository. We have only one search host (search11) that most low-accessed wikis (including commons) and it needs to be restarted every Sunday morning which introduces around 5-10 minutes downtime since it has so many wikis on it and it takes time for all of ~ 800 wikis to be properly cached loaded into memory. Hopefully at some point we'll able to spare two search hosts so we eliminate this downtime.

Are you still seeing this? Brion thought some changes that were made may have mitigated this.

rainman wrote:

The URLs listed above now work. Closing as FIXED, will reopen if it occurs again.

Bug 28427 notes a memory usage issue in UtfNormal::quickIsNFCVerify (part of the Unicode normalization routines we run on a lot of input and output), which could perhaps be contributing... I was thinking there was some related fix but that's actually for bug 28146 which was fixed by chopping up a run over a giant string into a run over smaller strings, and was specific to the calling code (in DjVu file handling) -- so that fix probably would not have an effect on this.

More generally, hitting memory limits here could simply be due to handling too much big data in one request, and it may need to change how it handles diving up the data into chunks.

It's possible that the errors were coming before and no longer are because specific large pages were coming over the channel that are no longer in the queue.

nagutokay wrote:

It's back again! The search engine update lags by 48 hours now.

rainman wrote:

*** Bug 29679 has been marked as a duplicate of this bug. ***

(In reply to comment #11)

  • Bug 29679 has been marked as a duplicate of this bug. ***

Just to be sure, if this bug is fixed and Bug 29679 is a duplicate of this one, when should we expect the index to be updated on Portuguese Wikibooks?

rainman wrote:

Both commons and ptwikibooks should be updated by tomorrow. However, the bug still occurs. So, reopening.

rainman wrote:

If anyone is going to track this down and needs some fresh URLs that fail, do the following:

ssh searchidx2
grep "HTTP response code: 500" /a/search/log/log-all | tail

I have no server access so couldn't tell you exactly what's being recorded now, but when pulling some of the older links shown above (for enwiki & commons) using the testing password I end up receiving what at first looks like a legit OAI-PMH XML response, but partway through after the end of one of the <Record>...</Record> elements suddenly the Wikimedia error page shows up like this:

... ^ OAI-PMH with embedded MediaWiki export XML

</page>

</mediawiki>
</metadata>
</record>
<!DOCTYPE html PUBLIC "-W3CDTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>

<title>Wikimedia Error</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta name="author" content="Mark Ryan, with translation by many people; see http://meta.wikimedia.org/wiki/Multilingual_error_messages"/>
<meta name="copyright" content="(c) 2005-2007 Mark Ryan and others. Text licensed under the GNU Free Documentation License. http://www.gnu.org/licenses/fdl.txt"/>

... v HTML error page

The OAI output, like Special:Export disables the regular output stuff and possibly the regular output buffering as well, so it might be that some error is triggering that error page (have we modified PHP to output that error page directly on a fatal error, perhaps?)

Taking a quick look over the current OAI extension code, I don't see any explicit use of UtfNormal::cleanUp(); though it'll be implicitly called already by the xmlsafe() string escaping wrapper in WikiExport common code & a few bits in OAIRepo itself.

A very large page with a lot of non-ASCII characters could perhaps be an individual thing that eats up a lot of memory in one chunk; it's also possible there's a memory leak I haven't noticed before.

Could try tweaking the number in OAIRepo::chunkSize() down -- this'll make it run fewer pages through each ListRecords call, and might help if there's a leak forcing memory usage up from record to record -- but *won't* help if it's dying on an individual page record being too-big.

r91364 makes the ListRecords chunk size configurable via $oaiChunkSize (default is 50, the same as was previously hardcoded).

Should be straightforward to merge that to 1.17wmf1, push that out, and try lowering the value to see if it has any effect. Smaller chunk size will require more requests to get through the data so may slow it down a little, but if it's leaking memory from record to record this may help.

Created attachment 8737
results of grep

results of grepping the logs

attachment tmp.txt ignored as obsolete

Sam, can you take this one on? Brion believes that this is ready to deploy, if I'm reading him right.

Has this not been deployed yet? Really should be...

Merged in r91364

Pushed out to cluster with default 50

Just trying to find where the logs are, and I'll try and reduce the number a bit

Still nothing since July 27th

Nothing since the 27th of July, but then on the 13th errors have started again seemingly

Reduced chunksize to 40 just now

Will upload more up to date list of just recent errors

Created attachment 8928
More recent error logs

Attached:

PHP fatal error in /usr/local/apache/common-local/php-1.17/extensions/OAI/OAIRepo_body.php line 881: <br/> 
Call to a member function getUrl() on a non-object

Looks like a regression in r80770: introduces a chance for a fatal error if we have an image row but don't succeed at loading the file object. (Not entirely sure how that happens but that's where the error is.)

(In reply to comment #24)

PHP fatal error in

/usr/local/apache/common-local/php-1.17/extensions/OAI/OAIRepo_body.php line
881: <br/>

Call to a member function getUrl() on a non-object

Seems like it might be worth live hacking in some debugging for the time being.

If we get false from wfFindFile, output to debug, and return ''?

r95169 on trunk, r95170 on 1.17wmf1 should resolve this. Has it been deployed?

Still needs merge to 1.18.

(In reply to comment #28)

r95169 on trunk, r95170 on 1.17wmf1 should resolve this. Has it been deployed?

Still needs merge to 1.18.

Yes, it has been deployed.

Did tag it 1.18

r95169 applied to 1.18 with r95628

Nothing again since the 21st August....

nagutokay wrote:

Wikimedia Commons search engine returns only stale results when searching - once again! Most recent returned results are, at least, four days old, if not older. Comment 31 says "since 21st August". I cant' confirm this but it looks likely. This would be a record-breaking duration for a bug on a major database!

rainman wrote:

The problem reported in Comment 32 seems to be unrelated to this bug, but to a change in InitaliseSettings.php (enabling of HTTPSExperiment) that broke lucene-search parsing of this file, which is needed to get the full URLs of all wikis for incremental update. Should be fixed now by manual override in lsearch-global-2.1.conf

Might want to ping roan if he doesn't already know

(In reply to comment #34)

Might want to ping roan if he doesn't already know

I noticed this comment but didn't really understand what Robert meant. Parsing of what broke exactly? Is the override in lsearch-global-2.1.conf a suitable permanent solution?

(In reply to comment #35)

(In reply to comment #34)

Might want to ping roan if he doesn't already know

I noticed this comment but didn't really understand what Robert meant. Parsing
of what broke exactly? Is the override in lsearch-global-2.1.conf a suitable
permanent solution?

Robert and I talked on IRC, which led to me fixing the real issue in r96287. This fix is now deployed.

[main] WARN org.wikimedia.lsearch.oai.IncrementalUpdater - Retry later: error while processing update for igwiki : Server returned HTTP response code: 500 for URL: http://ig.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:08:59Z
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ig.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:08:59Z
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ii.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2010-02-16
2011-09-20 19:03:26,440 [main] WARN org.wikimedia.lsearch.oai.IncrementalUpdater - Retry later: error while processing update for iiwiki : Server returned HTTP response code: 500 for URL: http://ii.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2010-02-16
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ii.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2010-02-16
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ik.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:09:04Z
2011-09-20 19:03:26,935 [main] WARN org.wikimedia.lsearch.oai.IncrementalUpdater - Retry later: error while processing update for ikwiki : Server returned HTTP response code: 500 for URL: http://ik.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:09:04Z
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ik.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:09:04Z
2011-09-22 19:43:55,264 [main] WARN org.wikimedia.lsearch.oai.OAIHarvester - I/O exception listing records: Server returned HTTP response code: 500 for URL: http://de.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumptionToken=mediawiki:90200283
java.io.IOException: Server returned HTTP response code: 500 for URL: http://de.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumptionToken=mediawiki:90200283

All of the above links work at present -- was this some transitory error?

(In reply to comment #38)

All of the above links work at present -- was this some transitory error?

It does seem that way. Think it's just worth closing this for the time being? As there isn't much that's actually actionable (ie do X and Y will be fixed)

This is working now. Bumping priority and closing so that if it is re-opened it shows up more quickly.