Parse errors with extra (ideographic) spaces in query
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

We're seeing
Nov 13 16:23:33 mw1059: PHP Warning: Search backend error during full_text search for '　　' after 29. Parse error on '　　': Encountered "<EOF>" at line 1, column 2. [Called from CirrusSearch\ElasticsearchIntermediary::failure in /srv/mediawiki/php-1.25wmf7/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php at line 97] in /srv/mediawiki/php-1.25wmf7/includes/debug/MWDebug.php on line 302

relatively commonly in production. We should figure out why. You aren't supposed to be able to search for just spaces because that doesn't work.

Version: REL1_22-branch
Severity: normal

Details

Reference: bz73374

Subject	Repo	Branch	Lines +/-
trim idiographic whitespace too	mediawiki/extensions/CirrusSearch	wmf/1.26wmf5	+10 -3
trim idiographic whitespace too	mediawiki/extensions/CirrusSearch	master	+10 -3
Fix for I49ee1270: Tabs aren't actually a problem	mediawiki/extensions/CirrusSearch	master	+4 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open	PRODUCTION ERROR	None	T94814 Fix: "Warning: Search backend error during .. took .." (tracking)
		Resolved	PRODUCTION ERROR	EBernhardson	T75374 Parse errors with extra (ideographic) spaces in query

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:55 AM

• bzimport added a project: CirrusSearch.

• bzimport set Reference to bz73374.

• bzimport added a subscriber: Unknown Object (MLST).

• Manybubbles created this task.Nov 13 2014, 4:29 PM

This is coming up all the time in the logs

In fact on fatalmonitor right now it takes up the majority of my screen.

Actually, if I alter my fork of fatalmonitor a bit, this becomes the top result.

@Manybubbles: They're U+3000 (IDEOGRAPHIC SPACE) rather than U+0020 (SPACE)

Ugh, I thought they were a tab. That explains why my patch did nothing to help.

• demon renamed this task from CirrusSearch: Track down relatively common exception to Parse errors with extra (ideographic) spaces in query.Apr 3 2015, 5:57 PM

• demon added a parent task: T94814: Fix: "Warning: Search backend error during .. took .." (tracking).

• demon set Security to None.

Change 204284 had a related patch set uploaded (by Chad):
Fix for I49ee1270: Tabs aren't actually a problem

https://gerrit.wikimedia.org/r/204284

gerritbot added a project: Patch-For-Review.Apr 15 2015, 3:37 PM

Here's a much more informative error message:

P574 query parse failure on 2o 3Caused 4 5 6 7 8 9 10 11 12 13 14Caused 15 16 17 18 19 20 21 22 23Caused 24Was 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46Caused 47Was 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 ideographic spaces

[2015-04-29 06:27:24,055][DEBUG][action.search.type ] [elastic1001] [zhwiki_content_1415377727][0], node[OuuilxJ-SKGpToIE1xqiGA], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@22eda4f3] lastShard [true] rg.elasticsearch.transport.RemoteTransportException: [elastic1010][inet[/10.64.32.142:9300]][search/phase/query] by: org.elasticsearch.search.SearchParseException: [zhwiki_content_1415377727][0]: query[MatchNoDocsQuery],from[-1],size[-1]: Parse Failure [Failed to parse source [{"_source":["id","title","namespace","redirect.*","timestamp","text_bytes"],"fields":"text.word_count","query":{"simple_query_string":{"fields":["all.plain^1","all^0.5"],"query":"　","default_operator":"AND"}},"highlight":{"pre_tags":["<span class=\"searchmatch\">"],"post_tags":["</span>"],"fields":{"title":{"type":"experimental","fragmenter":"none","number_of_fragments":1,"matched_fields":["title","title.plain"]},"redirect.title":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["redirect.title","redirect.title.plain"]},"category":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["category","category.plain"]},"heading":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["heading","heading.plain"]},"text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000},"no_match_size":150,"matched_fields":["text","text.plain"]},"auxiliary_text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000,"skip_if_last_matched":true},"matched_fields":["auxiliary_text","auxiliary_text.plain"]}},"highlight_query":{"query_string":{"query":"　","fields":["title.plain^20","redirect.title.plain^15","category.plain^8","heading.plain^5","opening_text.plain^3","text.plain^1","auxiliary_text.plain^0.5","title^10","redirect.title^7.5","category^4","heading^2.5","opening_text^1.5","text^0.5","auxiliary_text^0.25"],"auto_generate_phrase_queries":true,"phrase_slop":1,"default_operator":"AND","allow_leading_wildcard":false,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024"}}},"suggest":{"text":"　","suggest":{"phrase":{"field":"suggest","size":1,"max_errors":2,"confidence":2,"direct_generator":[{"field":"suggest","suggest_mode":"always","max_term_freq":0.5,"prefix_length":2}],"highlight":{"pre_tag":"<em>","post_tag":"</em>"}}}},"stats":["suggest","degraded_full_text"],"size":20,"rescore":[{"window_size":8192,"query":{"rescore_query":{"function_score":{"functions":[{"script_score":{"script":"log10((doc['incoming_links'].isEmpty() ? 0 : doc['incoming_links'].value) + 2)","lang":"groovy"}}]}},"query_weight":1,"rescore_query_weight":1,"score_mode":"multiply"}}]}]] at org.elasticsearch.search.SearchService.parseSource(SearchService.java:660) at org.elasticsearch.search.SearchService.createContext(SearchService.java:516) at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:488) at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:257) at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:688) at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:677) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) by: org.elasticsearch.index.query.QueryParsingException: [zhwiki_content_1415377727] Failed to parse query [　] at org.elasticsearch.index.query.QueryStringQueryParser.parse(QueryStringQueryParser.java:240) at org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:239) at org.elasticsearch.index.query.IndexQueryParserService.innerParse(IndexQueryParserService.java:342) at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:268) at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:263) at org.elasticsearch.search.highlight.HighlighterParseElement.parse(HighlighterParseElement.java:167) at org.elasticsearch.search.SearchService.parseSource(SearchService.java:644) ... 9 more by: org.apache.lucene.queryparser.classic.ParseException: Cannot parse '　': Encountered "<EOF>" at line 1, column 1. expecting one of: <NOT> ... "+" ... "-" ... <BAREOPER> ... "(" ... "*" ... <QUOTED> ... <TERM> ... <PREFIXTERM> ... <WILDTERM> ... <REGEXPTERM> ... "[" ... "{" ... <NUMBER> ... <TERM> ... "*" ... at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:125) at org.apache.lucene.queryparser.classic.MapperQueryParser.parse(MapperQueryParser.java:882) at org.elasticsearch.index.query.QueryStringQueryParser.parse(QueryStringQueryParser.java:223) ... 15 more by: org.apache.lucene.queryparser.classic.ParseException: Encountered "<EOF>" at line 1, column 1. expecting one of: <NOT> ... "+" ... "-" ... <BAREOPER> ... "(" ... "*" ... <QUOTED> ... <TERM> ... <PREFIXTERM> ... <WILDTERM> ... <REGEXPTERM> ... "[" ... "{" ... <NUMBER> ... <TERM> ... "*" ... at org.apache.lucene.queryparser.classic.QueryParser.generateParseException(QueryParser.java:708) at org.apache.lucene.queryparser.classic.QueryParser.jj_consume_token(QueryParser.java:590) at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:275) at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:181) at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170) at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:120) ... 17 more

Change 204284 abandoned by Chad:
Fix for I49ee1270: Tabs aren't actually a problem

https://gerrit.wikimedia.org/r/204284

EBernhardson claimed this task.May 11 2015, 8:36 PM

EBernhardson added a project: Discovery-Search (Current work).

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

not sure what chad abandoned his, it was the right general approach if we want to just not search for empty strings containing only idiographic spaces. I disabled the existing trimming and verified elasticsearch would give a similar error searching for a blank space.

SearchPhaseExecutionException[Failed to execute phase [dfs], all shards failed; shardFailures {[21Oc-wh0RIONUF7JIrDkAA][cirrustestwiki_content_first][0]: SearchParseException[[cirrustestwiki_content_first][0]: from[-1],size[-1]: Parse Failure [Failed to parse source

When elasticsearch sees the idiographic space within a standard query it looks to treat it as simple whitespace. Content using the idiographic whitespace as a separator can be found without using it. Similarly using the idiographic whitespace to separate search terms works as you would expect for normal whitespace.

php.net/trim documents that it strips a specific selection of chars (" \t\n\r\0\x0B"), all of them ASCII. Incoming patch will just trim the idiographic space as well as the standard ASCII set.

Change 210211 had a related patch set uploaded (by EBernhardson):
trim idiographic whitespace too

https://gerrit.wikimedia.org/r/210211

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.May 11 2015, 10:31 PM

Change 210211 merged by jenkins-bot:
trim idiographic whitespace too

https://gerrit.wikimedia.org/r/210211

Diffusion mentioned this in rMEXTb4416d23cfa8: Updated mediawiki/extensions Project: mediawiki/extensions/CirrusSearch….May 12 2015, 7:43 PM

Change 210415 had a related patch set uploaded (by Chad):
trim idiographic whitespace too

https://gerrit.wikimedia.org/r/210415

Change 210415 merged by jenkins-bot:
trim idiographic whitespace too

https://gerrit.wikimedia.org/r/210415

• demon mentioned this in rECIR4d5da21ee3ab: trim idiographic whitespace too.May 12 2015, 7:52 PM

EBernhardson mentioned this in rECIRb47d7ff78b03: trim idiographic whitespace too.May 12 2015, 7:52 PM

EBernhardson closed this task as Resolved.May 12 2015, 9:33 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

• demon moved this task from Untriaged to Resolved on the Wikimedia-production-error board.Jun 12 2015, 6:34 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptJun 12 2015, 6:34 PM

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.Sep 9 2015, 2:25 AM

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Dec 31 2015, 5:10 AM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptAug 28 2019, 11:12 PM

Parse errors with extra (ideographic) spaces in queryClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related ObjectsSearch...

Event Timeline

Parse errors with extra (ideographic) spaces in query
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Related Objects
Search...