Page MenuHomePhabricator

Parse errors with extra (ideographic) spaces in query
Closed, ResolvedPublicPRODUCTION ERROR

Description

We're seeing
Nov 13 16:23:33 mw1059: PHP Warning: Search backend error during full_text search for '  ' after 29. Parse error on '  ': Encountered "<EOF>" at line 1, column 2. [Called from CirrusSearch\ElasticsearchIntermediary::failure in /srv/mediawiki/php-1.25wmf7/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php at line 97] in /srv/mediawiki/php-1.25wmf7/includes/debug/MWDebug.php on line 302

relatively commonly in production. We should figure out why. You aren't supposed to be able to search for just spaces because that doesn't work.


Version: REL1_22-branch
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:55 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz73374.
bzimport added a subscriber: Unknown Object (MLST).
Krenair subscribed.

This is coming up all the time in the logs

In fact on fatalmonitor right now it takes up the majority of my screen.

Krenair triaged this task as High priority.Mar 26 2015, 4:21 AM

Actually, if I alter my fork of fatalmonitor a bit, this becomes the top result.

@Manybubbles: They're U+3000 (IDEOGRAPHIC SPACE) rather than U+0020 (SPACE)

Ugh, I thought they were a tab. That explains why my patch did nothing to help.

demon renamed this task from CirrusSearch: Track down relatively common exception to Parse errors with extra (ideographic) spaces in query.Apr 3 2015, 5:57 PM
demon set Security to None.

Change 204284 had a related patch set uploaded (by Chad):
Fix for I49ee1270: Tabs aren't actually a problem

https://gerrit.wikimedia.org/r/204284

Here's a much more informative error message:

1[2015-04-29 06:27:24,055][DEBUG][action.search.type ] [elastic1001] [zhwiki_content_1415377727][0], node[OuuilxJ-SKGpToIE1xqiGA], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@22eda4f3] lastShard [true]
2org.elasticsearch.transport.RemoteTransportException: [elastic1010][inet[/10.64.32.142:9300]][search/phase/query]
3Caused by: org.elasticsearch.search.SearchParseException: [zhwiki_content_1415377727][0]: query[MatchNoDocsQuery],from[-1],size[-1]: Parse Failure [Failed to parse source [{"_source":["id","title","namespace","redirect.*","timestamp","text_bytes"],"fields":"text.word_count","query":{"simple_query_string":{"fields":["all.plain^1","all^0.5"],"query":" ","default_operator":"AND"}},"highlight":{"pre_tags":["<span class=\"searchmatch\">"],"post_tags":["</span>"],"fields":{"title":{"type":"experimental","fragmenter":"none","number_of_fragments":1,"matched_fields":["title","title.plain"]},"redirect.title":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["redirect.title","redirect.title.plain"]},"category":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["category","category.plain"]},"heading":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["heading","heading.plain"]},"text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000},"no_match_size":150,"matched_fields":["text","text.plain"]},"auxiliary_text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000,"skip_if_last_matched":true},"matched_fields":["auxiliary_text","auxiliary_text.plain"]}},"highlight_query":{"query_string":{"query":" ","fields":["title.plain^20","redirect.title.plain^15","category.plain^8","heading.plain^5","opening_text.plain^3","text.plain^1","auxiliary_text.plain^0.5","title^10","redirect.title^7.5","category^4","heading^2.5","opening_text^1.5","text^0.5","auxiliary_text^0.25"],"auto_generate_phrase_queries":true,"phrase_slop":1,"default_operator":"AND","allow_leading_wildcard":false,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024"}}},"suggest":{"text":" ","suggest":{"phrase":{"field":"suggest","size":1,"max_errors":2,"confidence":2,"direct_generator":[{"field":"suggest","suggest_mode":"always","max_term_freq":0.5,"prefix_length":2}],"highlight":{"pre_tag":"<em>","post_tag":"</em>"}}}},"stats":["suggest","degraded_full_text"],"size":20,"rescore":[{"window_size":8192,"query":{"rescore_query":{"function_score":{"functions":[{"script_score":{"script":"log10((doc['incoming_links'].isEmpty() ? 0 : doc['incoming_links'].value) + 2)","lang":"groovy"}}]}},"query_weight":1,"rescore_query_weight":1,"score_mode":"multiply"}}]}]]
4 at org.elasticsearch.search.SearchService.parseSource(SearchService.java:660)
5 at org.elasticsearch.search.SearchService.createContext(SearchService.java:516)
6 at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:488)
7 at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:257)
8 at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:688)
9 at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:677)
10 at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
11 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
12 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
13 at java.lang.Thread.run(Thread.java:745)
14Caused by: org.elasticsearch.index.query.QueryParsingException: [zhwiki_content_1415377727] Failed to parse query [ ]
15 at org.elasticsearch.index.query.QueryStringQueryParser.parse(QueryStringQueryParser.java:240)
16 at org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:239)
17 at org.elasticsearch.index.query.IndexQueryParserService.innerParse(IndexQueryParserService.java:342)
18 at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:268)
19 at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:263)
20 at org.elasticsearch.search.highlight.HighlighterParseElement.parse(HighlighterParseElement.java:167)
21 at org.elasticsearch.search.SearchService.parseSource(SearchService.java:644)
22 ... 9 more
23Caused by: org.apache.lucene.queryparser.classic.ParseException: Cannot parse ' ': Encountered "<EOF>" at line 1, column 1.
24Was expecting one of:
25 <NOT> ...
26 "+" ...
27 "-" ...
28 <BAREOPER> ...
29 "(" ...
30 "*" ...
31 <QUOTED> ...
32 <TERM> ...
33 <PREFIXTERM> ...
34 <WILDTERM> ...
35 <REGEXPTERM> ...
36 "[" ...
37 "{" ...
38 <NUMBER> ...
39 <TERM> ...
40 "*" ...
41
42 at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:125)
43 at org.apache.lucene.queryparser.classic.MapperQueryParser.parse(MapperQueryParser.java:882)
44 at org.elasticsearch.index.query.QueryStringQueryParser.parse(QueryStringQueryParser.java:223)
45 ... 15 more
46Caused by: org.apache.lucene.queryparser.classic.ParseException: Encountered "<EOF>" at line 1, column 1.
47Was expecting one of:
48 <NOT> ...
49 "+" ...
50 "-" ...
51 <BAREOPER> ...
52 "(" ...
53 "*" ...
54 <QUOTED> ...
55 <TERM> ...
56 <PREFIXTERM> ...
57 <WILDTERM> ...
58 <REGEXPTERM> ...
59 "[" ...
60 "{" ...
61 <NUMBER> ...
62 <TERM> ...
63 "*" ...
64
65 at org.apache.lucene.queryparser.classic.QueryParser.generateParseException(QueryParser.java:708)
66 at org.apache.lucene.queryparser.classic.QueryParser.jj_consume_token(QueryParser.java:590)
67 at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:275)
68 at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:181)
69 at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170)
70 at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:120)
71 ... 17 more

Change 204284 abandoned by Chad:
Fix for I49ee1270: Tabs aren't actually a problem

https://gerrit.wikimedia.org/r/204284

not sure what chad abandoned his, it was the right general approach if we want to just not search for empty strings containing only idiographic spaces. I disabled the existing trimming and verified elasticsearch would give a similar error searching for a blank space.

SearchPhaseExecutionException[Failed to execute phase [dfs], all shards failed; shardFailures {[21Oc-wh0RIONUF7JIrDkAA][cirrustestwiki_content_first][0]: SearchParseException[[cirrustestwiki_content_first][0]: from[-1],size[-1]: Parse Failure [Failed to parse source

When elasticsearch sees the idiographic space within a standard query it looks to treat it as simple whitespace. Content using the idiographic whitespace as a separator can be found without using it. Similarly using the idiographic whitespace to separate search terms works as you would expect for normal whitespace.

php.net/trim documents that it strips a specific selection of chars (" \t\n\r\0\x0B"), all of them ASCII. Incoming patch will just trim the idiographic space as well as the standard ASCII set.

Change 210211 had a related patch set uploaded (by EBernhardson):
trim idiographic whitespace too

https://gerrit.wikimedia.org/r/210211

Change 210211 merged by jenkins-bot:
trim idiographic whitespace too

https://gerrit.wikimedia.org/r/210211

Change 210415 had a related patch set uploaded (by Chad):
trim idiographic whitespace too

https://gerrit.wikimedia.org/r/210415

Change 210415 merged by jenkins-bot:
trim idiographic whitespace too

https://gerrit.wikimedia.org/r/210415

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM