Page MenuHomePhabricator

Prefix Search: Would be nice if search engine could highlight the result rather than js
Open, LowPublic

Description

Sometimes we see weird results in the prefix search because Cirrus uses different matching rules then the jquery.suggestions library. In English, for example, Cirrus flattens high ascii. Searching for "resume" will return "résumé". Cirrus is quite capable of highlighting the result properly, but it has no way to tell the front-end what the result should look like.

I don't believe it would be practical to replicate Cirrus's logic on the front end because it can change and it is different for different wikis.

Details

Reference
bz60976

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:03 AM
bzimport set Reference to bz60976.
bzimport added a subscriber: Unknown Object (MLST).

I don't care how you do this, but please do. I hate the core search suggestions module.

Core could totally also output match indices from the opensearch API (that shouldn't be incompatible with anything, but I haven't checked), naively by default (we could just implement the same logic as the JS module has now), with a hook override for better search extensions. Then we could apply bolding in the UI trivially based on these indices.

I'm glad it has bothered someone else too.

matmarex set Security to None.
matmarex removed a subscriber: Unknown Object (MLST).

So, can we make this happen? When the necessary information is somehow exposed via action=opensearch API, I'll be happy to to implement the JavaScript part of this.

If I understand correctly, the OpenSearch API follows a standard response format we shouldn't change. We can add it to a prefixsearch or search API module, however. Probably using offsets or substrings to indicate what to highlight.

Current format:

{
"query": "resum"
"results": [
  "Resumé",
  "Resumé (magazine)",
  "RESUMECHAR (CONFIG.SYS directive)",
  "Resumen de acompañar"
]
}

Current (incomplete) highlighting behaviour:

Screen_Shot_2015-01-08_at_21.19.45.png (356×504 px, 39 KB)

Proposed formats:

{
"query": "resum"
"results": [
  [ 5, "Resumé" ],
  [ 5, "RESUMECHAR (CONFIG.SYS directive)" ],
  [ 5, "Resumé (magazine)" ],
  [ 5, "Resumen de acompañar" ]
]
}
{
"query": "resum"
"results": [
  [ "Resum", Resumé" ],
  [ "RESUM", "RESUMECHAR (CONFIG.SYS directive)", ] ..
]
}

Actually... Unless there are cases where the interpretation of unicode code points is different for one of the flattened characters, wouldn't it always simply be the length of the input string?

Except for namespace prefixes, as we allow normalisation/localisation of those.

Actually... Unless there are cases where the interpretation of unicode code points is different for one of the flattened characters, wouldn't it always simply be the length of the input string?

No, the processing can cause the number of separate characters to change, for example æ↔ae, ß↔ss. (I was also under the impression that Cirrus ignored/downplayed non-word characters like '(' when displaying search suggestions, but it doesn't seem to now.)

No, the processing can cause the number of separate characters to change, for example æ↔ae, ß↔ss. (I was also under the impression that Cirrus ignored/downplayed non-word characters like '(' when displaying search suggestions, but it doesn't seem to now.)

It does that in full text search but prefix search includes them. Its supposed to be just the right kind of sloppy matching....

But, yeah, the most flexibility possible would be best. We want the ability to properly handle whatever off the wall request comes in and if the highlighting code makes any assumptions then it'll break it. The best would be to accept offset pairs to highlight or the string marked up with <em> tags or something. The <em> tags might be simplest because you could transform them on the client side to whatever you like but they'd still be simple to read right in the string. Simpler than offset pairs, at least.

If I understand correctly, the OpenSearch API follows a standard response format we shouldn't change.

Can we not extend it? Like add another key, say 'matches', that would contain indexes of matched substrings in each suggestion result?

If I understand correctly, the OpenSearch API follows a standard response format we shouldn't change.

Can we not extend it? Like add another key, say 'matches', that would contain indexes of matched substrings in each suggestion result?

OpenSearch format is an array with an array inside. No string keys.

https://www.mediawiki.org/w/api.php?action=opensearch&search=ap&limit=4

[
    "ap",
    [
        "Apache configuration",
        "Apps/Commons",
        "Apps",
        "API/maintenance"
    ]
]

It seems we already extended it it by adding a second and third array at the end for text extract and urls:

[
    "ap",
    [
        "Apache configuration",
        "Apps/Commons",
        "Apps",
        "API/maintenance"
    ],
    [
        "Apache is probably the webserver used most with MediaWiki.",
        "",
        "",
        "This page is to document activity related to the MediaWiki API. This is an ongoing activity, led by Sam Reed."
    ],
    [
        "https://www.mediawiki.org/wiki/Apache_configuration",
        "https://www.mediawiki.org/wiki/Apps/Commons",
        "https://www.mediawiki.org/wiki/Apps",
        "https://www.mediawiki.org/wiki/API/maintenance"
    ]
]

That doesn't scale well though.

On second thought. From a design and user experience point of view. Do we even need the highlighting? I've rarely seen this kind of highlighting done in other search interfaces or autocompleted form fields. They just show the results.

I've played with it a bit locally and am liking it a lot. It feels a little wrong because we're so used to bit. I'd like to consider ditching that logic altogether and just displaying the results are normal (linked) text.

Screen_Shot_2015-03-14_at_03.20.10.png (377×630 px, 45 KB)

Screen_Shot_2015-03-14_at_03.19.57.png (428×642 px, 44 KB)

Thoughts?

EBjune lowered the priority of this task from Medium to Low.Sep 27 2018, 5:17 PM
EBernhardson renamed this task from Prefix Search: Would be nice if php could highlight the result rather than js to Prefix Search: Would be nice if search engine could highlight the result rather than js.Sep 27 2018, 5:17 PM
MPhamWMF subscribed.

Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of tickets we will not be addressing in the near term. Please feel free to reopen if you think a ticket is important, but bare in mind that given current priorities and resourcing, it is unlikely for the Search team to pick up these tasks for the indefinite future. We hope that the requested changes have either been addressed by or made irrelevant by work the team has done or is doing -- e.g. upgrading Elasticsearch to a newer version will solve various ES-related problems -- or will be subsumed by future work in a more generalized way.

RhinosF1 removed a project: Discovery-Search.
RhinosF1 subscribed.

Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham.