Page MenuHomePhabricator

Text contained in a URL is not returned via fulltext search
Closed, ResolvedPublic

Description

PROBLEM DESCRIPTION

Searching for a term within a URL does not return a result even though a result
is returned if the same term is contained in some normal (non-URL) text.

STEP-BY-STEP DESCRIPTION TO REPRODUCE THE PROBLEM

  1. Create a ticket with the following content:

    https://commons.wikimedia.org/wiki/File:Test_carnival.jpg File:Test_carnival2.jpg
  1. Perform a fulltext search for Test_carnival2.jpg and notice that the created

ticket is returned.

  1. Perform a fulltext search for Test_carnival.jpg and notice that the created

ticket is NOT returned.

STATUS

This was originally reported upstream but please see http://bugs.otrs.org/show_bug.cgi?id=10393#c5.


Version: wmf-deployment
Severity: normal

Details

Reference
bz64473

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 3:24 AM
bzimport added a project: Znuny.
bzimport set Reference to bz64473.
bzimport added a subscriber: Unknown Object (MLST).

It seems likely that this is related to the WordLengthMax property in Ticket::SearchIndex::Attribute (Ticket -> Core::FulltextSearch). This is currently set to the default value of 30, a limit exceeded by almost all URLs.

As an example, a ticket with the following content:

https://de.wikipedia.org/wiki/File:1XXY.jpg
https://de.wi.org/wiki/File:2XXY.jpg
https://de.w.org/File:3XXY.jpg

is returned after a search for 3XXY, but not after one for 1XXY or 2XXY (Ticket# 2014052710014793).

It would need to be looked into if, following an increase of that limit, a rebuild of the fulltext db is feasible (RebuildFulltextIndex.pl) and, if not / before that, if this value can simply be lifted without re-indexing all existing articles, so that the bug is fixed at least for all new articles.

(This bug is not low-priority, it's a critical feature for the permissions team. If they can't properly search tickets and specifically file URLs, they aren't able to find permission emails, and the corresponding files get deleted for copyright reasons.)

Upon consultation with Jeff Green, I've set WordLengthMax to 200. Indeed the bug appears to be fixed now for all new tickets, see ticket#2014081510013908 which basically reproduces the above example.

Jeff is currently investigating the possibility of rebuilding the fulltext db.

The search index was rebuilt (thanks Jeff), that way the fix was additionally applied to all existing tickets. => All done.