Page MenuHomePhabricator

Transliterated umlauts in the search field won't resolve
Closed, ResolvedPublic

Description

Author: denisg

Description:
If I enter a term containing umlauts in the search field on the left, but transliterate the
umlauts, the action fails and I am presented the search page, if there is no redirection for
that page. On the english Wikipedia most of the time there are such redirects. For the german
Wikipedia there would be not much sense to it.

Examples:
Goedel (for Gödel) fails on de.wikipedia.org; on en.wikipedia.org it resolves correctly.
Godel resolves on the english page too.


Version: unspecified
Severity: normal
Platform: PC
URL: http://de.wikipedia.org

Details

Reference
bz920

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 7:03 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz920.
bzimport added a subscriber: Unknown Object (MLST).

denisg wrote:

Would it be possible to resolve transliterated umlauts automatically to the correct character? It surely
wouldn't break anything.

afranke wrote:

Automatically adding the reverse-transliterated umlauts to the search results
would be
desirable in my opinion, in particular on de.wikipedia.org .
For example, entering "kuenstliche intelligenz" in the search box there
came up with the movie "A.I. – Künstliche Intelligenz", but not with the
main entry http://de.wikipedia.org/wiki/K%C3%BCnstliche_Intelligenz
which I was only able to find via the entry for the "AI" acronym.

Ibn.Battuta.Wikipedien wrote:

It would be nice to add more than just the umlauts and to more than just the German Wikipedia: The same (or worse) problem occurs on any Wikipedia that uses the Latin alphabet with special characters: The Spanish, Portuguese, French, Scandinavian (...), Slavic (... ... ...), Turkish languages, to name just the largest groups (with obviously many subgroups).

wikibugs wrote:

I agree with #3, and would still add to it. It would be desirable to handle both transliterated special characters and the accent- and featureless plain latin characters from which they have been derived as possible occurences of that special character. For example oe (common in Germany) or o (common in Sweden) for ö, or aa / a for å. I would even extend this mechanism to handling some groups of punctuation characters as one character in search, for example different quotation marks " „ “ ” « », different dashes - – —, different apostrophes ' ’ (see German article "Germany’s next topmodel"; there is a redirect from the simple version, though) etc.

wikibugs wrote:

*** Bug 7002 has been marked as a duplicate of this bug. ***

longthinker wrote:

This also applies to pinyin characters (latinization of chinese characters): for example "wuji" will not find "wújí" (as in german Wikipedias article "Taiji"). Both notations are common, the former especially in printed books.

rainman wrote:

Fixed in Lucene Search 2. Accents are always stripped, and common transliterations are added as aliases (see Bug 7002).

So, searching for Goedel should find Kurt Gödel as the first hit on both en and de wiki.