Page MenuHomePhabricator

Multilingual search on project portals (e.g. www.wikipedia.org)
Open, LowPublicFeature

Description

Author: mittnavnher

Description:
Hi. I'm very tired of constantly changing between Norwegian and English when I perform searches. It would be very nice if you added an option to display two (or more) search boxes at the front page.

Best regards from Norway


Version: unspecified
Severity: enhancement
See Also:
T49979: Show language selector in the body are of missing articles

Details

Reference
bz24767

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:07 PM
bzimport set Reference to bz24767.
bzimport added a subscriber: Unknown Object (MLST).

Abigor wrote:

Switching between Norwegian and English when doing a search.

Could you please explain something more?

I guess he wants to search multiple Wikipedias at the same time.

[[google:site:wikipedia.org]]? Do you mean in the www portal?
More useful is the request to search multiple projects in the same language at the same time, a feature which was enabled and soon disabled a while ago (there's a bug about it I believe).

Magnus, could wdsearch (http://magnusmanske.de/wordpress/?p=108 ) be hacked in a way that makes it possible to be integrated within a project portal? If yes we could start testing e.g. on www.wikiquote.org.

Gerard.meijssen wrote:

WDsearch does work on its own in Reasonator. It can be configured to show more results that what it does when implemented in the extended Wikipedia search..

So yes, it will work on eg wikiquote.org. In essence it will find whatever is available in Wikidata in any language.
Thanks,

GerardM

A multilingual full-text search would be handy, yes (though enabling wdsearch or something similar everywhere by default would be a big step in this direction already).

You may also want to see
https://de.wikipedia.org/wiki/Benutzer:Atlasowa/multilingual_search
with some thoughts and a lot of material and links about this and similar topics.

More useful is the request to search multiple projects in the same language at the same time, a feature which was enabled and soon disabled a while ago (there's a bug about it I believe).

Where do I find this case or just some more information about this previous feature and why it was disabled? I don't remember it at all, but I rarely vist www.wikipedia.org anyway.

Where do I find this case or just some more information about this previous feature and why it was disabled?

T46420: Restore interwiki (sister projects) results in search queries. It was restored on the Italian projects in the meanwhile, trivial to enable in other language subdomains.

My understanding is that searching every language for every query (as in the Italian case) would be quite expensive in terms of computing power.

Work on detecting the language in which the query was written, and searching the appropriate wiki, is here: T118278: [EPIC] Improve Language Identification for use in Cirrus Search. Presumably once the technology is working for searches within a wiki, it could be applied to portal searches as well.

Work on detecting the language in which the query was written, and searching the appropriate wiki, is here: T118278: [EPIC] Improve Language Identification for use in Cirrus Search. Presumably once the technology is working for searches within a wiki, it could be applied to portal searches as well.

The main problem with language detection is that languages can be very similar, and it's often easy to get them confused (especially on shorter strings like queries, and when you are limited to statistics rather than dictionaries, in order to keep things lightweight). The more languages you consider at once, the more likely you are to get an incorrect answer. The point of T121541 and its sub-tasks is to find what languages are regularly used on each wikipedia and find the set of languages to consider that give the best overall results. If you open up detection to 200+ languages, not only will it be slow, it will also be much more inaccurate. Spanish and Portuguese, Scots and English, Indonesian and Malay—and plenty of others—will be regularly mis-identified. Spanish and Portuguese may be unavoidable, but there are probably many times more English queries than Scots, for example, so maybe it's best not to consider Scots (we'd have to see what the data says—but that'd be my hypothesis at this point).

So there is no panacea, but it would be possible to do the same analysis as in T121541, but on queries submitted through the portal. We could make some additional volume-based assumptions—such as, queries in the Georgian script are most likely Georgian, and not Mingrelian, and queries in Hebrew script are probably Hebrew and not Yiddish—to provide additional coverage for languages that don't show up in the sample, but are relatively easy to identify compared to what is in the sample.

That said, I don't think there's been much discussion of the original request, which seemed to be configuring two search boxes, one defaulting to English, and one defaulting to Norwegian. I don't know if it's a good idea, but it was what was requested.

Thanks for looking into this.

That said, I don't think there's been much discussion of the original request, which seemed to be configuring two search boxes

There is plenty of discussion about what's requested in this report, i.e. a method to search an entire project domain like wikipedia.org (there are some duplicate reports too, e.g. T104984). The point has been repeated so many times in so many discussions, and appears so obvious to anyone who works in more than one wiki, that nobody bothered to express it at length; let me know if more words are needed.

It's easy to find people who need to use "site:wikipedia.org" searches on Google for many use cases. One way way is to search all wikis for said string, which is easier with mwgrep or similar. A sample: https://en.wikipedia.org/w/index.php?title=Special:Search&search="site%3Awikipedia.org"&fulltext=Search&profile=all

There is plenty of discussion about what's requested in this report, i.e. a method to search an entire project domain like wikipedia.org

It's been almost 6 years with no replies from the original reporter, so it's hard to say definitively, but I didn't read this as a request to search the entire wikipedia domain, or even to search two domains at once. I think this user speaks Norwegian and English, and regularly wants to search in Norwegian, and regularly wants to search in English, but doesn't seem to want to search in, say, Swahili, French, or Japanese on a regular basis. Hence the request for two configurable search boxes—one each for Norwegian and English—noting that others might want more.

If the user has set their "accept language" headers to Norwegian and English, they may be getting something more usable, as Norwegian will show up in the list around the globe on the portal page. Clicking that may be easier than finding Norwegian in the drop down.

Hey, there's an easier improvement to be had: put the user's accept languages at the top of the language drop down list! @debt, @JGirault, @Jdrewniak: is that something we could do in addition to pre-selecting their first accept language? That would be great compared to having to scroll through the whole list when your second favorite language is near the bottom. Would they be duplicated at the top, or moved? Not sure. But for this user, it would be much more convenient to have Norwegian and English right next to each other at the top of the list, I'm sure. (I know I'd be a bit happier if Spanish was right under English on my list!)

Also, some sort of indication that setting your accept language header would improve the UI would be nice, for those who didn't know that and don't have it set. I'm not sure how to indicate that, though.

Finally I'm not convinced that many users want to search across an entire domain at once. Some certainly do (esp. for research purposes), and I suspect some think they do, but wouldn't, once they see how many results they get that they can't read. In my review of queries that don't work out I often see minor typos in one language that turn out to be perfectly good words in another language that probably wasn't at all what was intended. And for English words that have been in the title of popular films, or are related to science and technology (e.g., matrix), you'll get hits in dozens of languages. "Search all my languages" would be nice, though!

Hey, there's an easier improvement to be had: put the user's accept languages at the top of the language drop down list! @debt, @JGirault, @Jdrewniak: is that something we could do in addition to pre-selecting their first accept language? That would be great compared to having to scroll through the whole list when your second favorite language is near the bottom. Would they be duplicated at the top, or moved? Not sure. But for this user, it would be much more convenient to have Norwegian and English right next to each other at the top of the list, I'm sure. (I know I'd be a bit happier if Spanish was right under English on my list!)

I think this is a great suggestion. And I think it wouldn't be a big amount of work. As someone who regularly searches in both english and french, it makes sense.

For more user satisfaction, once I emitted the idea of replacing the current "search language picker" with a more robust "(Language) Settings" tab, where users could:

  • define the language of the webpage (though the page isn't much localized yet).
  • select one or more search languages.

It's basically what Google does, and I think it works best for users. I fall in the group of people searching in multiple languages, so selecting only one search language doesn't work best for me.

There is no user authentication on this front page so until we do something about this, these settings would be stored in a cookie or in the local storage, on a per browser basis... I think it's fine.

Nonetheless, as a first step, I'd definitely give your suggestion a try!

Also, some sort of indication that setting your accept language header would improve the UI would be nice, for those who didn't know that and don't have it set. I'm not sure how to indicate that, though.

The recurring problem we have on the front page is the lack of localization. There's a dilemma because it is also an advantage to be language agnostic, universally readable. (One on a public computer going after one who changed the page language could be quickly turned off). Otherwise, an indication in the UI could start as simple as a small line in the footer saying: "This page is optimized per your browser language settings: English, French." (with a link on "browser language settings" to a mediawiki page explaining how to configure these, per browser).

Interesting ideas here that would need some A/B testing and community feedback. I'm adding it to our Portal backlog board.

put the user's accept languages at the top of the language drop down list!

That sounds entirely reasonable to me! The language picker on www.wikipedia.org already defaults to the users most preferred language, so if the user has multiple preferred languages, putting the other one(s) at the top of the list makes sense to me.

However, I think we can go a step further. Picking from the language list, even with this improvement, can still be cumbersome, but I like the idea of leveraging browser-set languages to enable a 'multilingual' search feature. We could potentially, enable multilinguage search results within the typeahead, the screenshot below is an idea of what that can look like.

multilingual-search-suggestions.png (1×2 px, 471 KB)

replacing the current "search language picker" with a more robust "(Language) Settings" tab.

In general, I think of settings as a last resort for any UI. In this specific case, user setting have drawbacks that would limit their usefulness, as mentioned by @JGirault : not tied to a user account, set on a per-browser basis, wikipedia has an existing language preference pane etc. However, I think settings could be the only customization mechanism on mobile, where users (to my knowledge) can't set browser language preference, and accept-language headers typically only send one language.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:00 AM
Aklapper removed a subscriber: wikibugs-l-list.