Page MenuHomePhabricator

Automatic stopwords for the 200+ languages without their own analyzer available
Open, LowPublicFeature

Description

Split from bug 54022: apart from the 30 languages currently supported, rather than use the default analyzer bare we should probably use stopwords calculated in an automatic way, while we wait for a custom ones to be made.
It seems cutoff_frequency setting and common_terms query may be used for this purpose.

I'd say that this is currently low priority but should probably be done before expanding elasticsearch beyond the ~30 supported languages.


Version: master
Severity: enhancement
Whiteboard: Elasticsearch_1.1
See Also:
T56022
T68969

Details

Reference
bz54875

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:33 AM
bzimport set Reference to bz54875.
bzimport added a subscriber: Unknown Object (MLST).

I'm not sure this should be a hard requirement before expanding beyond the ~30 languages with built in stop words. I certainly agree we should do it though.

I believe that is what nemo was referring to. The problem (right now) is that was use query string queries rather than term queries. For what we do, it makes a lot of sense. Anyway, query string queries don't play nice right yet with common terms queries. They could possibly be made to but I'm not sure about that yet. It'd probably make more sense to make this change in elasticsearch and for us to just flip the switch to turn it on.

I don't know anything about implementation details but yes, that would seem the most elegant way to handle it from the small hints I gathered around. However, it may also be viable to automatically generate "standard" stopwords lists for each language, from what I understand.

The upstream pull request linked above was closed in March 2014. Is there anything else which needs to be done upstream?

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt subscribed.

Removing the Discovery tags - it looks like this has been done for the most part.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:13 AM
Aklapper removed a subscriber: Manybubbles.