Page MenuHomePhabricator

Word count is wrong, does not recognize non-ASCII characters
Closed, DeclinedPublic

Description

Author: mr.heat

Description:
The following example counts 42 words. But I count 40 words.

http://de.wikipedia.org/wiki/Spezial:Artikelr%C3%BCckmeldungen_v5/Yellowstone-Nationalpark/04f917900607eb1692a1842b2b77d79c

I think the current count searches for words made of the letters a to z. Because of this a German word like "schönen" is counted as two words.

The best solution would be to use \p{L} instead of \w or [a-z] in the regular expression. Please note that this does not work in JavaScript.

http://www.regular-expressions.info/unicode.html


Version: master
Severity: minor

Details

Reference
bz47733

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 1:43 AM
bzimport set Reference to bz47733.

ß & ö are indeed the culprits.
PHP's native str_word_count is used, which isn't mb-safe.
However, using a regex matching chars (with diacritics) is not ideal either, since that would count words like "you're" or hyphenated words (and quite possibly in other languages other combinations with other characters) as multiple words. So that would be substituting 1 bad solution for another sub-optimal solution.
Perhaps we should split based on whitespace, remove all occurences without letters, and count that number?

Besides, the character length is wrong too, but switching strlen for mb_strlen should do the trick.

mr.heat wrote:

Splitting on whitespace is not good because some users write like this,without spaces.The count will be wrong.

I don't think the word count is an essential information. Pretty much every solution will be wrong depending on the language. Don't put time in this. I suggest to choose one of these very simple solutions:

a) Remove the word count. Stick with the characters (but switch to mb_strlen, of course).

b) Don't change the code but change the message to "Approx. 42 words". Maybe add a max(min($count, 10), round($count / 10) * 10) function and make it "Approx. 40 words".

Related URL: https://gerrit.wikimedia.org/r/62435 (Gerrit Change I84a72f3894fb19d2834719ebf253e33f2d436d8e)

I've removed it completely. I first decided to go with b), but even showing an approximate value makes increasingly less sense on languages with increasing multibyte characters (e.g. Chinese). Since it's a useless metric, I think it's best to remove it.

Change 62435 abandoned by Matthias Mullie:
(bug 47733) Word count is wrong, does not recognize non-ASCII characters

Reason:
AFT is unmaintained, these patches are not going to get reviewed

https://gerrit.wikimedia.org/r/62435

Jdforrester-WMF subscribed.

All development work on AbuseFilter v.5 (and indeed, previous versions) is halted. The project is archived, so having open tasks is inappropriate. Consequently, I'm closing all tasks.