Description
Details
- Reference
- bz57669
Related Objects
- Mentioned In
- T135020: Spike (3h): How to do proper sentence detection for TextExtracts
T115817: Template "lang" badly processed
T113633: Spike: Alternative to TextExtracts for Popups, Gather, Read more
T61641: Implement a reasonably elegant and non-labor-intensive means of describing/summarizing pages - Mentioned Here
- T1319: RfC: Text extraction
Event Timeline
bingle-admin wrote:
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/mobile/cards/1460
H. P. Lovecraft: Against the World, Against Life
http://en.wikipedia.org/w/api.php?format=jsonfm&action=query&pageids=17545993&prop=extracts&exsentences=2&exintro&explaintext
Here I obtained a truncated sentence, cause the dots in the name force the sentence to be truncated.
I have more comments on this bug.
It is just a guess, not tested on my local machine yet - sorry, I am new in wikimedia-dev, and process for reproducing the bug is still not clear.
My guess is due because the API truncate the sentence roughly by counting the dots '.'
If so, a quick improvement may be check:
if the char before the dot is a capital letter, or a word formed by a capital letter > truncate at the next dot
bingle-admin wrote:
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/mobile/cards/1478
Abbreviations can also cause problems. Here "ca." for "circa" at Norwegian Bokmål Wikipedia:
Per 67841, blanking out instances of the title before searching for a cutoff point would improve many of these cases.
(copy from the merged task:)
The hovercard for the German Wikipedia article of "D. J. Caruso" (https://de.wikipedia.org/wiki/D._J._Caruso) only says "D. J." E.g. the link from https://de.wikipedia.org/wiki/Caruso
A similar problem can be seen at the link to "J.P. Morgan & Co." in https://en.wikipedia.org/wiki/J._P._Morgan_%28disambiguation%29
Reported/discussed at https://www.mediawiki.org/wiki/Topic:S6hl4q8uvi4ux10n
With an NLP toolkit and three lines of python:
import nltk, json, requests data = requests.get('https://en.wikipedia.org/w/api.php?format=json&action=query&pageids=17545993&prop=extracts&exchars=10000&exintro&explaintext').text intro = json.loads(data)['query']['pages'].itervalues().next()['extract'] print nltk.sent_tokenize(intro)[0]
will give
H. P. Lovecraft: Against the World, Against Life (French: H. P. Lovecraft : Contre le monde, contre la vie) is a work of literary criticism by French author Michel Houellebecq regarding the works of H. P. Lovecraft.
We should find a place to store extracted summaries and set up one of the open-source toolkits to automatically process new revisions.
That stores the plaintext version of the article. I meant storing the definition (first X sentences) extracted from that by some text processing tool (which is probably not written in PHP and runs as an external service, since there aren't any serious NLP libraries in PHP). Seems similar to the use case of storing lead image focus points, probably the same solution could be used.
Neat. Do you know how well that works with languages other than English? I couldn't easily find out if NLTK supports anything else.
Ideally, we could use NTLK or similar for splitting to sentences and then just store the text with sentence end markers.
This issue also effects articles with dates in their extracts. For example:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exsentences=2&titles=Who%20Let%20the%20Dogs%20Out%3F
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exsentences=2&titles=Berlin_Wall
@MaxSem: This seems to be fixed for me. I get the following extract ...
"H. P. Lovecraft: Against the World, Against Life (French: H. P. Lovecraft : Contre le monde, contre la vie) is a work of literary criticism by French author Michel Houellebecq regarding the works of H. P. Lovecraft. The English-language edition for the American and UK market was translated by Dorna Khazeni, and features an introduction by American novelist Stephen King."
... from https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&pageids=17545993&prop=extracts&exsentences=2&exintro&explaintext
The other examples seem to be fixed as well.
@kaldari reports this as fixed. Description is very vague so if this is still a problem please edit description with better replication steps upon reopen.