Page MenuHomePhabricator

feature request: Text extraction from custom wiki markup
Closed, DeclinedPublic

Description

Hi,

This is a very interesting project for DBpedia [1]. We already try to extract abstracts from articles (e.g. [2]) but up to now we hack in the mw core to get it [3].

Looking at the code I noticed that for getting section 0 you parse the whole page. This is a very expensive operation for us. We usually get the wiki markup part that we want to extract and use this api call to get it.

api.php?format=xml&action=parse&prop=text&title=[...]&text=[...]

Then our hacked mw engine returns as clean text. As you probably guess, title is used to resolve self references {{PAGENAME}} and text the part of the page markup we want to get text from.

So, to get to the point, is this feasible in your extension? With some guiding from your side, we can also work on this.

[1] http://dbpedia.org
[2] http://dbpedia.org/page/Berlin
[3] https://github.com/dbpedia/extraction-framework/wiki/Dbpedia-Abstract-Extraction-step-by-step-guide#wiki-prepare-mediawiki---configuration-and-settings


Version: unspecified
Severity: enhancement

Details

Reference
bz62209

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:56 AM
bzimport added a project: TextExtracts.
bzimport set Reference to bz62209.
bzimport added a subscriber: Unknown Object (MLST).
  1. If you specify &exintro only intro will be parsed.
  2. TE operates only over HTML returned by parser, doing anything with wikitext directly would be essentially a different extension. What do you mean by "custom wiki markup"?

Thanks,

I already saw the &exintro option so, one question to understand this.

when I use this call: http://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=&explaintext=&titles=Athens

does this extension loads the whole page, convert it to html and then return the first section?
if not this extension is perfect for our purpose and don't read the rest :)

if yes, we would like to avoid loading the whole page as it would slow down our extraction.

What we do so far is to take the wiki markup of the page up to the first section and feed it in the mw "parse" api call [1] which normally returns html. Then we hack into the mw core to return cleaned text.

So, the request is to add the "text" and "title" parameters in your api. When they are given, instead of parsing the page by title you will parse the "text" parameter ("title" is used for magic words like {{PAGENAME}}), get the html and clean it the same way you do now.

Cheers,
Dimitris

[1] https://www.mediawiki.org/wiki/API:Parsing_wikitext#parse

(In reply to Dimitris Kontokostas from comment #2)

does this extension loads the whole page, convert it to html and then return
the first section?

Once again,

  1. If you specify &exintro only intro will be parsed.

Thanks again,

still, is it possible to add these two parameters?
This setting works for us but it would suit us better if we had the text/title option.

This way we only have to load the templates in the database and feed the text in the api. Otherwise we need to load the whole dump.

If you agree to this request, we can work on this addition.

I don't think that turning TE into yet another wikitext parsing facility is the way we want it to evolve. You can do it trivially for your infrastructure though, using ExtractFormatter class.