Page MenuHomePhabricator

API for external access ( SOAP, XML-RPC, REST... )
Closed, ResolvedPublic

Description

Author: mediazilla

Description:
There seems to be a need for some kind of API for accessing the wikipedia
articles by external programs (at least I need it ;-) ).

  • getting the data record of a given or the current version matching a given

article name

  • getting a list of version numbers matching an article name (this depends on #181).
  • getting a list of article names matching a search term

This API would relieve load from the Wikipedia servers as there has nothing to
be parsed before submitting.


Version: unspecified
Severity: enhancement

Details

Reference
bz208

Revisions and Commits

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 6:42 PM
bzimport set Reference to bz208.

mediazilla wrote:

  • getting a list of authors having worked on a given article (this has also been

requested in [Wikitech-l] by Jakob Voss).

jeluf wrote:

Added a soap interface to HEAD.

Open todos:

  • Improve search to use searchindex instead of cur.
  • Change client.php so that it can't be called via http, just cli.
  • Check whether nusoap is UTF-8 clean.
  • Add feature to limit number of requests.

jeluf wrote:

Updated:

  • using searchindex
  • client.php can only be called using the cli
  • upon first sight, nusoap seems to be utf8-clean. search query for utf8 strings

succeeded.

Todos:

  • limit number of requests per user per day

I would like to suggest something similar, but IMHO seasier to use and implement
than soap (I already posted this on wikitech-l on Sept. 15 2004 and was directed
here). I hope the new features would make life easier for people who develop
bots and other tools that access the wikipedia and also reduce traffic from such
tools. I also belive those to be fairly easy to implement.

The idea is to have an optionaly URL-Parameter ("format=" or such), that would
tell the software to return a page in a format different from the full-fledged
HTML. I would like to suggest formats for "real" pages and special pages
separately, as the requirements are different.

For articles, discussion-pages, etc, support the following formats:

  • source - return the wiki-source of that page
  • text - return a plain text version, with all markup striped/replaced (tables,

text boxes, etc do not have to be formatd nicely, but their content should be there)

For special pages and all automatically generated lists (kategories, changes,
watchlist, whatlinkshere, etc):

  • csv - return the list in CSV-format
  • rss - return entries in the list as RSS items.

Additionally, for the normal "full html" view, provide a switch "plain" that
supresses all sidebars, etc and shows just the formated text.

As to the implementation, I would suggest to map the format-name to the name of
a php-class and load it on demand. That way, now formats can be supported by
just placing an appropriate file in the php lib-path.

But all this is pretty different from the original bug - so maybe I should file
a new one?

Thank you.

jeluf wrote:

require_once('nusoap.php');

$s = new soapclient( 'http://en.wikipedia.org:80/soap/' );
$r = $s->call( 'getArticle', array( 'Frankfurt' ) );
print $r->text;

I don't see how this is complicated or how parsing a CSV file is easier.
The approach to have a format= parameter for all pages is not realistic. This
would require a complete rewrite of MediaWiki, which is not designed with a
strict separation model/view/controller.

(In reply to comment #4)

  • source - return the wiki-source of that page

Already have this: action=raw

  • text - return a plain text version, with all markup striped/replaced (tables,

text boxes, etc do not have to be formatd nicely, but their content should be there)

Potentially doable.

  • csv - return the list in CSV-format

Unlikely to be useful.

  • rss - return entries in the list as RSS items.

Already have this where supported: feed=rss (or feed=atom)

Additionally, for the normal "full html" view, provide a switch "plain" that
supresses all sidebars, etc and shows just the formated text.

Potentially doable.

But all this is pretty different from the original bug - so maybe I should file
a new one?

Please do.

avarab wrote:

*** Bug 1012 has been marked as a duplicate of this bug. ***

sfkeller wrote:

Thank you for taking over Bug 1012. But please don't forget to pay attention to
"get by category" as well as the fact that there are valid alternatives to SOAP
like the HTTP/GET with parameters (see RESTful architectural style).

avarab wrote:

*** Bug 1012 has been marked as a duplicate of this bug. ***

To clarify: this bug is not specifically about the SOAP protocol.

Bug 1012 was specifically about the action=raw and Special:Export interfaces, which are interfaces
specifically for retrieving editable source text. Non-page-attached metadata such as category
memberships is included in that interface as the source markup which produces those links, but the
requested things just don't fit into that interface.

This bug is a general one, and is on-topic for additional data fetch interfaces.

sfkeller wrote:

Regarding APIs: I vote for a continuation of the already existing "RESTful API"
instead of or in addition to the SOAP.

Wikimedia is already using REST; it's nothing new! Yahoo! and Amazon are using
it. It's exactly what Comment #4 above suggests as well as myself in Bug 1012.
For the debate about "REST versus SOAP" and "RESTful SOAP" see
http://c2.com/cgi/wiki?RestArchitecturalStyle.

In comment #5 of Bug 1012 Ævar Arnfjörð Bjarmason wrote:

Our needs aren't simple, ideally a SOAP api would handle all sorts of stuff such
as getting edit histories, feeds, rendered html, stuff that links to page $1 and
so on, also, there are standard API's for most programming languages that
implement it.

Look at the need of this API: There are almost only getter operations and its
parameters are (so far) a single article, a search term, a category or an
authors name. The responses are either a unstructured html, a wiki text or RSS
or CSV(??) - or an error message.

Now compare these requirements with the Pros of a RESTful implementation: HTTP
support is all that programming languages need. The Cons against REST come into
play when there are complex objects in the involved operation (request and
response), that is when enconding of non-string parameters is needed.

avarab wrote:

*** Bug 1233 has been marked as a duplicate of this bug. ***

avarab wrote:

(In reply to comment #6)

(In reply to comment #4)

Additionally, for the normal "full html" view, provide a switch "plain" that
supresses all sidebars, etc and shows just the formated text.

Potentially doable.

It's very doable, I did a action=html thing the other day that dumped the html
of an article, Tim also made a dumpHTML thing that does that, but just not
throuh a webinterface, the problem with it however was that it didn't allow
modification of the parser options which is what an API like this bug discusses
should implement.

avarab wrote:

In comment #5 of Bug 1012 Ævar Arnfjörð Bjarmason wrote:

Our needs aren't simple, ideally a SOAP api would handle all sorts of stuff such
as getting edit histories, feeds, rendered html, stuff that links to page $1 and
so on, also, there are standard API's for most programming languages that
implement it.

Look at the need of this API: There are almost only getter operations and its
parameters are (so far) a single article, a search term, a category or an
authors name. The responses are either a unstructured html, a wiki text or RSS
or CSV(??) - or an error message.

Just so we're clear on this I personally don't really care what we use, as long
as it's something that's designed in such a way that it can suit all our needs,
both current and potential ones, and do so through one interface, as well as
being widely supported by the most popular programming languages.

The problem with what you're suggesting is that it would basically be something
that would grow like cancer over time (and I'm talking about the mix of RSS, CSV
and other formats, I haven't really looked into REST) and be hell to implement,
you'd have to switch between parsing CSV, RSS and probably some other things
rather than just using one format for everything.

avarab wrote:

Now, to contribute something useful other than flaming other peoples choice of
API's;)

I made a special page extension the other day that was a basic proof-of-concept
of SOAP functionality, it used the nusoap_server class (see:
http://cvs.sourceforge.net/viewcvs.py/*checkout*/nusoap/lib/nusoap.php?rev=HEAD
) to parse requests and generate output, unlike Jeluf's implementation (which is
no longer in CVS) it used internal MediaWiki functions to fetch the various
things rather than making its own SQL queries on the database.

I don't have access to the code right now (it's on another computer that I don't
have access to at the moment) but for anyone interested in SOAP support making a
new special page (Special:SOAP) shouldn't be that difficult.

Just remember to turn off the skin output with $wgOut->disable()

sfkeller wrote:

We agree on the points that the API should be easy to implement, supported by
programming languages and covering both current and potential needs.

Don't hype SOAP and misinterpret REST: The former is more heavy weight than the
latter and is 'restricted' to XML. REST does'nt imply any format (CSV or RSS...)

  • it's up to our design choices to use XML - and it's not cancer-causing per se.

Both approaches don't resolve the need for a data model of the content to be
transferred. The choice is simply a matter of good evaluation and engineering.

davidco wrote:

(In reply to comment #16)

We agree on the points that the API should be easy to implement, supported by
programming languages and covering both current and potential needs.
Don't hype SOAP and misinterpret REST: The former is more heavy weight than the
latter and is 'restricted' to XML. REST does'nt imply any format (CSV or RSS...)

  • it's up to our design choices to use XML - and it's not cancer-causing per se.

Both approaches don't resolve the need for a data model of the content to be
transferred. The choice is simply a matter of good evaluation and engineering.

I don't think I care what the API looks like either. When I posted my request (since absorbed into #208), I was
basically looking for a way to grab a page's content in HTML format without the surrounding elements (header,
sidebar, footer, etc).

The idea is that there could then be a "Print this Article" link and it would print only the article content and
not the whole page (which is unnecessary if you're simply trying to give someone reference material). Similarly,
an "E-Mail this Article" link could do the same thing.

Another feature could be to "Download Article as PDF" or "Download Article as Word Document"...

Any API would allow these operations...

Of course I do have an opinion and would use SOAP/XML. It's ubiquitous even if it has some historical flaws.

avarab wrote:

(In reply to comment #17)

The idea is that there could then be a "Print this Article" link and it would

print only the article content and

not the whole page (which is unnecessary if you're simply trying to give

someone reference material). Similarly,

We already have that, just try printing any page (with monobook turned on) on a
modern browser, only the @media print styles will be used (i.e. the sidebars and
other irrelivant content won't be printed).

an "E-Mail this Article" link could do the same thing.

Another feature could be to "Download Article as PDF" or "Download Article as

Word Document"...

Any API would allow these operations...

Actually it wouldn't, any currently forseeable API would allow access to the
current interface (or rather a subset of it) throuh alternative mean that we'll
have PDF genration.

avarab wrote:

(In reply to comment #11)

Wikimedia is already using REST; it's nothing new! Yahoo! and Amazon are using
it. It's exactly what Comment #4 above suggests as well as myself in Bug 1012.

For those of you who have no idea what REST is, it's a (to paraphrase a
dot.kde.org post) desi4gn approach to web APIs. While SOAP tries to solve
everything inside SOAP itself, REST means you rely on existing mechanisms (URLs,
HTTP) as much as possible.

avarab wrote:

*** Bug 2037 has been marked as a duplicate of this bug. ***

d wrote:

I would like to request some functionality for this upcoming API - specifically:

  1. A method (a sequence of API calls) of enumerating and iterating over the entire content;
  2. A method of obtaining a list of objects that have been changed or are new since a particular time-stamp.

For an example of how (1) would be used, please see the Apache Commons VFS APIs, which look at a data source as a file
system - i.e. in a hierarchical fashion.

As I am a newcomer to this project and am ignorant to the way data is organized – I will only venture to make a trivial
suggestion. It would likely be possible (and simple) to overlay a simple topic based alphabetical hierarchy over all
content, so that at the first level the categorization is by the first letter of all topics or article titles, the next
level is based on the first two letters of the topic or article title, etc.

A.B.C.D.E.F.G.H.I………..Z
AA.AB.AC.AD………………..AX
AAA.AAB.AAC.AAD………..AXX

Etc.

Again – as long as there is some way to retrieve a tree structure and walk through it, or an iterator over the
collection of unique content object identifiers in the repository, the API would let me accomplish the task I am
looking at.

I have put together some ideas and suggestions for a REST interface om meta. See
here:

http://meta.wikimedia.org/wiki/REST

Please have a look and add your thoughts, if you like.

Coppied from bug #3365
I would like to have a feature that allows me to load diffs but just diffs. <rc>
bot reports
changes on articles on wikipedia, I have a working bot that processes this. I
cant load every diff
on wikipedia edit for practical reasons (bandwidth and cpu). So if I could load
just the diffs and
process that and check for obvious cases of sneaky vandalism. I dont believe it
is hard to
implement this since we already have a function showing the diff (with the rest
of the page).

It should be machibe friendly so as to use minimum bandwith usage and process
time. All I care about is whats removed and whats added.

wikipedia wrote:

I'd be interested in writing an implementation of WTTP (as described at
http://meta.wikimedia.org/wiki/WikiText_Transfer_Protocol). WTTP (or some other
RESTful approach) would be quicker to implement than a full-fledged SOAP
approach, and it seems like the 80% solution would be much more than the zero
percent we've got now.

sfkeller wrote:

I agree (c.f. my comment #8). Before starting one should start with the simple
case (diffs come later) and clarify the goal: request for a single page, a range
of pages or all (= bulk download?).

From the bot's perspective, the most needed features are bulk get and (bulk?) put - anything else can come later. Biggest
hurdle - we need a clear indication when the asked page(s) do not exist, and we need to make sure the user's login does not
get lost. These two issues have occured numerous time with the present system. Beyond that, any page parsing can be done on
the client until later.

The first stage of this interface has been implemented. Ladies and gentlemen, I
proudly present the Query API: http://en.wikipedia.org/w/query.php

It supports multiple output formats and bulk data retrieval of many items you
could only get by html scraping. Any comments are welcome on the API's home
page at http://en.wikipedia.org/wiki/User:Yurik/Query_API

We had some of this for ages. More specific requests should be filed separately.

epriestley added a commit: Unknown Object (Diffusion Commit).Mar 4 2015, 8:20 AM