Page MenuHomePhabricator

index.php does not honor the variant param with action=raw
Closed, InvalidPublic

Description

Author: jhecking

Description:
Copied from the thread "MediaWiki API and Chinese language variants" on the mediawiki-api mailing list:

Summary:
When called with the action=raw parameter the index.php API does not return the correct language variant specified using the variant parameter.

index.php?title=西恩塔&action=raw&variant=zh-cn and
index.php?title=西恩塔&action=raw&variant=zh-tw
should return markup in the specified variant but currently both return the same (zh?) variant.

Full Thread:

On 19/01/2008, Jan Hecking <jhecking@yahoo-inc.com> wrote:

To follow up on this somewhat old thread: I finally got around to
actually testing Paolo's suggestion of using the index.php API instead
of api.php. Turns out it doesn't actually work. While the index.php API
does have a variant parameter that allows to select one of the Chinese
language variants (e.g. zh, zh-hk, zh-tw) it does not actually honor
this parameter when combined with action=raw. When returning the raw
Wiki markup it always returns the same variant (zh?) no matter what
variant is specified.

$ curl -s
"http://zh.wikipedia.org/w/index.php?title=%E8%A5%BF%E6%81%A9%E5%A1%94&action=raw&variant=zh-cn"
> zh-cn
$ curl -s
"http://zh.wikipedia.org/w/index.php?title=%E8%A5%BF%E6%81%A9%E5%A1%94&action=raw&variant=zh-tw"
> zh-tw
$ diff zh-cn zh-tw

diff shows that the markup returned is identical. With action=view (the
default) the output is clearly different.

So it looks like there is actually no way to get the raw markup in
different language variants?

This is definitely a bug. I can verify that the variants work with
action=render but not with
action=raw. Please file a bug report on bugzilla.

Andrew Dunbar (hippietrail)

Thanks,

Jan

On 12/13/2007 6:45 PM, Jan Hecking wrote:

On 12/13/2007 11:28 PM, Paolo Liberatore wrote:

On Thu, 13 Dec 2007, Jan Hecking wrote:

On 12/13/2007 1:04 AM, Roan Kattouw wrote:

Jan Hecking schreef:

Hi,

Is it possible to retrieve content in different Chinese language
variants using the /w/api.php API? There doesn't seem to be a variant or
language parameter that would allow selecting a variant like "zh-tw" or
"zh-hk". Is there some other way to do this?

How is this done in the regular user interface, then?

I would like to know that as well. :)

My suspicion is that the user interface, i.e. the frontend servers, do
the conversion. Which would mean that all users of the MediaWiki API
would have to replicate that work. That would severely limit the use of
the API for Chinese language content IMHO. But then I don't know much
about MediaWiki yet and maybe I have just missed something obvious.

Thanks,
Jan

There is a "variant" parameter in
http://www.mediawiki.org/wiki/Manual:Parameters_to_index.php
I believe it's only used for Chinese.

Thanks for the reminder, Paolo! I hadn't considered index.php before
because I assumed it only returns the rendered HTML markup. But now I
saw that there is an action=raw parameter which returns the raw wiki
markup that I'm looking for. However this API has one other drawback: In
contrast to api.php it doesn't have an option to resolve redirects
automatically. When calling api.php I was using the redirects parameter
to do so but this doesn't seem to be supported by index.php when using
action=raw (only for action=view). That means I would potentially have
to make multiple calls to resolve redirects manually. Or is there a way
to avoid this?

Thanks,
Jan


Version: unspecified
Severity: normal

Details

Reference
bz12683

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:04 PM
bzimport set Reference to bz12683.
bzimport added a subscriber: Unknown Object (MLST).

jhecking wrote:

Could someone please at least confirm whether the action=raw method of the index.php API is supposed to work with the variant parameter? If not then we will have to look into either doing the transcoding client-side or scrape the rendered HTML markup from the Wiki servers instead of using the raw markup. Both options don't look very good. :(

action=raw is by definition raw. No variant processing would be applied to output.

(In reply to comment #2)

action=raw is by definition raw. No variant processing would be applied to
output.

I said that initially as well. But considering the alternatives (scraping action=render HTML or translating variants on the client side, both of which are evil), I think it would be best to add a variant parameter to action=raw, however inconsistent that may be. If no variant parameter is supplied, action=raw will still return the raw wikitext straight from the DB/cache, so no one using the action=raw interface the way it currently works will notice the difference.

Variant processing is for output, not raw source text. Performing variant processing on unprocessed source code is meaningless and would simply corrupt the code.

Since Brion says it doesn't belong at the index.php level I think there are a few options:

  • Do we really need raw+variant? The variant stuff may well interact with templates or other Wiki markup. Is there any reason people who want this can't get what they need from action=raw? What is the use case?
  • If not index.php we could add support to api.php
  • How independent is the variant processing code? Could we abstract a function for api.php that takes a string of one variant and transforms it to the other variant?
  • What about something like {{variant:xxx}}? We have similar wiki-functions already.

jhecking wrote:

Hi Brion, Andrew,

Here is some background on the user case for which I think raw+variant is needed - if there is a better way to achieve the same please let me know.

We have integrated Wikipedia content into our mobile search product. Wikipedia has very relevant content on a very large number of typical search queries. However from the majority of the mobile devices that our users are using (particularly in emerging markets) this content is not easily accessible since there is no mobile specific version of the Wikipedia web site. So instead of linking directly to the Wikipedia article on <intl>.wikipedia.org we use the raw markup to render a mobile compatible version of the article within our mobile search product. We use the raw markup for this since we need to be able to render the content in different device specific markup languages. Currently we only do this for articles from en.wikipedia.org but we would like to apply this to other languages as well. However if we cannot get the raw markup in the relevant language variant we cannot do this.

Here are a few examples how the integration looks like:
Sample search results pages featuring Wikipedia content: http://us.m.yahoo.com/p/search?p=who+was+albert+einstein, http://us.m.yahoo.com/p/search?p=what+is+dynamite
Albert Einstein article in oneSearch: http://us.m.yahoo.com/p/search/wiki?displayurl=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FAlbert_Einstein&.done=http%3A%2F%2Fus.m.yahoo.com%2Fp%2Fsearch%3Fp%3Dwho%2Bwas%2Balbert%2Beinstein

Hope this clarifies the intended use case. Let me know what your thoughts on this are.

Thanks,
Jan

Conversion applies to output, not to raw source code. Early conversion would corrupt the markup (particularly for Latin<->Cyrillic, Latin<->Arabic, etc variants).

jhecking wrote:

I'm not quite sure I can follow: How would the conversion corrupt the Wiki markup? The markup only consists of latin characters, right? Those wouldn't be affected by the conversion I assume. And if they are affected then how come the conversion does not affect the final output which is markup as well - HTML markup in this case.

I guess I have to take a closer look at how the variant parameter works for Cyrillic/Arabic languages. Maybe it works differently than for Chinese.

Thanks,
Jan

Template names, magic keywords, tag names, HTML fragments, bla bla bla.

jhecking wrote:

As far as I can see our alternatives are

  1. scrape the HTML markup instead of the raw wiki markup,
  2. use the raw markup and duplicate the whole transcoding logic on our servers.

Am I missing anything?

Thanks,
Jan

  1. scrape the HTML markup instead of the raw wiki markup

This was always going to be easier anyway. From the URLs you posted earlier it looks like all you need is the plain text. There is plenty of existing free code to extract plain text from HTML. It's much easier than parsing wiki templates etc.