Page MenuHomePhabricator

API pretty printer should not include double quotes in hyperlinks
Closed, ResolvedPublic

Description

When accessing the api.php file with the action parameter set to "sitematrix", a list of sites come up, as expected. However, the URLs for each site listed appears like this:

http://en.wikipedia.org/"

when it should appear like this:

http://en.wikipedia.org/

In other words, the quotation mark is included with the URL. And when it loads, an error page comes up saying that you might have wanted "http://en.wikipedia.org/wiki/%22", and it redirects you there. The fix is very simple: exclude the quotation mark.


Version: 1.12.x
Severity: normal
URL: http://en.wikipedia.org/w/api.php?action=sitematrix

Details

Reference
bz13218

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:05 PM
bzimport set Reference to bz13218.

When my previous comment was recorded, the site formatted the URL correctly by removing the quotation mark from the address on line 5. However, just to reinforce my point, the API includes the quotation mark.

overlordq wrote:

I changed it to the extention SiteMatrix since that's seperate from the API itself, but digging through the code of both I can't decide whether the error is in SiteMatrix or in formatHTML in ApiFormatBase.php

Since I dont have access to an input file, I can't really help past that.

overlordq wrote:

Should be able to replace lines 190&191 with something like:

$text = preg_replace('(http|https|ftp|gopher)://[\w\-\.]+/', '<a href="$0">$0</a>', $text);

since whitelist is better then trying to list every possible invalid character.
But then again, I dont know how it'll handle the other charsets.

robert wrote:

Current behaviour is as expected, perhaps your client is including the quotation mark in an automatically generated hyperlink, but the URL is surrounded by quotation marks once as expected -- when using a client program to access the API in XML mode you are expected to use a proper XML parser, in which case this would not be a problem. Resolving as WONTFIX.

Actually, not quite.
The issue isn't with the client, it's with the API's pretty print format.

As you'll see, this is the source output:
<a href="http://advisory.wikimedia.org/&quot;">http://advisory.wikimedia.org/&quot;</a>

The issue is that the pretty print format is including the quote inside it's pretty printing when it shouldn't because 90% of the formats always wrap the url inside of the quotes.

The ending characters for the various formats appear to be:
JSON/RAW: "
XML: " and <
PHP: "
WDX: " and <
YAML: (whitespace)
So the regex should terminate links at ["<\s] when pretty printing to output valid things.

But in general, I have a feeling that this kind of thing is mainly fault of trying to do this in a poor method.
The best thing to do would be to go the way of <source/> and find GeSHi files for our output formats, and use it to pretty print the API when not using the actual data formats. After all, we shouldn't be treating pretty printing as if every data format used the same type of output format.

Though, it looks like the reason this is happening is because the linking of the pretty printing was primarily meant so that the help page which shows up by default would have actual clickable links. The SiteMatrix is just the only thing in the API that uses a http:// prefix and as a result it's being linked to.

robert wrote:

Sorry, my bad. This is the APIs fault, changing resolution.

Changing description and component to something more accurate, assigning to self.

Fixed in r31452. The regex did include " as a terminating character, but unfortunately htmlspecialchars() has already replaced all "s with &quot; by then.