Page MenuHomePhabricator

External link search and protocol-relative URLs
Closed, ResolvedPublic

Description

I'm pretty sure that *something* is gonna break when searching for protocol-relative external links, at least through the API. I haven't actually tried and I'm not sure what would break exactly, but the API code is scary enough to convince me that it won't make it through unscathed.

This is mostly just a reminder to myself to fix this soonish.


Version: 1.20.x
Severity: normal

Details

Reference
bz29854

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 11:27 PM
bzimport set Reference to bz29854.

To my great pleasure, wfMakeUrlIndex does not appear to explode on these on trunk:

return wfMakeUrlIndex('http://foobar.com/baz/quux');

http://com.foobar./baz/quux

return wfMakeUrlIndex('//foobar.com/baz/quux');

//com.foobar./baz/quux

However it *does* explode on them in 1.17 and 1.18:

return wfMakeUrlIndex('//foobar.com/baz/quux');

PHP Notice: Undefined index: scheme in /var/www/rel1.18/includes/GlobalFunctions.php on line 2716
PHP Stack trace:
PHP 1. {main}() /var/www/rel1.18/maintenance/eval.php:0
PHP 2. eval() /var/www/rel1.18/maintenance/eval.php:79
PHP 3. wfMakeUrlIndex() /var/www/rel1.18/maintenance/eval.php(79) : eval()'d code:1
PHP 4. wfParseUrl() /var/www/rel1.18/includes/GlobalFunctions.php:2740
PHP Notice: Undefined index: scheme in /var/www/rel1.18/includes/GlobalFunctions.php on line 2718
PHP Stack trace:
PHP 1. {main}() /var/www/rel1.18/maintenance/eval.php:0
PHP 2. eval() /var/www/rel1.18/maintenance/eval.php:79
PHP 3. wfMakeUrlIndex() /var/www/rel1.18/maintenance/eval.php(79) : eval()'d code:1
PHP 4. wfParseUrl() /var/www/rel1.18/includes/GlobalFunctions.php:2740
./

Probably need to backport the updates to that function (I recall fixes related to file: URLs, which probably fixed this as a side effect).

However a bigger problem is that even if you can get them past the UI on the external links search page, this forces you to *know* whether you're looking for an http-only, an https-only, or a protocol-independent URL.

A possible way to make these work 'naturally' might be to actually change how we save external URLs into the table... instead of saving one entry without a protocol, we could save *two* entries -- one http and one https.

This would allow either to be searched for, but would also list both if fetching lists of external links for a given page. Which seems legit, to be honest. ;)

(In reply to comment #1)

Probably need to backport the updates to that function (I recall fixes related
to file: URLs, which probably fixed this as a side effect).

No, I deliberately fixed wfParseUrl() for prot rel URLs in r92024.

However a bigger problem is that even if you can get them past the UI on the
external links search page, this forces you to *know* whether you're looking
for an http-only, an https-only, or a protocol-independent URL.

A possible way to make these work 'naturally' might be to actually change how
we save external URLs into the table... instead of saving one entry without a
protocol, we could save *two* entries -- one http and one https.

This would allow either to be searched for, but would also list both if
fetching lists of external links for a given page. Which seems legit, to be
honest. ;)

I guess that would work, yeah.

reachouttothetruth wrote:

*** Bug 31721 has been marked as a duplicate of this bug. ***

This is very much a bug: at the moment neither Special:LinkSearch nor the API's list=exturlusage can find protocol-relative links at all, because trying to search for a link with an empty protocol defaults to searching for "http".

What is the desired solution?

  1. Allow [[Special:LinkSearch/example.com]] and prop=exturlusage&euquery=example.com&euprotocol= (or something similar) to search for protocol-relative links explicitly. This is easy to do, but as noted above requires the user to perform additional searches.
  2. When searching for "http", also search for protocol-relative. This is really hacky.
  3. Brion's solution, save entries with el_index "http" and with "https". The major drawback here is that it needs to fix the existing entries in externallinks when upgrading.

Personally, I'd prefer #3 if someone will actually do it in a timely manner (i.e. unlike bug 27480). #1 would be ok otherwise, at least until #3 is actually done.

(In reply to comment #4)

Personally, I'd prefer #3 if someone will actually do it in a timely manner
(i.e. unlike bug 27480). #1 would be ok otherwise, at least until #3 is
actually done.

#3 shouldn't require actually running refreshlinks, we should be able to find all protocol-relative entries and convert them pretty easily, without even accessing the page text.

I'll hopefully do this on Thursday or Friday.

Created attachment 9340
Possible patch to implement Brion's solution

(In reply to comment #5)

#3 shouldn't require actually running refreshlinks, we should be able to find
all protocol-relative entries and convert them pretty easily, without even
accessing the page text.

The complaint was just that that other bug requiring database updates seems to be being ignored. It could take just two SQL queries to fix the DB:

INSERT INTO externallinks (el_from,el_to,el_index) SELECT el_from, el_to, CONCAT('http:',el_index) FROM externallinks WHERE el_index LIKE '//%';
UPDATE externallinks SET el_index = CONCAT('https:',el_index) WHERE el_index LIKE '//%';

Avoiding killing the site while running those queries on Wikipedia is probably the hard part.

I tried my hand at writing the necessary patch; I copied doTemplatelinksUpdate for the MySQL database updating. There are probably things that could be done better. Even if the patch is good, someone would still have to write updaters for the non-MySQL databases.

Attached:

sumanah wrote:

Adding the "patch" and "need-review" keywords for a patch that needs review by developers. Brad, thank you for your patch.

(In reply to comment #6)

Created attachment 9340 [details]
Possible patch to implement Brion's solution

Thanks for the patch. I modified it a bit and applied it in r101951 and r101954.

I tried my hand at writing the necessary patch; I copied doTemplatelinksUpdate
for the MySQL database updating. There are probably things that could be done
better. Even if the patch is good, someone would still have to write updaters
for the non-MySQL databases.

I did end up changing that part quite a bit, yeah. I took the slow replication-aware code and moved it into a maintenance script. I also changed the queries a bit, they now use INSERT IGNORE and there's now two INSERT queries followed by a DELETE.

Attached:

(In reply to comment #8)

Thanks for the patch. I modified it a bit and applied it in r101951 and
r101954.

Of course I meant r102951 and r102954.