Page MenuHomePhabricator

urls should be decoded before regexp matching
Open, MediumPublicBUG REPORT

Details

Reference
bz32159

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:57 PM
bzimport added a project: SpamBlacklist.
bzimport set Reference to bz32159.
bzimport added a subscriber: Unknown Object (MLST).

The sbl extension searches for

/https?:\/\/+[a-z0-9_\-.]*(\bexample\.com\b)

That means sbl entries always start with a domain part of a url. Actually that's ok, because google-links like the above mentioned also include full urls. The problem is that those urls are encoded (see [[w:en:Percent-encoding]]) and the sbl extension does no decoding. So

...?url=http%3A%2F%2Fwww.example.com

is not resolved as

...?url=http://www.example.com

Solutions could be either

  1. letting the regexp pattern start not with /https?:\/\/+[a-z0-9_\-.]*(/ but with /https?(?i::|%3a)(?i:\/|%2f){2,}[a-z0-9_\-.]*(/ or
  2. decoding urls before doing the regexp matching.

(The second option is better for it is more general.)

anubhav: Mentioning the bug number in the commit message is highly welcome.

Change 57935 had a related patch set uploaded (by Platonides):
T34159: Decode urls before regexp matching in SpamBlacklist Extension

https://gerrit.wikimedia.org/r/57935

A similar (or actually the same) problem occurs with translate.google.[^\/]{2,5}/translate.
Right now, google-translate-urls can be used to circumvent the SBL, and we don't know a good way to cope with that problem via SBL or edit filter.
It would be better, if the SBL finds the blocked URLs inside the google-translate url.

See:

Aklapper changed the subtype of this task from "Task" to "Bug Report".Feb 15 2022, 9:39 PM
Aklapper removed a subscriber: wikibugs-l-list.