Duplication of blacklisted links already in page text is possible
Open, LowPublic
Actions

Assigned To

None

Authored By

	DanielFriesen
	May 14 2008, 9:14 AM

Description

In his commit Brion noted that when a spam link is already on the page, it can be duplicated and thus added in again.

I'm just posting this bug to note that behavior, and as a reminder to myself to come back, as I believe I may be able to eliminate this behavior.

Version: unspecified
Severity: minor
URL: http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SpamBlacklist/SpamBlacklist_body.php?r1=34769&r2=34768&pathrev=34769
See Also:

Details

Reference: bz14114

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T16114 Duplication of blacklisted links already in page text is possible
		Resolved		None	T3505 Spam blacklist, if match, check previous version and accept if URL present there

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 21 2014, 10:09 PM

• bzimport added a project: SpamBlacklist.

• bzimport set Reference to bz14114.

• bzimport added a subscriber: Unknown Object (MLST).

DanielFriesen created this task.May 14 2008, 9:14 AM

The trick is that we don't currently keep track of how _many_ times a given link is used on the page, either in the parser or the externallinks table. Without a count record, we can't easily track duplications.

I thought we were filtering the spam urls with the EditFilter hook, or whatever it was named? That's basically what the hook is for.

The [[mw:Extension:ProtectSection|ProtectSection]] extension actually makes use of that filter interestingly. It checks the before and after set of protect tags and makes sure that they are all still inside the page with the exact same content. This is basically what we're trying to do with the spam blacklist, except we would be using the url regex and ensuring that there are no spam urls in the after that don't appear in the before, and because of how that grabs individuals extra spam links of the same url will be considered extras.

To avoid reparsing the page fifty times or getting false negatives/false positives, we check the actual parser results.

Ok, then I'm just confused as to what Spam Blacklist is actually trying to do.

I thought all we were trying to do was stop users from saving a page with extra spam links. Which the EditFilter would do.

Define "with" -- that's the hard part! If you just do a regex on the raw text, you'll miss templates, things split over comments, bla bla bla. That's why it's become more and more complex over time, reparsing the text, then pulling data out of preparsed text to reduce the complexity and performance hit and increase reliability.

mike.lifeguard+bugs wrote:

changed summary

He7d3r mentioned this in T67447: Site was blocked but IP was still able to add links later (due to link already present on the page?).Apr 14 2015, 5:44 PM

He7d3r updated the task description. (Show Details)Apr 28 2015, 10:19 AM

He7d3r set Security to None.

He7d3r mentioned this in T15569: SpamBlacklist sometimes allows adding blacklisted links.Apr 28 2015, 10:21 AM

He7d3r mentioned this in T89699: Already used URLs on a page should not trigger spam blacklist.

He7d3r mentioned this in T71775: Spam blacklist disallows addition of blacklisted links which are already there.Apr 28 2015, 10:23 AM

He7d3r mentioned this in T18325: Blacklisted links should mean the page can't be saved.Apr 28 2015, 10:27 AM