Page MenuHomePhabricator

Make AbuseFilter aware of different content models.
Closed, ResolvedPublic

Description

The AbuseFilter extensions tries to detect vandalism using, among other things, pattern matching against page content. This is currently targeted at wikitext and does not work well, if at all, for non-text content like Wikidata items.

AbuseFilter needs to be adapted or extended to handle structured content smartly. It could define hooks which the Wikibase extension could then use to implement the desired behavior.


Version: unspecified
Severity: enhancement

Details

Reference
bz42064

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:01 AM
bzimport set Reference to bz42064.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

AbuseFilter needs to be adapted or extended

So shouldn't this be filed under the "AbuseFilter" component then?

(In reply to comment #1)

So shouldn't this be filed under the "AbuseFilter" component then?

Well, technically, a request to add the hook point we need would go into the "AbuseFilter" component, while the ticket for implementing the desired behavior would stay here. But I think we can omit the former.

I think this should be upgraded in priority in view of the spamming at http://wikidata-test-repo.wikimedia.de/wiki/Special:RecentChanges, which will surely start affecting the production wiki soon.

Could be broken down into

  • add hooks to intercept and extend the present implementation of AbuseFilter
  • replace non-functional vars with working versions for our content
  • extend the present set of vars (and methods) to accommodate for our content

It would be best if Wikibase could run EditFilterMerged hooks on a text representation of the edit. This would also allow for ConfirmEdit to request a captcha solution if a link is added, etc.

You could also run AbortMove on a rename, and ArticleDelete on a delete, but I'm not sure how much that would help.

(In reply to comment #5)

It would be best if Wikibase could run EditFilterMerged hooks on a text
representation of the edit.

There is no actual text representation - we can fake one, but that feels a bit scary.

This would also allow for ConfirmEdit to request a
captcha solution if a link is added, etc.

I don't see a good way to integrate this with the wikibase UI. Have you tried editing on wikidata.org?

Anyway, we don't support free form links, so it's always well known whether a link is added or not. For captchas etc we'll really need a new solution, though - currently, editing is JS-based.

You could also run AbortMove on a rename, and ArticleDelete on a delete, but
I'm not sure how much that would help.

These hooks should be called as normal. Everything that operates on revisions and pages as a whole, not on the content, should Just Work with wikibase items.

(In reply to comment #6)

There is no actual text representation - we can fake one, but that feels a bit
scary.

Faking it in the short term would probably be the best-- that will keep you from being overrun with simple spam in the next few weeks, and any sysop can add regex rules to flag or block spam patterns.

Long term, it would be much better to have a custom way to write and check rules for free-form text that is not wikitext. This would also help the Article Feedback project. Unfortunately, development for that hasn't even been put onto the priority list for spam fighting tools at the moment.

I don't see a good way to integrate this with the wikibase UI. Have you tried
editing on wikidata.org?

I think it would work just fine. You would get an error on save, letting you know a captcha needs to be solved, and the ConfirmEdit api can be used to show the image to the user. It would take some UI work, but all the tools are available for it.

If you allow anonymous users to add text in wikidata with no captchas, or other ways to prevent bots outright, then I think you're going to have a very difficult time just addressing it from the detection / response side. I hope I'm wrong, but I think would be worth spending some time designing now before we get flooded.

These hooks should be called as normal. Everything that operates on revisions
and pages as a whole, not on the content, should Just Work with wikibase items.

Good.

After a bit of testing (basically just prodding around) it seems like things sort of work during manual testing of filters, but hits doesn't get registered during ordinary updates. That could indicate that some important hook isn't run during our update cycle.

The hook EditFilterMerged is moved to EditFilterMergedContent. That could mean that AF at present is defunc. Use MW_SUPPORTS_EDITFILTERMERGED to switch between alternate hooks.

It could be more changes missing.

(In reply to comment #9)

Use MW_SUPPORTS_EDITFILTERMERGED to switch
between alternate hooks.

actually, no. use MW_SUPPORTS_CONTENTHANDLER.

Change I628b7c2d: (Bug 42064) Added an additional onEditFilterMerged for entities [DO NOT MERGE]

THis issue and the best approach to tackle it was discussed on wikitech-l: http://www.gossamer-threads.com/lists/wiki/wikitech/317401

I began sketching out how I think this could be implemented over the weekend. It would be great if all of the logic could be done in the existing hooks, since that is how AF expects for the filter to be extended. To take Matt's example on wikitech-l, I'm pretty sure you could implement Bayesian filtering using the existing hooks.

But, as pointed out, Wikidata doesn't operate on an EditPage, which the current hook handler uses to get information for the rules (title, previous revisions, etc), and also to re-show the edit page if the filter rejects the change. So AF does need a new hook handler for doing its work without the EditPage.

How about AbuseFilter defines an interface for objects it can filter, and Wikibase can make any objects that need to be filtered implement that interface (either formally or informally), so AbuseFilter does the work, but the knowledge it needs about Wikibase can be generalized to other projects as well?

Hm... such an Interface would mean a hard dependency on AbuseFilter. And what information would it provide? AbuseFilter uses a wide variety of "variables" (I would call them "features"), plus a mechanism for only calculating these on demand. To provide full flexibility, the interface would have to be very generic, perhaps only defining the method getEditVars similar to the one in AbuseFilter.class.php.

But that doesn't solve the question of when and how to trigger AbuseFilter when saving an update to a Wikibase Entity. We still need an alternative to EditFilterMerged for that.

I started to work on this issue yesterday, and came up with the following idea:

  1. modify the EditFilterMergedContent hook to be more generic, so it can be called from EditEntity as well as from EditPage. The most important change woudl be to pass an IContextSource as the first parameter instead of the EditPage. That already provides most information needed, including the WikiPage object. Modifying hooks is of course not to be taken lightly, but EditFilterMergedContent was only introduced in 1.21 as a Content-object aware alternative to the now deprecated EditFilterMerged. No extension is using it yet (AbuseFilter was due to be changed to use the new hook anyway).
  1. make EditEntity call EditFilterMergedContent.
  1. Make AbuseFilter use EditFilterMergedContent instead of EditFilterMerged.
  1. Add hook points to AbuseFilter that allow other extensions to control how "variables" are extracted from the present Content object.
  1. make Wikibase provide handlers for these hooks that implement the variable extraction suitable for Wikibase Entities. A quick-and-dirty way would be to implement a hook function for generating fake wikitext from an Entity.

This architecture keeps the two extensions largely independent, though it requires Wikibase to have some knowledge about AbuseFilter. This kind of cross-extension knowledge seems to be unavoidable, we will probably also provide special hook handlers for MWSearch and perhaps some more extensions. It would be cleaner to have this "glue code" in a separate extension, but that would be annoying overhead.

What do you think, Chris? Does that work for you, or would you prefer using a specialized interface? I suppose instead of implementing a handler for an AbuseFilter::getEditVars hook, EntityContent could implement a function called getAbuseFilterVars, and AbuseFilter could check for that instead of calling the hooks... nicer from an OO perspective, but not How Things Are Generally Done in MediaWiki.

I would be fine with your proposal, with just a couple minor modifications. It's definitely more "How Things Are Generally Done in MediaWiki".

Part of me would like to use this as an opportunity to take steps towards reworking spam handling in core. But even if we change how we do spam handling in the future, I think your plan is simple enough that we're not going to code ourselves into a corner or waste too much time getting the protection for wikidata in now.

  1. modify the EditFilterMergedContent hook to be more generic, so it can be

called from EditEntity as well as from EditPage. The most important change
woudl be to pass an IContextSource as the first parameter instead of the
EditPage. That already provides most information needed, including the
WikiPage
object. Modifying hooks is of course not to be taken lightly, but
EditFilterMergedContent was only introduced in 1.21 as a Content-object aware
alternative to the now deprecated EditFilterMerged. No extension is using it
yet (AbuseFilter was due to be changed to use the new hook anyway).

I agree. May need to ask on wikitech-l to make sure no one has started relying on it, but it's probably fine.

  1. make EditEntity call EditFilterMergedContent.

Yep.

  1. Make AbuseFilter use EditFilterMergedContent instead of EditFilterMerged.

This would probably be "in addition to". Since debian has chosen 1.19 for LTS, I'm hoping to keep AbuseFilter compatible as long as possible.

  1. Add hook points to AbuseFilter that allow other extensions to control how

"variables" are extracted from the present Content object.

Yep.

  1. make Wikibase provide handlers for these hooks that implement the variable

extraction suitable for Wikibase Entities. A quick-and-dirty way would be to
implement a hook function for generating fake wikitext from an Entity.

Yep. It would be great if the EditFilterMergedContent handler could do most of the same handling that the EditFilterMerged handler currently does (it will have to, if we want article editing to call that hook, instead of the current one), but everything else would need to be provided as a variable from the hooks.

So it sounds like we have a way forward!

That was a bit to fast... still requires changes in Wikibase repo

I have now implemented the changes I suggested in #16:

  • Change I99a19c93: (bug 42064) Make EditFilterMergedContent more generic.
  • Change I5f5b4200: (bug 42064) Call EditFilterMergedContent from EditEntity.
  • Change Ibb9d4c9a: (bug 42064) AbuseFilter + EditFilterMergedContent.

The last one of these makes AbuseFilter use EditFilterMergedContent instead of EditFilterMerged, if support for the ContentHandler is present. Some changes had to made, because EditFilterMergedContent does not provide an EditPage object.

One important change is how and when errors are reported and the edit form is show. Instead of calling EditPage::spamPage, AbuseFilter now returns a Status object to the hook's caller, containing all the necessary information to show a message to the user.

Verified in Wikidata demo sprint 30

Per http://www.wikidata.org/wiki/Wikidata:Administrators%27_noticeboard#Large_increase_in_spam , we are in big need of the ability of the filter to filter spam for Wikidata items (and soon, properties as well).

AFAIK this is going to go live on Monday, February 4 with 1.12wmf9. See https://www.mediawiki.org/wiki/MediaWiki_1.21/Roadmap