Page MenuHomePhabricator

AbuseFilter not setting utf-8 flag
Closed, ResolvedPublic

Description

Author: bhartshorne

Description:
The wiki page http://wikitech.wikimedia.org/view/Text_storage_data shows the change in the number of text storage objects of each type over the last 18 months. Some of those object types are changing when they shouldn't be. As written in that page, "The rise in "object/historyblobcurstub" doesn't really make sense. The rise in "gzip,external/simple pointer" is concerning."

Find out why they're changing and fix it.

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:59 PM
bzimport set Reference to bz32478.
bzimport added a subscriber: Unknown Object (MLST).

object/historyblobcurstub entries should only exist on old stuff that referenced the MediaWiki 1.4-era "cur" table.

You should absolutely *never* see an increase in those objects unless something weird is happening like existing ones are getting duplicated somehow in the course of trying to merge data...

It should be possible to compare their contents to see if the new ones are unique or duplicates?

'gzip,external/simple pointer' sounds like the default that gets created when saving a new revision, so if there's editing activity and in the absence of anything else I'd expect those to go up over time.

Ah, the 'gzip,external/simple pointer' is stuff not marked as UTF-8... that might be a bit worrying actually. :)

Shouldn't occur on new entries unless there's some config special case off the top of my head. Whether those entries are problematic or not depends on what the $wgLegacyEncoding setting is on the sites those blobs belong to.

(In reply to comment #2)

Ah, the 'gzip,external/simple pointer' is stuff not marked as UTF-8... that
might be a bit worrying actually. :)

Shouldn't occur on new entries unless there's some config special case off the
top of my head. Whether those entries are problematic or not depends on what
the $wgLegacyEncoding setting is on the sites those blobs belong to.

This is interesting. I went to inquiry about a just-produced one (eswiki)

+----------+---------------------+

old_idold_flags

+----------+---------------------+

51956028utf-8,gzip,external
51956027utf-8,gzip,external
51956026utf-8,gzip,external
51956025utf-8,gzip,external
51956024utf-8,gzip,external
51956023gzip,external
51956022utf-8,gzip,external
51956021utf-8,gzip,external
51956020utf-8,gzip,external
51956019utf-8,gzip,external

+----------+---------------------+

It turns out it doesn't (apparently) have revision:

select rev_id, rev_text_id, old_flags from revision join text on (rev_text_id=old_id) where rev_id <= 51506304 order by rev_id desc limit 10;

+----------+-------------+---------------------+

rev_idrev_text_idold_flags

+----------+-------------+---------------------+

5150630451956026utf-8,gzip,external
5150630351956025utf-8,gzip,external
5150630251956024utf-8,gzip,external<--
5150630151956022utf-8,gzip,external<--
5150630051956021utf-8,gzip,external
5150629951956020utf-8,gzip,external
5150629851956019utf-8,gzip,external
5150629751956018utf-8,gzip,external
5150629651956016utf-8,gzip,external
5150629551956015utf-8,gzip,external

+----------+-------------+---------------------+

Special:Recentchanges doesn't show anything suspicious around those two entries.

There were three Abusefilter hits at that time, Especial:AbuseLog/1077805-1077807 and it more or less correlates with the number of gzip,external entries.

AbuseFilter is indeed storing items in text table, *and not setting utf-8 flag for them*.
So I think all of them will be AbuseFilter hits, which are utf-8 but not marked as such, not content in legacy encoding.

Yes, select afl_var_dump from abuse_filter_log order by afl_id desc confirms that suspicion.

Good catch, Platonides!

Looks like this is where AbuseFilter stores variable state dumps on filter matches so they can be checked out later.

AbuseFilter::storeVarDump() and AbuseFilter::loadVarDump() are manually using the text table and ExternalStore -- this probably should be using a common interface underneath Revision's use of the same.

Since it doesn't use the Revision code paths right now the var dumps won't be run through $wgLegacyEncoding conversion on load, but if they get refactored into a common code path, that conversion might start happening before a refactored AbuseFilter::loadVarDump gets its data back.

This should not be fatal but would at least cause those old entries to display incorrectly on some sites[1] where non-ASCII chars were contained in the text.

[1] from http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
'wgLegacyEncoding' => array(

'enwiki' => 'windows-1252',
'dawiki' => 'windows-1252',
'svwiki' => 'windows-1252',
'nlwiki' => 'windows-1252',

'dawiktionary' => 'windows-1252',
'svwiktionary' => 'windows-1252',

'default' => false,

),
// all other sites will not attempt conversion

A bigger worry is that batch recompression or other maintenance work might try to renormalize those entries in a way that AbuseFilter's code path doesn't recognize.

Switching to using a common code path would avoid having to worry about new data formats (or saving the wrong data format, as it currently does!) but we'd have to fix the old data entries or devise a workaround.

Fixing those should be easy enough. What about also adding to them an abusefilter flag?
That way at least maintenance scripts could easily recognise and skip them.

Daimona moved this task from Backlog to Internal bugs on the AbuseFilter board.
Daimona subscribed.

So now we have a great opportunity to fix this, which is T213006. The task outlines the tasks of a new script used to clean afl_var_dump and related, and since this is strongly related, we should handle it as well.
My only question is how to fix it. Should we add the 'utf-8' flag to every entry, or what else? Just let me know and I can amend the patch.

Daimona raised the priority of this task from Medium to High.Mar 30 2019, 11:48 AM

Blocking other high priority tasks.

JTannerWMF subscribed.

According to the Maintaners page, the Abusefilter is the responsibility of the Core Platform team.

So now we have a great opportunity to fix this, which is T213006. The task outlines the tasks of a new script used to clean afl_var_dump and related, and since this is strongly related, we should handle it as well.
My only question is how to fix it. Should we add the 'utf-8' flag to every entry, or what else? Just let me know and I can amend the patch.

Is AbuseFilter still not adding the utf8 flag or does this refer only to old data? If it's the latter then just adding the flag to old data is fine.

cc @Anomie && @tstarling

@mobrovac Yes, it is, nothing has changed on this side since 2011. Since the plan is to fix the DB storage format and add new back-compat code, we can also start adding the utf-8 flag as part of T213006. The point is: is adding the flag enough?

Since the plan is to fix the DB storage format and add new back-compat code, we can also start adding the utf-8 flag as part of T213006.

+1 to that, provided the flag is added only where there no other locales already set.

The point is: is adding the flag enough?

Good question. Intuitively, I'd say yes if we plan on running the back-filling script every now and again, but that's not sustainable. I'll defer the real, complete answer to @Anomie and @tstarling which know more about the the internal schemae used in AF.

I've looked at this again. To the best of my understanding, unconditionally add the flag to new rows, and adding it retroactively should be enough. If this is correct, I can amend the maintenance script to do so.

Change 538625 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Set the utf-8 flag for var dumps in the text table

https://gerrit.wikimedia.org/r/538625

Change 482499 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add a maintenance script to clean afl_var_dump

https://gerrit.wikimedia.org/r/482499

daniel lowered the priority of this task from High to Medium.Sep 30 2019, 2:57 PM

Change 538625 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Set the utf-8 flag for var dumps in the text table

https://gerrit.wikimedia.org/r/538625

Change 482499 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Add a maintenance script to clean afl_var_dump

https://gerrit.wikimedia.org/r/482499