Page MenuHomePhabricator

excessive autopatrol log entries on wikidata
Closed, DuplicatePublic

Description

Normally only new pages are available for patrolling or autopatrol, but I see many entries in the patrol log where bots create a new claim and this is marked as autopatrolled (since bots hav the autopatrol flag). An example:

10:39, 19 April 2013 VIAFbot (talk | contribs | block) automatically marked revision 27230780 of page Colin McWilliam (Q5145398) patrolled

You can see in the history of the article,
http://www.wikidata.org/w/index.php?title=Q5145398&action=history
that this edit added a claim but certainly did not create the page. Here's the claim creation itself:
http://www.wikidata.org/w/index.php?title=Q5145398&oldid=27230780&diff=prev

The side effect of this is that the logging table is filled with these things. It's already up to almost 27 million log entries, the vast majority of them bots marking themselves as autopatrolled. In comparison, en wp has around 48.5 million log entries, and it's been running a whole lot longer with a much larger editor base.

If there is some compelling reason for having this patrol setup, then it should at least be documented in giant letters someplace obvious.

See also: T19237

Update: The wikidata logging table is now up to 606,370,402 rows, as of January 2018.

Details

Reference
bz47415

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:20 AM
bzimport set Reference to bz47415.
bzimport added a subscriber: Unknown Object (MLST).

For reference, http://bugzilla.wikimedia.org/41907 is the request for enabling RC patrol on wikidata

RC patrol is useful for anyone trying to fight vandalism since it immediately removes trusted or already checked edits.

I think it would be more useful to remove useless log entries like "automatically marked revision 27230780 of page Colin McWilliam (Q5145398) patrolled" which I doubt anyone checks or cares about.

pinkampersand.wikimedia wrote:

Echoing Lego's comment, if there's a way to turn off the log function for *automatic* patrolling, but not manual, that'd be great. In fact, even without any server-related concerns that'd be great, since a log entry accompanying every edit essentially makes the patrol logs impossible to navigate. (There are a handful of circumstances where it's important to know who patrolled a page, e.g. if an obviously vandalistic page has been marked as patrolled and you want to know who did that, so you can explain things to them or pull their rights if necessary.)

Related URL: https://gerrit.wikimedia.org/r/62785 (Gerrit Change Ic999454d001c38dea08746d1e8184f0163cb7330)

I'm not sure what the point is of disabling auto patrol logging entirely. That means patrolling tools will be unable to discover the log entry for a patrolled edit.

If "claim" should not be subject to auto-patrolling or patrolling, then that should be disabled instead.

Solving the "claim" auto patrol log problem, by disabling the logging for it entirely seems an odd way to solve the problem.

OK, I take your point. OTOH is there any real need for patrolling tools to discover lots and lots of log entries for autopatrolled bot entries? Maybe these can be excluded from the log and the rest left in.

pinkampersand.wikimedia wrote:

(In reply to comment #5)

I'm not sure what the point is of disabling auto patrol logging entirely.
That
means patrolling tools will be unable to discover the log entry for a
patrolled
edit.

I'm sorry, but I don't quite follow. When do you ever need to see the log entries for *auto*patrolled edits? Don't they just duplicate the page history?

[Speaking with my "Product Manager for Admin Tools" hat on.]

(In reply to comment #5)

I'm not sure what the point is of disabling auto patrol logging entirely.

The point is to save Wikidata from falling over because the DB can't scale. (Note, BTW, that the proposal is only to disable autopatrol logging for Wikidata, not other wikis; you can see the default setting for MW itself in the commit.)

That means patrolling tools will be unable to discover the log entry for
a patrolled edit.

Indeed. We have lost a lot of MW core functionality over the years because of our inability to design a system that can scale arbitrarily; this is not the first, and sadly won't be the last.

If "claim" should not be subject to auto-patrolling or patrolling, then that
should be disabled instead.

The fault is not with Wikibase (which uses the entirely-reasonable concept of letting wiki users edit things in the same way as on core MW), but with MW core's design not being thought-through in terms of scalability. We already know that the revisions table's growth is a problem; patrolling logs cause a second table to also be a problem.

Solving the "claim" auto patrol log problem, by disabling the logging for it
entirely seems an odd way to solve the problem.

I appreciate that this is disruptive for users of the patrolling logs, most notably the CVU tools, but this is a change made for site stability, and we must accept it.

(In reply to comment #8)

[Speaking with my "Product Manager for Admin Tools" hat on.]

(In reply to comment #5)

Solving the "claim" auto patrol log problem, by disabling the logging for it
entirely seems an odd way to solve the problem.

I appreciate that this is disruptive for users of the patrolling logs, most
notably the CVU tools, but this is a change made for site stability, and we
must accept it.

Can we set up a "rolling" log instead? Like have the log entries vanish after 30 days (recentchanges table length). After that point you can't tell whether the edit was patrolled or not, so it would be pointless to know who patrolled it.

This has the advantage of not breaking anything (hopefully), and being able to provide the necessary features that patrolling does, while still reducing the log table.

(In reply to comment #9)

Can we set up a "rolling" log instead? Like have the log entries vanish after
30 days (recentchanges table length). After that point you can't tell whether
the edit was patrolled or not, so it would be pointless to know who patrolled
it.

This has the advantage of not breaking anything (hopefully), and being able
to
provide the necessary features that patrolling does, while still reducing the
log table.

I'm hesitant to set up an entire separate logging structure just for patrolling. At that point we might as well just make patrolling a recentchange itself and og it to the recentchanges table.

(In reply to comment #8)

[Speaking with my "Product Manager for Admin Tools" hat on.]

(In reply to comment #5)

I'm not sure what the point is of disabling auto patrol logging entirely.

The point is to save Wikidata from falling over because the DB can't scale.
(Note, BTW, that the proposal is only to disable autopatrol logging for
Wikidata, not other wikis; you can see the default setting for MW itself in
the commit.)

The fault is not with Wikibase (which uses the entirely-reasonable concept of
letting wiki users edit things in the same way as on core MW), but with MW
core's design not being thought-through in terms of scalability. We already
know that the revisions table's growth is a problem; patrolling logs cause a
second table to also be a problem.

How come this is a problem new with Wikidata? We have close to a 1,000 of wikis with many thousands of wiki-admins, stewards, bots, reviewers, rollbackers and patrollers etc. all who make lots of edits that are autopatrolled.

(In reply to comment #7)

I'm sorry, but I don't quite follow. When do you ever need to see the log
entries for *auto*patrolled edits? Don't they just duplicate the page
history?

Yes, on a healthy wiki every revision would have a patrol entry at some point (either autopatrol or patrol by another user). This is nothing new.

I can imagine this being a scalability problem, but I don't see how that only becomes a problem now. And if it is, I imagine we'll need a solution for other all other wikis as well (commons, enwiki, ..). Perhaps operations thinks that could be deferred to later, but if this is as important as some people make it seem, I imagine it is as much as problem elsewhere as for wikidata and we'll need single solution for all very soon.

Is that worth boldly sacrificing the integrity of the database (inconsistently log entries missing for actions taken, that are usually there for the same action by other users and on all other wikis).

If "claim" should not be subject to auto-patrolling or patrolling, then that
should be disabled instead.

Solving the "claim" auto patrol log problem, by disabling the logging for it
entirely seems an odd way to solve the problem.

I appreciate that this is disruptive for users of the patrolling logs, most
notably the CVU tools, but this is a change made for site stability, and we
must accept it.

Maybe you mistunderstood, but I don't see how this relates to the cited statement. I am suggesting that if "claim" creations should not be reviewed through the patrolling system, what's stopping Wikibase from preventing the patrol entry in the first place? Perform the creation like other unpatrollable actions (such as uploads, they create an unpatrollable recentchanges entry and no autopatrol entry).

I think it would be unfortunate if claims are not patrollable but since that seems already accepted, I'm merely suggesting we don't also disable logging for autopatrols outside this area (e.g. edits to regular pages, talk pages, categories, user pages, project pages etc.)

I am suggesting that if "claim" creations should not be reviewed
through the patrolling system, what's stopping Wikibase from preventing the
patrol entry in the first place?

Claim creation is a regular edit to an Item page. The RC entry is generated upon save, that is not under the control of the Wikibase extension. I suppose we could hack in and try to suppress patrolling based on some magic property of some edits. But I feel this introduces even more inconsistency (why do some edits require patrolling, and others don't?)

Furthermore, Claim creation/changes by users without the Autopatroll right should still be patrolled, so suppressing patrolling for this type of edit is not desired.

Yes, on a healthy wiki every revision would have a patrol entry at some point
(either autopatrol or patrol by another user). This is nothing new.

This is indeed an expectation we would break. But I don't see how, why or where this assumption is important or even relevant. Do you have an example?

So here is (part of) why the situation is different on wikidata than anywhere else.

  1. Wikidata actually has more edits/sec than anywhere, including en wp.
  2. Almost all of those edits are autopatrolled and wind up in the log.
  3. On en wp a much tinier proportion of edits wind up in the log, since they don't use RCPatrol. The number of large projects with RCPatrol on and with lots of bot edits in a short period of time must be, well... one, and that's the one with the issue :-D

If we want RCPatrol to scale then we need to rethink the ever-expanding log; even a 30 day retention is better than what we have now. I still claim that bot edits being autopatrolled and then logged is a waste of resources.

See bug 17237 for a solution based on discussion from Amsterdam Hackathon 2013 between Daniel, Tim and Timo.

+1 from me for that approach. It covers all my concerns.

(In reply to comment #14)

See bug 17237 for a solution based on discussion from Amsterdam Hackathon
2013 between Daniel, Tim and Timo.

So this means that, if needed, we can proceed with this as a temporary hack in the Wikibase extension before we re-work master as part of the to-be-scheduled occasional MW core re-working that Tim agreed to (what are the Ops/growth issues and how quickly can we fix bug 17237?).

(In reply to comment #16)

So this means that, if needed, we can proceed with this as a temporary hack
in the Wikibase extension

Which temporary hack are you referring to? I'm only aware of Ic999454d, which makes logging autopatroll events optional in core. I think we can and should go ahead with that.

For now, the default should be to log autopatrolled events, and this should only be disabled for wikidata.org to avoid flooding the log. Once we have the patrolling info in the revision table, the log entries for autopatroll events are redundant, and might be turned off per default.

(In reply to comment #16)

So this means that, if needed, we can proceed with this as a temporary hack
in
the Wikibase extension before we re-work master as part of the
to-be-scheduled
occasional MW core re-working that Tim agreed to (what are the Ops/growth
issues and how quickly can we fix bug 17237?).

The dump-related ops issues can be worked around for now with a functional if not awesome hack, for now.

(In reply to comment #17)

(In reply to comment #16)

So this means that, if needed, we can proceed with this as a temporary hack
in the Wikibase extension

Which temporary hack are you referring to? I'm only aware of Ic999454d, which
makes logging autopatroll events optional in core. I think we can and should
go ahead with that.

That is indeed the temporary hack James was referring to (I was sitting next to him when he wrote that). It is temporary because as soon as the _bot and _patrolled fields are moved to the revision table we shall remove logging of autopatrol from core entirely as I'm pretty sure there is no longer an acceptable use-case for them (especially as long as they remain to be logged as the same log_type and log_action as non-auto patrols - ergo it will fix bug 25799). Keeping it around under a feature flag seems pointless and only encourages a bad user experience for patrollers.

For now, the default should be to log autopatrolled events, and this should
only be disabled for wikidata.org to avoid flooding the log. Once we have the
patrolling info in the revision table, the log entries for autopatroll events
are redundant, and might be turned off per default.

As James said, this is acceptable – assuming we've considered the feasibility of adding the patrolling info to the revision table soon enough for wikidata not to explode. That it will happen has pretty much been agreed on already, whether it is worth it to do this temporary hack first (thus semi-permanently losing some data about events in the database) or whether it is feasible to get this revision table change through before the problems becomes critical for wikidata.

So, just to be clear, the plan is to commit a temporary hack into core just so a single WMF wiki can shorten their logs until somebody gets around to properly fixing the bug? That doesn't sound like the cleanest solution.

(In reply to comment #20)

So, just to be clear, the plan is to commit a temporary hack into core just
so a single WMF wiki can shorten their logs until somebody gets around to
properly fixing the bug? That doesn't sound like the cleanest solution.

As both James and I have said, we have a plan in place to address this in a way that is acceptable to us (software developers & product managers) and will not cause problems to users of wikidata and/or users active in countervandalism network. In fact, it'll make things better and allow for other interesting new features.

However since making major schema changes requires a significant amount of coordination, database switches, and what not, it is not in our hands to make that happen. This is mostly up to platform operations.

So, depending on whether our plan can be executed before wikidata explodes we will have to settle on an intermediate solution. The solution proposed in earlier comments before mine (disabling autopatrol logging on wikidata) is in my opinion not ideal, but it could be worse. I think it is acceptable if and only if it is temporary and only until we finish the larger schema change.

kaldari reopened this task as Open.EditedJan 10 2018, 2:35 AM
kaldari subscribed.

Reopening this. We merged a new config variable (4 years ago) to let us turn off autopatrol logging on Wikidata, but we never actually turned it off, AFAICT. The logging table on wikidata now has over 600 million entries and we can't add new logging functionality (T12331) for fear of making it worse. The decision to move patrolling records from recentchanges to revision may or may not be a good idea, but it's been 3 years since folks signed off on that and we haven't done it yet. At this point, we need to fish or cut bait rather than letting our tech debt block needed improvements. My suggestion would be to finish what we started and turn off autopatrol logging on wikidata (and possibly commons which is also ballooning). While it's true that one day we may move the patrol flag from recentchanges to revision and may lament that we can't backfill all the records, it's also possible that this won't happen for years (or ever) or that no one will care that all the gazillion patrol records from long ago can't be backfilled into revision (I know I won't). Thoughts?

kaldari renamed this task from creation of new claims (and perhaps other edits) can be (auto)patrolled on wikidata to excessive autopatrol log entries on wikidata.Jan 10 2018, 2:40 AM

It's already become an issue for dumps, see T181935, and while I've implemented a workaround by parallelizing the logs dump job, all such workarounds are temporary until the log table stops growing at this speed. See in particular https://phabricator.wikimedia.org/T181935#3808070