Page MenuHomePhabricator

Implement revision filter by namespace for big wikis in "miser mode"
Closed, ResolvedPublic

Description

Author: romaine.wiki

Description:
Since MediaWiki 1.18 it is no longer possible to select the appropiate namespace in what a user has made edits. We certainly would like that back!

Greetings - Romaine


Version: unspecified
Severity: normal

Details

Reference
bz31197

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 11:55 PM
bzimport set Reference to bz31197.
bzimport added a subscriber: Unknown Object (MLST).

This was done deliberately in r88025 by freakolowsky, and purports to have been requested by Domas.

+9001

The related commit in question is r88025, committed by freakolowsky on 14 May 2011: "hidden namespace select box if in wgMiserMode(requested by domas)"

How on Earth did the servers survive all these years with this feature being enabled -- gasp! -- even on the English Wikipedia? Apparently well enough. I used to use the namespace selection box on Special:Contributions very often and not having it hinders the usability of the whole special page very much, kinda like what r48735 did to Special:RecentChanges -- it was eventually reverted in r56334.

(In reply to comment #1)

This was done deliberately in r88025 by freakolowsky, and purports to have been
requested by Domas.

(In reply to comment #2)

How on Earth did the servers survive all these years with this feature being
enabled -- gasp! -- even on the English Wikipedia? Apparently well enough.

CC Domas so he can comment on this himself.

The special page gets really hard to use without this feature.
The rationale is very poor.
In case we really need to save DB resources, is this less useful and more expensive than "Only show edits that are latest revisions", for instance?
Also, it doesn't make much sense to me to use $wgMiserMode for this: if the documentation is correct, MiserMode just delays functioning of some query special pages, making them less up to date but not totally unusable. Another configuration variable should be used.

(In reply to comment #4)

Also, it doesn't make much sense to me to use $wgMiserMode for this: if the
documentation is correct, MiserMode just delays functioning of some query
special pages, making them less up to date but not totally unusable. Another
configuration variable should be used.

The documentation is wrong, then. $wgMiserMode has been used as a generic "don't do expensive DB queries" setting for a long time.

(MAC)
(In reply to comment #4)

Also, it doesn't make much sense to me to use $wgMiserMode for this: if the
documentation is correct, MiserMode just delays functioning of some query
special pages, making them less up to date but not totally unusable. Another
configuration variable should be used.

Out documentation is slightly outdated for that, It's currently along the lines of "Disable database heavy services, so that they can be managed/controlled separately if desired (for services that permit it (for example: special page caching)."

I've asked Domas to provide a reason for this change. If we don't hear from him by the end of this week, then I'll suggest that the change be backed out.

Someone should probably get an explain of the query against say enwiki for reference and log it in here

EXPLAIN won't provide enough justice.

The problem is that currently "show me 50 edits from namespace X" can read 50 database rows, or it can read all of the user contributions and return 0.

It is not possible to index this without denormalizing the dataset (page_namespace has to sit together with all revisions).

e.g. 10 edits for Rambot reads:

mysql> show status like '%handler_read%';
+-----------------------+-------+

Variable_nameValue

+-----------------------+-------+

Handler_read_first0
Handler_read_key3
Handler_read_next9

now, 10 edits for Rambot, verifying page namespace (e.g. 10):

mysql> select * from revision join page on page_id=rev_page where rev_user_text='Rambot' and page_namespace=10 limit 10;
Empty set (1 min 53.16 sec)

mysql> show status like '%handler_read%';
+-----------------------+--------+

Variable_nameValue

+-----------------------+--------+

Handler_read_first0
Handler_read_key138423
Handler_read_next138448

Fixing this allows additional revision index, and denormalization which prohibits from cross namespace renames.

Or we can allow multiple-minute queries. As a rule, users with large histories get way more scripted contributions checks :-)

(In reply to comment #9)
I won't pretend to completely understand what has been illustrated here, but it would help to put it into simpler words. Does this mean that this particular feature draws on excess resources that has a deleterious effect on some other aspect of the project, or does it mean it's inelegant?

This feature is very heavily used by editors and administrators on a daily basis (I cannot remember an editing day in the last 3 years when I have not used it), and thus it's not just a "nice" feature but one on which many of us rely.

(In reply to comment #10)

(In reply to comment #9)
I won't pretend to completely understand what has been illustrated here, but it
would help to put it into simpler words. Does this mean that this particular
feature draws on excess resources that has a deleterious effect on some other
aspect of the project, or does it mean it's inelegant?

This feature is very heavily used by editors and administrators on a daily
basis (I cannot remember an editing day in the last 3 years when I have not
used it), and thus it's not just a "nice" feature but one on which many of us
rely.

It means that depending on how many edits the targeted user has, and how many of those were in the requested namespace, the query may take anywhere between 1 millisecond and 2 minutes. This is because the database can efficiently retrieve all edits by a certain user, but then has to go through them one by one to check the namespace against the requested namespace until it has either reached the number of qualifying edits requested (usually 50) or reached the end of the list of edits. The latter may take a long time for users with many edits. In the example Domas gave, he asked for 10 edits by Rambot in the Template namespace, and it took almost 2 minutes to examine every single edit by Rambot in the entire history of Wikipedia and conclude there were zero edits matching his criteria. And at our scale, queries taking more than a second are already considered slow.

Do note, my example was something I came up in few seconds, I could easily have found a way more interesting edge case ;-)

@"mybugs.mail" - heh, that assumes that the page returns any, but if people query, they want to query something that is deeply buried.

(In reply to comment #13)

@"mybugs.mail" - heh, that assumes that the page returns any, but if people
query, they want to query something that is deeply buried.

Yep!

It would be good to provide some message for the user in that case.

danny.leinad wrote:

If bots are the main problem, could you exclude them from such queries? And maybe allow query only by autoconfirmed users?

Many people complaining that you removed this function: http://pl.wikipedia.org/wiki/Wikipedia:Kawiarenka/Kwestie_techniczne#MediaWiki_1.18_-_problemy - it was very useful.

bulwersator wrote:

"And at our scale, queries taking more than a second are
already considered slow."

So solution is to fix this function
*look over last n edits
*kill query after 10 seconds
*cache results

It is very important function and removal "because it was slow" is a bad idea, it is similar to banning all interwiki bots due to large number of edits.

ForoaW wrote:

I don't think I buy this. Only a minority of the users use that feature; by default most users don't do any filtering. Mainly administrators use that filtering tool and till now, most of the time, response speed was acceptable. I use that many times a day but rarely in a distant past, mainly to check bot operations. So I fail to understand why this suddenly became a problem. If the system collapse under the load, I have no problem of having a smaller throttle or exponential back off inserted because anyway, we are generally verifying on a regular basis for the last hours and waiting a bit longer for the results results still in a time gain for us.

It's been explained why the feature is expensive, but not why it should be disabled, which is the point of the discussion.

There should be some criteria to disable features, otherwise they're disabled randomly. If there are server load problems, was this feature compared to other expensive features to evaluate cost vs. benefit and choose what to disable? Or does "queries taking more than a second are already considered slow" mean that all features which are able to generate queries longer than a second (or any other threshold) will be disabled?
If there are not server load problems, why should the feature be disabled?
Even if there are problems, can't they be resolved through other means and shouldn't their cost be compared to the benefit of the feature?

Last but not least, did someone prove that disabling the feature will actually reduce server load? Users need this (finding edits in specific namespaces), therefore if you don't allow this feature they'll just load the complete list of edits, 5000 at a time, and search the namespace within the pages.
In your example: Rambot, 140 000 edits, 28 pages; for the first I got "Served by mw30 in 14.588 secs", this makes a total of almost 7 minutes. That's not the length of the queries and I don't know if it's important, but perhaps it should be checked.

Just to reiterate what's been said here... I've been approached by a couple of admins from the English Wikipedia who are dismayed that this feature is gone; it's an important tool in their arsenal. Anything we can do to restore this funcationality would be greatly appreciated. They've got a hard job and could use the help.

Thanks

raising priority since this complaint is heard repeatedly on enwiki. This is a very visible issue.

Note: The namespace filter for revisions (like in Special:Contributions) was not removed from the MediaWiki software. It was changed to be hidden/disabled for large projects that run in a so called "Miser mode", which prevents certain queries to the database that are too slow for such a large project.

Retitling bug to request implementation of namespace filter in such a way that it can even be run on "miser mode" wikis. Depending on the cost/benefit this may not be possible in the short term.

(In reply to comment #9)

Or we can allow multiple-minute queries. As a rule, users with large histories
get way more scripted contributions checks :-)

Is the problem the fact that the multiple-minute query uses too many resources, or is it just that you think people will not like it if the page takes so long to load? I suspect the former, but if it's the latter it seems clear that people dislike it more not having the feature available.

It is only edge cases which take a lot of time (like rambot), and the longer the query takes, the more difficult it would be to obtain the results via another approach.

The vast majority of invocations are quick (a few seconds). If we are really worried about preventing these worse case queries, it should be disabled for users where

subject user_editcount is 100,000 or more, and 
invokers don't have a new permission like noquerylimit (similar to noratelimit)
  which would be given to administrators and maybe rollbackers

Any serious user wont be running these obscure queries unless they actually do want the results, and they are prepared to wait because the alternative is to step through 5000 edits at a time.

Indeed, would setting a reasonable edit count limit on the target user help as a short term fix?

And if the problem is only what the user would like, then you could even set a default, and allow the user to change the number. However, most usage of this tool would be relatively quick, anyway - and it would help for so many purposes. For example:

  1. Some users keep track of still-relevant discussions by ther contributions in specific namespaces (I do, for example, keep close track of my recent User talk: edits)
  2. At times, various user editing patterns need to be looked at. Since these editting patterns are some times namespace based, we need to be able to keep track of it.

Note that it both of these cases, you don't usually need to look too far down to find 50 contributions in the relevant namespace.

bulwersator wrote:

And maybe it is slow. Maybe this feature is significant load on wikimedia servers (unlikely). But alternative for user in the Rambot example is to load entire edting history (in chunks of 5000 edits) and find namespace keywords.

It is probably larger load on server (rendering multiple pages) and for user it requires more work - by multiple orders of magnitude.

bulwersator wrote:

And hard limit of checking only n last contribution is not proper solution, I suggest to make this limit configurable and allow also searching over entire contributions. noquerylimit is also certain solution, wikis will be able to add it to all user groups.

danny.leinad wrote:

(In reply to comment #20)

raising priority since this complaint is heard repeatedly on enwiki. This is a
very visible issue.

Nice to hear that only enwiki is important for you :(

PS. I see here requests from representants of various communities...

matthiasbecker1967 wrote:

If a user has several hundred edits a day but one needs to filter f.ex. the few
talk page edits only that user made it won't make sens to deliver only the say
three talk page edits within that user's last 200 edits. The flitering is now
near worthless. I don't think it make sens to filter like this. It should be
restored to the former or it should shut down totally.

Maybe we should ressources at other points, f. ex. with features delivering cat
images on user pages and other useless features and *not* on features which are
needed, especially in bigger language versions. But for smaller communities
with only a few sysops who need to browse the recent changes vor several days
back the new behaviour is a hit below the belt. Administering and fighting
vandalism just became harder.

I think the rational was really poor. If there's an issue with the server it should be solved their and not by removing the feature.

(In reply to comment #29)

I think the rational was really poor. If there's an issue with the server it
should be solved their and not by removing the feature.

It's been explained time and again that this isn't something that can *possibly* be fixed server-side, the feature is *inherently* slow.

meh, I guess we can enable it, and expect edge cases to be rare. if they are not, well, we can point people at this discussion or go back to 2005 and deploy query snipers ;-)

matthiasbecker1967 wrote:

Domas, no user wants servers to crash (well some want, but that's another case) and if things can be optimized they should. But IMHO the solution cannot be f.ex. the toolserver as some guys in the German WP at the moment try to figure out how to put a tool together which makes use of the API. Because of the toolserver isn't our most reliable system component and it isn't the fastest.

bulwersator wrote:

"It's been explained time and again that this isn't something that can
*possibly* be fixed server-side, the feature is *inherently* slow."

In the first place - why "it is/was slow" is valid reason to remove this? Is it using major part of CPU/RAM/bandwidth/hard drive space/etc ?

Frankly, I did see the commit first as a new feature, I wasn't supposed to cause a regression ;)
I saw this as a new feature being added and warned that it may be overly expensive in large scale environment.

As it was used before and wasn't killed in an emergency, I guess we can enable it in 1.18.

(In reply to comment #34)

As it was used before and wasn't killed in an emergency, I guess we can enable
it in 1.18.

Done in r99102, r99104.

I was sent here from a thread in mediawiki.org API talk:Usercontribs. On the English Wikipedia, in recent days, a simple usercontribs list request with two ucusers has changed from always responding immediately (sub-second) to always timing out. Here's the actual URL I recently tried:
GET /w/api.php?action=query&format=xml&list=usercontribs&ucuser=DavidBrooks%7CDavidBrooks-AWB&uclimit=20&ucnamespace=0&ucprop=title%7Ctimestamp%7Ccomment%7Cflags&continue= HTTP/1.1

The user (an IP) who commented on this suggested a relationship with the change on this thread. Am I seeing a real, permanent change? If so, I'll have to merge the results client-side.

ETA: now I actually read the title of this thread, should I just remove the namespace parameter and filter namespaces client-side?

The user (an IP) who commented on this suggested a relationship with the change on this thread.

Your issue isn't really related to this task. But I suspect it'll be fixed by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/461440, when that gets merged.

ETA: now I actually read the title of this thread, should I just remove the namespace parameter and filter namespaces client-side?

That would work.

Your issue isn't really related to this task. But I suspect it'll be fixed by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/461440, when that gets merged.

It does look possible. Would it be helpful to offer the URL that I cited as evidence in that thread? And it's not exactly the same; that bug says that a query even with one user is slow. I find the difference between one user and two users to be completely binary: immediate versus never.

ETA: now I actually read the title of this thread, should I just remove the namespace parameter and filter namespaces client-side?

That would work.

Sadly it doesn't. I'll wait for the fix.