Page MenuHomePhabricator

'category remove' regression when category contains only files
Closed, ResolvedPublic

Description

(Creating bug from https://gerrit.wikimedia.org/r/#/c/149645/)

In May 2014 this changeset broke 'category.py remove' for categories with only files and no pages.
https://gerrit.wikimedia.org/r/#/c/130301/

It is also a significant performance regression as 'categoryinfo' requires the server to fetch all of the category members, which is slow, only to return counts, when category.py needs to iterate over the category members, so it must hit the server again.


Version: core-(2.0)
Severity: normal

Details

Reference
bz68705

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:42 AM
bzimport set Reference to bz68705.
bzimport added a subscriber: Unknown Object (????).

I'm confused: I don't see a possibility to get the counts without hitting the server. And the server must be hit at least once to get the pages or categories in the “worst case”.

So the previous handling without using 'categoryinfo' seems to be the better solution. Of course it could be written a bit more pythonic by simply using

articles = cat.articles()
if articles:

  1. remove them

else:

  1. "complain"

Even the change in 'CategoryTreeRobot' should be reverted because the database should be the fasted way to get the number of articles in a category.

By the way database: This seems the only way to avoid hitting the server in the “best case” (|articles| == 0). But 'CategoryMoveRobot' doesn't use the database and it's only used for read-only operations.

But the changes above line 644 (old)/642 (new) look fine to me.

Now of course I'm fairly new to the bot's internals, so my assessment is probably off.

Your assessment is exactly the same as mine. CC:ing the people involved in the previous changeset in case they can see some problem with our assessment.

I didnt +2 your changeset because it looks to me like the previous changeset needs to be mostly reverted, except for the formatting changes. Do you want to put up a 'revert' patch for review?

(Yes, the server must be hit at least once. Invoking cat.categoryinfo is the unnecessary performance regression, as the client logic needs all contents, so it isnt useful to ask the server to compute the counts when the client can do it.)

I believe the database is only used for read-only operations because it isnt a significant problem if the read-only operations use stale data. Write operations need to use fresh data to minimise errors.

cat.articles() is a generator and bool(cat.articles()) is always True.

You may try a simple performace test like:

import pwb, pywikibot as py
s = py.Site()
c = py.Category(s, 'Mann')
c.categoryinfo['pages']

456626

You'll get the result above immediately

g = c.articles()

len(set(g))

You may wait several minutes to get a result. I've given up after 5.

(In reply to xqt from comment #3)

cat.articles() is a generator and bool(cat.articles()) is always True.

You may try a simple performace test like:

import pwb, pywikibot as py
s = py.Site()
c = py.Category(s, 'Mann')
c.categoryinfo['pages']

456626

You'll get the result above immediately

I see the server doesnt spend much time working on this, as it loads the data from the category table.

https://www.mediawiki.org/wiki/Manual:Category_table

However it is another API call on the server, which adds more network latency (think someone in the jungle of Borneo on a 3G modem), server workload and delays (an issue on slower wikis), etc.

g = c.articles()

len(set(g))

You may wait several minutes to get a result. I've given up after 5.

len(set(g)) is obviously not the best way to determine if the generator has any results.

The best way is, I think, to process the generator with an else block to detect if the generator had no results. There are other tricks to instantaneously determine if the generator is empty.

Okay my mistake that 'bool(cat.articles)' return true if it's not None. But you could then do something like:

empty = True
for page in cat.articles:

empty = False

if empty:

  1. complain

Now I'm sure if this is slower (but it doesn't appear so). And maybe there is a more pythonic way. The for-else does not differ between an empty list and a non empty.

(In reply to Fabian from comment #5)

The for-else does not differ between an empty list
and a non empty.

This will work

for i, page in enumerate(generator, 1):
    ....
else:
    if 'i' not in locals():
        ...

(I dont like locals() being used)

There are many other ways to do it, including wrapping the generator in a generator class that allows peaking.