Page MenuHomePhabricator

testExturlusage takes forever on test.wikipedia
Closed, ResolvedPublic

Description

testExturlusage uses:

for link in mysite.exturlusage('www.google.com', namespaces=[2, 3], total=5)

This returns quickly on test.wikidata , as there is very little data that matches

https://test.wikidata.org/w/index.php?title=Special%3ALinkSearch&target=http%3A%2F%2Fwww.google.com

All of the other travis build platforms also provide the five records requested in a reasonable period of time.

While test.wikipedia has a lot of data that matches:

https://test.wikipedia.org/w/index.php?title=Special%3ALinkSearch&target=http%3A%2F%2Fwww.google.com

PageGenerator on test.wikipedia yields four results after a few API calls, however after the fourth result it has backed off to requesting data with a geulimit of 1, resulting in the following data request/results sequence:

{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'20'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']}
{u'query-continue': {u'exturlusage': {u'geuoffset': 21}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}

{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'21'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']}
{u'query-continue': {u'exturlusage': {u'geuoffset': 22}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}

{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'22'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']}
{u'query-continue': {u'exturlusage': {u'geuoffset': 23}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}

{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'23'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']}
{u'query-continue': {u'exturlusage': {u'geuoffset': 24}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}

It then proceeds to iterate continuously seemingly forever. (I killed it after 10 mins)


Version: core-(2.0)
Severity: normal

Details

Reference
bz72209

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:47 AM
bzimport added a project: Pywikibot-tests.
bzimport set Reference to bz72209.
bzimport added a subscriber: Unknown Object (????).

There are more results far away, I do not know if you reached there.

Try with different values of geuoffset=10000 in the link below.
Between 12000 and 12500 they finish.

https://test.wikipedia.org/w/api.php?inprop=protection&geuprotocol=http&maxlag=5&generator=exturlusage&format=jsonfm&geuquery=www.google.com&prop=info|imageinfo|categoryinfo&meta=userinfo&indexpageids=&geulimit=5000&geuoffset=10000&action=query&geunamespace=2|3&iiprop=timestamp|user|comment|url|size|sha1|metadata&uiprop=blockinfo|hasmsg

{

"query-continue": {
    "exturlusage": {
        "geuoffset": 10500
    }
},
"warnings": {
    "exturlusage": {
        "*": "geulimit may not be over 500 (set to 5000) for users"
    }
},
"query": {
    "pageids": [
        "12828"
    ],
    "pages": {
        "12828": {
            "pageid": 12828,
            "ns": 2,
            "title": "User:\u05dc\u05e2\u05e8\u05d9 \u05e8\u05d9\u05d9\u05e0\u05d4\u05d0\u05e8\u05d8/monobook.js",
            "contentmodel": "javascript",
            "pagelanguage": "en",
            "touched": "2012-04-10T19:34:24Z",
            "lastrevid": 112424,
            "counter": "",
            "length": 4432,
            "protection": []
        }
    },
    "userinfo": {
        "id": 25083,
        "name": "Mpaa"
    }

Between 12000 and 12500 they finish.

{

"warnings": {
    "exturlusage": {
        "*": "geulimit may not be over 500 (set to 5000) for users"
    }
},
"query": {
    "userinfo": {
        "id": 25083,
        "name": "Mpaa"
    }
}

}

A possible strategy could be to increase the new_limit if the code is in this condition in api.py, line 1090:

else:

  1. if query-continue is present, self.resultkey might not have been
  2. fetched yet if "query-continue" not in self.data:
    1. No results. return --> start to increase counter? the tricky part is to maintain the total number returned = count

It tries to fetch only the number of elements left to reach 5.
When 1 is reached, it stays there for 12000 queries ..

  • 500 500 5 5 0 **

[[test:User:Nip]]
[[test:User:TeleComNasSprVen]]

  • 500 500 5 3 2 **

[[test:User:MaxSem/wap]]

  • 500 500 5 2 3 **
  • 500 500 5 2 3 **
  • 500 500 5 2 3 **
  • 500 500 5 2 3 **
  • 500 500 5 2 3 **
  • 500 500 5 2 3 **

[[test:User:HersfoldCiteBot/Citation errors needing manual review]]

  • 500 500 5 1 4 **
  • 500 500 5 1 4 **
  • 500 500 5 1 4 **

......

(In reply to Mpaa from comment #2)

It tries to fetch only the number of elements left to reach 5.
When 1 is reached, it stays there for 12000 queries ..

But MW doesnt return one row, as requested..?

Yes, pywikibot will need to detect that it is 'getting nowhere slowly', and exponentially increase the new_limit until it finds data or end of dataset.

(In reply to John Mark Vandenberg from comment #3)

(In reply to Mpaa from comment #2)

It tries to fetch only the number of elements left to reach 5.
When 1 is reached, it stays there for 12000 queries ..

But MW doesnt return one row, as requested..?

I meant that it will keep sending request with geulimit=1. So to get to to 12000, it will send 12000 request.
It returns one row at the time, containing just query-continue data:
{u'exturlusage': {u'geuoffset': 24}}
{u'exturlusage': {u'geuoffset': 25}}
...

Yes, pywikibot will need to detect that it is 'getting nowhere slowly', and
exponentially increase the new_limit until it finds data or end of dataset.

geulimit=1 says the client wants 1 only record.

The MW API isnt returning one record. It is moving the cursor forward by one and returning zero records.

It feels like MW is interpreting 'geulimit=1' as 'only look at one record, and return the data if it meets the request criteria'

The API documentation explains it

eunamespace         - The page namespace(s) to enumerate.
                      NOTE: Due to $wgMiserMode, using this may result in fewer than "eulimit" results
                      returned before continuing; in extreme cases, zero results may be returned
                      Values (separate with '|'): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ..
                      Maximum number of values 50 (500 for bots)

gerritadmin wrote:

Change 167438 had a related patch set uploaded by Mpaa:
api.py: increase api limits when data are sparse

https://gerrit.wikimedia.org/r/167438

gerritadmin wrote:

Change 167438 merged by jenkins-bot:
Increase limits in QueryGenerator when data are sparse

https://gerrit.wikimedia.org/r/167438