Page MenuHomePhabricator

DBQ-121 Sublist of the "List of pages of English Wikipedia with most interwikis"
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-121.
Summary: Sublist of the "List of pages of English Wikipedia with most interwikis"
Issue type: Task - A task that needs to be done.
Priority: Major
Status: Done
Assignee: Hoo man <hoo@online.de>


From: Helder <helder.wiki@gmail.com>

Date: Wed, 26 Jan 2011 19:54:22

Hi!

I would like to request two lists for use on Portuguese Wikipedia:

  • A list of 5000 pages of English Wikipedia with more interwikilinks;
  • The list of pages in the first list which doesn't have an interwiki to Portuguese Wikipedia (pt.wikipedia);

This request is analogous to <del>DBQ-91</del>, so I think it should be easy to generate.
Please, could you also inform how long does it takes to get such lists from database?

Thank you very much


Version: unspecified
Severity: major

Details

Reference
bz59374

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:26 AM
bzimport set Reference to bz59374.

From: Hoo man <hoo@online.de>

Date: Thu, 27 Jan 2011 15:38:07

The first query needed 6m58.933s to execute, the second 1m12.616s.
Further, the second only includes the page 2016, I don't know if that can be true, but I couldn't find an error in my Queries ![][1]

SQL:
(first query)

SELECT page.page_title as page FROM langlinks INNER JOIN page on langlinks.ll_from = page.page_id WHERE page.page_namespace = 0 GROUP BY ll_from ORDER BY COUNT(*) DESC LIMIT 5000;

(second query)

SELECT page.page_title as page FROM (SELECT * FROM langlinks INNER JOIN page on langlinks.ll_from = page.page_id WHERE page.page_namespace = 0 GROUP BY ll_from ORDER BY COUNT(*) DESC LIMIT 5000) as langlinks INNER JOIN page ON langlinks.ll_from = page.page_id LEFT JOIN (SELECT COUNT(*) as bool, ll_from FROM langlinks WHERE ll_lang = 'pt' GROUP BY ll_from) as tmp ON tmp.ll_from = langlinks.ll_from WHERE isnull(tmp.bool);

Results:
http://toolserver.org/~hoo/dbq/dbq-121.txt (first query)
http://toolserver.org/~hoo/dbq/dbq-121_2.txt (second query)

[1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif

From: Helder <helder.wiki@gmail.com>

Date: Thu, 27 Jan 2011 17:26:46

Hi!

Thank you Hoo man!

Since the second list has only one page, do you mind to expand it to some reasonable value above 5000 (10000 maybe?) so that we can get a list with more pages? If possible, could you also add a column with the number of interwiki links in the resulting list?


From: Hoo man <hoo@online.de>

Date: Thu, 27 Jan 2011 19:23:00

I reran the second query with the top 10,000 pages and it is displaying the number interwiki links now:
http://toolserver.org/~hoo/dbq/dbq-121_3.txt

Fell free to ask, if you want even more.


From: Helder <helder.wiki@gmail.com>

Date: Fri, 28 Jan 2011 10:31:15

Hi again!

It seems that the criteria to filter the list is still too restrictive. Maybe one of these alternatives (both?) could give us better results:

  • A list of en.wp pages with more than 20 interwiki links and no interwiki to pt.wp
  • A sublist of articles in the "list of 5000 en.wp pages with more interwikis" having an interwiki to a "pt.wp page whose size is less than 3kb" (instead of having no interwiki to 'pt'). In this case, it would be good to sort by size of Portuguese article, so that the community can focus on improving the small articles. If possible, this list should have 4 columns: English title, number of interwikis, Portuguese title, size of Portuguese article.

Are those lists still feasible?

Thanks for helping us


From: Hoo man <hoo@online.de>

Date: Fri, 28 Jan 2011 22:45:37

Both done ![][1]

SQL:
(first query)

SELECT COUNT(*) as links, page.page_title as page FROM langlinks INNER JOIN page on langlinks.ll_from = page.page_id WHERE page.page_namespace = 0 GROUP BY langlinks.ll_from HAVING links > 20 ORDER BY links DESC;

(second query)
This one was a bit more complicated, I first had to pull the page list from enwiki_p in a user table, then dump that user table and import it to the server with ptwiki_p on. (MySQL doesn't support JOINS between different Servers yet)

This is the SELECT INSERT into the user table:

INSERT INTO u_hoo.dbq121 SELECT langlinks.ll_title FROM (SELECT * FROM langlinks INNER JOIN page on langlinks.ll_from = page.page_id WHERE page.page_namespace = 0 GROUP BY ll_from ORDER BY COUNT(*) DESC LIMIT 5000) as langlinks INNER JOIN page ON langlinks.ll_from = page.page_id LEFT JOIN (SELECT COUNT(*) as bool, ll_from FROM langlinks WHERE ll_lang = 'pt' GROUP BY ll_from) as tmp ON tmp.ll_from = langlinks.ll_from WHERE tmp.bool IS NOT NULL;

After the dump and import to sql 2:

SELECT page.page_len as page_size, dbq121.page_title as page FROM u_hoo.dbq121 INNER JOIN page ON dbq121.page_title = page.page_title WHERE page.page_len < 3072 AND page.page_namespace = 0 ORDER BY page.page_len DESC;

Results:
http://toolserver.org/~hoo/dbq/dbq-121_4.txt (first query)
http://toolserver.org/~hoo/dbq/dbq-121_5.txt (second query)

[1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif

From: Helder <helder.wiki@gmail.com>

Date: Tue, 01 Feb 2011 15:33:16

Thank you very much!

I just have one question: The last line of second list is
17 Alfa
but the article has 16 only characters:
"#REDIRECT [[Α]]\n"
as you can see here:
http://pt.wikipedia.org/w/index.php?title=Alfa&action=edit

Is there a bug making that value wrong? How should we interpret the numbers in that column?


From: Hoo man <hoo@online.de>

Date: Tue, 01 Feb 2011 15:51:52

The numbers are right, they just show the page size in bytes, which may make them vary. Further the additional "\n" at the end (which stands for a new line), is another char ![][1]

[1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif

From: Helder <helder.wiki@gmail.com>

Date: Tue, 01 Feb 2011 16:58:51

Yep, this is the problem: without the new line, there are only 15 characters in that wikitext, so it still doesn't seems to be correct. I've noticed this on larger pages as well, like the page "L" which has 3006 in the first column, but has 2892 characters (counted by putting
javascript:alert(jQuery('#wpTextbox1').text().length)
in address bar when editing the page:
http://pt.wikipedia.org/w/index.php?title=L&oldid=22987621&action=edit
)

But I think I understood: after checking the value of PAGESIZE:{{FULLPAGENAME}} for a page with "c" and other with "ç" I noticed this last character takes two bytes instead of only one. So, never mind... =)

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: hoo@online.de
CC list: hoo@online.de