Page MenuHomePhabricator

DBQ-104 List of mainspace articles less than 0.5kb in size in Tamil Wikipedia with date of creation
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-104.
Summary: List of mainspace articles less than 0.5kb in size in Tamil Wikipedia with date of creation
Issue type: Task - A task that needs to be done.
Priority: Minor
Status: Done
Assignee: EdoDodo <dodo.wikipedia@gmail.com>


From: Sodabottle <sodabottle@gmail.com>

Date: Fri, 24 Sep 2010 22:14:20

I am trying to create a list of Tamil Wikipedia articles that are less than 0.5kb in size. This is for a new wikiproject to develop stubs in ta.wiki.

The list is for mainspace articles only and redirects, templates and categories shouldn't be included


Version: unspecified
Severity: minor

Details

Reference
bz59356

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:25 AM
bzimport set Reference to bz59356.

From: EdoDodo <dodo.wikipedia@gmail.com>

Date: Sat, 25 Sep 2010 08:57:45

I ran the following query:

SELECT page_id, page_len FROM page
WHERE page_len < 512 AND page_namespace = 0 AND page_is_redirect != 1
ORDER BY page_len

Then, I did some post-processing using the API and the page IDs to get the page titles properly encoded (the database did not return them encoded properly). I then combined the results of the API queries and that of the database query and exported them to a CSV file (UTF-8 encoded), which is attached. After that, I ran a bunch of regex find-and-replaces manually to get the data from the CSV into a wikitable, which is also attached. Feel free to use whichever format is more convenient for you.

Only 150 pages were found, but that should give you a bit to work on ![][1]. Unfortunately, the size does not consider transcluded templates so some pages are in fact not really that short (for example, your main page is second on the list, because it is made up of just three transcluded templates). The results are ordered by length, the shortest ones are first.

Anyway, good luck and let me know if there's anything else I can do for you.

[1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif

From: Sodabottle <sodabottle@gmail.com>

Date: Sat, 25 Sep 2010 09:13:57

Thank you EdoDodo,

This is exactly what i was looking for. Now we can get to work on these. I actually surprised at the lesser number of articles - the wikimedia stats page (http://stats.wikimedia.org/EN/TablesWikipediaTA.htm) says that 18% of our articles are less than 0.5k in size (though it is for May 2010, i am sure we have not done anything to expand the stubs since then). So by our article count, there should be about 5000 articles that are < 0.5k. Any idea why this discrepancy happens?

regards
Sodabottle


From: EdoDodo <dodo.wikipedia@gmail.com>

Date: Sat, 25 Sep 2010 09:33:20

Hi,

Hmm... It is strange that there is such a large discrepancy between the two, it's only been 4 months after all, and the overall size of the wiki that was listed hasn't increased an awful lot (although, if a significant part of the size increase was focused on stubs, that would explain it). I'll ask someone to check my query later today and see if I've made any mistakes but it looks all right to me.

EdoDodo


From: Guandalug <A.Meiske@nightstone.de>

Date: Sat, 25 Sep 2010 16:57:53

The query is not to be blamed in any case.

I just run a count( * ) on it, and found 145 - seems somebody is very busy getting rid of short articles.

If the limit is set to 600 byte, it's 216 articles. With 1024 Byte (1kB) it'd be a liast of 1095 articles.


From: Sodabottle <sodabottle@gmail.com>

Date: Sat, 25 Sep 2010 17:05:34

Thanks for the confirmation. I started working on the list, Thats why the numbers are decreasing

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: dodo.wikipedia@gmail.com
CC list: dodo.wikipedia@gmail.com, guandalug@nurfuerspam.de, sodabottle@gmail.com