Page MenuHomePhabricator

DBQ-99 Quality information from 100 articles
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-99.
Summary: Quality information from 100 articles
Issue type: Task - A task that needs to be done.
Priority: Trivial
Status: Done
Assignee: (none)


From: ACNiklas <wiki@bietenbeck.net>

Date: Thu, 26 Aug 2010 11:55:49

I have a list of 100 articles (the 100 most viewed articles in medicine) and want to extract the following informations:

  • last edit article
  • size of article
  • size discussion
  • number of editors article

Please help me write a sql-query as I don't know anything about the structure of wikipedia's tables.

I can supply the list of articles in probably any format. The result is also fine in any format.

It would be nice to store the query, so I can modify and rerun it later on.

Thanks,

Andreas


Version: unspecified
Severity: trivial

Details

Reference
bz59351

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:25 AM
bzimport set Reference to bz59351.

From: DaB. <dab@ts.wikimedia.org>

Date: Thu, 26 Aug 2010 12:02:24

SELECT page_title,page_len,rev_timestamp FROM page JOIN revision ON page_latest=rev_id WHERE page_namespace=0 AND page_title="XXXX" LIMIT 1;

for size-of-article and last-edit.


From: DaB. <dab@ts.wikimedia.org>

Date: Thu, 26 Aug 2010 12:03:46

SELECT page_title,page_len,rev_timestamp FROM page JOIN revision ON page_latest=rev_id WHERE page_namespace=1 AND page_title="XXXX" LIMIT 1;

for the size-of-discusssion and last-edit-on-discussion.


From: DaB. <dab@ts.wikimedia.org>

Date: Thu, 26 Aug 2010 12:08:39

SELECT page_title,COUNT![][1] AS number FROM (SELECT DISTINCT page_title,rev_user_text FROM page JOIN revision ON page_id=rev_page WHERE page_namespace=0 AND page_title="XXXX") AS dummy;

for the number-of-editors. You can change

AND page_title="XXXX"

to

AND page_title="XXXX" AND rev_user>0

to remove non-login-users.

[1]: https://jira.toolserver.org/images/icons/emoticons/star_yellow.gif

From: ACNiklas <wiki@bietenbeck.net>

Date: Thu, 26 Aug 2010 20:28:01

Thanks for the quick reply. I suppose I can substitute

AND page_title="XXXX"

with

AND page_title IN ('article001', 'article002'...).

Is the page_title URLEncoded (f%C3%BCr instead of für) or can I use all characters as they appear in wikipedia's titles?

Thanks again,

Andreas


From: DaB. <dab@ts.wikimedia.org>

Date: Thu, 26 Aug 2010 20:34:23

You can use all chars as long as your charset is correct (that's default for the mysql-command), but you have to replace spaces " " with underscores "_".


From: ACNiklas <wiki@bietenbeck.net>

Date: Sat, 28 Aug 2010 09:26:58

Ok, so I have now the following queries:

SELECT page_title,page_len,rev_timestamp FROM page JOIN revision ON page_latest=rev_id WHERE page_namespace=0 AND page_title IN (art1, art2..) LIMIT 1;

SELECT page_title,page_len,rev_timestamp FROM page JOIN revision ON page_latest=rev_id WHERE page_namespace=1 AND page_title IN (art1, art2..) LIMIT 1;

SELECT page_title,COUNT AS number FROM (SELECT DISTINCT page_title,rev_user_text FROM page JOIN revision ON page_id=rev_page WHERE page_namespace=0 AND page_title = IN (art1, art2..)) AS dummy;

(art1, art2 ...) is

("Vagina", "Vulva", "Borderline-Persönlichkeitsstörung", "Penis", "Lyme-Borreliose", "Multiple_Sklerose", "Analverkehr", "Klitoris", "Schamhaarentfernung", "Schwangerschaft", "Burnout-Syndrom", "Konfektionsgröße", "Hämorrhoiden", "Depression", "Schamlippe", "Orgasmus", "Schizophrenie", "Pfeiffer-Drüsenfieber", "Penis_des_Menschen", "Diabetes_mellitus", "Fisting", "Ejakulation", "Herz", "Weibliche_Ejakulation", "Lupus_erythematodes", "Thrombose", "Ethanol", "Autismus", "Herpes_Zoster", "Morbus_Crohn", "AIDS", "Syphilis", "Magnetresonanztomographie", "Erektion", "Warze", "Gicht", "Gold", "Bandscheibenvorfall", "Amphetamin", "Asperger-Syndrom", "Arthrose", "Body-Mass-Index", "Tuberkulose", "Schuhgröße", "Narzissmus", "Windpocken", "Kokain", "Bipolare_Störung", "Parkinson-Krankheit", "Homöopathie", "Epilepsie", "Büstenhalter", "Blutgruppe", "Hitzeschaden", "Demenz", "Kondom", "Hand-Fuß-Mund-Krankheit", "Empathie", "Gonorrhoe", "Herpes_simplex", "Lorenzos_Öl", "Schilddrüse", "Down-Syndrom", "Vaginalverkehr", "Blutdruck", "Leukämie", "Chlamydien", "Staphylococcus_aureus", "Anus", "Atopisches_Ekzem", "Bettwanze", "Lungenentzündung", "PH-Wert", "Fibromyalgie", "Suizid", "Meningitis", "Ibuprofen", "Hepatitis_B", "Nesselsucht", "Menstruationszyklus", "Adipositas", "Amyotrophe_Lateralsklerose", "Sperma", "Lungenembolie", "Geschlechtsorgan", "Hashimoto-Thyreoiditis", "Alkoholkrankheit", "Gesetzliche_Krankenversicherung", "Weibliche_Brust", "Malaria", "Schlaganfall", "Marihuana", "Leistenbruch", "Zirkumzision", "Prostata", "Glucose", "Anämie", "Fersensporn", "Laktoseintoleranz", "Klitorisvorhautpiercing")

What else do I have to do to get these queries executed?

Thanks,
Andreas


From: As <megeekz@arandomblog.co.cc>

Date: Fri, 03 Sep 2010 23:07:51

We can go ahead and execute the query for you. I'll go ahead and do it.


From: As <megeekz@arandomblog.co.cc>

Date: Fri, 03 Sep 2010 23:16:57

Query 1 has finished, doing query 2 right now.


From: As <megeekz@arandomblog.co.cc>

Date: Fri, 03 Sep 2010 23:34:41

All queries finished.

The results of queries 1 and 2 will be posted as a text file alone. I will also post query 3 as a text file as well, but it is close to 1.0M so I will also gzip it.

http://toolserver.org/~andrew/9403/ is where you may find the results of your queries.

Please tell me when to remove the results.


From: ACNiklas <wiki@bietenbeck.net>

Date: Sat, 04 Sep 2010 11:03:59

Thanks for runing these queries. However I noticed, that all articles with German special characters (äüöß) in their name have not produced any results ("Leukämie" for example). Do you think there is a way to fix this?

Andreas


From: As <megeekz@arandomblog.co.cc>

Date: Sat, 04 Sep 2010 15:04:02

I have re-run the query, please check again.


From: As <megeekz@arandomblog.co.cc>

Date: Sat, 04 Sep 2010 15:05:46

Actually, I'm still not getting the results. I'll try a different method.

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: (none)
CC list: wikimedia-bugzilla@dabpunkt.eu