Page MenuHomePhabricator

DBQ-89 Finding unlinked articles / finding empty but linked articles
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-89.
Summary: Finding unlinked articles / finding empty but linked articles
Issue type: Task - A task that needs to be done.
Priority: Minor
Status: Done
Assignee: merl <mewikipedia@to.mabomuja.de>


From: Läuterungsberg <Ligl@schulevonathen.de>

Date: Tue, 16 Mar 2010 21:33:48

a) Create a list of articles which do exist, but aren´t linked to.
b) Create a list of articles which do not have any content but are linked to (via a redlink), sorted by the number of redlinks to the article.

Both is meant for the english and the german wikipedia.


Version: unspecified
Severity: minor

Details

Reference
bz59341

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:24 AM
bzimport set Reference to bz59341.

From: merl <mewikipedia@to.mabomuja.de>

Date: Wed, 17 Mar 2010 15:41:59

Some existing tools:
a)

  • for dewiki

http://de.wikipedia.org/wiki/Special:LonelyPages
http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Verwaiste_Seiten/Arbeitsseite
http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Verwaiste_Seiten/Kategorien
http://toolserver.org/~merl/specialpages/dewiki/LonelyPages

  • for enwiki

http://de.wikipedia.org/wiki/Special:LonelyPages

b) http://toolserver.org/~magnus/missingtopics.php
http://de.wikipedia.org/wiki/Benutzer:P.Copp/Spezialseiten/Gew%C3%BCnschte_Artikel (an old version)


Which output format? (simply list/CSV, wiki-syntax, ...)
What do you mean with "(not) linked to"? I think you only want links included from article namespace (ignoring links from other namespaces)?


From: merl <mewikipedia@to.mabomuja.de>

Date: Fri, 19 Mar 2010 23:31:19

No response, so i'll run the query.
a) All articles that are not linked, embedded or redirected. I removed the disambiguationspages. But for dewiki you should use my tool, because my bot daily add these pages to the project page and those aren't lonely anymore.
b) All links from article namespace to not existing articles
Here my SGE script

#! /bin/bash
#$ -N DBQ-89
#$ -hard
#$ -l sqlprocs-s5=1
WIKI="dewiki"
SQL="
CREATE TEMPORARY TABLE temp (t_pid INT PRIMARY KEY, t_pt VARCHAR(255));
INSERT INTO temp
 SELECT page_id, page_title
  FROM ${WIKI}_p.page p
   LEFT JOIN ${WIKI}_p.pagelinks ON p.page_namespace = pl_namespace AND p.page_title=pl_title
   LEFT JOIN ${WIKI}_p.templatelinks ON p.page_namespace = tl_namespace AND p.page_title=tl_title
   LEFT JOIN ${WIKI}_p.redirect ON p.page_namespace = rd_namespace AND p.page_title=rd_title
  WHERE p.page_namespace=0 AND page_is_redirect=0 AND pl_namespace IS NULL AND tl_namespace IS NULL AND rd_namespace IS NULL;
CREATE TEMPORARY TABLE tempdab(dab_tltitle INT PRIMARY KEY);
INSERT INTO tempdab
 SELECT pl_title
  FROM ${WIKI}_p.page
   INNER JOIN ${WIKI}_p.pagelinks ON page_id=pl_from
  WHERE page_namespace=8 AND page_title='Disambiguationspage' AND pl_namespace=10;
CREATE TEMPORARY TABLE tempdel(td_pid INT PRIMARY KEY);
INSERT IGNORE INTO tempdel
 SELECT t_pid
  FROM temp
   INNER JOIN ${WIKI}_p.templatelinks ON t_pid=tl_from
   INNER JOIN tempdab ON tl_title = dab_tltitle
  WHERE tl_namespace=10;
DELETE a
 FROM temp a
  INNER JOIN tempdel b ON t_pid=td_pid;
SELECT REPLACE(t_pt,'_',' ')
 FROM temp
 ORDER BY t_pt;"
mysql -wBN -h$WIKI-p.db u_$USER -e "$SQL" > ${HOME}/public_html/sql/DBQ-89-${WIKI}-LonelyPages.txt
SQL="
CREATE TEMPORARY TABLE temp (pid INT, pt VARCHAR(255), plt VARCHAR(255), INDEX(pt), INDEX(plt));
INSERT INTO temp
 SELECT page_id, page_title, pl_title
 FROM ${WIKI}_p.page
  INNER JOIN ${WIKI}_p.pagelinks ON pl_from = page_id
 WHERE page_namespace = 0 AND page_is_redirect = 0 AND pl_namespace = 0;
SELECT CONCAT(REPLACE(plt,'_',' '),',',COUNT(*))
 FROM temp
  LEFT JOIN ${WIKI}_p.page ON page_namespace = 0 AND plt = page_title
 WHERE page_namespace IS NULL
 GROUP BY plt
 HAVING COUNT(*) > 300
 ORDER BY COUNT(*) DESC
 LIMIT 5000;"
mysql -wBN -h$WIKI-p.db u_$USER -e "$SQL"  > ${HOME}/public_html/sql/DBQ-89-${WIKI}-Redlinks.txt

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: bugreporter@to.mabomuja.de
CC list: bugreporter@to.mabomuja.de