Page MenuHomePhabricator

DBQ-27 find enwiki articles that don't exist on enwiktionary
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-27.
Summary: find enwiki articles that don't exist on enwiktionary
Issue type: Task - A task that needs to be done.
Priority: Minor
Status: Done
Assignee: norman james vondall <norm_vondall@yahoo.com>


From: Msh210 <m.hamm.1@alumni.nyu.edu>

Date: Wed, 04 Jun 2008 19:51:51

Could someone please find every enwiki ns:0 article [Foo] such that (1) en.wiktionary does not have [foo]; (2) enwiki's article [Foo], if not a hard redirect, contains the word "foo" in it somewhere in lowercase; and (3) if enwiki's article [Foo] is a redirect to [Bar], then the latter contains the word "foo" in its somewhere in lowercase? (This is, in short, a way to find all words, excpet proper nouns, that enwiki has articles on and enwiktionary doesn't.) If the following doesn't complicate matters too much, I'd rather have an additional restriction: (4) The title "Foo" contains a space (20) in it. Thanks much. (Note that this is not strictly a ts request, as dump analysis will suffice. I just haven't found anyone willing and able to analyze the dumps.)


Version: unspecified
Severity: minor

Details

Reference
bz59283

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:20 AM
bzimport set Reference to bz59283.

From: SQL <sxwiki@gmail.com>

Date: Wed, 13 Aug 2008 16:16:58

...I think this may be beyond what can realistically be done. Probably hundreds of thousands, if not millions of articles between the two, and, I'm not quite sure how we'd filter out proper nouns, without someone going over it line-by-line. Perhaps someone else has a better way to go about this, however.


From: Msh210 <m.hamm.1@alumni.nyu.edu>

Date: Wed, 13 Aug 2008 17:15:47

"I'm not quite sure how we'd filter out proper nouns, without someone going over it line-by-line." The original description described that, though perhaps it wasn't made clear enough: by making sure that the enwiki article has its title in its text, but in _lowercase_, you've gotten rid of most proper nouns. (And the rest can get through the filter; I don't care.)


From: CBM <cbm.wikipedia@gmail.com>

Date: Sat, 16 Aug 2008 23:45:46

The request requires scanning the wiki source code of each page. That cannot be done with a toolserver database query. It will have to be done with a database dump.


From: MZMcBride <mzmcbride@gmail.com>

Date: Sun, 17 Aug 2008 00:16:23

This is not something that can be done with the Toolserver, as CBM noted. You'll likely need to find someone with a database dump in order to do what you're trying to do.

Perhaps try asking at http://en.wikipedia.org/wiki/Wikipedia:Bot_requests or http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical) ? Usually bot operators and people with database dumps overlap.

Resolved as declined.

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: (none)
CC list: b@mzmcbride.com, cbm.wikipedia@gmail.com, sxwiki@gmail.com, msh210+wmfbugzilla@gmail.com