Page MenuHomePhabricator

Automatic links to internal wiki articles based on patterns
Open, LowPublicFeature

Description

Author: jediarchives11

Description:
When a user is editing a page, would there be a way to have the system automatically
check for related articles. Ex. A user adds the word physics, but doesn't link it
because he doesn't know that there is an article about physics. This automatic
linking, the system would check the newly edited part of the article and see if any
of the words match article names and change the word to a link to that article.


Version: unspecified
Severity: enhancement

Details

Reference
bz2336

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 8:31 PM
bzimport set Reference to bz2336.
bzimport added a subscriber: Unknown Object (MLST).

gangleri wrote:

changed subject from "automatic linking" to "automatic wikification"

avarab wrote:

Changed the product to MediaWiki extensions.

gregory.szorc wrote:

First draft of extension

First draft of automatic wikification extension. It needs some work in the
regular expression arena. It is designed to work with the 1.5 db layout.

attachment wikification.php ignored as obsolete

jediarchives11 wrote:

Wow. Someone decided to make this extension. Thanks. I'm not that knowledgable with coding
unfortunately, but I will try to help as much as I can. I've already read the code and thanks
to your comments have been able to understand a majority of it. I already have a couple
comments that I hope will help. Does it allow you to do custom namespaces with the extension?
And how many sql queries will this generate? I know my website provider has a limit of # of
sql queries per user per hour. Also, if you would like to test on a wiki (other than the CWRU
wiki that you run) you can use my wiki if you'd like, which should be up soon.

gregory.szorc wrote:

The good news about this extension is that it only generates 1 SQL query. The
bad news about the SQL query is that it can be massive, depending on the size of
the article being saved. However, this giant query only gets executed when a
page is actually saved.

There are currently some limitations to the extension. The primary limitations
are the poorly written regular expressions. As it stands, the replacement
regular expression is the worst. It will replace text, but will mess up
formatting in the process. In addition, the script does not yet support
namespaces other than the main namespace. This change should be trivial,
however. Functions for generating links to internal topics can be found in
'includes/Title.php' (I believe).

Also, automatic wikification, although it sounds cool, has some drawbacks. When
I ran it on some test articles on http://wiki.case.edu, it would convert common
words like "case" to links because "Case" is the shorthand name of my
university. Unless I am mistaken, the MediaWiki hook system does not allow you
to return the text from the pre-save hook (which this extension is) and have the
user verify it.

If this extension is to become used in production environments, it will need
some attending by those with more experience with regular expressions than I.
Once those problems are fixed, I will attend to fixing the other issues.

jediarchives11 wrote:

Sorry, but what exactly do you mean by "regular expressions"? Also, to reduce the size of the
query, would it be helpful to have the extension determine what is different between the new
and old versions before scanning for links? That won't add links to articles created since
the last time the entire article was scanned, so maybe not. And you're right, it might be
best to have readers check it before it adds the links. If that's not possible, something
else would have to be done.

gregory.szorc wrote:

A regular expression is a method to match text patterns. They are a very powerful tool. See http://en.wikipedia.org/wiki/Regular_expression for more info.

Finding a diff between versions and then doing the substitution would be very difficult. You would have to extract the old contents, run a search on the new terms, and somehow do a string replace on the
autowikified links only in the new text. The last part seems a bit challenging.

In all honesty, I think it would be more beneficial to spend time writing thorough documentation on creating links than working on this extension. When it comes to creating content, humans will always be able
to do a better job than computers. Automatic wikification, although cool, will not always be perfect.

An alternative to investigate would be a tool run by experienced wiki users that scans articles for possible links and prompts whether to change the text into a link.

jediarchives11 wrote:

I don't know why I didn't think of this before, but could an exclude list / key /
attribute / column / whatever be the solution to at least one problem. For
example, make the "case" article exempt from automatic wikification. This can be
done whatever way makes it easiest to code. This would eliminate one major
problem of words with multiple meanings being turned into links when they
shouldn't be.

jediarchives11 wrote:

I feel that if someone who knows more coding could work on this, it could be made
much better.

jporter wrote:

I will help with the regexes.

jediarchives11 wrote:

Not sure how the regex stuff is going, but I have another question. Is it possible to
run this just once through the database by running the file on the internet, or does
it have to be done when pages are saved? What would I have to change to get it to
work that way?

jediarchives11 wrote:

I've been working on getting this to run for all pages in the main namespace at
one time, and it has become very confusing and frustrating. If ANYONE can help
out that knows MediaWiki and PHP, their help would be greatly appreciated. Thanks.

jediarchives11 wrote:

Second draft of extension

The second draft fixed a bug in the first draft that would take out the space
before the word that is linked. Also, the extension does not seem to be
linking phrases, although it should. Hopefully I will figure out how to make
an exclude list soon.

attachment wikification.php ignored as obsolete

Didn't you guys think about the possibility where this autowikification tool
links to too many articles. Let's face it, en: wikipedia is a big one and there
are a lot of articles about lots of different stuff. The result of this can be
almost totally blue text. This is somewhat unwanted. On the other hand, small
wikis have little articles and this could barely help. All in all, I think it's
a good idea, but it needs human control, IMO. After all, let's not forget that
red links aren't bad in small wikis - they are, conversely, helpful and good for
the project, but red links aren't a part of this extension, so I'll shut up. :)

jediarchives11 wrote:

Yes, I never expected this to be used on en: wikipedia. It would be used on small to
medium wikis, mainly to add links to things that the author didn't know about. I do
agree, however, that there should be some human control. Maybe instead of
automatically adding the links, instead just suggesting them and allowing the user to
choose which to include.

Soon, I will upload a new version, one that now includes a way to exclude pages. For
example, my wiki's about page kept linking. With the exclude list, you can add the
word "about" as a word to exclude. You can also use this to keep the number of links
down.

Lastly, you're right, red links are good, but like said, this extension doesn't do
anything with them.

jediarchives11 wrote:

Version 0.3

This new version includes a way to exclude words by modifying the $excludelist
array. Eventually, you will be able to set this in localsettings.php. Also,
linking phrases now works correctly.

Unfortunately, new bugs have been discovered. The extension will not link the
last word in the article and words with periods or commas (ex. home,) will not
be handled correctly.

attachment wikification.php ignored as obsolete

Wiki.Melancholie wrote:

For the German Wikipedia there is a wikifier on:

http://217.160.138.71/development/wikipedia/wikify/

It works fine and could exemplify for this request.

jediarchives11 wrote:

Unfortunately I don't speak German, so if there is anyone that could translate
this to English or provide the code (with English comments) here, that would be
great.

  • Bug 4886 has been marked as a duplicate of this bug. ***

robchur wrote:

A note on running this on all pages once; if well-written, then a wrapper around
the code could be provided in the form of a custom maintenance script which
could rip through all article pages.

*** Bug 7015 has been marked as a duplicate of this bug. ***

xplosion2 wrote:

Is there a possibility to add functionality of including wikification only from
a "whitelist" (nothing else would be considered).

I mean the contrast to posting #16: $includelist.

Thanks for support!

jediarchives11 wrote:

Version 0.45

Attached:

sumanah wrote:

Adding "need-review" keyword to indicate extension awaits review. jediarchives11, you might want to check whether an extension like this already exists (look on mediawiki.org) - if it doesn't, you should probably update your extension to work with MediaWiki as it is now, and then follow these instructions: https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment

Rm patch-needs-review - never going to be deployed on WMF for non-technical reasons. Technically, works only with $wgDBprefix = 'wiki', uses raw SQL, pegs master with requests perfectly suitable for slaves.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:02 AM
Aklapper removed a subscriber: wikibugs-l-list.