Page MenuHomePhabricator

Run "refreshLinks.php --dfn-only" on all wikis periodically
Closed, ResolvedPublic

Description

Author: beau

Description:
There have been added [[Special:Wantedfiles]] and [[Special:Wantedtemplates]] recently. However like all other Special:Wanted* pages they are pretty useless, because *links tables have lots of crap rows (some pages and categories are listed, but nothing links to them). Dead links (those from deleted/nonexistend pages) should be removed from database.

What is more [[Special:Wantedfiles]] lists files placed in shared repository... Refreshing such pages is a waste of resources.


Version: unspecified
Severity: minor

Details

Reference
bz16112

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:20 PM
bzimport set Reference to bz16112.
bzimport added a subscriber: Unknown Object (MLST).

Same on dewiki and probably other projects. Changed summary to reflect that and added shell-Keyword.

*** Bug 16603 has been marked as a duplicate of this bug. ***

Trevor -- couple quick notes on this:

This action can be done via an existing maintenance script:
php maintenance/refreshLinks.php --dfn-only

However, the implementation (deleteLinksFromNonexistent() in refreshLinks.inc) isn't currently feasible for Wikimedia's use because it's a potentially very slow query, which can mess with our DB replication and disrupt the site for users until the slave databases catch up.

Currently it's doing a single DELETE per table to clear out all matching rows. I'd recommend breaking this out into two parts for each table:

  1. SELECT the relevant page ID numbers (those for which no page record exists).
  1. DELETE the matching rows from the link table, preferably in batches. (One at a time means it'll be very slow if there are many results; doing all at once means we might disrupt replication or hit SQL limits.)

Once the function's cleaned up we can go ahead and run it on the live sites.

*** Bug 16895 has been marked as a duplicate of this bug. ***

This should be able to be cleared up as the function has been rewritten in
r45431 ( http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=45431 ).

Citing Brion:
03:11 < brion> w00t
03:11 < brion> i'll poke over it tomorrow

An updated (and actually working version..) was committed about four weeks ago [1]. The wikimedia wiki's have been updated, so this commit is available at the servers.
Has Brion, or any other administrator, had time to run the script?

[1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/45721

mike.lifeguard+bugs wrote:

for r45721:

  • 23:59, 14 January 2009 Brion VIBBER (Talk | contribs | block) changed the status of this revision [removed: new added: ok]

So, it just needs to be run on shell. I don't know if Trevor will do that - could be assigned to wikibugs or someone who does shell requests.

Wiki.Melancholie wrote:

Just want to give two examples:

  • [[wikt:de:Spezial:Linkliste/Template:Französisch|Template:Französisch]]
  • [[wikt:de:Spezial:Linkliste/Template:Substantiv-Tabelle (Englisch)|Template:Substantiv-Tabelle (Englisch)]]

are both listed at [[wikt:de:Special:WantedTemplates]], for a long time now. "Substantiv-Tabelle (Englisch)" is an extremely old example, like some others, just stuck. They all tell "1 link" or more would be there, but there are none. Latest update: 06:24, 18. Sep 2009. We are watching this for quite some time now on dewikt.

So the script mentioned above should either be run again (maybe regularly) or for the first time.
Please see Mike's comment #7:

So, it just needs to be run on shell. I don't know if Trevor will do that -
could be assigned to wikibugs or someone who does shell requests.

  • Bug 21962 has been marked as a duplicate of this bug. ***

Can this please be excecuted any time soon ?
For all wiki's and a special request for nl.wikipedia =)

Krinkle

I have refresh links on nlwiki :

$ php refreshLinks.php --wiki nlwiki --dfn-only
Retrieving illegal entries from pagelinks... 0..100..200..300..312
Retrieving illegal entries from imagelinks... 0..100..110
Retrieving illegal entries from categorylinks... 0..100..200..243
Retrieving illegal entries from templatelinks... 0..100..200..300..400..500..600..700..800..900..1000..1100..1200..1300..1400..1500..1600..1700..1800..1900..2000..2100..2195
Retrieving illegal entries from externallinks... 0..10
$

As well as Wantedcategories, Wantedfiles and Wantedtemplate :

Wantedfiles got 5000 rows in 1m 33.69s
Wantedcategories got 12 rows in 22.64s
Wantedtemplates got 33 rows in 2m 34.86s

We still have to run the link refresher on all wiki.

This probably blocks bug 24480.

pdhanda wrote:

I think this was a request to run refreshLinks.php and updateSpecialPages.php more regularly on all wikis. Re-assigning to default assignee for someone to pick up.

(In reply to comment #13)

I think this was a request to run refreshLinks.php and updateSpecialPages.php
more regularly on all wikis. Re-assigning to default assignee for someone to
pick up.

It wouldn't be bad to do this once, though. We have some errors since years ago, e.g. bug 24480.

An alternative is to exclude non existing pages from the count: bug 32395

(In reply to comment #16)

An alternative is to exclude non existing pages from the count: bug 32395

Better than that bug is to run always/weekly/monthly the maintenance script "refreshLinks.php --dfn-only" before "updateSpecialPages.php" in the same cron job, because bug 32395 only fixed the Wanted*-Specialpages. In theory this ghost entries also effected other querypages like Special:MostLinkedPages, but in that big count the ghost entries are not verifiably.

It is possible to add that script to the existing cron job? Thanks.

(In reply to comment #17)

(In reply to comment #16)

An alternative is to exclude non existing pages from the count: bug 32395

Better than that bug is to run always/weekly/monthly the maintenance script
"refreshLinks.php --dfn-only" before "updateSpecialPages.php" in the same cron
job, because bug 32395 only fixed the Wanted*-Specialpages. In theory this
ghost entries also effected other querypages like Special:MostLinkedPages, but
in that big count the ghost entries are not verifiably.

It is possible to add that script to the existing cron job? Thanks.

That sounds like the subject of a separate bug/ticket.

(In reply to comment #18)

(In reply to comment #17)

(In reply to comment #16)

An alternative is to exclude non existing pages from the count: bug 32395

Better than that bug is to run always/weekly/monthly the maintenance script
"refreshLinks.php --dfn-only" before "updateSpecialPages.php" in the same cron
job, because bug 32395 only fixed the Wanted*-Specialpages. In theory this
ghost entries also effected other querypages like Special:MostLinkedPages, but
in that big count the ghost entries are not verifiably.

It is possible to add that script to the existing cron job? Thanks.

That sounds like the subject of a separate bug/ticket.

I am not sure, because you have to run "refreshLinks.php --dfn-only" once on each wiki to fix this bug or you add it to the cron job and wait that the cron job is running on each wiki and than this bug and my comment is fixed. But feel free to clone this Bug, if necessary.

No action since weeks? It is possible to get a decision for running once or running in a cron job? And than run it or update the cron job?

Thanks.

[[:w:it:Special:WantedTemplates]] has the same issue.
Today, 24 of the 50 most wanted templates have 0 (zero!) trasclusion in live pages:

  • Template:Geobox statoNoQuadre (58 collegamenti)
  • Template:Geobox ISTAT (58 collegamenti)
  • Template:Geobox festivo (58 collegamenti)
  • Template:Geobox patrono (58 collegamenti)
  • Template:Geobox catasto (58 collegamenti)
  • Template:Geobox coordinate comuni (55 collegamenti)
  • Template:Geobox comuniSmall (55 collegamenti)
  • Template:Da aiutare mese (47 collegamenti)
  • Template:Da aiutare (46 collegamenti)
  • Template:Wik (44 collegamenti)
  • Template:Stub comuni (25 collegamenti)
  • Template:Geografia/colorestub (25 collegamenti)
  • Template:Musica (23 collegamenti)
  • Template:Letteratura (18 collegamenti)
  • Template:URSSPD (18 collegamenti)
  • Template:Da tradurre (16 collegamenti)
  • Template:Link esterni (15 collegamenti)
  • Template:Qif (15 collegamenti)
  • Template:Stub bio (14 collegamenti)
  • Template:Film/rinvio (13 collegamenti)
  • Template:Cinema/rinvio (13 collegamenti)
  • Template:Da wikificare (12 collegamenti)
  • Template:NavigazioneSport (11 collegamenti)
  • Template:Trama (9 collegamenti)

And most of the others have a count wrong.

Please run the script periodically on all wikies.

Thanks.

Reducing scope of this bug; let's open a separate one for different requests and try to solve at least part of the problems at last.

  • Bug 15152 has been marked as a duplicate of this bug. ***
  • Bug 27480 has been marked as a duplicate of this bug. ***

I am requesting a periodically run, because the ghost entries some times came back. Thanks.

Turning into an ops request, raising priority, and filing a ticket with ops (RT #2355)

(In reply to comment #26)

Turning into an ops request, raising priority, and filing a ticket with ops (RT
#2355)

What's the status of this?

(In reply to comment #27)

(In reply to comment #26)

Turning into an ops request, raising priority, and filing a ticket with ops (RT
#2355)

What's the status of this?

Status: NEW

robla and CT have been communicating on this, most recently on 2-13. I've pinged CT.

What is the status of this? Thanks.

Next week is over, what is the status of the RT? Thanks.

Sam and Mutante just handled something on this. I've asked them to update the ticket.

(In reply to comment #35)

Sam and Mutante

Should have said "Daniel"...

here's a suggestion: (via puppet of course):

https://gerrit.wikimedia.org/r/#patch,sidebyside,5104,3,manifests/mediawiki.pp

feel free to comment directly in the code in gerrit if you like

gerrit 5104 was successfully merged. Thanks!

The cron job is run every day at the hour of the number inside cluster name?

How does work the monitoring of that cron job, when its fails or did not run?

Now that this has (almost) been fixed, someone ma want to look into the similar bug 27480 to see what's needed there.

@Umherirrender: the cron job ran for all clusters except s1. Yes, at the hour of the number in the cluster name first. But since s1 failed i disabled the others (they had just refreshed succesfully anyways), and i am now running it on s1 again manually in a screen. The cron jobs write logfiles to the local filesystem in /home/mwdeploy/refreshLinks.

(In reply to comment #40)

Now that this has (almost) been fixed, someone ma want to look into the similar
bug 27480 to see what's needed there.

It is not a similar request, because this bug request periodically run.

refreshLinks with -dfn-only is *only* a sql, which runs on the cluster (and maybe to slow or to heavy for enwiki)

refreshLinks without -dfn-only means to reparse all pages, doing that periodically sounds not like a good idea ...

In my opinion it is enough to run this script once in a month, or at least right before updateSpecialPages is running (every 3 days), because only there you will see the ghost entries. But when a enwiki run needs hours, that will delay the updateSpecialPages also, which is not a good idea.

after repeatedly trying the run on cluster s1 and it failing, for now i did this:

cluster s2-s7 refreshes all seem to be working fine, so the cron jobs are now changed to just run once monthly automatically. to keep it simple: s2 on day 2 of the month, 3 on day 3 and so on, always at midnight. So that would resolve the ticket, just that:

s1 stays being deactivated in automatic crons for now.

  • Bug 33817 has been marked as a duplicate of this bug. ***

daniel, anything new with s1?

(In reply to comment #46)

daniel, anything new with s1?

The problem is with the script (due to the sheer size go the enwiki database, I guess), so most likely isn't Daniels problem to fix it..

I'm running it again manually to try and see what was wrong with it (I can't remember), but I guess the fix is going to be to query the highest pageid, and do batches of X (100,000? 1M?) upto the pagecount

The original queries take an age, and isn't going to attempt to load it all.

mysql> explain select DISTINCT pl_from from pagelinks LEFT JOIN page ON pl_from=page_id;
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+-----------+------------------------------+

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra

+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+-----------+------------------------------+

1SIMPLEpagelinksindexNULLpl_from265NULL624327870Using index; Using temporary
1SIMPLEpageeq_refPRIMARYPRIMARY4enwiki.pagelinks.pl_from1Using index; Distinct

+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+-----------+------------------------------+
2 rows in set (0.01 sec)

Removing the distinct would make things simpler.. If kept a client side count, and removed the distint... Would this work for us..

mails2vichu wrote:

Can anyone tell me how to get assigned to a bug?
please mail me at :- mails2vichu@gmail.com

Vishnu: By adding an "I plan to work on this" comment here, however this might be a harder one to contribute to.

JohnLewis claimed this task.

Seems to be done.

JohnLewis set Security to None.

So... Guess we're just missing wikitech here? Or does that not count? :)

Krenair claimed this task.

I guess maintenance scripts on wikitech/silver is a larger issue, I'll open a separate ticket.