Page MenuHomePhabricator

Ghost categories in wanted categories
Closed, ResolvedPublic

Description

There are several categories on it.wiki which are listed
as wanted categories but do not contain any real pages. Only categories previously deleted seem to be affected (but obviously not all of them).
They are:

  1. Aree gaeltacht ‎(4 elementi)
  2. Film con trama ‎(4 elementi)
  3. Comuni Svizzeri ‎(3 elementi)
  4. Da finire Spagnolo ‎(2 elementi)
  5. Stub Biografie ‎(2 elementi)
  6. Storia dell'antico Egitto ‎(2 elementi)
  7. Rugbysti italiani ‎(1 elemento)
  8. Autori latini ‎(1 elemento)
  9. Artisti manga ‎(1 elemento)
  10. Rugbysti gallesi ‎(1 elemento)
  11. Arte di Siena ‎(1 elemento)
  12. Regine dei Belgi ‎(1 elemento)
  13. Laghi dell'Italia ‎(1 elemento)
  14. Internet e società ‎(1 elemento)
  15. Da finire Inglese ‎(1 elemento)
  16. Categorie orfane ‎(1 elemento)
  17. Trama ‎(1 elemento)
  18. Telefilm ‎(1 elemento)
  19. Stub Informatica ‎(1 elemento)
  20. Stub Cinema ‎(1 elemento)
  21. Strumenti ‎(1 elemento)
  22. Storia dell’Egitto ‎(1 elemento)

Version: unspecified
Severity: trivial
URL: http://it.wikipedia.org/wiki/Speciale:CategorieRichieste

Details

Reference
bz15152

Related Objects

StatusSubtypeAssignedTask
OpenFeatureNone
ResolvedNone

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:19 PM
bzimport set Reference to bz15152.
bzimport added a subscriber: Unknown Object (MLST).

This special page display cached content. It is updated from time to time
and might not reflect the actual state.

jalo75 wrote:

http://it.wikipedia.org/wiki/Speciale:CategorieRichieste

It is updated once a day, but theese categories are visible since June, or before.

After talking with people in #wikipedia-it I am reopening this bug.

Some categories have been seen as seen as wanted for a long time although
they have no articles.

SELECT cl_to, count(*) FROM categorylinks
LEFT JOIN page ON cl_to=page_title AND page_namespace='14'
WHERE page_title IS NULL AND cl_to='Stub_Biografie'
GROUP BY cl_to;
+----------------+----------+

cl_tocount(*)

+----------------+----------+

Stub_Biografie2

+----------------+----------+
1 row in set (0.00 sec)

Looking at the categorylinks table :

SELECT * FROM categorylinks WHERE cl_to='Stub_Biografie' \G

  • 1. row ******* cl_from: 86979 cl_to: Stub_Biografie cl_sortkey: Quinto Aurelio Simmaco

cl_timestamp: 20050430032231

  • 2. row ******* cl_from: 85110 cl_to: Stub_Biografie cl_sortkey: Rabindranath Tagore

cl_timestamp: 20050430032208
2 rows in set (0.01 sec)

Most probably, refreshlinks needs to clean out the categorylinks when cl_from
does not exist :)

maintenance/refreshLinks.php :

this bit's bad for replication: disabling temporarily
--brion 2005-07-16
//deleteLinksFromNonexistent();

The problem with that cleanup is that it can potentially take a *very* long time. Due to the way replication is seralialized in MySQL, long-running write queries disrupt the replication stream -- while it runs, slaves will be lagging significantly behind the master, which causes either user-visible disruption (old data served out) or too much load being diverted to the master, or some combination.

Additionally, a single giant query of that sort is likely to get rolled back as other processes make updates to the table while it's working.

To be replication-friendly, it may need to be broken down into smaller batches, updating up to a few hundred rows at a time.

Thanks Brion for the explanation. I will code something safer.

Note that the deletion of ghost entries in refreshLinks.php is now batched and replication-friendly. Possibly merge this issue with bug 12168?

There are other ghost categories:

  1. Biografia ‎(2 elementi)
  2. Componenti elettronici ‎(1 elemento)
  3. Sovrani greci ‎(1 elemento)
  4. Campionato di calcio italiano ‎(1 elemento)
  5. Template condizionali ‎(1 elemento)
  6. Album dei Doors ‎(1 elemento)
  7. Specie (uccelli) ‎(1 elemento)

Looks like Roan resolved bug 12168. This is still an issue on a lot of wikis. It may make sense to run the refresh script on all of them. If not, I'd like to add en.wiki to this request (or maybe I should file a new bug?).

Looks like we want refreshLinks to be run on itwiki and enwiki. I will consult with the ops folks on Monday and run the script then if there are no objections from them. Even though refreshLinks is supposed to be safe now, I'd rather not run it against such large wikis on a Saturday.

It's an issue on other wikis as well. Using the Toolserver's copy of the databases:

mysql> SELECT c.* FROM categorylinks c

-> LEFT JOIN page ON cl_from = page_id
-> WHERE page_id IS NULL AND cl_from > 0;
  • dewiki_p

379 rows

  • frwiki_p

188 rows

  • enwiki_p

3842 rows

  • ruwiki_p

154 rows

(In reply to comment #11)

It's an issue on other wikis as well. Using the Toolserver's copy of the
databases:

mysql> SELECT c.* FROM categorylinks c

-> LEFT JOIN page ON cl_from = page_id
-> WHERE page_id IS NULL AND cl_from > 0;
  • dewiki_p

379 rows

  • frwiki_p

188 rows

  • enwiki_p

3842 rows

  • ruwiki_p

154 rows

I've used the same query (omitting cl_from > 0) to track down and delete ghost entries on dewiki, frwiki, enwiki, ruwiki and itwiki. This is not a substitute for running refreshLinks of course, but that takes a long time on such large wikis.

EN.WP.ST47 wrote:

Since the solution to this bug appears to be running a maintenance script, I'm changing this from MediaWiki->Special Pages to Wikimedia->Site Requests.

Yes. For example in "Category:Biografie" is still present a ghost element. From the dump of 2012-01-09 in categorylinks table I found this entry:
(2447501,'Biografie','Stoermer ,Mark August','2010-01-27 19:15:42','','','page')
but that page has been deleted.
Please run refreshLinks.php on all wikies periodically.

  • This bug has been marked as a duplicate of bug 16112 ***