Page MenuHomePhabricator

Weblinks not found by CirrusSearch on Wikimedia Commons
Closed, ResolvedPublic

Details

Reference
bz59205

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:15 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz59205.
bzimport added a subscriber: Unknown Object (MLST).

For those following along at home:

  1. Make sure you disable the "New Search" BetaFeature or else both searches use Cirrus.
  2. I had more luck reproducing the behavior by search "Everything":

CirrusSearch: https://commons.wikimedia.org/w/index.php?title=Special:Search&search=http%3A%2F%2Fwww.niag-online.de%2Fdownloads%2F2012-11-16_niag-kleve_sb58_nov2012.pdf&fulltext=Search&profile=all&redirs=1&srbackend=CirrusSearch
LuceneSEarch: https://commons.wikimedia.org/w/index.php?title=Special:Search&search=http%3A%2F%2Fwww.niag-online.de%2Fdownloads%2F2012-11-16_niag-kleve_sb58_nov2012.pdf&fulltext=Search&profile=all&redirs=1

LuceneSearch finds the url because it appears inside the wikitext. CirrusSearch doesn't because the url doesn't appear in the page _text_. The url appears as the href attribute of an anchor tag:
<a class="external text" href="http://www.niag-online.de/downloads/niag-kleve_lile_sb58.pdf" rel="nofollow">

Because CirrusSearch renders the wikitext to HTML then removes all the tags it only sees the text of the link.

I'm going to mark this bug a duplicate of the older bug we've file for the problem but raise the priority of the other bug.

*** This bug has been marked as a duplicate of bug 52905 ***