Page MenuHomePhabricator

Allow search in the raw wiki text source via insource:
Closed, ResolvedPublic

Description

The German community did a voting for a "technical wish list". "Source search" made it into the top 20 wishes. See [[w:de:WP:Umfragen/Technische_Wünsche/Suche#Wünsche]].

I had the chance to talk with one of the current CirrusSearch developers and we think it should be fairly easy to implement this: In addition to the current field (which contains the visible text only) we could add a second field that contains the plain, untransformed wiki text. I suggest a keyword "insource:..." to allow searching this field. This could be very powerful in combination with the existing "hastemplate:...".

Possible problems:

  1. This will roughly double the size of the index. Is this worth it?
  2. Stemming should be disabled on this field, if that's possible. And it probably needs a few more tweaks.
  3. Searching for special characters can't work, right?
  4. Can this still work if we switch to Parsoid some day? It should, right?

Version: master
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=43652

Details

Reference
bz65783

Related Objects

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:24 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz65783.
bzimport added a subscriber: Unknown Object (MLST).

Change 137733 had a related patch set uploaded by Manybubbles:
Basic insource support

https://gerrit.wikimedia.org/r/137733

(In reply to Gerrit Notification Bot from comment #2)

Change 137733 merged by jenkins-bot:
Insource support

https://gerrit.wikimedia.org/r/137733

Whoa, really?

(In reply to MZMcBride from comment #3)

(In reply to Gerrit Notification Bot from comment #2)

Change 137733 merged by jenkins-bot:
Insource support

https://gerrit.wikimedia.org/r/137733

Whoa, really?

Yep, should start making its way live with the next wmf branch come Thursday.

(In reply to Chad H. from comment #4)

(In reply to MZMcBride from comment #3)

(In reply to Gerrit Notification Bot from comment #2)

Change 137733 merged by jenkins-bot:
Insource support

https://gerrit.wikimedia.org/r/137733

Whoa, really?

Yep, should start making its way live with the next wmf branch come Thursday.

Caveats for regexes:

  1. Its kinda slow.
  2. We only allow 2 concurrent queries at a time.
  3. We have a maximum queue of 10. This is to keep more then 12 apaches stuck waiting for it.
  4. Syntax error feedback is only OK, not great.
  5. If you fill up the queue then you won't get a useful error message.
  6. No highlighting of results at all. Something I'll work on fixing in the next couple weeks.
  7. Its going to take some time after the initial release for all pages to be indexed. We didn't have the source indexed before so we'll have to regenerate all the documents and we didn't write anything fancy to do just the source so we'll end up rerendering everything. Its slow, but it'll work.
  8. The regex language is actually Lucene's regex which is designed to be efficient rather then super expressive. I chose it because its safe.
  9. Other stuff I don't remember?

Docs are here: https://www.mediawiki.org/wiki/Search/CirrusSearchFeatures#insource:

We were tired of waiting for ops to build out infrastructure for easy copying to labs. So we figured we'd just make it in prod and limit it to a few executors. Hopefully everything will be just fine. We might, but haven't yet, decided it'd be best to limit it to users with a permission, or signed in users, or something. We'd only do that if we saw that it was crushing us or that some asshole was keeping the queue full and no legitimate users could use it.

(In reply to Nik Everett from comment #5)

We were tired of waiting for ops to build out infrastructure for easy
copying to labs.

Well it's a lower priority for Swift than say image storage, so I understand the delay. We still want it though for backups and labs :)

(In reply to Chad H. from comment #6)

(In reply to Nik Everett from comment #5)

We were tired of waiting for ops to build out infrastructure for easy
copying to labs.

Well it's a lower priority for Swift than say image storage, so I understand
the delay. We still want it though for backups and labs :)

Yeah! I totally want backups! I just was tired of waiting for it for regexes. Hopefully it won't turn out to be a mistake.