Page MenuHomePhabricator

Implement ability to search wikitext of current Wikimedia wiki pages with regular expressions (regex)
Closed, ResolvedPublic

Description

Wikimedia should offer the ability to search the current wikitext of live wiki pages with regular expressions. This would be very helpful in identifying various problems with various wikis.


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=65783

Details

Reference
bz43652

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:21 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz43652.
bzimport added a subscriber: Unknown Object (MLST).

I was just thinking about this issue again. Because of the way prose is, you want to be able to search for "\bthe the\b" where "\b" is a word boundary. And other silliness like this. MapReduce works wonderfully for this (Ori and I tried BigQuery at some point). I think resolving this bug would be a good goal for 2014. We could even look at Labs instead of the production cluster, if needed. Copying some of the search folks as this is fundamentally a search issue.

I wonder if this is something that can replace some of the more uncommon customizations that lsearchd did to improve recall. It might not be because this is really an expert tool and those uncommon customizations (dash handling and stuff) effect everyone.

In any case, I think it might be useful to lean on search to cut down the list of pages that must be checked. Lucene search and Elasticsearch both seem well optimized for a "first pass" you'd use to identify candidates that might match the regex. I suppose it wouldn't always be the right thing to do, but it might be nice.

I like implementing this in labs because it could be a real performance drain on the production infrastructure if done there. OTOH, if we put the wikitext in Elasticsearch we could have it run the regexes pretty easily. The only trouble would be making sure the regexes don't cause a performance problem and I'm not sure that is possible.

(In reply to comment #3)

I like implementing this in labs because it could be a real performance drain
on the production infrastructure if done there. OTOH, if we put the wikitext
in Elasticsearch we could have it run the regexes pretty easily. The only
trouble would be making sure the regexes don't cause a performance problem
and I'm not sure that is possible.

Can you please ballpark how much work would be involved in setting up Elasticsearch with the most recent English Wikipedia page text (wikitext) dump on Labs for use with sane regular expressions? The current dump is about 19.1 GB compressed (cf. http://dumps.wikimedia.org/enwiki/20140102/).

I suppose that depends on how good you need it to be. I spent half an hour this morning and have an instance loading the data. It is using the wikipedia river which is some toy thing the Elasticsearch folks maintain, ostensibly for testing. It isn't what we want in the end for a great many reasons, not least of which that it munges the wikitext something fierce so it probably isn't what we want but it is something. It was easy to set up and gives us something to play with.

I think what you are asking for is actually a few pieces:

  • A tool to keep the index up to date- My guess is this'd take a day to get know labs, another day or two to get it working the first time, then about a week of bug fixes spread out over the first couple month.
  • A tool to dispatch queries against it sanely- I'm less sure about this. Anywhere from a couple of days to a month depending on surprises. I can't really estimate bug fixes because I'm so wild on the mark for the tool.

I'll play with the wikipedia river instance and see what kind of queries I can fire off against it manually.

Finally, if we forgo making the second tool then users could technically just use it as an Elasticsearch instance with wikitext on it. I'm not sure how many people that would be useful for and what kind of protection it'd need to have. I imagine hiding it in the labs network and making folks sign in to labs with port forwarding would be safe enough.

So I think this is a great idea and so I talked to Marc today about doing this in labs. He's on board with the idea, but confirmed my fear that it's bad timing. We're in the middle of trying to move labs to eqiad so it's a bad time to set up a new service--I'm thinking we set this up like database replication to real hardware, then figure out how people can query against it.

In the meantime, I've started a page on wikitech: https://wikitech.wikimedia.org/wiki/Search/Labs_services. Let's work on hashing out some of the implementation details while we let ops finish the migration.

We have a dumpGrepper in the parsoid repository:
https://git.wikimedia.org/blob/mediawiki%2Fservices%2Fparsoid/ac5483ae6cba6be86989457ea7cf2ae6e460388a/tests%2FdumpGrepper.js

Quickstart:
git clone https://gerrit.wikimedia.org/r/p/mediawiki/services/parsoid
cd parsoid
npm install libxmljs
cd tests
nodejs dumpGrepper --help # show options
zcat dump.xml.gz | nodejs dumpGrepper <regexp>

(In reply to Gabriel Wicke from comment #7)

We have a dumpGrepper in the parsoid repository

Yeah, there's a plethora of tools to search dumps for text. This is about searching the real time indexes though :)

  • Bug 54503 has been marked as a duplicate of this bug. ***

https://gerrit.wikimedia.org/r/#/c/137733/

The patch isn't really fully ready but its on its way.

Before we merge it I'd like to get some kind of better response to regex syntax errors then we have.

I can live with deploying it without the optimization in Elasticsearch that'll make it faster and more memory efficient. That'll be nice but not required.

I can also live with deploying it without highlighting so long as we get highlighting "real soon" afterwords.

The patch has been merged and should be deployed (according to [[mw:MediaWiki 1.24/wmf10]] and looking at [[Special:Version]]), but the insource: prefix doesn't seem to work yet. Any hints why that could be?

This one requires that the index be rebuilt before it'll work. I tried to make that clear in my email to ambassadors about it but I should have posted that here as well.

The reindex is proceeding as quickly as I'm able to get it:

  1. group0 wikis have been done since Monday.
  2. group1 wikis are about 60% of the way through the process.
  3. wikipedias are about 10% of the way through the process.

One of the problems with the whole expand templates thing that Cirrus does is that the reindex process is really slow when you have to dip all the way down to mediawiki to get the data. In this case we do. We might have been able to create a one shot tool to do this but that would have been a big chunk more work and a bit more risk of failure. What we're doing now we've done many times before. Safe = easier for a team of two to manage.

Anyway, some wikis are starting to see it:
https://en.wikipedia.org/wiki/Special:Search/insource:/a/
https://www.mediawiki.org/w/index.php?title=Special%3ASearch&profile=default&search=url+insource%3A%2F%26title%3Dfoo%2F&fulltext=Search

As promised its quite slow (~30 seconds on enwiki). Right now its still using the timeouts from the full text search. If that becomes a problem we'll have to raise them and look at other tricks to speed this up.

Marking now that its working pending index building. I (or you) can verify it once it works on your wiki.