Page MenuHomePhabricator

CirrusSearch includes the text of audio tags
Closed, ResolvedPublic

Description

Searching for "JavaScript disabled" finds pages with audio.

http://en.wikipedia.beta.wmflabs.org/w/index.php?search=%22JavaScript+disabled%22&title=Special%3ASearch&fulltext=1


Version: unspecified
Severity: normal

Details

Reference
bz53426

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:54 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz53426.

So the reason this is languishing is that the patch uses an github dependency to parse the page into dom and them xpath to remove it. That is cool and all but we're not super happy having another dependency and it is slow. We don't like slow.

BTW, libxml and therefore built-in php DOMDocument don't support html5.

We've abandoned that particular patch and will be rewriting it using some functionality being merged into core soon. For now I'm marking this On_Hold but I plan to get to it as soon as our piece is merged into core.

Change 85135 had a related patch set uploaded by Chad:
Overhaul wikitext formatting

https://gerrit.wikimedia.org/r/85135

Looks like the core change required for this (https://gerrit.wikimedia.org/r/#/c/84342/) just landed. I'd like to wait until it is in production before we merge this patch just so we can still deploy from master.

Change 85135 merged by jenkins-bot:
Overhaul wikitext formatting

https://gerrit.wikimedia.org/r/85135