Page MenuHomePhabricator

CirrusSearch: Don't index text in visually hidden elements
Closed, DuplicatePublic

Description

Cirrus shouldn't index the contents of visually hidden elements. I haven't confirmed that we do this but I wouldn't be surprised. We're not going to boot a browser and run javascript use that to figure out if the contents are hidden, but we should be able to do something.


Version: unspecified
Severity: normal

Details

Reference
bz60484

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:06 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz60484.

I'd like to get to work on this but I'm not really sure what classes we normally use for hidden text.

I'm wondering if we should avoid using any existing classes for this and go a completely different route. Introduce something like a "nosearch" class and document it as being completely ignored by search.

I'll just do the nosearch class thing. This also has the neat side effect of making it possible to highlight all text on the page that is not included in search by injecting css like so:
.nosearch {

background-color: red;

}

Change 112500 had a related patch set uploaded by Manybubbles:
Don't search text wrapped in nosearch class

https://gerrit.wikimedia.org/r/112500

(In reply to Chad H. from comment #2)

I'm wondering if we should avoid using any existing classes for this and go
a completely different route. Introduce something like a "nosearch" class
and document it as being completely ignored by search.

How many such classes do we have, is there a list somewhere? Do we have an estimate of how much overlap there would be with "noprint" and "nomobile"? There were some complaints when "nomobile" was "spinned off" from noprint, IIRC. Users who complained about excessive indexing might be a suitable target whom to ask an estimate of where they'd like to add such a class on a live wiki.

jsalsman wrote:

(In reply to Nik Everett from comment #0)

Cirrus shouldn't index the contents of visually hidden elements.

Many of the "hat" templates used to elide comments are referred to as "archive" templates, but the templates are not removed when the pages are actually archived.

For that reason, I recommend not excluding them from search.

Was there a different use case?

The biggest case I can think of for excluding text from search is the license information on commons. Please take that as an example. Maybe it is the only example I think it is pretty important.

  1. The license information doesn't add a whole lot to the result. Try searching commons with Cirrus for "distribute", "transmit", or "following" and you'll very quickly start to see the text of the CC license. And the searches find 14 million results. Heaven forbid you want to find "distributed transmits" or something. You'll almost exclusively get the license highlighted and you'll still find 14 million results. This isn't _horrible_ because the top results all have "distribute" or "transmit" in the title but it isn't great.
  2. Knock on effect from #2: because relevance is calculated based on the inverse of the number of documents that contain the word the then every term in the CC license is worth less then words not in the license. I can't point to any example of why that is bad but I feel it in my bones. Feel free to ignore this. I'm probably paranoid.
  3. Entirely self serving: given #1, the contents of the license take up an awful lot of space for very little benefit. If I had more space I could make Cirrus a beta on more wikis. It is kind of a lame reason and I'm attacking the space issue from other angles so maybe it'll be moot long before we get this deployed and convince the community that it is worth doing.
  4. Really really self serving: if .nosearch is the right solution and is useful then it is super duper easy to implement. Like one line of code, a few tests, and bam. Its already done, just waiting to be rebased and merged. It was so easy it would have taken longer to estimate the effort then to propose an implementation.

I really wouldn't be surprised if someone couldn't come up with great reason why #1 is silly and we just shouldn't do it.

The big problem with the nosearch class implementation is that it'd be pretty simple to abuse and hard to catch the abuse because the text is still on the page. One of the nice things about the solution is you could use a web browser's debugger to highlight all the text excluded from search by writing a simple CSS class.

(In reply to Nik Everett from comment #7).

[...] 3. Entirely self serving: given #1, the contents of the license take up an
awful lot of space for very little benefit. [...]

Is the RDF markup in the HTML also indexed, currently? That's a good part of the HTML, a fee hundreds bytes like this:

<p><span class="licensetpl_link" style="display:none;">http://creativecommons.org/licenses/by/2.0</span> <span class="licensetpl_short" style="display:none;">CC-BY-2.0</span> <span class="licensetpl_long" style="display:none;">Creative Commons Attribution 2.0</span> <span class="licensetpl_link_req" style="display:none;">true</span><span class="licensetpl_attr_req" style="display:none;">true</span></p>

Are those already ignored due to the display:none?

The rest of the license is indeed not covered by noprint, noclass or "Exclude in print" :/, though a lot of stuff is on other projects, see e.g. https://en.wikipedia.org/wiki/Category:Exclude_in_print

(In reply to Nemo from comment #8)

Are those already ignored due to the display:none?

No. That would be nice, and I imagine we could detect the naïve case of it being in a style="" tag pretty easily. It's a whole lot harder when it's in CSS.

Removing PATCH_TO_REVIEW as we're going to have a few patches I think.

Nemo pointed out to me on irc that we're indexing all the welcome messages too. Those are good candidates to skip if possible.

Change 112500 abandoned by Manybubbles:
Don't search text wrapped in nosearch class

Reason:
Replacing in favor of I70be621d2702409d88befd8f7ba0936cbc4912ef.

It won't save us any space but it'll improve snippets and relevance. Hopefully.

https://gerrit.wikimedia.org/r/112500

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana lowered the priority of this task from Medium to Lowest.
Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.

This bug has performance implications too, can't be lowest priority.

Nemo_bis raised the priority of this task from Lowest to Low.Dec 30 2015, 7:47 AM
Nemo_bis set Security to None.

This bug has performance implications too

Can you expand on what the performance implications are?

can't be lowest priority.

Performance bugs can be lowest priority if they're not causing much of an impact.