Page MenuHomePhabricator

Search should index template expansion
Closed, ResolvedPublic

Description

Author: conrad.irwin

Description:
At the moment, any content generated by templates or extensions is ignored by the search engine.

This poses particular problems on Wiktionary where a large amount of inflection data is generated by templates - for example searching for "lusimus" (not easily identifiable as a form of "[[ludo]]") will return no results.

This should be a configuration option, and one that will only work with some search backends.


Version: unspecified
Severity: major

Details

Reference
bz18861

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 10:42 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz18861.

Adding Robert to CC list as we'd want to consider how to make this possible for the external Lucene search engine and others.

conrad.irwin wrote:

One possibility would be to index the final HTML output, which is available (it would appear) in the form of $editInfo->output at Article.php:2982 (which is where the Search seems to be explicitly updated). It would also require that the maintainance/updateSearchIndex.inc either looks in the parser cache, or re-parses all articles. As I've no real idea how things hook together, this might of course be white noise, if so; sorry.

jadcpub-mediawiki wrote:

Just indexing the parameters passed to templates would be a big step forward.

For example in our wiki (http://cameronedge.com/fixwiki), descriptive text contained in pages is passed to a common template which ensures a standard formatting. For example:

{{Value info

EnumName=HomeCompetentAuthority
Sort=69
Group=
Enum=69
Tag=452
FromVersion=FIX.5.0
Desc=Home Competent Authority (Home CA)
FieldName=PartyRole

}}

Unfortunately, with the current search capabilities, none of this text shows up in searches.

conrad.irwin wrote:

*** Bug 22779 has been marked as a duplicate of this bug. ***

conrad.irwin wrote:

*** Bug 22908 has been marked as a duplicate of this bug. ***

Wikisource seems to have a similar issue relating to transcluded pages from the Page: namespace when transcluded into the main namespace.

Wikisource has a PAGE: namespace where we undertake the editing of pages of works. These pages are then transcluded to the main namespace for presentation to readers.

The issue is that when the indexing process takes place that the transcluded pages in the main namespace do not get indexed, only the pages in the Page: namespace, hence defeating the purposes of the work.

Examples of the two searches
http://en.wikisource.org/wiki/Highways_and_Byways_in_Sussex
contains the relevant pages

http://en.wikisource.org/wiki/Special:Search?search=singleton&prefix=Highways+and+Byways+in+Sussex&fulltext=Search+in+this+work&fulltext=Search

http://en.wikisource.org/wiki/Special:Search?search=singleton&prefix=Page%3AHighways+and+Byways+in+Sussex&fulltext=Search+in+this+work+%28Page%3A+prefix%29&fulltext=Search

This is especially problematic as the default searches do not include the Page: namespace, and we wouldn't particularly want them to anyway.

The impact is at two levels.

  • Base level searches do not find the relevant text is on site
  • Drill down searches per work do not function, this is especially problematic when one is working on reproducing large reference works, eg. 1911 Encyclopaedia Britannica.

The same problem happens with the /doc pages used to document templates at en.wikipedia. For example, searching for "This is a meta-template used for creating interwiki links to other" we get the page "Template:Sister/doc" but not the page "Template:Sister" where the former is transcluded:
http://en.wikipedia.org/w/index.php?title=Special%3ASearch&redirs=1&search=%22This+is+a+meta-template+used+for+creating+interwiki+links+to+other%22&fulltext=Search&ns10=1&title=Special%3ASearch&advanced=1&fulltext=Advanced+search

teukrosannon wrote:

Changed importance to "high", as on 18.04.2011 about 40% of main namespace of Polish Wikisource is ignored by the search engine.

(In reply to comment #8)

Changed importance to "high", as on 18.04.2011 about 40% of main namespace of
Polish Wikisource is ignored by the search engine.

Can you give a URL example of a page which is ignored?
Btw, the "Page" namespace can be included in search results by adding it to [[mw:Manual:$wgNamespacesToBeSearchedDefault]]

conrad.irwin wrote:

The one in the original comment is still broken:

http://en.wiktionary.org/wiki/ludo should appear when searching for
http://en.wiktionary.org/w/index.php?title=Special%3ASearch&search=l%C5%ABsimus

(Though the problem is less problematic now that the lusimus entry exists).

beau wrote:

(In reply to comment #9)

Can you give a URL example of a page which is ignored?
Btw, the "Page" namespace can be included in search results by adding it to
[[mw:Manual:$wgNamespacesToBeSearchedDefault]]

For example the query: https://secure.wikimedia.org/wikisource/pl/w/index.php?title=Specjalna%3ASzukaj&search=oznaczaj%C4%85cym+anio%C5%82a
does not return https://secure.wikimedia.org/wikisource/pl/wiki/Encyklopedja_Ko%C5%9Bcielna/Abaddon which contain the text "oznaczającym anioła" transcluded by a template.

(In reply to comment #11)

(In reply to comment #9)

Can you give a URL example of a page which is ignored?
Btw, the "Page" namespace can be included in search results by adding it to
[[mw:Manual:$wgNamespacesToBeSearchedDefault]]

For example the query:
https://secure.wikimedia.org/wikisource/pl/w/index.php?title=Specjalna%3ASzukaj&search=oznaczaj%C4%85cym+anio%C5%82a
does not return
https://secure.wikimedia.org/wikisource/pl/wiki/Encyklopedja_Ko%C5%9Bcielna/Abaddon
which contain the text "oznaczającym anioła" transcluded by a template.

Thanks.

Including the "Strona" namespace is a temporary fix.

https://secure.wikimedia.org/wikisource/pl/w/index.php?title=Specjalna%3ASzukaj&redirs=1&search=oznaczaj%C4%85cym+anio%C5%82a&fulltext=Search&ns0=1&ns100=1&ns102=1&ns104=1&title=Specjalna%3ASzukaj&advanced=1&fulltext=Advanced+search

This can be added to default search [[mw:Manual:$wgNamespacesToBeSearchedDefault]]

(In reply to comment #12)

Btw, the "Page" namespace can be included in search results by adding it to
[[mw:Manual:$wgNamespacesToBeSearchedDefault]]

...

Including the "Strona" namespace is a temporary fix.

That's not a fix at all as the internal search engine needs to find the mainspace content, not the workspace content and we don't want it to search the pagespace by default. I've boosted the importance to major as this makes the search function practically useless for works that are transcluded from pagespace, which is the goal of wikisource and the purpose of Proofread Page.

orenbochman wrote:

As I see it the problem is that the search indexes are based on analyzing the wiki source and updated page source and the indexer never sees the final output.

The simplest fix, mentioned above, was my original plan. Which would be to fetch the generated output from cache. This would solve the wikisource use case of the page namespace and the Wiktionary use case where text such as inflection table is generated by a template.

functionally it can be implemented as follows:

  1. Adding to the index a wikiRenderedField to the indexer.
  2. This field which would be tokenized but not stored.
  3. The analysis could be done using Apache Tika.
  4. In some scenarios the field would be the size of a book and may adversely affect the ranking of documents throughout the index. (a situation where 1% of the documents contain 95% percent of the search lexicon)

The alternative depends on the progress with the new parser. If it is successful and can be used as a library it would be integrated into search as a new WikiSourceAnalyzer which will capture the full details of both source and output with many side benefits.

lowering priority to reflect reality

(In reply to comment #15)

lowering priority to reflect reality

To me that feels condescending to the Wikisource sites and seems WP-centric, so I am being bold and at least calling it normal. Wikisources have the default text search set to include main ns, but not Page: ns, due to the latter being our workspace, therefore not the prime display space.

Not fixing this bug, or giving it a priority that means nothing will occur, basically says that searching at the Wikisources where they are transcluded and better quality texts is not important to WMF. I feel that sends very much the wrong message, is unhelpful to producing double proofread works that are transcluded into main ns that they cannot be found by the search engine.

We spend more time on frippery like WikiLove than fixing products that are at the core of quality reproductions.

teukrosannon wrote:

(In reply to comment #15)

lowering priority to reflect reality

Take a look at this page - http://toolserver.org/~phe/statistics.php

As of today, over 50% of content on French, Polish, Catalan and Norvegian Wikisources is inaccessible by search engine. For Wikisource projects, it is a matter of critical importance. Obviously, changing priority back to "high" (or even "normal") changes nothing by itself, but maybe you should reconsider your approach to this problem.

wmf.amgine3691 wrote:

Restoring to high priority.

For projects other than Wikipedia, and increasingly on Wikipedia itself due to ever-expanding use of infoboxen, search fails to find structured information.

This is a high-priority for the communities.

(In reply to comment #15)

lowering priority to reflect reality

one more question here (in addition to the wikisource issue mentioned by others) is:
are we taking the "voting system" seriously?

  • if the answer is "yes", then i fail to understand lowering the priority of a bug with that many votes.
  • if the answer is "no", it would be more respectful to disable the voting system, rather than ask people to "vote" for bugs and then ignore their votes.

peace.

(In reply to comment #18)

This is a high-priority for the communities.

I don't think there is any doubt about that.

For the meaning of the fields "priority" and "severity", see:

  • [[mw:Bug management/Bugzilla usage#Priority]]

(In reply to comment #19)

are we taking the "voting system" seriously?

There are plans of renaming "vote" to "watch" or "bookmark":

teukrosannon wrote:

(In reply to comment #20)

For the meaning of the fields "priority" and "severity", see:

  • [[mw:Bug management/Bugzilla usage#Priority]]

Well, honestly - yes, I really thing it is a major problem that should be fixed in a reasonable time. It is opened since 2009, forgive me that I'm not excited to read that it will be fixed "within 6 months, or the release after next" (from now, I presume).

wmf.amgine3691 wrote:

(In reply to comment #20)

(In reply to comment #18)

This is a high-priority for the communities.

I don't think there is any doubt about that.

For the meaning of the fields "priority" and "severity", see:

  • [[mw:Bug management/Bugzilla usage#Priority]]

It's difficult to reply to this without - unintending - to be insulting. This is a major problem, and it should be fixed within a month. I think you should stop thinking about this as a problem for the sisterprojects: content on nearly a million en.Wikipedia pages is not being searched.

As an example, an optional parameter in the {{Infobox caste}} is Kuladevata/Kuladevi - the associated God/Goddess. Yet a search for kuladevi does not find pages including the Caste infobox unless the word is also in the wikitext. As more information is structured into infobox templates, and the templates become more specialized, there will be an increase in missed but relevant articles.

(In reply to comment #20)

There are plans of renaming "vote" to "watch" or "bookmark":

I've asked for the ability to change these sorts of things directly. Since it looks like that is going to take a while, or may not happen, then I'll have to use an intermediary in Ops. (Also see Bug 34668 for another instance of this.)

A big part of the problem with improving search is that there is no dedicated person at the Foundation who manages search. We're hiring someone to fill this roll, but there is also a volunteer who has recently shown a lot of interest in search. I'll try to get him to look at this bug.

Un-assigning this bug from Rob L.

It's unclear to me whether this bug is properly sorted currently. Is this a Wikimedia/Lucene problem or is it a MediaWiki/Search problem?

Can we just cry over the ignoring of this bug.

The point is that at this stage we have between tens of and hundreds of thousands [YES HUNDREDS OF THOUSANDS] of pages that are not found in the main namespace [CONTENT!] by our own search engines. This is happening across tens of wikis.

Surely this is problematic.

Surely it can get some attention.

If the search person has been employed #c23 maybe this bug could be assigned to them?

Yes, let's be bold and assign it to Ram for further investigation when there's time: it's probably among the most worthy search bugs anyway (and it's been assigned to RobLa in the past so there's a precedent ;) ).

«The plan is to expand all templates. One question that has come up is, should we not expand some of the templates? NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)»
https://www.mediawiki.org/wiki/Talk:Requests_for_comment/CirrusSearch

CodeCat, it would be helpful if you could comment on that talk, or highlight any other pieces notably missing in the RfC.

It's worth noting explicitly -- since nobody has done so yet -- that this bug prevents Lucene's "incategory:" feature from working with transcluded category tags.

Example:

If article A transcludes template T, and template T contains <includeonly>[[Category:C]]</includeonly>, then article A is in category C. To Lucene, however, article A is not "incategory" C because article A doesn't contain an explicit category tag. So a Lucene query containing incategory:"C" will not find article A.

This shortcoming has caused innumerable headaches on our MediaWiki instance since, to the user, "incategory" does not seem to work.

Thanks for pointing that out Dan! I just verified that this works with CirrusSearch right now and I've added myself a TODO to make sure I add a regression test for this.

It's important to notice that our current and incomplete search engine makes easy to find substituted templates that shouldn't be, like maintenance tags, or any text that does not belong directly to the Main namespace. So whatever solution we come up should contain a configuration option to not index expansion as well.

msh210+wmfbugzilla wrote:

(In reply to comment #32)

It's important to notice that our current and incomplete search engine makes
easy to find substituted templates that shouldn't be, like maintenance tags,
or any text that does not belong directly to the Main namespace. So whatever
solution we come up should contain a configuration option to not index
expansion as well.

Maybe, but that seems a low priority, and certainly not sufficient to avoid fixing this bug. (And dumps can be used for the purpose mentioned. This bug affects users whereas #32 affects editors only, who can use dumps.)

This is solved by using the CirrusSearch extension which stores the expanded wikitext. Unexpanded wikitext is retained as well for editors (as requested here and in other places). Marking this as FIXED.