Page MenuHomePhabricator

Query pages should limit to content namespaces, not just main namespace
Closed, ResolvedPublic

Description

Author: robchur

Description:
Per bug 7991 comment 3, certain query pages should probably have their SQL
tweaked such that all valid content namespaces, as defined in
$wgContentNamespaces, are included, rather than just the main namespace.

This might still present problems, where pages are laid out with the assumption
that all list items reflect a single namespace, but it's probably a viable
consideration for the future.


Version: unspecified
Severity: enhancement

Details

Reference
bz8130

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:29 PM
bzimport set Reference to bz8130.

nephele wrote:

Implementation of Namespace::isContentQuery()

Makes various special pages check all specified content namespaces instead of just NS_MAIN.

attachment nscontent.diff ignored as obsolete

nephele wrote:

This bug has been a problem on the wiki where I'm an admin (www.uesp.net), because our site makes extensive use of content namespaces other than NS_MAIN. I've implemented a code modification on our wiki that fixes this bug, and I think it would be useful if the same fixes could be implemented in the core mediawiki code. In my previous comment I uploaded a patch file that implements all of the suggested changes.

I've added a new static function, Namespace::isContentQuery(), to Namespace.php (basically just moving existing code in SpecialPopularPages.php into a centralized function).
That function has then been added into the SQL queries of all relevant special pages, basically just replacing "page_namespace=".NS_MAIN with "page_namespace ".Namespace::isContentQuery() . A few special cases where more than just one line needed to be changed in the Specialxxx.php are:

  • On SpecialPopularPages.php and SpecialShortpages.php existing redundant code was removed.
  • On SpecialUncategorizedpages.php, the default request was changed to be "all content pages" (requested using NULL) instead of NS_MAIN (Uncategorizedtemplates and Uncategorizedcategories still work properly)

These changes have been in place at UESP for a year now, so I'm confident that they do work on a live wiki. If anyone would like to see the resulting code in action, you can check some of UESP's special pages, i.e.
http://www.uesp.net/wiki/Special:Disambiguations
http://www.uesp.net/wiki/Special:Lonelypages
http://www.uesp.net/wiki/Special:Deadendpages
http://www.uesp.net/wiki/Special:DoubleRedirects
The only caveat is that UESP is still running Mediawiki 1.10.0 (in part because we have numerous code modifications like this that make upgrading unbearably difficult, thus the interest in getting some of the modifications added to mediawiki). However, the relevant code hasn't changed much since 1.10. The attached patch was created for 1.14, and I did some basic tests on the 1.14 configuration to check that the changes work.

Argh, ugliness!

All this code should use the Database class functions instead of building a raw SQL query. When they do that, you no longer need the ugly-as-hell MWNamespace::isContentQuery() (the Namespace class was renamed to MWNamespace a while ago) that spits out weird raw SQL fragments, but just a function that returns an array of content namespaces.

nephele wrote:

Yes, the code may be ugly. But the patch is not the source of the ugliness -- the existing SpecialPage classes all pass around raw SQL fragments and make minimal use of the Database class functions. In a nutshell, trying to refactor the code to make it pretty would be a major code revamp -- with no performance or functionality benefits -- and refactoring the code would by itself still do nothing to fix this bug. I'm not trying to provide the ultimate, perfect code; I'm just trying to provide some progress towards fixing a two-year old bug.

Some of the fundamental obstacles complicating a conversion of all this code to use the Database class functions include:

  • Some 32 special pages are based on QueryPage. Any comprehensive fix requires revamping all of those pages, plus QueryPage.php and PageQueryPage.php. Even worse, a comprehensive fix is likely to break any extension special pages that create special pages derived from QueryPage -- possibly 235 extensions based on Mediawiki-registered extensions alone.
  • QueryPage::getSQL is supposed to return raw SQL. If getSQL is changed to instead return an array of ($table, $var, $cond, $fname, $options), then every use of getSQL is made uglier, because it's now passing around 5+ parameters instead of just one. The other option is that getSQL still returns raw SQL, but that the various implementations of getSQL use Database functions to generate that raw SQL -- however, that's not currently possible, because the Database class does not have functions that construct a raw SQL query and return it -- without actually executing the SQL query.
  • Many of the existing SQL queries cannot be handled by the Database functions. See SpecialDisambiguations, for example, where the page table is accessed twice within a single query, and therefore it is aliased ("FROM {$page} AS pb, {$pagelinks} AS la, {$page} AS pa"). The database functions are unable to handle table aliases. Without major revisions to Database, it is impossible to change all of the special pages to use the database functions -- and changing half of the special pages to use one syntax and leaving half using the old syntax is even uglier, in my opinion.

After all of that is done, it's still necessary to figure out how to fix this bug.

  • The "prettiest" way to do it would seem to be adding it as a flag recognized by Database::makeSelectOptions. However, burying it that deep within the Database class then means it's unavailable to queries that are not entirely constructed by the Database functions -- such as SpecialDisambiguations (unless the database functions are made fully alias-aware). So then a separate function is needed outside of makeSelectOptions that spits out raw SQL fragments -- so why not just introduce that separate function right now instead of forcing a code revamp that won't even get rid of the separate function?
  • I added isContentQuery to the MWNamespace class because the necessary information is ''not'' part of the database: the list of content namespaces is defined solely through a global variable, $wgContentNamespaces. With my patch, all uses of $wgContentNamespaces are centralized in Namespace.php -- only MWNamespace needs to know the inner details of the content namespaces are defined.

I'm willing to update my patch if provided with specific suggestions about what could be done differently within the scope of this approach. But I'm not prepared to take on a huge project to fix something that's basically unrelated to this bug. Especially since with my superficial knowledge of the MW codebase, my efforts to tackle such a huge revamp would inevitably add a whole new stack of bugs.

(In reply to comment #4)

Yes, the code may be ugly. But the patch is not the source of the ugliness --
the existing SpecialPage classes all pass around raw SQL fragments and make
minimal use of the Database class functions. In a nutshell, trying to refactor
the code to make it pretty would be a major code revamp -- with no performance
or functionality benefits -- and refactoring the code would by itself still do
nothing to fix this bug. I'm not trying to provide the ultimate, perfect code;
I'm just trying to provide some progress towards fixing a two-year old bug.

Some of the fundamental obstacles complicating a conversion of all this code to
use the Database class functions include:

  • Some 32 special pages are based on QueryPage. Any comprehensive fix requires

revamping all of those pages, plus QueryPage.php and PageQueryPage.php. Even
worse, a comprehensive fix is likely to break any extension special pages that
create special pages derived from QueryPage -- possibly 235 extensions based on
Mediawiki-registered extensions alone.

  • QueryPage::getSQL is supposed to return raw SQL. If getSQL is changed to

instead return an array of ($table, $var, $cond, $fname, $options), then every
use of getSQL is made uglier, because it's now passing around 5+ parameters
instead of just one. The other option is that getSQL still returns raw SQL,
but that the various implementations of getSQL use Database functions to
generate that raw SQL -- however, that's not currently possible, because the
Database class does not have functions that construct a raw SQL query and
return it -- without actually executing the SQL query.

  • Many of the existing SQL queries cannot be handled by the Database functions. See SpecialDisambiguations, for example, where the page table is accessed

twice within a single query, and therefore it is aliased ("FROM {$page} AS pb,
{$pagelinks} AS la, {$page} AS pa"). The database functions are unable to
handle table aliases. Without major revisions to Database, it is impossible to
change all of the special pages to use the database functions -- and changing
half of the special pages to use one syntax and leaving half using the old
syntax is even uglier, in my opinion.

After all of that is done, it's still necessary to figure out how to fix this
bug.

  • The "prettiest" way to do it would seem to be adding it as a flag recognized

by Database::makeSelectOptions. However, burying it that deep within the
Database class then means it's unavailable to queries that are not entirely
constructed by the Database functions -- such as SpecialDisambiguations (unless
the database functions are made fully alias-aware). So then a separate
function is needed outside of makeSelectOptions that spits out raw SQL
fragments -- so why not just introduce that separate function right now instead
of forcing a code revamp that won't even get rid of the separate function?

  • I added isContentQuery to the MWNamespace class because the necessary

information is ''not'' part of the database: the list of content namespaces is
defined solely through a global variable, $wgContentNamespaces. With my patch,
all uses of $wgContentNamespaces are centralized in Namespace.php -- only
MWNamespace needs to know the inner details of the content namespaces are
defined.

I'm willing to update my patch if provided with specific suggestions about what
could be done differently within the scope of this approach. But I'm not
prepared to take on a huge project to fix something that's basically unrelated
to this bug. Especially since with my superficial knowledge of the MW
codebase, my efforts to tackle such a huge revamp would inevitably add a whole
new stack of bugs.

I'm working on doing all this in the querypage-work branch; and yes, it is possible to prettify stuff and move away from raw SQL.

nephele wrote:

Aha, that makes it alot easier (especially for a newb like me) to see the direction where you're heading with the special page queries.

So, should I wait until the querypage-work branch is complete to proceed?

Or could I help by providing a patch for querypage-work? In which case, it looks like what you're suggesting is that the entry in $conds would be changed from

page_namespace => NS_MAIN

to something like

page_namespace => MWNamespace::getContentNamespaces()

where getContentNamespaces could return a single value or an array of values (returning a single-value as a non-array seems slightly more efficient for Database::makeList).

(In reply to comment #6)

Aha, that makes it alot easier (especially for a newb like me) to see the
direction where you're heading with the special page queries.

So, should I wait until the querypage-work branch is complete to proceed?

Or could I help by providing a patch for querypage-work? In which case, it
looks like what you're suggesting is that the entry in $conds would be changed
from

page_namespace => NS_MAIN

to something like

page_namespace => MWNamespace::getContentNamespaces()

Yeah, that's what will be needed. I'll look into that on a per-query basis (such a condition might hurt efficiency for some queries).

where getContentNamespaces could return a single value or an array of values
(returning a single-value as a non-array seems slightly more efficient for
Database::makeList).

Doesn't really matter.

nephele wrote:

Add/implement MWNamespace::getContentNamespaces(), patch for querypage-work branch

Here's a new patch, written specifically for the querypage-work branch. The new function is now MWNamespace::getContentNamespaces(). I went ahead and added it to every query where I think it should be used. In the process, I took a stab at converting SpecialPopularPages over to the new syntax.

In terms of efficiency, using getContentNamespaces() instead of just NS_MAIN shouldn't have any impact on wikipedia, because the two end up producing identical SQL. And as someone who uses a wiki where the content namespaces have been customized, I'd argue getContentNamespaces() needs to be used in all possible queries -- because otherwise the corresponding special page ends up being completely useless. If performance really is a limiting factors in some of these queries, it might make more sense to just drop the namespace condition completely (or at least drop it if the content namespaces have been customized).

There are a couple of other minor fixes included in the patch that I came across while doing a quick test-run of the code; I left them in place because it seemed likely you'd want to make those fixes eventually. (Specifically, abstract-ing PageQueryPage, fixing a missing ' in SpecialDeadendpages)

Attached:

(In reply to comment #8)

Created an attachment (id=6062) [details]
Add/implement MWNamespace::getContentNamespaces(), patch for querypage-work
branch

Here's a new patch, written specifically for the querypage-work branch.

Thanks, committed in r49951.

The
new function is now MWNamespace::getContentNamespaces(). I went ahead and
added it to every query where I think it should be used. In the process, I
took a stab at converting SpecialPopularPages over to the new syntax.

Thanks; I thought I had converted all of them, but I seem to have missed that one somehow.

In terms of efficiency, using getContentNamespaces() instead of just NS_MAIN
shouldn't have any impact on wikipedia, because the two end up producing
identical SQL. And as someone who uses a wiki where the content namespaces
have been customized, I'd argue getContentNamespaces() needs to be used in all
possible queries -- because otherwise the corresponding special page ends up
being completely useless. If performance really is a limiting factors in some
of these queries, it might make more sense to just drop the namespace condition
completely (or at least drop it if the content namespaces have been
customized).

True. In all of the queries you modified, it looks like the WHERE clause for page_namespace is not indexed anyway, so whether it limits to one or more values doesn't matter.

There are a couple of other minor fixes included in the patch that I came
across while doing a quick test-run of the code; I left them in place because
it seemed likely you'd want to make those fixes eventually. (Specifically,
abstract-ing PageQueryPage, fixing a missing ' in SpecialDeadendpages)

Good catch.

  • Bug 18804 has been marked as a duplicate of this bug. ***