Page MenuHomePhabricator

Redirect tag is not resolved in XML dump file
Closed, ResolvedPublic

Description

Hi,

Currently, the page_is_redirect field from the page table is not resolved. This means that in the XML files there is a tag <redirect> present if page_is_redirect is 1 and such a tag is absent if page_is_redirect is 0. It would be tremendously useful to add an attribute to the <redirect> tag that would include the final destination of the redirection.

Best,

Diederik


Version: unspecified
Severity: normal

Details

Reference
bz30513

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:50 PM
bzimport set Reference to bz30513.

Created attachment 8962
This patch resolves the page_is_redirect field in the xml dump file. It will add an attribute name to the <redirect> tag so it should be backwards compatible.

A huuuuuuuge thanks to Trevor who basically wrote this patch so all credits go to him if this works and all complaints should go to me in case this is a bad idea.

attachment resolve_redirect_patch.txt ignored as obsolete

Some issues:

  • no matching update to XML schema version & definition
  • redirect lookup _may_ not be fully synchronized with the page data snapshot
  • what does this do in the null case? No field or empty field? If this is an implicit choice, it should probably be made more explicit.
  • relies on implicit string conversion of Title object which... appears to give text formatting by default. Wouldn't hurt to make this explicit.

Diederik, Trevor,

Could you update the patch with respect to Brion's review in comment #3 so that it can be applied?

Created attachment 8973
Updated patch

Updated patch; fixes issues as mentioned by Brion.

attachment resolve_redirect_patch.diff ignored as obsolete

New patch fixes issues 1,3 and 4.
Brion: could you elaborate on issue 2 and possible ways to handle that? I am not sure what to do about it.

And this patch also included a separate namespace tag, sorry for not mentioning.

sumanah wrote:

Added the "patch" and "need-review" keywords; Mark hopes to get someone to review the patch soon.

sumanah wrote:

Diederik, could you re-upload your updated patch and ensure it's in the right format (and that it applies cleanly to trunk as is)? Right now, attachment # 8973 is in a format such that Bugzilla won't show the diff automatically...

Created attachment 9394
Patch adds a new namespace tag and determines redirect target.

As requested by Sumana, updated to most recent version of trunk. Patch also includes version increment.

attachment namespace_redirect_xml.patch ignored as obsolete

Created attachment 9400
New patch adds export-0.6.xsd and fixes minor != null check

Based on IRC feedback, made changes, including adding a new export-0.6.xsd file.

attachment namespace_redirect_xml.patch ignored as obsolete

Note for the committer: export-0.6.xsd should be added with export-0.5.xsd as history.

Stylistic issue: Add spaces around $redirect !== null
You're still using strval($redirect) (Brion point #4)

As I said on irc, I'm not convinced about adding the new <ns> tag, as that information is already available.

Created attachment 9401
Updated history section of export-0.5.xsd, export-0.6.xsd and added explicit string conversion to title object.

Attached:

The old patches should be marked obsolete.

Overall patch seems sane (it will still need and get more code review once committed).

(In reply to comment #13)

Created attachment 9401 [details]
and added explicit string conversion to title object.

No. We didn't mean you should change $title->getPrefixedText() to strval( $title->getPrefixedText() ) ), which is redundant.
But to use $redirect->getPrefixedText() instead of strval( $redirect )

The if condition could be changed to $redirect instanceOf Title && $redirect->isValidRedirectTarget() to make sure it's appropiate (just copied the codnition from Title.php:396).

Attached:

(In reply to comment #16)

The if condition could be changed to $redirect instanceOf Title &&
$redirect->isValidRedirectTarget() to make sure it's appropiate (just copied
the codnition from Title.php:396).

Yep, that would also make using getPrefixedText() OK since getRedirectTarget() can return a string and not a Title.

Created attachment 9414
Fixes issues as suggested by Platonides

The previous patch is still needed and is not obsolete.

Attached:

Marking the namespace id number is possibly not quite sufficient without specifying the prefix up in the namespaces section. Probably enough for search purposes, but consider adding aliases up in the namespaces section in <siteinfo> as well.

To resolve the Lucene search indexer's issue with this (bug 32376) which causes the search prefix problem (bug 31697), the Lucene indexer will also need to be updated to *use* the field.

Per IRC disucssion with Ariel a simpler one-off fix for the search stuff is:

  • per (bug 32376) switch to using the canonical names in XML export (so 'Usuario:' not 'Usuaria:'); the MWSearch prefix search implementation will rewrite the titles to gendered form in the actual search results.

Looking over the patch:
+ Version 0.6 adds a separate namespace tag, locates the
+ redirect target and strips of the namespace from the title.

it doesn't actually appear to be stripping the namespace from the <title>, it's still using getPrefixedText()... and if it did strip it that would be a serious breaking change in the format.

Can you clarify what this is meant for?

The current solution (as in trunk) is fine: it leaves the title as is and adds a new namespace tag. This tiny bit of redundancy (having the namespace as an integer in a separate tag and the fully qualified namespace in the title) will make it much more easier to select articles form a specific namespace. In particular, with the added gender specific namespaces it becomes only harder to parse out the namespace that an article belongs to. So the current is that there are no incompatible changes, just for each article an extra namespace tag.