Page MenuHomePhabricator

Store the page_id of the moved page in log_page
Closed, ResolvedPublic

Description

When pages are moved (log_action="move" AND log_type="move"), the log_page field in the logging table contains the page_id of the *redirect page* that was created or zero if no redirect was created.

This is painful and counter-intuitive. I expected log_page to contain the page_id of the *moved page*.

Proposed solution:

Store the page_id of the *moved page* in log_page for future moves and store the page_id of the newly created redirect in the serialized array stored in log_params.

Potential issues:

This may cause backwards compatibility issues if someone is relying on the redirect page_id appearing in log_page for a page move.


Version: 1.23.0
Severity: minor
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=57079
https://bugzilla.wikimedia.org/show_bug.cgi?id=68930

Details

Reference
bz57084

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:12 AM
bzimport set Reference to bz57084.

It is counter-intuitive.

I'm not sure if we should break backwards compatibility. I would normally want to keep it. However, the page ID is 0 when suppressing the redirect, which means there's no consistent way to get even the source page ID (though you can get the source on regular moves, and people may be relying on this).

An option that should not break backwards compatibility is to add the old page ID in log_params (and somehow expose this to the move-specific sub-object in logevents). See https://en.wikipedia.org/w/api.php?action=query&list=logevents&lelimit=200&letype=move&format=jsonfm

I'd like to point out that adding the *moved page* id to log_params is sub-optimal since log_params can't be indexed in such a way that discovering move events for a particular page would be efficient.

However, if there's a serious concern about backwards compatibility, I find this to be a good compromise.

Have we considered the possibility of server-side instrumentation? We introduced ServerSideAccountCreation precisely for the same reasons: (1) generate a clean dataset for analysis; (2) add fields that are not natively available from MW (like mobile account creations); (3) avoid creating backward compatibility issues.

Obviously EventLogging doesn't address the problem of historical data and would only produce data that is not (yet?) publicly accessible.

This sounds like an orthogonal proposal. Regardless of whether we instrument an EventLogging capture, log_page should still capture the deleted page's ID.

I mistakenly wrote the comment above thinking that I was commenting on bug #26122. Yet, I feel like the argument still applies here, though to a lesser extent due to potential backwards compatibility issues.

(In reply to comment #5)

I mistakenly wrote the comment above thinking that I was commenting on bug
#26122. Yet, I feel like the argument still applies here, though to a lesser
extent due to potential backwards compatibility issues.

I agree. EventLogging is a useful tool for the WMF, but it is not a substitute for MediaWiki's built-in logging.

Built-in logging is important for third-party MW installs, on-wiki users (though obviously not everything is visible to all users), and the pages-logging.xml.gz dumps (though of course, what to dump requires consideration).

Backwards compatibility is important, but it's not the only thing. Sometimes we have to break it to move forward.

As noted, I think it can be solved for built-in logging either way (with or without backwards compatibility); they have different advantages.

I just wanted to leave another note because I've run into this issue again.

I'm working on building a graph of the use of draft namespace[1] in the English Wikipedia over time. This work is made substantially more difficult by the lack of a persistent identifier of the moved page associated with move events. Since move events change the non-persistent identifier of a page (namespace & title), and draft articles are intended to be moved, I need to reprocess the entire log in order to keep track of these changing identifiers every time I need to update the graph's data.

  1. https://en.wikipedia.org/wiki/Wikipedia:Draft_namespace

Shall we add another field to the logging table? Here are some possibilities:

*log_from_page
*log_source_page
*log_old_page

*log_to_page
*log_destination_page
*log_target_page
*log_new_page

I was thinking we'd want to avoid a name like log_moved_from because we might want to also use that field for, e.g., merging histories or some other unforeseen use, besides page moves, in which there's both a "from" page and a "to" page.

log_new_page has the disadvantage that we might not always be creating a new page, so it would be misleading.

"source" could have the disadvantage of being confused with rc_source. "old" could have the disadvantage of being confused with the fields in the text table that start with old_; plus it suggests that the page_id perhaps has been supplanted with a new page_id, which is misleading.

Suppose we go with the log_params option. How would that work? The way it's currently set up is as a serialized array like this:

array(2) {

["4::target"]=>
string(3) "Foo"
["5::noredir"]=>
string(1) "0"

}

Shall we add a "6::pageid" element?

I was actually just about to request something similar. I've been working on enumerating metadata about public actions in MediaWiki and flagging some data values that are (painfully) currently missing from the logs. See https://meta.wikimedia.org/wiki/Research:Ideas/MediaWiki_events:_a_generalized_public_event_datasource

Adding a field to params seems like a fine option to me in this circumstance.

The more I think about it, the more it seems like a good idea to put the "moved from" page ID in log_page, and put the redirect page ID in log_params. How much of a big deal would the backwards compatibility issues be? We might actually be doing the bot owners more of a favor than a disservice by making this change, since their implementations could become more straightforward.

I couldn't agree more. I can't imagine a use-case where the created redirect page benefits from a position in an indexed column (log_page).

Putting the page_id of the moved page in log_page fits with the principle of least surprise.

(In reply to Aaron Halfaker from comment #12)

I couldn't agree more. I can't imagine a use-case where the created
redirect page benefits from a position in an indexed column (log_page).

Also, as re backwards compatibility, if there is someone using the ID of the created redirect, they may be inconvenienced somewhat by the change. However, they won't be completely out of luck, since it's still in log_params.

Also importantly, if you need the redirect page ID, it's possible to programatically determine where to find it, even if you have to deal with data sets spanning the transition. I believe the algorithm would be:

  1. If log_page is 0, it's old-style and there is no redirect.
  2. If log_page is non-zero:

2a. If there is a log_redirect_page_id (or whatever) in the log_params, it's new-style, and that's obviously the redirect page ID.
2b. If there is no log_redirect_page_id, it's old-style, and log_page is the redirect page ID.

Change 150969 had a related patch set uploaded by leucosticte:
Make log_params contain null revision rev_id

https://gerrit.wikimedia.org/r/150969

This patch adds a "redirpageid" log parameter to the API that only shows up when there's a redirect page created. So, e.g., suppose "Foo" is moved to "Bar" with a redirect. You'd get something like this for list=recentchanges:

<rc type="log" ns="0" title="Foo" pageid="15" revid="0" old_revid="0" rcid="15" logid="14" logtype="move" logaction="move">
  <move new_ns="0" new_title="Bar" nullrevid="28" redirpageid="16" />

But suppose the redirect is suppressed; then you'd get something like this:

<rc type="log" ns="0" title="Foo" pageid="15" revid="0" old_revid="0" rcid="15" logid="14" logtype="move" logaction="move">
  <move new_ns="0" new_title="Bar" suppressedredirect="" nullrevid="28" />

Is that what you want, or do you want redirpageid to show up as, e.g., 0 if the redirect is suppressed?

This patch also, incidentally, fixes bug 68950.

When considering this change, I like to consider the EventLogging schema that we set up for tracking page moves on Wikimedia wikis given the lack of useful data in the logging table. See https://meta.wikimedia.org/wiki/Schema:PageMove.

In matching that schema to the proposed data structure:

  • userId -- rc.userid
  • userText -- rc.user
  • pageId -- rc.pageid
  • oldNamespace -- rc.ns
  • oldTitle -- parsed(rc.title) (to remove namespace prefix)
  • newNamespace -- rc.move.new_ns
  • newTitle -- parsed(rc.move.new_title) (to remove namespace prefix)
  • redirectId -- rc.move.redirpageid
  • comment -- rc.comment

Is that mapping right? If so, it makes sense to me.

Also, re. zero for rc.move.redirpageid, we decided to go that way in the EventLogging schema since it follows the pattern of zero'd ID fields that is relatively consistent across the database (e.g. rev_user and rev_parent_id in revision).

(In reply to Aaron Halfaker from comment #17)
Actually, the way the API is set up now, neither rc.title nor rc.move.new_title are parsed to remove the namespace prefix. This means that bots have to do that parsing. We could introduce a new result property for the title without namespace prefix. I'm not sure what we'd call it, though.

This latest patch uses zero for rc.move.redirpageid if the redirect is suppressed. So it would appear in the API as:

<rc type="log" ns="0" title="Foo" pageid="15" revid="0" old_revid="0" rcid="15" logid="14" logtype="move" logaction="move">
  <move new_ns="0" new_title="Bar" suppressedredirect="" nullrevid="28" redirpageid="0"/>

Brian Wolff says, "Im opposed to suddenly changing the definition of log_page in a breaking way without a compelling reason." Is there a compelling reason, and is the breakage a big deal? Should we ask at wikitech-l?

Well, I think we do have a compelling reason. I am also a fan of asking wikitech-l to help us imagine what might be broken by the change.

(In reply to Nathan Larson from comment #18)

We could introduce a new result property for the title without
namespace prefix.

This would be useful. I end up doing a lot of work to extract namespaces from full page names. Every time I realize I'm going to need to do it, I feel a deep sadness as parsing titles tends to be error prone. Sadly, splitting on ":" is not a sufficient strategy. I've actually written libraries to help minimize the pain for others. See https://pythonhosted.org/mediawiki-utilities/lib/title.html#mw-lib-title

How would we handle existing log items? Per what was written in comment #13 above, I was thinking we could write a script that would look for rows lacking the redirpageid log parameter. Those would be the ones before the change was made.

Then it would look at the log_page and see where that redirect points to. It would need to look in the revision history, or maybe even the archive table (if the redirect page was later deleted), to find the revision that was created with the same rev_timestamp or ar_timestamp as the log_timestamp.

Then it could look up the page title from that redirect revision in the page table to get the page_id to put in log_page.

That page could have been meanwhile moved to a different page title, or deleted and undeleted, either of which would make the page title have a different page ID than what it was when the page move occurred. So it might be necessary for the script to follow the trail in the logging table till it gets to the correct page ID. That will be an interesting engineering challenge.

Alternatively, we can take the easy way out and simply fix bug 68930. Then we wouldn't be making any breaking changes.

I'm a fan of the non-easy way and I'm willing to put into the work to build a script that would handle the engineering challenge of back-filling old data.

Doing this work will take some time (a couple of weeks?) and I'd prefer if it was done after Wikimania. I've already done some related work, so I'm familiar with the complexities that would be involved. See https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation#Datasets

FWIW, I'm planning to do this work for other reasons. See https://meta.wikimedia.org/wiki/Research:Ideas/MediaWiki_events:_a_generalized_public_event_datasource

(In reply to Aaron Halfaker from comment #23)

I'm a fan of the non-easy way and I'm willing to put into the work to build
a script that would handle the engineering challenge of back-filling old
data.

I've posted some discussion at [[mw:log_page migration script]].

(In reply to Nathan Larson from comment #18)

(In reply to Aaron Halfaker from comment #17)
Actually, the way the API is set up now, neither rc.title nor
rc.move.new_title are parsed to remove the namespace prefix. This means that
bots have to do that parsing. We could introduce a new result property for
the title without namespace prefix. I'm not sure what we'd call it, though.

Not sure it's necessary. It may be that most people who need to do it already have a library that does so.

Anyway, it would be outside the scope of this change.

(In reply to Nathan Larson from comment #22)

That page could have been meanwhile moved to a different page title, or
deleted and undeleted, either of which would make the page title have a
different page ID than what it was when the page move occurred.

Or been suppressed/oversighted, etc.

So it might
be necessary for the script to follow the trail in the logging table till it
gets to the correct page ID. That will be an interesting engineering
challenge.

Indeed, definitely interesting, possibly feasible.

(In reply to Aaron Halfaker from comment #20)

Well, I think we do have a compelling reason. I am also a fan of asking
wikitech-l to help us imagine what might be broken by the change.

Well people should include said compelling reason in the commit summary. Right now there is no reason mentioned in the commit summary compelling or otherwise.

Keep in mind that changing this value would break every existing usage of this value, including not only api users, but possibly tool labs folks who use the db. It would also change how namespace filtering works on special:recentchanges. To me that seems quite a breaking change.

(In reply to Bawolff (Brian Wolff) from comment #27)

Keep in mind that changing this value would break every existing usage of
this value, including not only api users, but possibly tool labs folks who
use the db. It would also change how namespace filtering works on
special:recentchanges. To me that seems quite a breaking change.

I wonder how many people are using this value? This is kind of a catch-22, but the fact that people weren't clamoring for this change to be made earlier suggests to me that not many people are using the value, or they would have noticed how counterintuitive the situation was and said something. I'm not very familiar with tool labs; if there a way we can do a keyword search to see what uses they're making of log_page?

How does it change namespace filtering on Special:RecentChanges? It seems like with or without the change in effect, when you filter by namespace, results show up or don't based on the value of rc_namespace, i.e. the namespace you're moving the page from (rather than to).

I assumed you wanted to change that value too (to be consistent).


A less backwards breaking change could be to add a target_page_id entry to the log_search table.

(In reply to Bawolff (Brian Wolff) from comment #29)

A less backwards breaking change could be to add a target_page_id entry to
the log_search table.

This is implemented in the patch set 5, gerrit change 150969. Should we WONTFIX this bug, then?

(In reply to Nathan Larson from comment #30)

(In reply to Bawolff (Brian Wolff) from comment #29)

A less backwards breaking change could be to add a target_page_id entry to
the log_search table.

This is implemented in the patch set 5, Gerrit change #150969. Should we
WONTFIX this bug, then?

The log_search approach seems like a good one to me.

Change 157872 had a related patch set uploaded by Aaron Schulz:
Move log log_page entries are now that of the moved page

https://gerrit.wikimedia.org/r/157872

Change 157872 merged by jenkins-bot:
Move log log_page entries are now that of the moved page

https://gerrit.wikimedia.org/r/157872

This is a quite drastic change, needs communication. (At least it has release notes.) I suggest to describe it as "Log entries for the move log are now stored for the target title rather than the original title", as this is effectively what users see isn't it.