Page MenuHomePhabricator

Ids in user contributions Atom feed are not unique
Closed, ResolvedPublic

Description

According to the Atom specification (http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.id), entries with the same id should represent the same entry. And because an entry in user contributions feed (e.g. http://en.wikipedia.org/w/index.php?title=Special%3AContributions/Svick&feed=atom&limit=50&target=Svick&year=&month=) represents an edit, each edit should have its own id, but currently, the id is the URL of the changed page.

I think this causes repeated showing of the same edit in Google Reader.


Version: 1.16.x
Severity: normal
URL: http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#contributions_syndication

Details

Reference
bz23686

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:59 PM
bzimport set Reference to bz23686.
bzimport added a subscriber: Unknown Object (MLST).

rrr wrote:

id(In reply to comment #0)

According to the Atom specification
(http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.id),
entries with the same id should represent the same entry. And because an entry
in user contributions feed (e.g.
http://en.wikipedia.org/w/index.php?title=Special%3AContributions/Svick&feed=atom&limit=50&target=Svick&year=&month=)
represents an edit, each edit should have its own id, but currently, the id is
the URL of the changed page.

I think this causes repeated showing of the same edit in Google Reader.

I validate contributions ATOM feed with http://feedvalidator.org/check.cgi .
validator says:

: column 81: Two entries with the same id

and I change url "feed=atom" to "feed=rss", validator says

: column 84: guid values must not be duplicated within a feed http://.....
: column 1: Missing atom:link with rel="self"

so I think feed function has id check bug. and rss feed function has not correctly template.

rrr wrote:

Now, ATOM feed's id was made from only Article name.
so, id (or rssfeed's guid) overlaps occurred.

I think that id generator use mix of article name and edition number, this bug will fix.

nowone21 wrote:

It's exactly as KATO Takayuki says.

Currently, feeds are built using for <guid> field in RSS and for <id> in Atom something like "http://xx.wikipedia.org/wiki/article_name". The solution doesn't seem too complicated and would consist in using instead of "http://xx.wikipedia.org/wiki/article_name" something like "http://xx.wikipedia.org/w/index.php?title=article_name&oldid=xxxxxxxx". This way, we would make sure that every entry has an unique identifier.
But this change should be implemented as soon as possible because this issue is already causing trouble. Check this:

http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#To_Google_Reader_users:_You_may_be_missing_items_from_your_watchlist_feed.

and this:

http://www.google.com/support/forum/p/reader/thread?tid=4e28dcb545efabb3&hl=en

I think this is pretty much the same issue as the old bug 3998; technically what we're doing is valid (considering each page as an item, and we're including multiple versions of them) but it is indeed probably not matching up well with what receiving entities will be expecting.

Probably best to change the feeds to go ahead and use ids that are specific to the revision and the way it's being displayed, ensuring that feed-processing systems do keep them separate in their caches.

(My old arguments on bug 3998 are in the other direction, but I'm pretty convinced now that I was wrong in 2005. ;)

nowone21 wrote:

About what is being done is valid or not, all I can say is that my RSS Watchlist feed doesn't pass W3C validation...
http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fen.wikipedia.org%2Fw%2Fapi.php%3Faction%3Dfeedwatchlist%26allrev%3Dallrev%26wlowner%3DCanyq%26wltoken%3D080630b3f4931ff5964fa7e69e6ee5a19871d1dc%26feedformat%3Drss
or feed validator test
http://www.feedvalidator.org/check.cgi?url=http%3A%2F%2Fen.wikipedia.org%2Fw%2Fapi.php%3Faction%3Dfeedwatchlist%26allrev%3Dallrev%26wlowner%3DCanyq%26wltoken%3D080630b3f4931ff5964fa7e69e6ee5a19871d1dc%26feedformat%3Drss
My Atom Watchlist feed passes both tests but with recommendations related to this not unique id issue:
http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fen.wikipedia.org%2Fw%2Fapi.php%3Faction%3Dfeedwatchlist%26allrev%3Dallrev%26wlowner%3DCanyq%26wltoken%3D080630b3f4931ff5964fa7e69e6ee5a19871d1dc%26feedformat%3Datom
http://www.feedvalidator.org/check.cgi?url=http%3A%2F%2Fen.wikipedia.org%2Fw%2Fapi.php%3Faction%3Dfeedwatchlist%26allrev%3Dallrev%26wlowner%3DCanyq%26wltoken%3D080630b3f4931ff5964fa7e69e6ee5a19871d1dc%26feedformat%3Datom
Finally, it must be remembered that, as I have proved, such a popular feed reader like Google Reader misses items from Wikipedia Watchlists very often due to this problem. In fact, I was using it to follow changes in Wikipedia articles I track but, as many of these articles are being controlled against vandalism, I can't accept these losses. Therefore, while this issue is fixed, I am following my watchlist manually, ignoring feeds.
Obviously, I don't know how may people use Google Reader to control their Watchlists but for me, this is a serious problem with (I think) an easy solution.

nowone21 wrote:

In the last days, I've realized that there's been a change after which entries use now an unique identifier following the pattern:

//es.wikipedia.org/w/index.php?title=[Article_title]&amp;diff=[Edition_id]

where [Article_title] is, of course, article title, and [Edition_id] is the edition number, which as far as I know, is an unique identifier. Therefore, the issue described in this page should be no longer a problem.