Page MenuHomePhabricator

Long edit summaries may get truncated in RC->IRC feeds
Closed, ResolvedPublic

Description

Author: rbeelaard

Description:
If an article name is >>180 chars the rc feedback to irc is incorrect. In CDFV there
appears a user : (colon). The summary shows 27.0.0.1 PRIVMSG #nl.wikipedia : <the
article name>
Around 180 chars the behaviour varies.
A test with 160 chars long article name and a long summary in the edit screen gives a
truncated summary in irc. A test with 150 chars, however, although also truncating
the summary, does not give a 10 chars longer summary.

History in MediaWiki looks fine.

Any subsequent edit on such an article maintains the faulty reporting in irc.

The article triggering the investigation is:
http://nl.wikipedia.org/w/index.php?
title=Instructie_betreffende_de_criteria_voor_het_onderscheiden_van_roepingen_met_betr
ekking_tot_personen_met_homoseksuele_tendensen_in_het_licht_van_hun_toelating_tot_semi
naries_en_de_heilige_geloften

This article, which is also a redir, does not allow the addition of text after the
link. However a test with a short title and looking like this:

#redirect[[blabla]] this is a test

can be edited.

This might be a second but related problem.


Version: unspecified
Severity: minor
OS: Windows XP
Platform: PC

Details

Reference
bz4253

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:59 PM
bzimport set Reference to bz4253.
bzimport added a subscriber: Unknown Object (MLST).

This also occurs rather more often with the ja.wikipedia rc stream. Because unicode url's tend to be long, the 470 character limit (probably some 512 char limit for the entire line?) is reached quite fast... and incomplete messages are hard to parse ;)

Some recent examples from the enwiki RC feed (with coloring stripped):

"[[Image:Special Reaction Team, exit out in stack formation behind the shield after delivering a closed line telephone inside the Pacific Marine Credit Union, during a simulated bank robbery.jpg]] http://en.wikipedia.org/w/index.php?title=Image:Special_Reaction_Team%2C_exit_out_in_stack_formation_behind_the_shield_after_delivering_a_closed_line_telephone_inside_the_Pacific_Marine_Credit_Union%2C_during_a_simulated_bank_robbery.jpg&diff=168124521&ol"

"[[Image:163rd Military Police Detachment Special Reaction Team Soldiers assault a bus from the SRT Van while conducting tactical assault training at Fort Campbell, Kentucky.jpg]] http://en.wikipedia.org/w/index.php?title=Image:163rd_Military_Police_Detachment_Special_Reaction_Team_Soldiers_assault_a_bus_from_the_SRT_Van_while_conducting_tactical_assault_training_at_Fort_Campbell%2C_Kentucky.jpg&diff=168124969&oldid=168124852 * TabooTikiGod"

Of course, this could be taken as an argument against absurdly long image names like these. But as long as we continue to allow page names up to 255 bytes long, it'd be nice if the RC feed could be made to cope with them. (In particular, I can think of ways in which a clever vandal might use such long titles to avoid being spotted by antivandal bots watching the feed.)

  • Bug 16097 has been marked as a duplicate of this bug. ***

The IRC protocol (see rfc 2812) has a hard limit of 512 characters per command, and no way to split one message over multiple commands. If this is to be fixed the format of the RC feed messages must be changed somehow, and the various bots updated to handle the new format.

Would probably have to involve some kind of continuation line syntax. Note that, due to URL-encoding, it's possible for the diff link alone to exceed 512 characters in length.

...or we could just leave the page name _out_ of the diff link: simply http://en.wikipedia.org/w/index.php?diff=168124969&oldid=168124852 works just fine.

(In reply to comment #5)

...or we could just leave the page name _out_ of the diff link: simply
http://en.wikipedia.org/w/index.php?diff=168124969&oldid=168124852 works just
fine.

That would certainly reduce the incidence of the problem, but it would still fail if a long edit summary were used on a page with a long title or by a user with a long username, or if the user with the long name edits the page with the long title.

(In reply to comment #6)

That would certainly reduce the incidence of the problem, but it would still
fail if a long edit summary were used on a page with a long title or by a user
with a long username, or if the user with the long name edits the page with the
long title.

True, but as long as it's just the summary that's truncated I suspect most users of the feed can live with it. Of course, long username + long page title could still push it over the limit, but I think a lot of wikis these days cap username length (via the username blacklist) to something like 40-80 characters max anyway.

matthew.britton wrote:

(In reply to comment #7)

True, but as long as it's just the summary that's truncated I suspect most
users of the feed can live with it.

If by "live with it", you mean "spam the crap out of the API instead because it's the only way to get the same information while guaranteeing it to actually be correct", then yes... :/

Removed the title from diff URLs and trimmed the summary if its still too long in r42695. This should fix the majority of cases, but "really long username" + "really long title" will still have the problem.

Ideally there would be sort of continuation to second or third messages if necessary, but this would be more difficult to implement, and has the risk of breaking bots that use the IRC RC feed, due to unexpected content in the messages and things being in the wrong places, so more discussion would be needed on that. Input from bot operators who use the IRC feeds would be welcome.

Note that the trimming you have in there won't really do much good. The IRC protocol tacks on the source user identifier, a command ("PRIVMSG"), a target, some punctuation, and a trailing "\r\n" which leaves rather less than 512 characters for the actual message (the exact amount less depends on the length of the channel name and the length of the user identifier).

In fact, it might be best to just forego the trimming entirely: as Brad notes, it's not very effective, and not having it allows bots to tell if the message has been truncated by checking for the trailing \003.

Comment trimming removed in r42711.

matthew.britton wrote:

(In reply to comment #9)

Ideally there would be sort of continuation to second or third messages if
necessary, but this would be more difficult to implement, and has the risk of
breaking bots that use the IRC RC feed, due to unexpected content in the
messages and things being in the wrong places, so more discussion would be
needed on that. Input from bot operators who use the IRC feeds would be
welcome.

All IRC bots that I'm aware of use a regex to match the IRC messages, so all that *should* happen is the continuation would not be picked up.

Just took a look at the #en.wikipedia channel, and it seems new page creations still include the title in the URL. They should probably be changed to include an "oldid=" parameter instead. Will take a look at it myself later, if I don't forget.

matthew.britton wrote:

(In reply to comment #14)

Just took a look at the #en.wikipedia channel, and it seems new page creations
still include the title in the URL. They should probably be changed to include
an "oldid=" parameter instead. Will take a look at it myself later, if I don't
forget.

Please do; that would fix bug 16586 too. :)

Comment #14 should be fixed in r44406.

mike.lifeguard+bugs wrote:

The original problem here is INVALID AFAICT, due to limits of IRC. Instead, can replace IRC by XAMPP (bug 17450), which blocks bug 14045 which is basically the real complaint here. Everything else mentioned seems to have been addressed. I guess I will mark this as FIXED.