RSS feed items (HTML) are not rendered as HTML but htmlescaped
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Wikinaut
	Feb 27 2012, 10:29 PM

Description

see https://www.mediawiki.org/wiki/Extension_talk:RSS#RSS_.5B2.10.5D.3B_r112480_output_is_distorted_containing_.28table_border_cellpadding_cellspacing_....29_12516

Requires a change in RSSParser.php renderFeed renderItem and escapeTemplateParameter .

It's difficult to change in the current code structure which uses a final
"$renderedFeed = $parser->recursiveTagParse( $renderedFeed, $frame );"

Developers' help is requested.

Version: unspecified
Severity: normal
URL: https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/RSS/RSSParser.php?view=markup

Details

Reference: bz34763

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T32377 Suggestion: add a new parameter to limit the number of characters when rendering the channel item <description>
Resolved		Wikinaut	T36763 RSS feed items (HTML) are not rendered as HTML but htmlescaped
Resolved		None	T37002 Sanitizer:removeHTMLtags fails for <img src=> tag when enclosed in <a> link
Invalid		None	T37013 Sanitizer:removeHTMLtags failure: it removes XHMTL style <img src=... /> tags when allowing this tag expressly

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 12:18 AM

• bzimport added projects: MW-extension-1.20-version, MediaWiki-extensions-RSS.

• bzimport set Reference to bz34763.

Wikinaut created this task.Feb 27 2012, 10:29 PM

link is https://www.mediawiki.org/w/index.php?title=Extension_talk:RSS&oldid=504276

(In reply to comment #1)

link is
https://www.mediawiki.org/w/index.php?title=Extension_talk:RSS&oldid=504276

arrgh I hate LiquidThreads
link is http://preview.tinyurl.com/7hmbxer

the beast is https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/RSS/RSSParser.php?view=markup function renderItem

be aware of protected $ItemMaxLength = 200;
for test increase that value to 20000, otherwise HTML tags are cropped and stopped in between.

Solution of this bug must change the item length limitation so that it works _after_ the HTML tag rendering in order not to break tag(s).

Detailed description of the problem:

function renderItem( $item ) renders each component (basically title, date, description) of each RSS feed item.

item descriptions are currently sanitized by function escapeTemplateParameter which includes htmlspecialchars. The call in renderItem could be replaced by Sanitizer::removeHTMLtags( item, null, array(), array( "a", "img" , "b", "u", "i", "s", ) ) or something like that.

renderItem replaces the string "{{{description}}}" in Template MediaWiki:Rss-item by the so-sanitized description, which is "HTML" but not "[[Wikitext]]".

the problem comes from renderFeed which must use Template MediaWiki:Rss-item which has the form

{{MediaWiki:Rss-feed | title = {{{title}}} | link = {{{link}}} | date = {{{date}}} | author = {{{author}}} | description = {{{description}}} }} [*]

(the template Rss-feed controls the layout of the feed, list form, bullets, indentation of feed items; {{{description}}} being replaced by "HTML" in step 3.)

It must be parsed.

Problem:
"HTML" must not be further parsed.

$wgOut->addHTML() is not the a solution per se, because, as I said, [*] needs to be parsed.

How can I achieve this "do parse [*] but do not parse

[CORR] Last sentence should read:

How can I achieve this "do parse [*] - but do not parse the part "HTML".

Some thoughts on this:

*First of all, unfortunatly it looks like the sanitizer can't really be used, since (I assume) we want <tag-we-don't-recognize> to be silently ignored (aka the tag removed, but its contents not removed) instead of htmlescaped.

What I would do, is basically make my own regex filter (somewhat based on Sanitizer::removeHTMLTags. At the very least, steal its list of allowed html tags) that just kills any tag not on the safe list.

This should mostly work, since anything on the safe list should pass through the parser fine, and anything else would be gone. The only hickup would be that people would probably want links to come through unharmed, which means they would have to be converted to wiki-syntax [http://foo bar] style links in order for that to work.

(If you want, I can make a patch that would probably better describe what I'm thinking than this comment did)

The other thing to maybe look into, is for the actual substitution of {{{link}}} (or whatever) in the template - maybe use recursiveTagParse (or possibly some other method from the parser. Not sure off the top of my head which is most appropriate) with a custom frame containing the args from the feed instead of using str_replace. That way the parameter substitution would be exactly like how it normally works in templates. People could do things like {{{link|text if no link}}}, etc.

(In reply to comment #7)

Some thoughts on this:

I read it, thanks ! Will come with a new version in the next few days.

(In reply to comment #7)

My main problem with this "bug" is to find a solution to the question

"how can I recursively parse (because templates are to be rendered) but one of the template parameters should not be parsed but should be render as supplied. Is there any solution ?

recursiveTagParse( "{{MediaWiki:Rss-feed | 1 = 'something [[Wikitext]] to be parsed' | 2 = 'DO-NOT-PARSE-ME-I-AM-SANITIZED-HTML' }}" );

MediaWiki:Rss-feed has for example this content:

{{{1}}}

: {{{2}}}

I think, this appears to be impossible in the framework and current code structure of the Extension:RSS (not developed by me), which uses a "two template" approach for parsing RSS feed items through a template for allowed item elements and then the template which renderes items into a listed feed.

Ok, there's a significantly easier way than what I suggested above:

$nonParsedText = $parser->insertStripItem( "DO NOT PARSE TEXT" );

$parser->recursiveTagParse( "some text... " . $nonParsedText . "...more text" );

This would make it mostly non-parsed (I think, I haven't tested it).

Or more specifically, it will make it non-parsed to the same extent that the return value of a parserHook <tag> extension is non-parsed. (So doBlockLevels is still done on the text, and a couple other things. Most people probably won't notice). To really totally make it non-parsed you can get the parser's mStripState object and do something like:

$nonParsedText = $parser->mStripState->addNoWiki( "{$parser->mUniqPrefix}-rssExtension-" . sprintf( '%08X', $parser->mMarkerIndex++ ) . Parser::MARKER_SUFFIX, "Text not to parse" );

but that's more complicated, and I'm not really sure if extensions are supposed to touch the parser's internal variables like that.

Just to keep you updated: I found a solution which solves the rendering of HTML at least for many cases.

fixed in r113297 .
small tolerated regression bug30377 .

Aklapper edited projects, added MW-1.20-release; removed MW-extension-1.20-version.Dec 19 2014, 8:19 PM

ori mentioned this in rMEXT50af8d40aae8: Updated mediawiki/extensions Project: mediawiki/extensions/RSS….Jun 2 2015, 12:01 AM

ori mentioned this in rERSSe66a9afd99f1: Don't rely on strip marker uniqueness.

ori mentioned this in rERSS24988d6911d9: Don't rely on strip marker uniqueness.

ori mentioned this in rERSS6421f1f37b1b: Don't rely on strip marker uniqueness.

RSS feed items (HTML) are not rendered as HTML but htmlescapedClosed, ResolvedPublicActions