Page MenuHomePhabricator

RSS feed items (HTML) are not rendered as HTML but htmlescaped
Closed, ResolvedPublic

Description

see https://www.mediawiki.org/wiki/Extension_talk:RSS#RSS_.5B2.10.5D.3B_r112480_output_is_distorted_containing_.28table_border_cellpadding_cellspacing_....29_12516

Requires a change in RSSParser.php renderFeed renderItem and escapeTemplateParameter .

It's difficult to change in the current code structure which uses a final
"$renderedFeed = $parser->recursiveTagParse( $renderedFeed, $frame );"

Developers' help is requested.


Version: unspecified
Severity: normal
URL: https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/RSS/RSSParser.php?view=markup

Details

Reference
bz34763

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:18 AM
bzimport set Reference to bz34763.

be aware of protected $ItemMaxLength = 200;
for test increase that value to 20000, otherwise HTML tags are cropped and stopped in between.

Solution of this bug must change the item length limitation so that it works _after_ the HTML tag rendering in order not to break tag(s).

Detailed description of the problem:

  1. function renderItem( $item ) renders each component (basically title, date, description) of each RSS feed item.
  1. item descriptions are currently sanitized by function escapeTemplateParameter which includes htmlspecialchars. The call in renderItem could be replaced by Sanitizer::removeHTMLtags( item, null, array(), array( "a", "img" , "b", "u", "i", "s", ) ) or something like that.
  1. renderItem replaces the string "{{{description}}}" in Template MediaWiki:Rss-item by the so-sanitized description, which is "HTML" but not "[[Wikitext]]".
  1. the problem comes from renderFeed which must use Template MediaWiki:Rss-item which has the form

{{MediaWiki:Rss-feed | title = {{{title}}} | link = {{{link}}} | date = {{{date}}} | author = {{{author}}} | description = {{{description}}} }} [*]

(the template Rss-feed controls the layout of the feed, list form, bullets, indentation of feed items; {{{description}}} being replaced by "HTML" in step 3.)

It must be parsed.

Problem:
"HTML" must not be further parsed.

$wgOut->addHTML() is not the a solution per se, because, as I said, [*] needs to be parsed.

How can I achieve this "do parse [*] but do not parse

[CORR] Last sentence should read:

How can I achieve this "do parse [*] - but do not parse the part "HTML".

Some thoughts on this:

*First of all, unfortunatly it looks like the sanitizer can't really be used, since (I assume) we want <tag-we-don't-recognize> to be silently ignored (aka the tag removed, but its contents not removed) instead of htmlescaped.

What I would do, is basically make my own regex filter (somewhat based on Sanitizer::removeHTMLTags. At the very least, steal its list of allowed html tags) that just kills any tag not on the safe list.

This should mostly work, since anything on the safe list should pass through the parser fine, and anything else would be gone. The only hickup would be that people would probably want links to come through unharmed, which means they would have to be converted to wiki-syntax [http://foo bar] style links in order for that to work.

(If you want, I can make a patch that would probably better describe what I'm thinking than this comment did)

The other thing to maybe look into, is for the actual substitution of {{{link}}} (or whatever) in the template - maybe use recursiveTagParse (or possibly some other method from the parser. Not sure off the top of my head which is most appropriate) with a custom frame containing the args from the feed instead of using str_replace. That way the parameter substitution would be exactly like how it normally works in templates. People could do things like {{{link|text if no link}}}, etc.

(In reply to comment #7)

Some thoughts on this:

I read it, thanks ! Will come with a new version in the next few days.

(In reply to comment #7)

My main problem with this "bug" is to find a solution to the question

"how can I recursively parse (because templates are to be rendered) but one of the template parameters should not be parsed but should be render as supplied. Is there any solution ?

recursiveTagParse( "{{MediaWiki:Rss-feed | 1 = 'something [[Wikitext]] to be parsed' | 2 = 'DO-NOT-PARSE-ME-I-AM-SANITIZED-HTML' }}" );

MediaWiki:Rss-feed has for example this content:

  • {{{1}}}

: {{{2}}}

I think, this appears to be impossible in the framework and current code structure of the Extension:RSS (not developed by me), which uses a "two template" approach for parsing RSS feed items through a template for allowed item elements and then the template which renderes items into a listed feed.

Ok, there's a significantly easier way than what I suggested above:

$nonParsedText = $parser->insertStripItem( "DO NOT PARSE TEXT" );

$parser->recursiveTagParse( "some text... " . $nonParsedText . "...more text" );

This would make it mostly non-parsed (I think, I haven't tested it).

Or more specifically, it will make it non-parsed to the same extent that the return value of a parserHook <tag> extension is non-parsed. (So doBlockLevels is still done on the text, and a couple other things. Most people probably won't notice). To really totally make it non-parsed you can get the parser's mStripState object and do something like:

$nonParsedText = $parser->mStripState->addNoWiki( "{$parser->mUniqPrefix}-rssExtension-" . sprintf( '%08X', $parser->mMarkerIndex++ ) . Parser::MARKER_SUFFIX, "Text not to parse" );

but that's more complicated, and I'm not really sure if extensions are supposed to touch the parser's internal variables like that.

Just to keep you updated: I found a solution which solves the rendering of HTML at least for many cases.

fixed in r113297 .
small tolerated regression bug30377 .