Page MenuHomePhabricator

Ampersand replaced by its HTML entity even in <html> sections, breaking JavaScript
Open, LowPublic

Description

Author: frederic.keller

Description:
The issue I am facing is that all pure Ampersand "&" present in a page content are replaced by their HTML entity &amp;

Even when allowing raw HTML, $wgRawHtml=true, and surrounded by HTML tags, <html></html>, the ampersand are replaced.

I would like to keep pure &, because the users should be able to add some Javascript in their pages. But with this replacement the & used as a logic operator is corrupted, and the Javascript as well.

Here is a small content to explain and show the problem. Just add this content to a page and check the source.

<html>
ampersand : &amp;
pure ampersand: & (should not be replaced)

</html>

Is there a solution to this problem, or will it be fixed in the next version ?

Thank you very much !


Version: 1.20.x
Severity: normal

Details

Reference
bz10407

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:54 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz10407.
bzimport added a subscriber: Unknown Object (MLST).

ayg wrote:

Sure &amp; should work correctly in JavaScript, just as it does in URLs. The XML parser is supposed to replace it with & before passing it on to the JavaScript parser or anything else. It really doesn't work? Try using <![CDATA[ ... ]]> around your JavaScript.

ais523 wrote:

(In reply to comment #1)

Sure &amp; should work correctly in JavaScript, just as it does in URLs. The
XML parser is supposed to replace it with & before passing it on to the
JavaScript parser or anything else. It really doesn't work? Try using
<![CDATA[ ... ]]> around your JavaScript.

I can confirm that this doesn't work correctly (tested on a private wiki with the <html> tag enabled, MediaWiki version 1.9.0).

Test code:
<html>
<script>
<[CDATA[
alert("Testing the & sign");
]]>
</script>
</html>

Output in page's HTML:
<script>
<[CDATA[
alert("Testing the &amp; sign");
]]>
</script>

and the script displays the message Testing the &amp; sign in the alert box that comes up. I've actually written scripts inside HTML tags on that wiki, and it's been a pain having to express a&&b as !(!a||!b)...

The <![CDATA[ ... ]]> would be used to allow a raw & in the source to pass the XML parser correctly. It would ensure that &amp; is interpreted as &amp; *instead of* the & it otherwise would be.

Note that in HTML 4, <script> contents are defined as CDATA already, which is why common browsers already handle it that way. In pure XHTML, this wouldn't be implied, which is why we make it explicit in our own output.

Since various processing is done on the output code even after the <html> sections are done, currently it may not be possible to get nice 'clean' output of this sort.

ayg wrote:

(mid-air collision)

Right, right, of course <![CDATA[ will just muck things up further, the MW parser doesn't recognize it. But it's an error to output a literal &, in <script> or elsewhere, that doesn't begin an entity. It should work correctly as &amp;, as far as I can tell.

Unfortunately, testing in Firefox, it does not. <![CDATA[ seems to be the only way to get this to work, so for this to function correctly MediaWiki would have to either insert <![CDATA[ ... ]]> intelligently inside <script> and maybe <style> tags, and not HTML-escape those; or else just not clean them at all.

Is it Tidy doing the cleaning, or the Sanitizer? Does the entity get replaced even with Tidy off?

clemenzi wrote:

On http://en.wikipedia.org/wiki/Special:Watchlist, I get a javascript error every time I refresh due specifically to this bug. The following line is in the header section

<script type="text/javascript" src="http://en.wikipedia.org/w/index.php?title=-&amp;action=raw&amp;gen=js&amp;useskin=monobook"><!-- site js --></script>

Because the ampersands are not handled correctly, that line returns an html text page instead of the expected javascript.

This has just started happening in the last day or so.

herd wrote:

(In reply to comment #5)

On http://en.wikipedia.org/wiki/Special:Watchlist, I get a javascript error
every time I refresh due specifically to this bug. The following line is in the
header section

<script type="text/javascript"
src="http://en.wikipedia.org/w/index.php?title=-&amp;action=raw&amp;gen=js&amp;useskin=monobook"><!--
site js --></script>

Because the ampersands are not handled correctly, that line returns an html
text page instead of the expected javascript.

This has just started happening in the last day or so.

No, that's normal, this bug is about & being replaced with &amp; in the <script> body, not in the src parameter value. For example:
<html><script type="text/javascript">if(skin && stylepath) alert('woo')</script></html>
will break, whereas
<html><script type="text/javascript" src="http://en.wikipedia.org/w/index.php?title=MediaWiki:Common.js/watchlist.js&action=raw&ctype=text/javascript"></script></html>
will correctly escape the & to &amp; and the browser expects and understands this.

What error are you getting exactly? that gen=js appears on every page load, not just watchlists, and is what loads MediaWiki:Common.js and MediaWiki:SKINNAME.js (probably Monobook). http://en.wikipedia.org/wiki/MediaWiki:Common.js/watchlist.js is loaded just on the watchlist page, so possibly an error there.

clemenzi wrote:

The exact error is

Line: 8
Char: 2
Error: Expected identifier, string or number
Code: 0
URL: http://en.wikipedia.org/wiki/Special:Watchlist

I got to the ampersands by saving the html locally and debugging one line at a time. However, I checked (as suggested) and other pages which do not produce errors have the same line. I tried a more complete test (I left in all the code). Running from my hard drive, I first converted the relative links to absolute links. Now, there are 2 errors. Commenting out the line I indicated above stopped them both. However, the errors are different when running locally, so it appears that I was wrong.

Line: 2
Char: 1
Error: invalid character
Code: 0
URL: file://path to my test case

BTW, I am running IE 6.

clemenzi wrote:

(In reply to comment #7)

The exact error is
Line: 8
Char: 2
Error: Expected identifier, string or number
Code: 0
URL: http://en.wikipedia.org/wiki/Special:Watchlist
I got to the ampersands by saving the html locally and debugging one line at a
time. However, I checked (as suggested) and other pages which do not produce
errors have the same line. I tried a more complete test (I left in all the
code). Running from my hard drive, I first converted the relative links to
absolute links. Now, there are 2 errors. Commenting out the line I indicated
above stopped them both. However, the errors are different when running
locally, so it appears that I was wrong.
Line: 2
Char: 1
Error: invalid character
Code: 0
URL: file://path to my test case
BTW, I am running IE 6.

The problem went away today at 2pm, last saw the problem at 6am.

Cbarr wrote:

I have hit this bug on wikimediafoundation.org. I was trying to use logical and "&&" in my javascript and the parser changed both giving "&amp;&amp;".

Bump, caused in issue again on wikimediafoundation.org. Makes code really annoying to write.

This comment was removed by Izno.