Page MenuHomePhabricator

Syntax for stripping HTML and wiki markup
Open, LowPublicFeature

Description

Author: ui2t5v002

Description:
Similar to {{urlencode: }}, I'd like a parserfunction for stripping wikimarkup
and HTML from text. For instance:

The quick brown fox --> The quick brown fox
The [[quick]] [[brown]] [[fox]] --> The quick brown fox

CO<sub>2</sub> --> CO2

My specific application is for generating machine-readable COinS tags from
citation templates. For instance, if someone cites the book:

title = [[Aristotle for Everybody]]: Difficult Thought Made Easy
edition = 6<sup>th</sup> edition

which we have an article for, it shows up in the citation template with a link,
which is great. But in the machine-readable citation information, it needs to
become plain text:

Aristotle for Everybody: Difficult Thought Made Easy
6th edition

This would also be useful for templates where parameters need to be linked in
one place but not in another, are linked by the template itself, but people
often link their parameters by accident, etc. It might be useful for automated
linking to section anchors with markup, too?

Test with <sub>sub</sub> and <sup>sup</sup>

has the anchor

#Test_with_sub_and_sup

for instance.

I'm sure there are many other template-related functions that would be helped by
this, too.


Version: unspecified
Severity: enhancement
URL: http://www.mediawiki.org/wiki/Extension:Strip_Markup

Details

Reference
bz8161

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:31 PM
bzimport set Reference to bz8161.
bzimport added a subscriber: Unknown Object (MLST).

robchur wrote:

I'd be concerned about the time this might require on large chunks of text.

ui2t5v002 wrote:

(In reply to comment #1)

I'd be concerned about the time this might require on large chunks of text.

If that's a limitation, could it just be limited to short strings? Does the
urlencode function have the same problem?

robchur wrote:

URL-encoding is less work.

ui2t5v002 wrote:

Does a similar function already exist for section anchors?

robchur wrote:

Yes, but there'd still be the potential for some moron to shove a load of
wikitext into the parser function and increase the amount of processing time.

I could just be being paranoid, of course; Tim Starling's probably the best
person to consult about this...

ui2t5v002 wrote:

(In reply to comment #5)

Yes, but there'd still be the potential for some moron to shove a load of
wikitext into the parser function and increase the amount of processing time.

Yeah. The applications I'm imagining are only short snippets of text, though,
so limiting it to 100 characters or so per instance would be fine.

But then do you have to worry about many multlipe instances?

I could just be being paranoid, of course; Tim Starling's probably the best
person to consult about this...

Yes, I was mentioning the urlencode and anchor name functions so that their
processing time and server impact could be compared.

ssanbeg wrote:

Image alt text may be a better comparison. i.e [[Image:wiki.png|some text]]
will parse "some text" for the caption, then strip the tags for the alt text.

I don't think you can directly strip wiki markup, so it would seem a bit
wasteful to parse that just to discard the results, but I don't think it would
be that much slower than normal parsing.

ui2t5v002 wrote:

(In reply to comment #7)

Image alt text may be a better comparison. i.e [[Image:wiki.png|some text]]
will parse "some text" for the caption, then strip the tags for the alt text.

Oh. You mean like:

[[Image:Ant.jpg|thumb|Here is an [[ant]] with {{carbon}}{{oxygen|2}} and
3.63&times;10<sup>24</sup> things]]

will have alt text of:

Here is an ant with CO2 and 3.63×1024 things

I hadn't thought of that. So, in actuality, we already have a function that
does *exactly* what I'm looking for?

We've had it for years, it's in use on a very large number of articles, multiple
times each, and any moron can come along and put inordinate amounts of complex
wikicode into it (http://en.wikipedia.org/wiki/User:Omegatron/Sandbox) and no
one's ever complained about it causing server load problems?

:-)

How easy would it be to make this into a user-accessible ParserFunction?

ssanbeg wrote:

(In reply to comment #8)

(In reply to comment #7)

Image alt text may be a better comparison. i.e [[Image:wiki.png|some text]]
will parse "some text" for the caption, then strip the tags for the alt text.

Oh. You mean like:

[[Image:Ant.jpg|thumb|Here is an [[ant]] with {{carbon}}{{oxygen|2}} and
3.63&times;10<sup>24</sup> things]]

will have alt text of:

Here is an ant with CO2 and 3.63×1024 things

I hadn't thought of that. So, in actuality, we already have a function that
does *exactly* what I'm looking for?

We've had it for years, it's in use on a very large number of articles, multiple
times each, and any moron can come along and put inordinate amounts of complex
wikicode into it (http://en.wikipedia.org/wiki/User:Omegatron/Sandbox) and no
one's ever complained about it causing server load problems?

:-)

Yeah, that's my thought.

How easy would it be to make this into a user-accessible ParserFunction?

Shouldn't be too hard. I don't think a parserfunction, though, since it's
harder to pass arbitrary text to them, and it would return text anyway.
Something like

<stripmarkup>Here is an [[ant]] with {{carbon}}{{oxygen|2}} and
3.63&times;10<sup>24</sup> things</stripmarkup>

would seem reasonable.

ui2t5v002 wrote:

(In reply to comment #9)

Shouldn't be too hard. I don't think a parserfunction, though, since it's
harder to pass arbitrary text to them, and it would return text anyway.

I'm not sure what you mean by this, but a stripmarkup tag (or something shorter
to type) would make me just as happy. Just as long as I can do things like
<strip>{{{parameter}}}</strip> inside a template.

ssanbeg wrote:

strip markup extension

I thank that's a bit simpler to add random text, since you don't have to worry
about something like a stray | terminating the argument.

Here's a quick extension I just put together.

Attached:

ui2t5v002 wrote:

(In reply to comment #11)

I thank that's a bit simpler to add random text, since you don't have to worry
about something like a stray | terminating the argument.

Very good point. I agree that the pseudo-html tags are better.

Changed summary from "ParserFunction for stripping HTML and wiki markup" to
"Syntax for stripping HTML and wiki markup" to reflect Attachment #2831.

ui2t5v002 wrote:

Not to clutter up this bug, but are there plans for testing this/implementing it
on en?

ayg wrote:

Note that due to bug 2257, I believe this patch would not presently work for
template parameters, the intended use. Please correct me if I'm wrong.

ssanbeg wrote:

(In reply to comment #15)

Note that due to bug 2257, I believe this patch would not presently work for
template parameters, the intended use. Please correct me if I'm wrong.

Most of the examples are like <strip>{{thing}}</strip>, which would work fine;
but I see there is one example like <strip>{{{thing}}}</strip>, which wouldn't
work with the XML tag, but should be doable with a parser function.

ui2t5v002 wrote:

(In reply to comment #16)

Most of the examples are like <strip>{{thing}}</strip>, which would work fine;
but I see there is one example like <strip>{{{thing}}}</strip>, which wouldn't
work with the XML tag, but should be doable with a parser function.

All of the things I want to use this for are inside templates, like the
<strip>{{{thing}}}</strip> style.

Blindwanderer wrote:

*necromancy*
I contribute to a third party and we use tooltips to enhance the user experience. The problem is that they are an attribute, so all wiki markup has to be processed and all resulting HTML markup stripped. This wouldn't be a problem if we weren't using complex templates and Extension:VariablesExtension.

Here is an example page:
https://wiki.secondlife.com/wiki/PRIM_TEXTURE

Its annoying to have to supply and handle alternate text. I'd be more than willing to limit the execution time of this function if it could reduce the complexity of our code.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM
Aklapper removed a subscriber: wikibugs-l-list.