Page MenuHomePhabricator

Strip <br> tags from extracts
Open, LowPublic

Description

mobile view

On https://ja.wikipedia.org/wiki/%E6%B0%B4%E4%B8%AD%E3%80%81%E3%81%9D%E3%82%8C%E3%81%AF%E8%8B%A6%E3%81%97%E3%81%84 there's a <br clear="both" /> in the article.

That br tag shows up in the extract as well, causing some issues for downstream users (example: https://musicbrainz.org/artist/6fb627d9-983e-43c5-bf73-efcf8e81926b).

There's also extra whitespace in the mobile view (see attachment) which I think is using related code?.

MusicBrainz bug report is http://tickets.musicbrainz.org/browse/MBS-7948


Version: unspecified
Severity: normal

Attached:

Screen_Shot_2014-10-26_at_6.49.30_PM.PNG (532×337 px, 67 KB)

Replication steps

On http://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exchars=100000000&titles=Test%20br%20tags%20in%20extracts
<br> tag shows which is expected since html is requested but this leads a random empty space.

With explaintext flag set it doesn't show:
http://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exchars=100000000&explaintext=&titles=Test%20br%20tags%20in%20extracts

We would like to rethink this behaviour.

AC

Details

Reference
bz72546

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:52 AM
bzimport added a project: TextExtracts.
bzimport set Reference to bz72546.
bzimport added a subscriber: Unknown Object (MLST).

Is this still an issue? The example no longer works for me :(

Thanks @Legoktm that's super helpful. Will take a look Monday.

Yes, that's what this bug is about. There's no good reason for a <br clear="both" /> in the extract.

Yes, that's what this bug is about. There's no good reason for a <br clear="both" /> in the extract.

Well.. you've asked for HTML so it makes sense to include it, but I can understand that in some applications this might not be useful... however some applications it may be. Maybe what is actually needed here is an additional API parameter which explicitly asks for an extract without any unnecessary* formatting..?

  • we'd have to define what this means.

Well.. you've asked for HTML so it makes sense to include it, but I can understand that in some applications this might not be useful... however some applications it may be. Maybe what is actually needed here is an additional API parameter which explicitly asks for an extract without any unnecessary* formatting..?

  • we'd have to define what this means.

If we need a decision maker, I can be that person: I think this makes sense. I could email mediawiki-api-announce and see if there are cases when this behaviour wouldn't be desirable?

Sounds good to me @phuedx that would be a great first step.

phuedx renamed this task from Strip <br> tags from extracts? to Strip <br> tags from extracts.Jun 22 2017, 10:00 AM

If we need a decision maker, I can be that person: I think this makes sense. I could email mediawiki-api-announce and see if there are cases when this behaviour wouldn't be desirable?

Here's the archived thread: https://lists.wikimedia.org/pipermail/mediawiki-api/2017-June/004001.html

Let's give people a week to respond to the email. If no problems we'll go ahead and do this.

Given lack of responses I guess we should go ahead with this?
Are there any cases where removing the br tag may be problematic?

e.g. poetry?

'Er petticoat was yaller an' 'er little cap was green,
An' 'er name was Supi-yaw-lat - jes' the same as Theebaw's Queen, 
An' I seed her first a-smokin' of a whackin' white cheroot,
An' a-wastin' Christian kisses on an 'eathen idol's foot:
Bloomin' idol made o' mud 
Wot they called the Great Gawd Budd
Plucky lot she cared for idols when I kissed 'er where she stud!
On the road to Mandalay...

would become

'Er petticoat was yaller an' 'er little cap was green,An' 'er name was Supi-yaw-lat - jes' the same as Theebaw's Queen,An' I seed her first a-smokin' of a whackin' white cheroot,An' a-wastin' Christian kisses on an 'eathen idol's foot:Bloomin' idol made o' mudWot they called the Great Gawd BuddPlucky lot she cared for idols when I kissed 'er where she stud!On the road to Mandalay...

We probably want to replace it with a space rather than strip to avoid lines joining.

@phuedx what do you think?

Given lack of responses I guess we should go ahead with this?

👍

We probably want to replace it with a space rather than strip to avoid lines joining.

👍

@phuedx should we do this? This one seems quite a trivial fix so I'm open to fixing it, but I'm also wary we'd be adding more complexity to an API with many many problems so an alternative approach would be to add a warning as part of T170617

Jdlrobson raised the priority of this task from Low to Medium.Jul 13 2017, 6:51 PM

@phuedx should we do this? This one seems quite a trivial fix so I'm open to fixing it, but I'm also wary we'd be adding more complexity to an API with many many problems so an alternative approach would be to add a warning as part of T170617

Yeah! It'll make HTML extracts easier to consume – arguably – and there's been no pushback about it on the mailing list.

Jdlrobson lowered the priority of this task from Medium to Low.Jul 14 2017, 3:24 PM