Page MenuHomePhabricator

Extracts API: Extracts strips lang attributes from html by flattening the span elements
Closed, ResolvedPublic

Description

I really love the extracts feature, but I noted that currently all span elements are flattened out of the cleaned up HTML.

But one of the biggest usages of span tags is to mark different languages and script directions using the attributes dir and lang. These different languages are quite often present in the first line of an article on a non-english topic. I think those are thus very important elements to preserve in our multilingual content.


Version: master
Severity: normal

Details

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:19 AM
bzimport added a project: TextExtracts.
bzimport set Reference to bz57582.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/mobile/cards/1454

Can you provide an example of real-life breakages caused by this removal?

Font selection for the bengali language article extract probably fails for many people in this result: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&titles=Bengali_language&format=jsonfm

There is no indication another font needs to be used for this fragment, so only glyph fallback can save you. Voice software also won't know when to select a different voice.

You could make a similar argument for the font-family css style attribute that ULS depends on for IPA for instance. But since ULS can also use language attributes, I think those are a tad more important.

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/mobile/cards/1479

Change 183496 had a related patch set uploaded (by Phuedx):
Don't flatten spans

https://gerrit.wikimedia.org/r/183496

Patch-For-Review

@MaxSem is it really as simple as 183496? What have I missed?

I've updated 183496 to remove class and style attributes from all spans in the output.

Change 183496 merged by jenkins-bot:
Don't flatten spans

https://gerrit.wikimedia.org/r/183496

Jdlrobson claimed this task.