Page MenuHomePhabricator

Incorrectly truncated multibyte UTF-8 char
Closed, ResolvedPublic

Description

change of preg_match, preg_replace in checkTitleEncoding

Problem: some links en Russian language interface are very long, example category page link like

http://ru.wikisource.org/w/index.php?title=Category:CatName&from=PageName

looks like

http://ru.wikisource.org/w/index.php?title=%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D1%8F:%D0%9F%D0%BE%D1%8D%D0%B7%D0%B8%D1%8F_%D0%9C%D0%B0%D0%BA%D1%81%D0%B8%D0%BC%D0%B8%D0%BB%D0%B8%D0%B0%D0%BD%D0%B0_%D0%90%D0%BB%D0%B5%D0%BA%D1%81%D0%B0%D0%BD%D0%B4%D1%80%D0%BE%D0%B2%D0%B8%D1%87%D0%B0_%D0%92%D0%BE%D0%BB%D0%BE%D1%88%D0%B8%D0%BD%D0%B0&from=%D0%9F%D1%83%D1%81%D1%82%D1%8B%D0%BD%D1%8F+%28%D0%98+%D1%8F+%D0%B1%D1%8B%D0%BB+%D1%81%D0%BE%D1%81%D0%BB%D0%B0%D0%BD+%D0%B2+%D0%B3%D0%BB%D1%83%D0%B1%D1%8C+%D1%81%D1%82%D0%B5%D0%BF%D0%B5%D0%B9+%E2%80%94+%D0%92%D0%BE%D0%BB%D0%BE%D1%88%D0%B8%D0

"from" parameter is often truncated at the middle of multibyte char

getGPCVal function in WebRequest.php uses checkTitleEncoding

checkTitleEncoding function of Language.php uses

preg_match( '/^([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|' .

'[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})+$/', $s );

to check is string in UTF8 or not.

But rests of incorrectly truncated multibyte UTF-8 char in the end of the string do not match this regexp.

So checkTitleEncoding wrongly converts truncated UTF-8 line to fallback8bitEncoding.

As a result, link "next 200 pages" on following category page of Russian Wikisource works incorrectly.

http://ru.wikisource.org/wiki/Категория:Поэзия_Максимилиана_Александровича_Волошина

Some articles of the category are not visible neither on the first, nor on the second category page.

I suggest to change regular expression to consider possible scraps of UTF codes of chars in the end of a line


Version: 1.12.x
Severity: minor

Attached:

Details

Reference
bz12444

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:57 PM
bzimport set Reference to bz12444.
bzimport added a subscriber: Unknown Object (MLST).

Why is the from truncated? Is there some kind of limit? Wouldn't it be broken anyway even if the encoding is correct?

Cannot reproduce anymore with the example category.