Page MenuHomePhabricator

Numbering system grouping for Indian languages
Closed, ResolvedPublic

Description

Mediawiki gives 3 digit grouping for numbers by default (1234567890 → 1,234,567,890). All Indian languages and many other Asian languages uses a different way of grouping (1234567890 → 1,23,45,67,890). Magicwords such as {{NUMBEROFARTICLES}} also give 3 digit group pre-formated counts. For languages like Malayalam, Hindi, Tamil, Kannada, Bengali, etc need easily readable and understandable formatting in traditional Indian style grouping style.

Pls see: http://en.wikipedia.org/wiki/Indian_numbering_system , http://ml.wikipedia.org/wiki/Special:Statistics (example for mediawiki's default grouping)


Version: unspecified
Severity: enhancement

Details

Reference
bz29495
TitleReferenceAuthorSource BranchDest Branch
Draft: analytics: webrequest: add webrequest_frontend refine dag.repos/data-engineering/airflow-dags!643gmodenarefine-webrequest-frontendmain
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:27 PM
bzimport set Reference to bz29495.

I could have sworn this had already been done, but can't find it or a bug for it. :)

This'll require either a customized commafy() method on the Language subclass, or a way of triggering different behavior from a setting from the Message file (similar to the way the digit transform table can be specified there).

Questions:

  • Should this grouping *always* be used, for all numbers? Are there exceptions for certain lengths or certain types of numbers? (Years in dates usually are not run through commafy() so won't have this applied.)
  • Should this grouping *always* be used, regardless of whether using indic or western style digits? (see bug 29279)
  • Would there be any controversy or conflict over making this change for any particular languages?
  • Is there a complete list of which languages this should apply to?

From http://en.wikipedia.org/wiki/Decimal_mark#Countries_using_Arabic_numerals_with_decimal_comma

In India, due to a numeral system using lakhs (lacs) (1,00,000 equal to
100 000) and crores (1,00,00,000 equal to 10 000 000), comma is used at
levels of thousand, lakh and crore, for example, 10 million (1 crore)
would be written as 1,00,00,000.

So it looks like it doesn't apply to devnagri digits. But that is just a guess.

shijualex wrote:

So it looks like it doesn't apply to Devanagari digits. But that is just a
guess.

It applies to all Indic language numerals (devanagri/kannada/Bengali/odia/....)Few languages like Malayalam, Tamil, Telugu use Indo-Arabic numerals. For them also the above enhancement is necessary.

(In reply to comment #1)

  • Should this grouping *always* be used, for all numbers? Are there exceptions

for certain lengths or certain types of numbers? (Years in dates usually are
not run through commafy() so won't have this applied.)

Yes, Years in dates should not be grouped. :)

  • Should this grouping *always* be used, regardless of whether using indic or

western style digits? (see bug 29279)

This grouping should always be used, regardless style of digits. For example Malayalam numbers are archaic now, but using Indian style grouping with borrowed digits. As well as one can see the words crore and lakh are directly derived from Hindi words Karode and lakh.

  • Would there be any controversy or conflict over making this change for any

particular languages?

As far as now this kind of grouping is the only style popular in India. Even English news papers use the words crore, lakh and 3,2,2,.. grouping. This kind of grouping is probably easily readable and understandable.

  • Is there a complete list of which languages this should apply to?

Currently Malayalam (ml), Hindi (hi), Sanskrit (sa), Tamil (ta), Kannada (kn), Telugu (te), Marathi (mr), Urdu (ur), Oriya (or), Bangali (bn), Panjabi (pa), Gujarati (gu), Bhojpuri (bho), Assamese (as), Kashmiri (ks) are okay with exception of date, with its own digits as well as with English (1,2,3,..,0) digits.

There may be more languages such as Sinhalese (si), Burmese (my), Farsi (fa), Dhivehi (dv) and many other South East Asian language, which are using same style numbering system.

Ok since that's used for so many languages, probably easiest to do it in the base Language::commafy() triggered by a setting, that way we won't have to add extra classes just to duplicate the same alternate layout. :)

Maybe options like:

$digitGrouping = '1k'; 1,000 10,000 1,000,000 (default)
$digitGrouping = '10k';
1000 10,000 1,000,000
$digitGrouping = 'indic'; 1,000 10,000 10,00,000
$digitGrouping = 'none';
1000 10000 1000000

The default behavior would be covered by '1k' mode, adding the thousands separator at every 3 digits.

Current languages to switch from manual commafy() overrides to using '10k' mode, skipping the separator until reaching 10,000 (mostly Eastern European and some Central Asian languages):

  • be_tarask
  • bg
  • et
  • hy
  • kaa
  • kk_cyrl
  • ksh
  • ku_ku
  • pl
  • ru
  • uk

Current languages to switch from manual commafy() to 'none' mode:

  • km (see [[Khmer_numerals]])
  • my (see [[Burmese_numerals]])

I wasn't sure if those non-conversions were right; Khmer and Burmese both use south-east asian indic script variants, but neither appears to use standard digit grouping per examples at above so I believe that is indeed correct, though Burmese is listed as sometimes using the crore/lakh grouping at [[Indian_numbering_system]].

mayurdce wrote:

I am agree with this Shiju and praveen with this Issue that All Indian languages and many other Asian languages uses a different way of grouping (1234567890 → 1,23,45,67,890).I think this system should be applied universally for all indic Wikis.Bcoz all indic wikis use the same format.

Regards
mayur

(In reply to comment #5)

Ok since that's used for so many languages, probably easiest to do it in the
base Language::commafy() triggered by a setting, that way we won't have to add
extra classes just to duplicate the same alternate layout. :)

Maybe options like:

$digitGrouping = '1k'; 1,000 10,000 1,000,000 (default)
$digitGrouping = '10k';
1000 10,000 1,000,000
$digitGrouping = 'indic'; 1,000 10,000 10,00,000
$digitGrouping = 'none';
1000 10000 1000000

A better way to specify these options in a more generic way is to follow the LC_NUMERIC grouping property format of Glibc locale definitions.

"grouping keyword consists of a sequence of semicolon-separated integers. Each integer specifies the number of digits in a group. The initial integer defines the size of the group immediately to the left of the decimal delimiter. The following integers define succeeding groups to the left of the previous group. If the last integer is not -1, the size of the previous group (if any) is used repeatedly for the remainder of the digits. If the last integer is -1, no further grouping is performed."

3;-1 123456,789
3 123,456,789 (this is en_US default format)
3;2;-1 1234,56,789
3;2 12,34,56,789 (this is Indic)
-1 123456789 (equivalent to 'none')

This can cover any complex formatting requirements.

(In reply to comment #7)

A better way to specify these options in a more generic way is to follow the
LC_NUMERIC grouping property format of Glibc locale definitions.

<snip>

This can cover any complex formatting requirements.

Can you create a patch or implement this, Santhosh?

akshay.leadindia wrote:

Santhosh's solution can be implemented by modifying the specific language localization file in Glibc. For example, if we wanted to fix this issue for Hindi Wikipedia, then we would edit the hi_IN localization file & modify the LC_NUMERIC field value to 3;2 , recompile PHP & set the current locale of Hindi Wikipedia to hi_IN

An alternate solution is to use built in NumberFormatter class of PHP & specify the the language specific formatting as the pattern http://www.php.net/manual/en/numberformatter.setpattern.php
http://www.icu-project.org/apiref/icu4c/classDecimalFormat.html#_details
This class offers a wide range of options for formatting & can be implemented with just a few lines of code.

Continuing with example of Hindi Wikipedia, this can be done as

Add to LocalSettings.php
$wgNumberPattern = ",,###";

Modify the commafy() in Language.php
function commafy( $_ ) {

                global $wgNumberPattern;
		$currentLocale = setlocale( LC_NUMERIC, "0" );
                $numberFormat = new NumberFormatter( $currentLocale, NumberFormatter::DEFAULT_STYLE );
                $numberFormat->setPattern( $wgNumberPattern );
                return $numberFormat->format( $_ );

}

We can do this in MediaWiki itself. Yes it's duplication, but we can't yet require PHP 5.3 nor wait for PHP to be patched.

r97793 adds the required feature to support number grouping pattern. Will add the pattern in Message Classes soon.

,,# pattern added to ml, hi, pa, gu, or, bn, as, te, ta, kn, mr languages in r97804