We were never able to implement anything other than a basic, locale-
unaware collator with the JSON export of the CLDR as it did not have
collation data. We can now use ICU to implement collation.
ICU 72 began using non-ASCII spaces in some formatted date-time strings.
Every major browser has found that this introduced major breakage in web
compatibility, as many sites and tools expect ASCII spaces. This patch
removes these non-ASCII spaces in the same manner as the major engines.
Such behavior is also tested by WPT.
This required updating some LibJS spec steps to their latest versions,
as the data expected by the old steps does not quite match the APIs that
are available with the ICU. The new spec steps are much more aligned.
There have been a number of changes to the locale resolution AOs that
we've fallen behind on. Mostly editorial, but includes one normative
change to canonicalize Unicode extension keywords in the Intl.Locale
constructor.
This uses ICU for all of the Intl.PluralRules prototypes, which lets us
remove all data from our plural rules generator.
Plural rules depend directly on internal data from the number formatter,
so rather than creating a separate Locale::PluralRules class (which will
make accessing that data awkward), this adds plural rules APIs to the
existing Locale::NumberFormat.
The proposal has undergone quite a few normative changes since we last
synced with it. There was a time when it could not be implemented as it
was written, which is no longer the case. The resulting proposal has had
so many changes compared to our implementation, that it wouldn't make
sense to implement them commit-by-commit as we normally do. So instead,
this just implements the HEAD revision of the spec in one pass.
This uses ICU for the Intl.DateTimeFormat `format` `formatToParts`,
`formatRange`, and `formatRangeToParts`.
This lets us remove most data from our date-time format generator. All
that remains are time zone data and locale week info, which are relied
upon still for other interfaces. So they will be removed in a future
patch.
Note: All of the changes to the test files in this patch are now aligned
with other browsers. This includes:
* Some very incorrect formatting of Japanese symbols. (Looking at the
old results now, it's very obvious they were wrong.)
* Old FIXMEs regarding range formatting not including the start/end date
when only time fields were requested, but the dates differ.
* Day period inconsistencies.
The IntlMV is meant to be arbitrarily precise. If the user provides a
string value to be formatted, we lose precision by converting extremely
large values to a double. We were never able to address this, as support
for arbitrary precision was a big FIXME. But ICU can handle it by just
passing the raw string on through.
This uses ICU for the Intl.NumberFormat `formatRange` and
`formatRangeToParts` prototypes.
Note: All of the changes to the test files in this patch are now aligned
with both Chrome and Safari.
This uses ICU for the Intl.NumberFormat `format` and `formatToParts`
prototypes. It does not yet port the range formatter prototypes.
Most of the new code in LibLocale/NumberFormat is simply mapping from
ECMA-402 types to ICU types. Beyond that, the only algorithmic change is
that we have to mutate the output from ICU for `formatToParts` to match
what is expected by ECMA-402. This is explained in NumberFormat.cpp in
`flatten_partitions`.
This lets us remove most data from our number format generator. All that
remains are numbering system digits and symbols, which are relied upon
still for other interfaces (e.g. Intl.DateTimeFormat). So they will be
removed in a future patch.
Note: All of the changes to the test files in this patch are now aligned
with both Chrome and Safari.
Note: We keep locale parsing and syntactic validation as-is. ECMA-402
places additional restrictions on locales above what is required by the
Unicode spec. ICU doesn't provide methods that let us easily check those
restrictions, whereas LibLocale does. Other browsers also implement
their own validators here.
This introduces a locale cache to re-use parsed locale data and various
related structures (not doing so has a non-negligible performance impact
on Intl tests).
The existing APIs for canonicalization and display names are pretty
intertwined, so they must both be adapted at once here. The results of
canonicalization are slightly different on some edge cases. But the
changed results are actually now aligned with Chrome and Safari.
https://cldr.unicode.org/index/downloads/cldr-44
Notable changes that affect us include:
* The Islamic Calendar is now localized as the Hijri Calender (in en-US)
but has not been updated for all locales. So this patch updates tests
where possible and removes a few test cases that currently cannot be
localized.
* The und locale has received more likely subtag data (the und locale is
basically a pseudo-locale meaning "undetermined").
* The exponential symbol in the Arabic number system was changed from
U+0627 to U+0623.
For example, the locale "fr-FR" will have the preferred hour cycle list
of "H hB", meaning h23 and h12-with-day-periods. Whether date-times are
actually formatted with day-periods is up to the user, but we need to
parse the hour cycle as h12 to know that the FR region supports h12.
This bug was revealed by LibJS no longer blindly falling back to h12 (if
the `hour12` option is true) or h24 (if the `hour12` option is false).
These are normative changes in the ECMA-402 spec. See:
896ffccaf4ec46e25c455
(This combines the above commits into one patch as they each do not work
on their own).
When formatting a currency style pattern with compact notation, we were
(trying to) doubly insert the currency symbol into the formatted string.
We would first look up the currency pattern in GetNumberFormatPattern
(for the en locale, this is "¤#,##0.00", which our generator transforms
to "{currency}{number}").
When we hit the "{number}" field, NumberFormat will do a second lookup
for the compact pattern to use for the number being formatted. By using
the currency compact patterns, we receive a second pattern that also has
the currency symbol (for the en locale, if formatting the number 1000,
this is "¤0K", which our generator transforms to
"{currency}{number}{compactIdentifier:0}". This second lookup is not
supposed to have currency symbols (or any other symbols), thus we hit a
VERIFY_NOT_REACHED().
Instead, we are meant to use the decimal compact pattern, and allow the
currency symbol to be handled by only the outer currency pattern.
For example the words "can't" and "32.3" should not have boundaries
detected on the "'" and "." code points, respectively.
The String test cases fixed here are because "b'ar" is now considered
one word.
This is a normative change in the Intl.NumberFormat V3 spec. See:
08f599b
Note that this didn't seem to actually affect our implementation. The
Unicode spec states:
https://www.unicode.org/reports/tr35/tr35-53/tr35-numbers.html#Plural_Ranges
"If there is no value for a <start,end> pair, the default result is end"
Therefore, our implementation did not have the behavior noted by the
issue this normative change addressed:
const pr = new Intl.PluralRules("en-US");
pr.selectRange(1, 1); // Is "other", should be "one"
Our implementation already returned "one" here because there is no such
<start=one, end=one> value in the CLDR for en-US. Thus, we already
returned the end value of "one".
This is a normative change in the Intl.NumberFormat V3 spec. See:
29acfc6
This is to allow Intl.PluralRules to use these options, as they were in-
effect required by later AOs anyways.
There were some notable changes to the CLDR JSON format and data in this
release.
The patterns for a date at a specific time, i.e. "{date} at {time}", now
appear under the "atTime" attribute of the "dateTimeFormats" object.
Locale specific changes that affected test-js:
All locales:
* In many patterns, the code points U+00A0 (NO-BREAK SPACE) and U+202F
(NARROW NO-BREAK SPACE) are now used in place of an ASCII space. For
example, before the "dayPeriod" fields AM and PM.
* Separators such as U+2013 (EN DASH) are now surrounded by U+2009 (THIN
SPACE) in place of an ASCII space character.
Locale "en":
* Narrow localizations of time formats are even more narrow. For
example, the abbreviation "wk." for "week" is now just "wk".
Locale "ar":
* The code point U+060C (ARABIC COMMA) is now used in place of an ASCII
comma.
* The code point U+200F (RIGHT-TO-LEFT MARK) now appears at the
beginning of many localizations.
* When the "latn" numbering system is used for currency formatting, the
currency symbol more consistently is placed at the end of the pattern.
Locale "he":
* The "many" plural rules category has been removed.
Locales "zh" and "es-419":
* Several display-name localizations were changed.
This isn't called out in TR-35, but before ICU even looks at CLDR data,
it adds a hard-coded set of default patterns to each locale's calendar.
It has done this since 2006 when its DateTimeFormat feature was first
created. Several test262 tests depend on this, which under ECMA-402,
falls into "implementation defined" behavior. For compatibility, we
can do the same in LibUnicode.