Commit graph

65 commits

Author SHA1 Message Date
Timothy Flynn
63c3437274 LibUnicode: Use BCP 47 data to generate available calendars and numbers
BCP 47 will be the single source of truth for known calendar and number
system keywords, and their aliases (e.g. "gregory" is an alias for
"gregorian"). Move the generation of available keywords to where we
parse the BCP 47 data, so that hard-coded aliases may be removed from
other generators.
2022-02-16 07:23:07 -05:00
Timothy Flynn
89ead8c00a LibJS+LibUnicode: Parse Unicode keywords from the BCP 47 CLDR package
We have a fair amount of hard-coded keywords / aliases that can now be
replaced with real data from BCP 47. As a result, the also changes the
awkward way we were previously generating keys. Before, we were more or
less generating keywords as a CSV list of keys, e.g. for the "nu" key,
we'd generate "latn,arab,grek" (ordered by locale preference). Then at
runtime, we'd split on the comma. We now just generate spans of keywords
directly.
2022-02-16 07:23:07 -05:00
Timothy Flynn
6efbafa6e0 Everywhere: Update copyrights with my new serenityos.org e-mail :^) 2022-01-31 18:23:22 +00:00
Timothy Flynn
bced4e9324 LibJS+LibUnicode: Convert Intl.ListFormat to use Unicode::Style
Remove ListFormat's own definition of the Style enum, which was further
duplicated by a generated ListPatternStyle enum with the same values.
2022-01-25 19:02:59 +00:00
Timothy Flynn
e261132e8b LibUnicode: Add helper methods to convert a Style to and from a string
This conversion is duplicated a few times in our Intl implementation, so
let's just define these once and be done with it.
2022-01-25 19:02:59 +00:00
Timothy Flynn
b0671ceb74 LibUnicode: Add a method to combine locale subtags into a display string
This is just a convenience wrapper around the underlying generated APIs.
2022-01-13 23:05:31 +01:00
Timothy Flynn
91acc2e9c5 LibUnicode: Parse and generate locale display patterns
These patterns indicate how to display locale strings when that locale
contains multiple subtags. For example, "en-US" would be displayed as
"English (United States)".
2022-01-13 23:05:31 +01:00
Timothy Flynn
8126cb2545 LibJS+LibUnicode: Remove unnecessary locale currency mapping wrapper
Before LibUnicode generated methods were weakly linked, we had a public
method (get_locale_currency_mapping) for retrieving currency mappings.
That method invoked one of several style-specific methods that only
existed in the generated UnicodeLocale.

One caveat of weakly linked functions is that every such function must
have a public declaration. The result is that each of those styled
methods are declared publicly, which makes the wrapper redundant
because it is just as easy to invoke the method for the desired style.
2022-01-13 13:43:57 +01:00
Timothy Flynn
0d75949827 LibUnicode: Parse and generate locale display names for date fields 2022-01-13 13:43:57 +01:00
Timothy Flynn
7f162c471d LibUnicode: Parse and generate locale display names for calendars
Note there's a bit of an unfortunate duplication in the calendar enum
generated by UnicodeLocale and the existing enum generated by
UnicodeDateTimeFormat. The former contains every calendar known to the
CLDR, whereas the latter contains the calendars we've actually parsed
for DateTimeFormat (currently only Gregorian). The new enum generated
here can be removed once DateTimeFormat knows about all calendars.
2022-01-13 13:43:57 +01:00
Timothy Flynn
f576142fe8 LibJS+LibUnicode: Convert UnicodeLocale to link with weak symbols 2022-01-04 22:49:43 +00:00
Timothy Flynn
ba4cdf34f8 LibUnicode: Convert UnicodeDateTimeFormat to link with weak symbols 2022-01-04 22:49:43 +00:00
Timothy Flynn
09be26b5d2 LibUnicode: Dynamically load the generated UnicodeLocale symbols 2021-12-21 13:09:49 -08:00
Timothy Flynn
6dbdfb6ba1 LibUnicode: Add special handling of hour cycle (hc) Unicode keywords
For other keywords, allowed values per locale are generated at compile
time. But since the CLDR doesn't present hour cycles on a per-locale
basis, and hour cycles lookups depend on runtime data, we must handle
hour cycle keyword lookups differently than other keywords.
2021-11-29 22:48:46 +00:00
Timothy Flynn
914675e826 LibJS+LibUnicode: Separate number formatting methods from Locale.h
Currently, we generate separate data files for locale and number format
related tables/methods, but provide public accessors for all of the data
in one Locale.h file. Rather than continuing this trend for date-time,
relative time, etc. formatting, it's a bit easier to reason about if the
public accessors are also in separate files.
2021-11-29 22:48:46 +00:00
Timothy Flynn
cafb717486 LibUnicode: Parse and generate CLDR unit data for Intl.NumberFormat
The units data is in another CLDR package, cldr-units.
2021-11-16 23:14:09 +00:00
Timothy Flynn
80493908d3 LibUnicode: Tweak the definition of the plurality "many"
As noted at the top of this method, this is a naive implementation of
the Unicode plurality specification. But for now, we should tweak the
defintion of "many" to be "more than 2" (which is what I had in mind
when I wrote this, but forgot about fractions).
2021-11-16 23:14:09 +00:00
Timothy Flynn
6d34a0b4e8 LibJS+LibUnicode: Rename method to select a NumberFormat plurality
Instead of currency pattern lookups within select_currency_unit_pattern,
rename the method to select_pattern_with_plurality and accept any list
of patterns. This method will be needed for units.
2021-11-16 23:14:09 +00:00
Timothy Flynn
3b7f5af042 LibUnicode: Generate primary and secondary number grouping sizes
Most locales have a single grouping size (the number of integer digits
to be written before inserting a grouping separator). However some have
a primary and secondary size. We parse the primary size as the size used
for the least significant integer digits, and the secondary size for the
most significant.
2021-11-14 10:35:19 +00:00
Timothy Flynn
c65dea64bd LibJS+LibUnicode: Don't remove {currency} keys in GetNumberFormatPattern
In order to implement Intl.NumberFormat.prototype.formatToParts, do not
replace {currency} keys in the format pattern before ECMA-402 tells us
to. Otherwise, the array return by formatToParts will not contain the
expected currency key.

Early replacement was done to avoid resolving the currency display more
than once, as it involves a couple of round trips to search through
LibUnicode data. So this adds a non-standard method to NumberFormat to
do this resolution and cache the result.

Another side effect of this change is that LibUnicode must replace unit
format patterns of the form "{0} {1}" during code generation. These were
previously skipped during code generation because LibJS would just
replace the keys with the currency display at runtime. But now that the
currency display injection is delayed, any {0} or {1} keys in the format
pattern will cause PartitionNumberPattern to abort.
2021-11-13 19:01:25 +00:00
Timothy Flynn
0c9711efba LibUnicode: Handle all space code points when creating currency patterns
Previously, we were checking if the code point immediately before/after
the {currency} key was U+00A0 (non-breaking space). Instead, to handle
other spacing code points, we must check if the surrounding code point
has the separator general category.
2021-11-13 19:01:25 +00:00
Timothy Flynn
ada4bab405 LibUnicode: Remove GeneralCategory::Symbol string lookup
When I originally wrote this method, I had it in LibJS, where we can't
refer to the GeneralCategory enumeration directly. This is a big TODO,
anyone outside of LibUnicode can't assume the generated enumerations
exist and must get these values by string lookup. But this function
ended up living in LibUnicode, who can reference the enumeration.
2021-11-13 19:01:25 +00:00
Timothy Flynn
a701ed52fc LibJS+LibUnicode: Fully implement currency number formatting
Currencies are a bit strange; the layout of currency data in the CLDR is
not particularly compatible with what ECMA-402 expects. For example, the
currency format in the "en" and "ar" locales for the Latin script are:

    en: "¤#,##0.00"
    ar: "¤\u00A0#,##0.00"

Note how the "ar" locale has a non-breaking space after the currency
symbol (¤), but "en" does not. This does not mean that this space will
appear in the "ar"-formatted string, nor does it mean that a space won't
appear in the "en"-formatted string. This is a runtime decision based on
the currency display chosen by the user ("$" vs. "USD" vs. "US dollar")
and other rules in the Unicode TR-35 spec.

ECMA-402 shies away from the nuances here with "implementation-defined"
steps. LibUnicode will store the data parsed from the CLDR however it is
presented; making decisions about spacing, etc. will occur at runtime
based on user input.
2021-11-13 11:52:45 +00:00
Timothy Flynn
39e031c4dd LibJS+LibUnicode: Generate all styles of currency localizations
Currently, LibUnicode is only parsing and generating the "long" style of
currency display names. However, the CLDR contains "short" and "narrow"
forms as well that need to be handled. Parse these, and update LibJS to
actually respect the "style" option provided by the user for displaying
currencies with Intl.DisplayNames.

Note: There are some discrepencies between the engines on how style is
handled. In particular, running:

new Intl.DisplayNames('en', {type:'currency', style:'narrow'}).of('usd')

Gives:

  SpiderMoney: "USD"
  V8: "US Dollar"
  LibJS: "$"

And running:

new Intl.DisplayNames('en', {type:'currency', style:'short'}).of('usd')

Gives:

  SpiderMonkey: "$"
  V8: "US Dollar"
  LibJS: "$"

My best guess is V8 isn't handling style, and just returning the long
form (which is what LibJS did before this commit). And SpiderMoney can
handle some styles, but if they don't have a value for the requested
style, they fall back to the canonicalized code passed into of().
2021-11-13 11:52:45 +00:00
Timothy Flynn
1f2ac0ab41 LibUnicode: Move number formatting code generator to UnicodeNumberFormat 2021-11-12 20:46:38 +00:00
Timothy Flynn
feb8c22a62 LibUnicode: Parse and generate standard percentage formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
604a596c90 LibUnicode: Parse and generate compact decimal formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
12b468a588 LibUnicode: Begin parsing and generating locale number systems
The number system data in the CLDR contains information on how to format
numbers in a locale-dependent manner. Start parsing this data, beginning
with numeric symbol strings. For example the symbol NaN maps to "NaN" in
the en-US locale, and "非數值" in the zh-Hant locale.
2021-11-12 09:17:08 +00:00
Timothy Flynn
3ae4ff109f LibUnicode: Extract canonicalization of Unicode extension values
LibJS will need to canonicalize Unicode extension values, so extract the
lambda that was doing this work to its own function. This also changes
the helpers it invokes to take the provided key as a StringView because
we don't need (and won't always have) full String objects here.
2021-09-11 11:05:50 +01:00
Timothy Flynn
b1d4bcf364 LibUnicode: Generate numeric keyword values for each locale
This is needed for Intl.NumberFormat's usage of the ResolveLocale AO,
where the [[RelevantExtensionKeys]] internal slot will be "nu".
2021-09-11 11:05:50 +01:00
Timothy Flynn
4f2bcebe74 LibUnicode+LibJS: Store locale keyword values as a single string
Previously, LibUnicode would store the values of a keyword as a Vector.
For example, the locale "en-u-ca-abc-def" would have its keyword "ca"
stored as {"abc, "def"}. Then, canonicalization would occur on each of
the elements in that Vector.

This is incorrect because, for example, the keyword value "true" should
only be dropped if that is the entire value. That is, the canonical form
of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change
for canonicalization. However, we would canonicalize that locale as
"en-u-kb-abc".
2021-09-08 21:08:48 +01:00
Timothy Flynn
75657b79c6 LibUnicode: Update comment with link to related upstream issue
LibUnicode has to hard-code some aliases because the related data is not
available in the JSON export of CLDR. Turns out there is a ticket to add
this data in an upcoming CLDR release. Add a link to that ticket for
reference.
2021-09-08 21:08:48 +01:00
Timothy Flynn
3f64a14e06 LibUnicode: Parse and generate the Unicode locale list patterns dataset
This data informs consumers how to join lists of values. For example,
in en-US, the list ["a", "b", "c"] formatted to a string should become
"a, b, and c".
2021-09-06 23:49:56 +01:00
Timothy Flynn
12ae0a44d7 LibUnicode: Add public wrapper for the generated locale_from_string 2021-09-06 15:24:27 +01:00
Timothy Flynn
a77f323dfb LibUnicode: Implement the Remove Likely Subtags method
Unlike Add Likely Subtags, this method doesn't require generated data.
Instead, it is defined in terms of Add Likely Subtags.
2021-09-04 13:51:40 +01:00
Timothy Flynn
e6a2ab1202 LibUnicode: Generate an implementation of the Add Likely Subtags method 2021-09-04 13:51:40 +01:00
Timothy Flynn
ca90231794 LibUnicode: Define is_unicode_*_subtag helpers inline in their header
The UnicodeLocale generator will need to parse canonicalized locale
strings, and will require using these methods. However, the generator
cannot depend on LibUnicode because Locale.cpp within LibUnicode already
depends on the generated file. Instead, defining the methods that the
generator needs inline allows the generator to use them without linking
against LibUnicode.
2021-09-04 13:51:40 +01:00
Timothy Flynn
21c4922ac0 LibUnicode: Add helper methods to LocaleID and LanguageID for LibJS
Add a method to remove an extension type from the locale's extension set
and methods to convert a locale and language to a string without
canonicalization. Each of these will be used by LibJS.
2021-09-02 17:56:42 +01:00
Timothy Flynn
a05419db55 LibUnicode: Add lexer to test if a string matches the "type" production 2021-09-02 17:56:42 +01:00
Timothy Flynn
fd0011989a LibUnicode: Resolve the most likely territory alias when there are many 2021-09-01 14:14:47 +01:00
Timothy Flynn
1fbc5dba08 LibUnicode: Generate Unicode locale likely subtag data
CLDR contains a set of likely subtag data where, given a locale, you can
resolve what is the most likely language, script, or territory of that
locale. This data is needed for resolving territory aliases. These
aliases might contain multiple territories, and we need to resolve which
of those territories is most likely correct for a locale.

Note that the likely subtag data is quite huge (a few thousand entries).
As an optimization encouraged by the spec, we only generate the smallest
subset of this data that we actually need (about 150 entries).
2021-09-01 14:14:47 +01:00
Timothy Flynn
72f49e42b4 LibUnicode: Perform complex Unicode locale alias substitution 2021-09-01 14:14:47 +01:00
Timothy Flynn
da89cf9afb LibUnicode: Canonicalize calendar subtags
Calendar subtags are a bit of an odd-man-out in that we must match the
variants "ethiopic-amete-alem" in that order, without any other variant
in the locale. So a separate method is needed for this, and we now defer
sorting the variant list until after other canonicalization is done.
2021-09-01 14:14:47 +01:00
Timothy Flynn
8458f477a4 LibUnicode: Canonicalize timezone subtags 2021-09-01 14:14:47 +01:00
Timothy Flynn
335f985b31 LibUnicode: Canonicalize the subtag "imperial" to "uksystem" 2021-09-01 14:14:47 +01:00
Timothy Flynn
2d90144888 LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN" 2021-09-01 14:14:47 +01:00
Timothy Flynn
409f39b336 LibUnicode: Canonicalize the subtag "names" to "prprname" 2021-09-01 14:14:47 +01:00
Timothy Flynn
f907a7dc38 LibUnicode: Canonicalize the subtag "yes" to "true" 2021-09-01 14:14:47 +01:00
Timothy Flynn
556374a904 LibUnicode: Substitute Unicode locale aliases during canonicalization
Unicode TR35 defines how locale subtag aliases should be emplaced when
converting a locale to canonical form. For most subtags, it is a simple
substitution. Language subtags depend on context; for example, the
language "sh" should become "sr-Latn", but if the original locale has a
script subtag already ("sh-Cyrl"), then only the language subtag of the
alias should be taken ("sr-Latn").

To facilitate this, we now make two passes when canonicalizing a locale.
In the first pass, we convert the LocaleID structure to canonical syntax
(where the conversions all happen in-place). In the second pass, we form
the canonical string based on the canonical syntax.
2021-09-01 14:14:47 +01:00
Timothy Flynn
9b118f1f06 LibUnicode: Generate Unicode locale alias data
CLDR contains a set of aliases for languages, territories, etc. that no
longer are meant to be used (e.g. due to deprecation). For example, the
language "aam" is deprecated and should be canonicalized as "aas".
2021-09-01 14:14:47 +01:00