We have a fair amount of hard-coded keywords / aliases that can now be
replaced with real data from BCP 47. As a result, the also changes the
awkward way we were previously generating keys. Before, we were more or
less generating keywords as a CSV list of keys, e.g. for the "nu" key,
we'd generate "latn,arab,grek" (ordered by locale preference). Then at
runtime, we'd split on the comma. We now just generate spans of keywords
directly.
Currently, the UnicodeLocale generator collects a list of known locales
from the CLDR before processing language display names. For each locale,
the identifier is broken into language, script, and region subtags, and
we create a list of seen languages. When processing display names, we
skip languages we hadn't seen in that first step.
This is insufficient for language display names like "en-GB", which do
not have an locale entry in the CLDR, and thus are skipped. So instead,
create the list of known languages by actually reading through the list
of languages which have a display name.
Note there's a bit of an unfortunate duplication in the calendar enum
generated by UnicodeLocale and the existing enum generated by
UnicodeDateTimeFormat. The former contains every calendar known to the
CLDR, whereas the latter contains the calendars we've actually parsed
for DateTimeFormat (currently only Gregorian). The new enum generated
here can be removed once DateTimeFormat knows about all calendars.
We had a hard-coded table of number system digits copied from ECMA-402.
Turns out these digits are in the CLDR, so let's parse the digits from
there instead of hard-coding them.
LibUnicode no longer needs to generate a list of time zone names that it
parsed from metaZones.json. We can defer to the TZDB for a golden list
of time zones.
The generator parses metaZones.json to form a mapping of meta zones to
time zones (AKA "golden zone" in TR-35). This parser errantly assumed
this was a 1-to-1 mapping.
Unlike most data in the CLDR, hour cycles are not stored on a per-locale
basis. Instead, they are keyed by a string that is usually a region, but
sometimes is a locale. Therefore, given a locale, to determine the hour
cycles for that locale, we:
1. Check if the locale itself is assigned hour cycles.
2. If the locale has a region, check if that region is assigned hour
cycles.
3. Otherwise, maximize that locale, and if the maximized locale has
a region, check if that region is assigned hour cycles.
4. If the above all fail, fallback to the "001" region.
Further, each locale's default hour cycle is the first assigned hour
cycle.
Most locales have a single grouping size (the number of integer digits
to be written before inserting a grouping separator). However some have
a primary and secondary size. We parse the primary size as the size used
for the least significant integer digits, and the secondary size for the
most significant.
This data is published under ISO-4217 as an XML file. Since we can't
parse XML files yet, and the data isn't very large, it was translated to
C++ manually here.
This data informs consumers how to join lists of values. For example,
in en-US, the list ["a", "b", "c"] formatted to a string should become
"a, b, and c".
Most alias substitutions are "simple", meaning that alias matching is
done by examining a single locale subtag. However, there are a handful
of "complex" aliases where matching is done by examining multiple
subtags. For example, the variant subtag "lojban" causes the locale
"art-lojban" to be canonicalized to "jbo", but only when the language
subtag is "art" (i.e. this should not occur for the locale "en-lojban").
This generates a method to perform complex alias matching.
This commit is preemptive to upcoming commits which add more subtags to
the CLDR generator. Rather than generating a giant HashMap containing
all data, generate more (smaller) Array-based tables. This mimics the
UCD generator. This also allows simpler lookups at runtime since we can
generate index-based lookups into the smaller tables rather easily.
Without this change, adding the remaining locale subtags would result
in the generation and compilation of UnicodeLocale.cpp taking about 30s
on my machine. With this change, it takes about half that. Additionally,
the size of the generated file reduces by about 1.5MB.
The UCD set of data contained a very small subset of all locales just to
handle some special casing rules. This enumeration will be needed within
the CLDR generator as well. So rather than duplicate the enum, remove it
from the UCD generator in favor of the full list of locales known by the
CLDR generator.
The Unicode standard publishes a database known as the Common Locale
Data Repository (CLDR). This is a massive set of data from which anyone
implementing Unicode's Technical Standard #35 may generate their
implementation: https://www.unicode.org/reports/tr35/
This commit updates LibUnicode to download the compressed database and
extract a small subset. That subset is used to generate a list of
available locales and the territories (AKA regions) associated with each
locale.
Previously, each code point's General Category was part of the generated
UnicodeData structure. This ultimately presented two problems, one
functional and one performance related:
* Some General Categories are applied to unassigned code points, for
example the Unassigned (Cn) category. Unassigned code points are
strictly excluded from UnicodeData.txt, so by relying on that file,
the generator is unable to handle these categories.
* Lookups for General Categories are slower when searching through the
large UnicodeData hash map. Even though lookups are O(1), the hash
function turned out to be slower than binary searching through a
category-specific table.
So, now a table is generated for each General Category. When querying a
code point for a category, a binary search is done on each code point
range in that category's table to check if code point has that category.
Further, General Categories are now parsed from the UCD file
DerivedGeneralCategory.txt. This file is a normal "prop list" file and
contains the categories for unassigned code points.
There are a couple of minor nuances with parsing script values, compared
to other properties. In Scripts.txt, the UCD file lists the full name of
each script; other properties, like General Category, list the shorter
name in their primary files. This means that the aliases listed in
PropertyValueAliases.txt are reversed for script values.
This downloads the PropertyValueAliases.txt UCD file, which contains a
set of General Category aliases.
This changes the General Category enumeration to now be generated as a
bitmask. This is to easily allow General Category unions. For example,
the LC (Cased_Letter) category is the union of the Ll, Lu, and Lt
categories.
Adds methods to retrieve a Unicode property from a string and to check
if a code point matches a Unicode property.
Also adds a <LibUnicode/Forward.h> header.