Commit graph

8 commits

Author SHA1 Message Date
Timothy Flynn
6719e5cb17 LibUnicode: Generate locale subtag data as multiple smaller tables
This commit is preemptive to upcoming commits which add more subtags to
the CLDR generator. Rather than generating a giant HashMap containing
all data, generate more (smaller) Array-based tables. This mimics the
UCD generator. This also allows simpler lookups at runtime since we can
generate index-based lookups into the smaller tables rather easily.

Without this change, adding the remaining locale subtags would result
in the generation and compilation of UnicodeLocale.cpp taking about 30s
on my machine. With this change, it takes about half that. Additionally,
the size of the generated file reduces by about 1.5MB.
2021-08-27 12:32:24 +01:00
Timothy Flynn
b8ad4d302e LibUnicode: Move Locale enumeration from generated UCD data to CLDR data
The UCD set of data contained a very small subset of all locales just to
handle some special casing rules. This enumeration will be needed within
the CLDR generator as well. So rather than duplicate the enum, remove it
from the UCD generator in favor of the full list of locales known by the
CLDR generator.
2021-08-27 12:32:24 +01:00
Timothy Flynn
ea21573ed8 LibUnicode: Download Unicode's CLDR database and generate locale data
The Unicode standard publishes a database known as the Common Locale
Data Repository (CLDR). This is a massive set of data from which anyone
implementing Unicode's Technical Standard #35 may generate their
implementation: https://www.unicode.org/reports/tr35/

This commit updates LibUnicode to download the compressed database and
extract a small subset. That subset is used to generate a list of
available locales and the territories (AKA regions) associated with each
locale.
2021-08-26 22:04:09 +01:00
Timothy Flynn
5ac23d244d LibUnicode: Generate separate tables for Unicode properties
Similar to General Categories, this generates separate tables for the
Property list.
2021-08-11 13:11:01 +02:00
Timothy Flynn
7dce2bfe23 LibUnicode: Generate separate tables for General Category properties
Previously, each code point's General Category was part of the generated
UnicodeData structure. This ultimately presented two problems, one
functional and one performance related:

  * Some General Categories are applied to unassigned code points, for
    example the Unassigned (Cn) category. Unassigned code points are
    strictly excluded from UnicodeData.txt, so by relying on that file,
    the generator is unable to handle these categories.

  * Lookups for General Categories are slower when searching through the
    large UnicodeData hash map. Even though lookups are O(1), the hash
    function turned out to be slower than binary searching through a
    category-specific table.

So, now a table is generated for each General Category. When querying a
code point for a category, a binary search is done on each code point
range in that category's table to check if code point has that category.

Further, General Categories are now parsed from the UCD file
DerivedGeneralCategory.txt. This file is a normal "prop list" file and
contains the categories for unassigned code points.
2021-08-11 13:11:01 +02:00
Timothy Flynn
f5c1bbc00b LibUnicode: Parse UCD Scripts.txt and generate as a Unicode property
There are a couple of minor nuances with parsing script values, compared
to other properties. In Scripts.txt, the UCD file lists the full name of
each script; other properties, like General Category, list the shorter
name in their primary files. This means that the aliases listed in
PropertyValueAliases.txt are reversed for script values.
2021-08-04 13:50:32 +01:00
Timothy Flynn
16e86ae743 LibUnicode: Generate General Category unions and aliases
This downloads the PropertyValueAliases.txt UCD file, which contains a
set of General Category aliases.

This changes the General Category enumeration to now be generated as a
bitmask. This is to easily allow General Category unions. For example,
the LC (Cased_Letter) category is the union of the Ll, Lu, and Lt
categories.
2021-08-02 21:02:09 +04:30
Timothy Flynn
f1809db994 LibUnicode: Add public methods to compare and lookup Unicode properties
Adds methods to retrieve a Unicode property from a string and to check
if a code point matches a Unicode property.

Also adds a <LibUnicode/Forward.h> header.
2021-07-30 21:26:31 +01:00