Commit graph

256 commits

Author SHA1 Message Date
Timothy Flynn
671eaa0c59 LibUnicode: Add helper lambda for appending canonicalized strings
Once canonical extensions are implemented, the number of:

    if (optional_string.has_value() {
        builder.append('-');
        builder.append(optional_string->to_lowercase_string());
    }

Will be quite large. This commit just adds a helper lambda to handle
this pattern to prevent this function from becoming even more enormous.
2021-08-30 19:42:40 +01:00
Timothy Flynn
30855e6663 LibUnicode: Parse locale private use extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
29f76ef7c8 LibUnicode: Parse locale extensions of the other extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
d2d304fcf8 LibUnicode: Parse locale extensions of the transformed extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
eda92d15e4 LibUnicode: Parse locale extensions of the Unicode locale extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
dd89901b07 LibUnicode: Use GenericLexer to parse Unicode language IDs
This is preparatory work to read locale extensions. The parser currently
enforces that the entire string is consumed. But to parse extensions,
parse_unicode_locale_id() will need parse_unicode_language_id() to just
stop parsing on the first segment that does not match the language ID
grammar. It will also need to know where the parsing stopped. Both of
these needs are fulfilled by GenericLexer.

The caveat is that we can no longer simply split the parsed string on
separator characters. So parse_unicode_language_id() now operates as a
small state machine.
2021-08-30 19:42:40 +01:00
Andrew Kaster
63956b36d0 Everywhere: Move all host tools into the Lagom/Tools subdirectory
This allows us to remove all the add_subdirectory calls from the top
level CMakeLists.txt that referred to targets linking LagomCore.

Segregating the host tools and Serenity targets helps us get to a place
where the main Serenity build can simply use a CMake toolchain file
rather than swapping all the compiler/sysroot variables after building
host libraries and tools.
2021-08-28 08:44:17 +01:00
Andrew Kaster
e88761b2b9 Meta+LibUnicode: Move unicode_data helper to Meta/CMake
Moving this helper CMake file to the centralized Meta/CMake folder helps
to get a better grasp on what extra files are required for the build,
and what files are generated.

While we're at it, don't use add_compile_definitions for
ENABLE_UNICODE_DATA, which only needs to be seen by LibUnicode sources.
2021-08-28 08:44:17 +01:00
Robert Syring
4f2dc8db26 LibUnicode: Change unzip commands to also extract subdirectories
Changed unzip commands from * to ** in order to also extract subdirectories from cldr.zip.
2021-08-28 08:13:32 +01:00
Timothy Flynn
8b93d51212 LibUnicode: Parse Unicode CLDR currencies and generate locale mappings 2021-08-27 12:32:24 +01:00
Timothy Flynn
297db925fc LibUnicode: Extract cldr-numbers dataset from CLDR database
This dataset holds the values needed to handle DisplayNames.prototype.of
with a type option of "currency".
2021-08-27 12:32:24 +01:00
Timothy Flynn
0f02def3c2 LibUnicode: Parse Unicode CLDR scripts and generate locale mappings 2021-08-27 12:32:24 +01:00
Timothy Flynn
ab7a1dd89e LibUnicode: Parse Unicode CLDR languages and generate locale mappings 2021-08-27 12:32:24 +01:00
Timothy Flynn
6719e5cb17 LibUnicode: Generate locale subtag data as multiple smaller tables
This commit is preemptive to upcoming commits which add more subtags to
the CLDR generator. Rather than generating a giant HashMap containing
all data, generate more (smaller) Array-based tables. This mimics the
UCD generator. This also allows simpler lookups at runtime since we can
generate index-based lookups into the smaller tables rather easily.

Without this change, adding the remaining locale subtags would result
in the generation and compilation of UnicodeLocale.cpp taking about 30s
on my machine. With this change, it takes about half that. Additionally,
the size of the generated file reduces by about 1.5MB.
2021-08-27 12:32:24 +01:00
Timothy Flynn
b8ad4d302e LibUnicode: Move Locale enumeration from generated UCD data to CLDR data
The UCD set of data contained a very small subset of all locales just to
handle some special casing rules. This enumeration will be needed within
the CLDR generator as well. So rather than duplicate the enum, remove it
from the UCD generator in favor of the full list of locales known by the
CLDR generator.
2021-08-27 12:32:24 +01:00
Timothy Flynn
a57615c2b4 Meta: Ensure cmake fails if we are unable to unzip the CLDR database 2021-08-26 23:40:23 +02:00
Timothy Flynn
137e98cb6f LibUnicode: Add public accessors to generated locale data 2021-08-26 22:04:09 +01:00
Timothy Flynn
b7a95cba65 LibUnicode: Implement grammar validators for Unicode TR-35
ECMA-402 requires validating user input against the EBNF grammar for
Unicode locales described in TR-35: https://www.unicode.org/reports/tr35

This commit adds validators for that grammar, as well as other helper to
e.g. canonicalize a locale string.
2021-08-26 22:04:09 +01:00
Timothy Flynn
ea21573ed8 LibUnicode: Download Unicode's CLDR database and generate locale data
The Unicode standard publishes a database known as the Common Locale
Data Repository (CLDR). This is a massive set of data from which anyone
implementing Unicode's Technical Standard #35 may generate their
implementation: https://www.unicode.org/reports/tr35/

This commit updates LibUnicode to download the compressed database and
extract a small subset. That subset is used to generate a list of
available locales and the territories (AKA regions) associated with each
locale.
2021-08-26 22:04:09 +01:00
Timothy Flynn
a98d3a1a85 LibUnicode: Download and parse DerivedNormalizationProps UCD file
This file contains the last properties that LibUnicode is not parsing.
Much of the data in this file is not currently used; that is left as a
FIXME for when String.prototype.normalize is implemented. Until then,
only the code point properties are utilized for regular expression
pattern escapes.
2021-08-11 13:11:01 +02:00
Timothy Flynn
1e91334008 LibUnicode: Handle edge-case script extensions, Common and Inherited
These script extensions have some peculiar behavior in the Unicode spec.
The UCD ScriptExtension file does not contain these scripts. Rather, it
is implied the code points which have these scripts as an extension are
the code points that both:

  1. Have Common or Inherited as their primary script value
  2. Do not have any other script value in their script extension lists

Because these are not explictly listed in the UCD, we must manually form
these script extensions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
47bb350ebd LibUnicode: Generate separate tables for scripts and script extensions
Notice that unlike the note in populate_general_category_unions(),
script extension do indeed have code point ranges which overlap. Thus,
this commit adds code to handle that, and hooks it into the GC unions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
e6e462249f LibUnicode: Generate *_from_string methods using a hash map
Rather than a long series of string comparisons, generate each of these
methods using a hash map of the enumeration name to its value.
2021-08-11 13:11:01 +02:00
Timothy Flynn
5ac23d244d LibUnicode: Generate separate tables for Unicode properties
Similar to General Categories, this generates separate tables for the
Property list.
2021-08-11 13:11:01 +02:00
Timothy Flynn
b06c104076 LibUnicode: Include Unassigned code points in the Other General Category
Now that the generator parses unassigned General Category properties, it
can include Unassigned (Cn) in the Other (C) category.
2021-08-11 13:11:01 +02:00
Timothy Flynn
7dce2bfe23 LibUnicode: Generate separate tables for General Category properties
Previously, each code point's General Category was part of the generated
UnicodeData structure. This ultimately presented two problems, one
functional and one performance related:

  * Some General Categories are applied to unassigned code points, for
    example the Unassigned (Cn) category. Unassigned code points are
    strictly excluded from UnicodeData.txt, so by relying on that file,
    the generator is unable to handle these categories.

  * Lookups for General Categories are slower when searching through the
    large UnicodeData hash map. Even though lookups are O(1), the hash
    function turned out to be slower than binary searching through a
    category-specific table.

So, now a table is generated for each General Category. When querying a
code point for a category, a binary search is done on each code point
range in that category's table to check if code point has that category.

Further, General Categories are now parsed from the UCD file
DerivedGeneralCategory.txt. This file is a normal "prop list" file and
contains the categories for unassigned code points.
2021-08-11 13:11:01 +02:00
Timothy Flynn
4e546cee97 LibUnicode: Remove WordBreakProperty from generated Unicode data
This was originally used for the "is_final_code_point" algorithm in
LibUnicode/CharacterTypes.cpp. However, it has since been superseded by
DerivedCoreProperties and is now unused. Remove it as it is currently a
waste of time to process the data, and is trivial to add back if we need
it again.
2021-08-11 13:11:01 +02:00
Timothy Flynn
6f2640d031 LibUnicode: Parse UCD DerivedBinaryProperties.txt and generate property 2021-08-04 13:50:32 +01:00
Timothy Flynn
9113f892a7 LibUnicode: Parse UCD emoji-data.txt and generate Unicode property 2021-08-04 13:50:32 +01:00
Timothy Flynn
5edd458420 LibUnicode: Parse UCD ScriptExtensions.txt and generate property 2021-08-04 13:50:32 +01:00
Timothy Flynn
6bdb19fe21 LibUnicode: Remove unused parameter from Unicode data generator 2021-08-04 13:50:32 +01:00
Timothy Flynn
f5c1bbc00b LibUnicode: Parse UCD Scripts.txt and generate as a Unicode property
There are a couple of minor nuances with parsing script values, compared
to other properties. In Scripts.txt, the UCD file lists the full name of
each script; other properties, like General Category, list the shorter
name in their primary files. This means that the aliases listed in
PropertyValueAliases.txt are reversed for script values.
2021-08-04 13:50:32 +01:00
Timothy Flynn
1bb6404a19 LibUnicode: Invoke Unicode data generator a single time
It takes a non-neglible amount of time to parse all of the UCD files and
generate the Unicode data files. To help compile times, only invoke the
generator once.
2021-08-04 11:18:24 +02:00
Timothy Flynn
9413c3a0d1 LibUnicode: Generate a map of code points to their Unicode table index
The current strategy of searching for a code point within the generated
table is slow for code points > U+0377 (the last code point whose index
is the same value as the code point itself). For larger code points, we
are doing a linear search through the table.

Instead, generate a HashMap of each code point to its entry in the table
for faster runtime lookups.

This had the added benefit of being able to remove a fair amount of code
from the generator. We no longer need to track that last contiguous code
point (U+0377) nor each code point's index in the generated table.
2021-08-04 11:18:24 +02:00
Timothy Flynn
5de6d3dd90 LibUnicode: Add public methods to compare and lookup General Categories
Adds methods to retrieve a General Category from a string and to check
if a code point matches a General Category.
2021-08-02 21:02:09 +04:30
Timothy Flynn
f63287cd63 LibUnicode: Initialize manually created Unicode properties inline
Using initializer lists directly in the UnicodeData struct definition
feels a bit cleaner than invoking HashMap::set in main().
2021-08-02 21:02:09 +04:30
Timothy Flynn
16e86ae743 LibUnicode: Generate General Category unions and aliases
This downloads the PropertyValueAliases.txt UCD file, which contains a
set of General Category aliases.

This changes the General Category enumeration to now be generated as a
bitmask. This is to easily allow General Category unions. For example,
the LC (Cased_Letter) category is the union of the Ll, Lu, and Lt
categories.
2021-08-02 21:02:09 +04:30
Timothy Flynn
d485cf29d7 LibRegex+LibUnicode: Begin implementing Unicode property escapes
This supports some binary property matching. It does not support any
properties not yet parsed by LibUnicode, nor does it support value
matching (such as Script_Extensions=Latin).
2021-07-30 21:26:31 +01:00
Timothy Flynn
f1809db994 LibUnicode: Add public methods to compare and lookup Unicode properties
Adds methods to retrieve a Unicode property from a string and to check
if a code point matches a Unicode property.

Also adds a <LibUnicode/Forward.h> header.
2021-07-30 21:26:31 +01:00
Timothy Flynn
3f80791ed5 LibUnicode: Manually assign special code point properties
The Unicode standard defines a few extra properties that are not defined
in any UCD file, so we must assign them manually.
2021-07-30 21:26:31 +01:00
Timothy Flynn
bba3152104 LibUnicode: Parse and generate PropertyAliases
These are all used for Unicode property escapes.
2021-07-30 21:26:31 +01:00
Timothy Flynn
761c16d873 LibUnicode: Parse and utilize DerivedCoreProperties
DerivedCoreProperties are pseudo-properties that are the union of other
categories and properties. For example, the derived property Math is the
union of the general category Sm and the property Other_Math.

Parsing these is necessary for implementing Unicode property escapes.
But it also has the added benefit that LibUnicode now does not need to
derive some of these properties at runtime.
2021-07-30 21:26:31 +01:00
Timothy Flynn
4eb4b06688 LibUnicode: Do not replace underscores in property names
Originally, this was done to make the generated enums look more like the
rest of Serenity's enums. But for Unicode property escapes, LibUnicode
will need to compare property names from a RegExp.prototype object to
these parsed property names, which will be easier without this
modification.
2021-07-30 21:26:31 +01:00
Timothy Flynn
5d09a00189 LibUnicode: Generate PropList enumeration as a bitmask
Rather than generating the PropList as a list of enums, generate it as
a bitmask. Not only will this be better for runtime property searching,
this will allow parsing of the DerivedCoreProperties list more easily.
2021-07-30 21:26:31 +01:00
Andrew Kaster
38707f4a20 LibUnicode: Make unicode data generation logic more relocatable
The previous logic had several checks for Lagom directories and
subdirectories. All we really want to do for these header checks is make
sure that the files end up in an included folder prefixed with
LibUnicode. We also don't need to hard code the path to the generator,
the $<TARGET_FILES> generator expression can create the path for us.
2021-07-29 21:46:25 +01:00
Timothy Flynn
c4bfda7f7f LibUnicode: Handle code points that are both cased and case-ignorable
Apparently, some code points fit both categories, for example U+0345
(COMBINING GREEK YPOGEGRAMMENI). Handle this fact when determining if
a code point is a final code point in a string.
2021-07-28 23:42:29 +02:00
Timothy Flynn
dff156b7c6 LibUnicode: Reduce Unicode data generator boilerplate
There's a fair amount of boilerplate when e.g. adding a new UCD file to
parse or a new enumeration to generate. Reduce the overhead by adding
helper lambdas. Also adds a couple missing spec links with UCD field
information.
2021-07-28 23:42:29 +02:00
Timothy Flynn
7827aede6f LibUnicode: Check word break when deciding on case-ignorable code points 2021-07-28 23:42:29 +02:00
Timothy Flynn
12fb3ae033 LibUnicode: Download and parse the word break property list UCD file
Note that unlike the main property list, each code point has only one
word break property. Code points that do not have a word break property
are to be assigned the property "Other".
2021-07-28 23:42:29 +02:00
Timothy Flynn
c45a014645 LibUnicode: Check property list when deciding if a code point is cased 2021-07-28 23:42:29 +02:00