Commit graph

65 commits

Author SHA1 Message Date
Timothy Flynn
d13142f015 LibJS+LibUnicode: Store parsed Unicode locale data as full strings
Originally, it was convenient to store the parsed Unicode locale data as
views into the original string being parsed. But to implement locale
aliases will require mutating the data that was parsed. To prepare for
that, store the parsed data as proper strings.
2021-09-01 14:14:47 +01:00
Timothy Flynn
f897c2edb3 LibUnicode: Canonicalize locale private use extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
6f0cb52dc4 LibUnicode: Canonicalize locale extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
671eaa0c59 LibUnicode: Add helper lambda for appending canonicalized strings
Once canonical extensions are implemented, the number of:

    if (optional_string.has_value() {
        builder.append('-');
        builder.append(optional_string->to_lowercase_string());
    }

Will be quite large. This commit just adds a helper lambda to handle
this pattern to prevent this function from becoming even more enormous.
2021-08-30 19:42:40 +01:00
Timothy Flynn
30855e6663 LibUnicode: Parse locale private use extensions 2021-08-30 19:42:40 +01:00
Timothy Flynn
29f76ef7c8 LibUnicode: Parse locale extensions of the other extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
d2d304fcf8 LibUnicode: Parse locale extensions of the transformed extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
eda92d15e4 LibUnicode: Parse locale extensions of the Unicode locale extension form 2021-08-30 19:42:40 +01:00
Timothy Flynn
dd89901b07 LibUnicode: Use GenericLexer to parse Unicode language IDs
This is preparatory work to read locale extensions. The parser currently
enforces that the entire string is consumed. But to parse extensions,
parse_unicode_locale_id() will need parse_unicode_language_id() to just
stop parsing on the first segment that does not match the language ID
grammar. It will also need to know where the parsing stopped. Both of
these needs are fulfilled by GenericLexer.

The caveat is that we can no longer simply split the parsed string on
separator characters. So parse_unicode_language_id() now operates as a
small state machine.
2021-08-30 19:42:40 +01:00
Timothy Flynn
8b93d51212 LibUnicode: Parse Unicode CLDR currencies and generate locale mappings 2021-08-27 12:32:24 +01:00
Timothy Flynn
0f02def3c2 LibUnicode: Parse Unicode CLDR scripts and generate locale mappings 2021-08-27 12:32:24 +01:00
Timothy Flynn
ab7a1dd89e LibUnicode: Parse Unicode CLDR languages and generate locale mappings 2021-08-27 12:32:24 +01:00
Timothy Flynn
6719e5cb17 LibUnicode: Generate locale subtag data as multiple smaller tables
This commit is preemptive to upcoming commits which add more subtags to
the CLDR generator. Rather than generating a giant HashMap containing
all data, generate more (smaller) Array-based tables. This mimics the
UCD generator. This also allows simpler lookups at runtime since we can
generate index-based lookups into the smaller tables rather easily.

Without this change, adding the remaining locale subtags would result
in the generation and compilation of UnicodeLocale.cpp taking about 30s
on my machine. With this change, it takes about half that. Additionally,
the size of the generated file reduces by about 1.5MB.
2021-08-27 12:32:24 +01:00
Timothy Flynn
137e98cb6f LibUnicode: Add public accessors to generated locale data 2021-08-26 22:04:09 +01:00
Timothy Flynn
b7a95cba65 LibUnicode: Implement grammar validators for Unicode TR-35
ECMA-402 requires validating user input against the EBNF grammar for
Unicode locales described in TR-35: https://www.unicode.org/reports/tr35

This commit adds validators for that grammar, as well as other helper to
e.g. canonicalize a locale string.
2021-08-26 22:04:09 +01:00