Commit graph

18 commits

Author SHA1 Message Date
Timothy Flynn
f1809db994 LibUnicode: Add public methods to compare and lookup Unicode properties
Adds methods to retrieve a Unicode property from a string and to check
if a code point matches a Unicode property.

Also adds a <LibUnicode/Forward.h> header.
2021-07-30 21:26:31 +01:00
Timothy Flynn
3f80791ed5 LibUnicode: Manually assign special code point properties
The Unicode standard defines a few extra properties that are not defined
in any UCD file, so we must assign them manually.
2021-07-30 21:26:31 +01:00
Timothy Flynn
bba3152104 LibUnicode: Parse and generate PropertyAliases
These are all used for Unicode property escapes.
2021-07-30 21:26:31 +01:00
Timothy Flynn
761c16d873 LibUnicode: Parse and utilize DerivedCoreProperties
DerivedCoreProperties are pseudo-properties that are the union of other
categories and properties. For example, the derived property Math is the
union of the general category Sm and the property Other_Math.

Parsing these is necessary for implementing Unicode property escapes.
But it also has the added benefit that LibUnicode now does not need to
derive some of these properties at runtime.
2021-07-30 21:26:31 +01:00
Timothy Flynn
4eb4b06688 LibUnicode: Do not replace underscores in property names
Originally, this was done to make the generated enums look more like the
rest of Serenity's enums. But for Unicode property escapes, LibUnicode
will need to compare property names from a RegExp.prototype object to
these parsed property names, which will be easier without this
modification.
2021-07-30 21:26:31 +01:00
Timothy Flynn
5d09a00189 LibUnicode: Generate PropList enumeration as a bitmask
Rather than generating the PropList as a list of enums, generate it as
a bitmask. Not only will this be better for runtime property searching,
this will allow parsing of the DerivedCoreProperties list more easily.
2021-07-30 21:26:31 +01:00
Andrew Kaster
38707f4a20 LibUnicode: Make unicode data generation logic more relocatable
The previous logic had several checks for Lagom directories and
subdirectories. All we really want to do for these header checks is make
sure that the files end up in an included folder prefixed with
LibUnicode. We also don't need to hard code the path to the generator,
the $<TARGET_FILES> generator expression can create the path for us.
2021-07-29 21:46:25 +01:00
Timothy Flynn
c4bfda7f7f LibUnicode: Handle code points that are both cased and case-ignorable
Apparently, some code points fit both categories, for example U+0345
(COMBINING GREEK YPOGEGRAMMENI). Handle this fact when determining if
a code point is a final code point in a string.
2021-07-28 23:42:29 +02:00
Timothy Flynn
dff156b7c6 LibUnicode: Reduce Unicode data generator boilerplate
There's a fair amount of boilerplate when e.g. adding a new UCD file to
parse or a new enumeration to generate. Reduce the overhead by adding
helper lambdas. Also adds a couple missing spec links with UCD field
information.
2021-07-28 23:42:29 +02:00
Timothy Flynn
7827aede6f LibUnicode: Check word break when deciding on case-ignorable code points 2021-07-28 23:42:29 +02:00
Timothy Flynn
12fb3ae033 LibUnicode: Download and parse the word break property list UCD file
Note that unlike the main property list, each code point has only one
word break property. Code points that do not have a word break property
are to be assigned the property "Other".
2021-07-28 23:42:29 +02:00
Timothy Flynn
c45a014645 LibUnicode: Check property list when deciding if a code point is cased 2021-07-28 23:42:29 +02:00
Timothy Flynn
38adfd8874 LibUnicode: Download and parse the property list UCD file 2021-07-28 23:42:29 +02:00
Timothy Flynn
39f971e42b LibUnicode: Begin implementing special Unicode case folding
This implements unconditional special case folding, and conditional
folding for non-locale cases. Worth noting that the only conditional,
non-locale special case is for converting an uppercase sigma to
lowercase.
2021-07-27 21:04:36 +01:00
Timothy Flynn
5b110034dd LibUnicode: Produce each code point's general category
This will be needed for the Unicode Standard's Default Case Algorithm.
Generate the field as an enumeration rather than a string for easier
comparison.
2021-07-27 21:04:36 +01:00
Timothy Flynn
32ea461385 LibUnicode: Download and parse the special casing UCD file
This adds a SpecialCasing structure to the generated UnicodeData.h/cpp
files. This structure contains casing rules for code points which have
non-1-to-1 upper-to-lower case code point mappings. Further, these rules
may be limited to specific locales or other context.
2021-07-27 21:04:36 +01:00
Timothy Flynn
98d8274040 Meta: Add LibUnicode (and its tests) to Lagom
This is primarily to allow using LibUnicode within LibJS and its REPL.

Note: this seems to be the first time that a Lagom dependency requires
generated source files. For this to work, some of Lagom's CMakeLists.txt
commands needed to be re-organized to include the CMake files that fetch
and parse UnicodeData.txt. The paths required to invoke the generator
also differ depending on what is currently building (SerenityOS vs.
Lagom as part of the Serenity build vs. a standalone Lagom build).
2021-07-26 17:03:55 +01:00
Timothy Flynn
4dda3edc9e LibUnicode: Introduce a Unicode library for interacting with UCD files
The Unicode standard publishes the Unicode Character Database (UCD) with
information about every code point, such as each code point's upper case
mapping. LibUnicode exists to download and parse UCD files at build time
and to provide accessors to that data.

As a start, LibUnicode includes upper- and lower-case code point
converters.
2021-07-26 17:03:55 +01:00