For SerenityOS, we parse emoji metadata from the UCD to learn emoji
groups, subgroups, names, etc. We used this information only in the
emoji picker dialog. It is entirely unused within Ladybird.
This removes our dependence on the UCD emoji file, as we no longer
need any of its information. All we need to know is the file path to
our custom emoji, which we get from Meta/emoji-file-list.txt.
There are a couple of differences here due to using ICU:
1. Titlecasing behaves slightly differently. We previously transformed
"123dollars" to "123Dollars", as we would use word segmentation to
split a string into words, then transform the first cased character
to titlecase. ICU doesn't go quite that far, and leaves the string
as "123dollars". While this is a behavior change, the only user of
this API is the `text-transform: capitalize;` CSS rule, and we now
match the behavior of other browsers.
2. There isn't an API to compare strings with case insensitivity without
allocating case-folded strings for both the left- and right-hand-side
strings. Our implementation was previously allocation-free; however,
in a benchmark, ICU is still ~1.4x faster.
These allow us to binary search the code point compositions based on
the first code point being combined, which makes the search close to
O(log N) instead of O(N).
And add a verification step to the emoji data generator to ensure all
emoji are listed in this file. This file will be used as a sources list
in both the CMake and GN build systems.
It is probably possible to generate this list. But in a first attempt,
the CMake code to set the file as a dependency of a pseudo target, which
would then parse the file and install the listed emoji was getting quite
verbose and complicated. So for now, let's just maintain this list.
The former automatically adapts the prefix to binary and octal
output, and is what we already use in the majority of cases.
Patch generated by:
rg -l '0x\{' | xargs sed -i '' -e 's/0x{:/{:#/'
I ran it 4 times (until it stopped changing things) since each
invocation only converted one instance per line.
No behavior change.
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
https://unicode.org/versions/Unicode15.1.0/
This update includes a new set of code point properties, Indic Conjunct
Break. These may have the values Consonant, Linker, or Extend. These are
used in text segmentation to prevent breaking on some extended grapheme
cluster sequences.
These were used to generate specialized tables. Now that those tables
have been migrated to general 2-stage lookup tables, these fields are
all unused.
Similar to commit 0652cc4, we now generate 2-stage lookup tables for
case conversion information. Only about 1500 code points are actually
cased. This means that case information is rather highly compressible,
as the blocks we break the code points into will generally all have no
casing information at all.
In total, this change:
* Does not change the size of libunicode.so (which is nice because,
generally, the 2-stage lookup tables are expected to trade a bit
of size for performance).
* Reduces the runtime of the new benchmark test case added here from
1.383s to 1.127s (about an 18.5% improvement).
There is no functional change here. This information will compose the
upcoming multistage casing tables in an upcoming patch. Extract it to
its own struct to prepare for that.
There is no functional change here. This just adjusts the changes made
in commit 0652cc4 to be a bit more generic for code point casing tables.
We currently only generate property tables, which boil down to a vector
of booleans. Casing tables will be a struct of varying types, so this
generalizes some of the generator to prepare for that ahead of time, to
make the upcoming casing patch smaller / easier to grok.
When generating code point property tables, we currently binary search
the code point range lists for each property to decide if a code point
has that property. However, we are both iterating over the code points
and through the sorted properties in order. This means we do not need
to search code point ranges that are below the current code point at
all. We can even remove the code point ranges that fall below the
current code point, as we will not see a code point in those ranges
again.
On my machine, this reduces the run time of GenerateUnicodeData from
3.4 seconds to 1.2 seconds.
We currently produce a single table for all categories of code point
properties (GeneralCategory, Script, etc.). Each row contains a field
indicating the range of code points to which that property applies. At
runtime, we then do a binary search through that table to decide if a
code point has a property.
This changes our approach to generate a 2-stage lookup table for each of
those categories. There is an in-depth explanation of these tables above
the new `create_code_point_tables` method. The end effect is that code
point property lookup is reduced from a binary search to constant-time
array lookups.
In total, this change:
* Increases the size of libunicode.so from 2.7 MB to 2.9 MB.
* Reduces the runtime of the new benchmark test case added here from
3.576s to 1.020s (a 3.5x speedup).
* In a profile of resizing a TextEditor window with a 3MB file open,
the runtime of checking if a code point has a word break property
reduces from ~81% to ~56%.
The next commit will need a type from LibUnicode/CharacterTypes.h. To
avoid conflicts between that header's CodePointRange and the one that is
defined in the code generator, just use the public definition.
We started generating this data in commit 0505e03, but it was unused.
It's still not used, so let's remove it, rather than bloating the size
of libunicode.so with unused data. If we need it in the future, it's
trivial to add back.
Note we *have* always used the block name data from that commit, and
that is still present here.
Similar to POSIX read, the basic read and write functions of AK::Stream
do not have a lower limit of how much data they read or write (apart
from "none at all").
Rename the functions to "read some [data]" and "write some [data]" (with
"data" being omitted, since everything here is reading and writing data)
to make them sufficiently distinct from the functions that ensure to
use the entire buffer (which should be the go-to function for most
usages).
No functional changes, just a lot of new FIXMEs.
This also removes DirIterator::error_string(), since the same strerror()
string will be included when you print the Error itself. Except in `ls`
which is still using fprintf() for now.
This sorts the array of generated emoji data by code point (first by
code point length, then by code point value). This lets us use a binary
search to find emoji data, rather than the current linear search.
In a profile of scrolling around /home/anon/Documents/emoji.txt, this
reduces the runtime of Gfx::Emoji::emoji_for_code_points from 69.03% to
28.42%. Within that, Unicode::find_emoji_for_code_points reduces from
28.42% to just 1.95%.
Similar to the FontDatabase, this will be needed for Ladybird to find
emoji images. We now generate just the file name of emoji image in
LibUnicode, and look for that file in the specified path (defaulting to
/res/emoji) at runtime.