Commit graph

300 commits

Author SHA1 Message Date
Timothy Flynn
aa3a30870b LibUnicode: Replace code point bidirectional classes with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
e77dafc987 LibUnicode: Replace code point scripts and script extensions with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
986ff984cc LibUnicode: Replace code point general categories with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
c804bda5fd LibUnicode: Replace code point properties with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
ab56b8c8dc LibUnicode: Remove the locale-unaware text segmentation implementation 2024-06-20 13:46:54 +02:00
Timothy Flynn
5cf818e305 LibUnicode: Replace case transformations and comparison with ICUs
There are a couple of differences here due to using ICU:

1. Titlecasing behaves slightly differently. We previously transformed
   "123dollars" to "123Dollars", as we would use word segmentation to
   split a string into words, then transform the first cased character
   to titlecase. ICU doesn't go quite that far, and leaves the string
   as "123dollars". While this is a behavior change, the only user of
   this API is the `text-transform: capitalize;` CSS rule, and we now
   match the behavior of other browsers.

2. There isn't an API to compare strings with case insensitivity without
   allocating case-folded strings for both the left- and right-hand-side
   strings. Our implementation was previously allocation-free; however,
   in a benchmark, ICU is still ~1.4x faster.
2024-06-20 10:59:55 +02:00
Timothy Flynn
8d7216f4e0 LibUnicode: Replace IDNA ASCII conversion with ICU 2024-06-18 21:07:56 +02:00
Timothy Flynn
83475c5380 LibUnicode: Replace Unicode string normalization with ICU
In a benchmark, ICU's implementation was over 3x faster than ours.
2024-06-18 21:07:56 +02:00
Timothy Flynn
1feef17bf7 LibUnicode: Remove completely unused code point name & block name data
These were used for e.g. the Character Map on Serenity, but are not used
at all for Ladybird.
2024-06-18 21:07:56 +02:00
Timothy Flynn
4de8adabac LibLocale: Replace available locale lookups with ICU 2024-06-16 06:57:08 +02:00
Dan Klishch
b8c3e75573 Meta+Userland: Fix more instances of bad lambda-Variant interaction
These don't cause compilation to fail but they still crash crashd.
2024-04-18 13:14:33 -06:00
Idan Horowitz
945c58c7c1 LibUnicode: Generate and use code point composition mappings
These allow us to binary search the code point compositions based on
the first code point being combined, which makes the search close to
O(log N) instead of O(N).
2024-04-06 14:21:04 -04:00
Timothy Flynn
683c08744a Userland: Avoid some conversions from rvalue strings to StringView
These are all actually fine, there is no UAF here. But once e.g.
`ByteString::view() &&` is deleted, these instances won't compile.
2024-04-04 11:23:21 +02:00
Timothy Flynn
91cd43a7ac Meta: Add a file containing a list of all emoji file names
And add a verification step to the emoji data generator to ensure all
emoji are listed in this file. This file will be used as a sources list
in both the CMake and GN build systems.

It is probably possible to generate this list. But in a first attempt,
the CMake code to set the file as a dependency of a pseudo target, which
would then parse the file and install the listed emoji was getting quite
verbose and complicated. So for now, let's just maintain this list.
2024-03-23 17:26:31 -04:00
Nico Weber
24a469f521 Everywhere: Prefer {:#x} over 0x{:x} in format strings
The former automatically adapts the prefix to binary and octal
output, and is what we already use in the majority of cases.

Patch generated by:

    rg -l '0x\{' | xargs sed -i '' -e 's/0x{:/{:#/'

I ran it 4 times (until it stopped changing things) since each
invocation only converted one instance per line.

No behavior change.
2024-02-21 17:54:38 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Timothy Flynn
43e9dc0500 LibUnicode: Use weak symbols to provide default IDNA defintions
Rather than using #ifdef blocks, update the fallback IDNA definitions to
use weak symbols to match the rest of LibUnicode / LibLocale.
2023-12-10 10:19:14 -05:00
Simon Wanner
7d9fe44039 LibUnicode: Download and parse IDNA data 2023-12-10 08:04:58 -05:00
Tim Schumacher
a2f60911fe AK: Rename GenericTraits to DefaultTraits
This feels like a more fitting name for something that provides the
default values for Traits.
2023-11-09 10:05:51 -05:00
Timothy Flynn
139c575cc9 LibUnicode: Update to Unicode version 15.1.0
https://unicode.org/versions/Unicode15.1.0/

This update includes a new set of code point properties, Indic Conjunct
Break. These may have the values Consonant, Linker, or Extend. These are
used in text segmentation to prevent breaking on some extended grapheme
cluster sequences.
2023-09-15 18:30:26 +02:00
Andreas Kling
8b936b5912 AK: Make SourceGenerator::set() infallible 2023-08-22 13:08:24 +02:00
Sam Atkins
0d021a63c7 LibUnicode: Generate data for bidirectional character types
This will let us examine code points to determine the rtl/ltr direction
of a piece of text.
2023-08-20 16:21:35 -04:00
Lucas CHOLLET
3f35ffb648 Userland: Prefer _string over _short_string
As `_string` can't fail anymore (since 3434412), there are no real
benefits to use the short variant in most cases.
2023-08-08 07:37:21 +02:00
Timothy Flynn
b91af3c6a0 LibUnicode: Remove a few generator tracking fields that are now unused
These were used to generate specialized tables. Now that those tables
have been migrated to general 2-stage lookup tables, these fields are
all unused.
2023-07-28 05:28:50 +02:00
Timothy Flynn
456211932f LibUnicode: Perform code point case conversion lookups in constant time
Similar to commit 0652cc4, we now generate 2-stage lookup tables for
case conversion information. Only about 1500 code points are actually
cased. This means that case information is rather highly compressible,
as the blocks we break the code points into will generally all have no
casing information at all.

In total, this change:

    * Does not change the size of libunicode.so (which is nice because,
      generally, the 2-stage lookup tables are expected to trade a bit
      of size for performance).

    * Reduces the runtime of the new benchmark test case added here from
      1.383s to 1.127s (about an 18.5% improvement).
2023-07-28 05:28:50 +02:00
Timothy Flynn
0ee133af90 LibUnicode: Separate code point case information into its own structure
There is no functional change here. This information will compose the
upcoming multistage casing tables in an upcoming patch. Extract it to
its own struct to prepare for that.
2023-07-28 05:28:50 +02:00
Timothy Flynn
a332a8ad19 LibUnicode: Prepare Unicode data generator for multistage casing tables
There is no functional change here. This just adjusts the changes made
in commit 0652cc4 to be a bit more generic for code point casing tables.
We currently only generate property tables, which boil down to a vector
of booleans. Casing tables will be a struct of varying types, so this
generalizes some of the generator to prepare for that ahead of time, to
make the upcoming casing patch smaller / easier to grok.
2023-07-28 05:28:50 +02:00
Timothy Flynn
3fae92eea2 LibUnicode: Search code point properties sequentially at compile time
When generating code point property tables, we currently binary search
the code point range lists for each property to decide if a code point
has that property. However, we are both iterating over the code points
and through the sorted properties in order. This means we do not need
to search code point ranges that are below the current code point at
all. We can even remove the code point ranges that fall below the
current code point, as we will not see a code point in those ranges
again.

On my machine, this reduces the run time of GenerateUnicodeData from
3.4 seconds to 1.2 seconds.
2023-07-28 05:28:50 +02:00
Timothy Flynn
0652cc48c0 LibUnicode: Perform code point property lookups in constant time
We currently produce a single table for all categories of code point
properties (GeneralCategory, Script, etc.). Each row contains a field
indicating the range of code points to which that property applies. At
runtime, we then do a binary search through that table to decide if a
code point has a property.

This changes our approach to generate a 2-stage lookup table for each of
those categories. There is an in-depth explanation of these tables above
the new `create_code_point_tables` method. The end effect is that code
point property lookup is reduced from a binary search to constant-time
array lookups.

In total, this change:

    * Increases the size of libunicode.so from 2.7 MB to 2.9 MB.

    * Reduces the runtime of the new benchmark test case added here from
      3.576s to 1.020s (a 3.5x speedup).

    * In a profile of resizing a TextEditor window with a 3MB file open,
      the runtime of checking if a code point has a word break property
      reduces from ~81% to ~56%.
2023-07-26 08:36:20 +02:00
Timothy Flynn
8f1d73abde LibUnicode: Use the public CodePointRange in the code generator
The next commit will need a type from LibUnicode/CharacterTypes.h. To
avoid conflicts between that header's CodePointRange and the one that is
defined in the code generator, just use the public definition.
2023-07-26 08:36:20 +02:00
Timothy Flynn
cb128dcf75 LibUnicode: Move the CodePointRangeComparator struct to a public header
Move it out of the generated code so that it may be used by the code
generator itself.
2023-07-26 08:36:20 +02:00
Timothy Flynn
c950f88611 LibUnicode: Stop generating Block property data
We started generating this data in commit 0505e03, but it was unused.
It's still not used, so let's remove it, rather than bloating the size
of libunicode.so with unused data. If we need it in the future, it's
trivial to add back.

Note we *have* always used the block name data from that commit, and
that is still present here.
2023-07-26 08:36:20 +02:00
Ben Wiederhake
5cfa883b9f LibUnicode: Explicitly mark HashMap copy 2023-05-19 22:33:57 +02:00
Lucas CHOLLET
8c34959b53 AK: Add the Input word to input-only buffered streams
This concerns both `BufferedSeekable` and `BufferedFile`.
2023-05-09 11:18:46 +02:00
Cameron Youell
1d24f394c6 Everywhere: Use LibFileSystem where trivial 2023-03-21 19:03:21 +00:00
Sam Atkins
b18c1c7291 LibUnicode: Remove now-unused dir-iterator helper functions 2023-03-15 12:49:33 -04:00
Sam Atkins
8a8ad81aa1 LibUnicode: Migrate GenerateEmojiData to Directory::for_each_entry() 2023-03-15 12:49:33 -04:00
Sam Atkins
8672b380f6 LibUnicode: Read emoji file title from LexicalPath directly
... rather than taking the whole file name, and then manually trimming
the extension off.
2023-03-15 12:49:33 -04:00
gustrb
5141c86587 AK: Rename CaseInsensitiveStringViewTraits to reflect intent
Now it is called `CaseInsensitiveASCIIStringViewTraits`, so we can be
more specific about what data structure does it operate onto. ;)
2023-03-14 21:34:32 +00:00
Tim Schumacher
8032724574 CodeGenerators: Ensure that we always print the entire generated output 2023-03-13 15:16:20 +00:00
Tim Schumacher
d5871f5717 AK: Rename Stream::{read,write} to Stream::{read_some,write_some}
Similar to POSIX read, the basic read and write functions of AK::Stream
do not have a lower limit of how much data they read or write (apart
from "none at all").

Rename the functions to "read some [data]" and "write some [data]" (with
"data" being omitted, since everything here is reading and writing data)
to make them sufficiently distinct from the functions that ensure to
use the entire buffer (which should be the go-to function for most
usages).

No functional changes, just a lot of new FIXMEs.
2023-03-13 15:16:20 +00:00
Sam Atkins
774f328783 LibCore+Everywhere: Return an Error from DirIterator::error()
This also removes DirIterator::error_string(), since the same strerror()
string will be included when you print the Error itself. Except in `ls`
which is still using fprintf() for now.
2023-03-05 20:23:42 +01:00
Timothy Flynn
ca2b030336 LibUnicode: Use binary search for lookups into the generated emoji data
This sorts the array of generated emoji data by code point (first by
code point length, then by code point value). This lets us use a binary
search to find emoji data, rather than the current linear search.

In a profile of scrolling around /home/anon/Documents/emoji.txt, this
reduces the runtime of Gfx::Emoji::emoji_for_code_points from 69.03% to
28.42%. Within that, Unicode::find_emoji_for_code_points reduces from
28.42% to just 1.95%.
2023-03-05 16:44:20 +01:00
Timothy Flynn
03f32bdf86 LibUnicode: Validate that all emoji images in /res/emoji actually exist
This will raise a compile error if an emoji image was neglected to be
added to e.g. emoji-serenity.txt, or if the code points are not correct.
2023-03-03 17:09:58 +00:00
Timothy Flynn
fd1fbad1d2 LibGfx+LibUnicode: Support specifying the path to search for emoji
Similar to the FontDatabase, this will be needed for Ladybird to find
emoji images. We now generate just the file name of emoji image in
LibUnicode, and look for that file in the specified path (defaulting to
/res/emoji) at runtime.
2023-03-01 14:54:16 +00:00
MacDue
01fa3bb788 LibUnicode: Propagate try_append() errors when building emoji data 2023-02-24 22:18:25 +01:00
Timothy Flynn
8c38d46c1a LibUnicode: Generate the path to emoji images alongside emoji data
This will provide for quicker emoji lookups, rather than having to
discover and allocate these paths at runtime before we find out if they
even exist.
2023-02-24 19:48:47 +01:00
Tim Schumacher
874c7bba28 LibCore: Remove Stream.h 2023-02-13 00:50:07 +00:00
Tim Schumacher
606a3982f3 LibCore: Move Stream-based file into the Core namespace 2023-02-13 00:50:07 +00:00
Tim Schumacher
d43a7eae54 LibCore: Rename File to DeprecatedFile
As usual, this removes many unused includes and moves used includes
further down the chain.
2023-02-13 00:50:07 +00:00