Commit graph

5 commits

Author SHA1 Message Date
Timothy Flynn
fe676585f5 AK: Add a UTF-16 string with optimized short- and ASCII-string storage
This is a strictly UTF-16 string with some optimizations for ASCII.

* If created from a short UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an inlined byte buffer.

* If created with a long UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an outlined char buffer.

* If created with a short or long UTF-8 or UTF-16 string that is not
  ASCII, then the string is stored in an outlined char16 buffer.

We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
2025-07-18 12:45:38 -04:00
Timothy Flynn
86b1c78c1a AK+Everywhere: Prepare Utf16View for integration with a UTF-16 string
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.

This also adds a UDL to construct a Utf16View from a string literal:

    auto string = u"hello"sv;

This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
2025-07-03 09:51:56 -04:00
mikiubo
8ec72d6906 LibUnicode: Avoid rejecting end-of-text position as a valid boundary
When the cursor was positioned at the end of text,
attempting to move it left(using the left arrow key)
would fail because align_boundary() was rejecting
the end-of-text position as a valid boundary.
2025-04-11 15:30:17 -04:00
Timothy Flynn
e6b7c8cde2 LibUnicode: Consistently reject out-of-bounds segmenter indices
In the UTF-8 implementation, this prevents out-of-bounds access of the
underlying text data, as the ICU macro would essentially do something
akin to `text[text.length()]`.

The UTF-16 implementation already checks for out-of-bounds, but would
previously return 0. We now return an empty Optional in both impls. This
doesn't affect LibJS (the user of the UTF-16 impl), as it already does
bounds checking before invoking LibUnicode APIs.
2025-01-16 23:22:48 +01:00
Timothy Flynn
93712b24bf Everywhere: Hoist the Libraries folder to the top-level 2024-11-10 12:50:45 +01:00
Renamed from Userland/Libraries/LibUnicode/Segmenter.cpp (Browse further)