Commit graph

10 commits

Author SHA1 Message Date
Timothy Flynn
298ec6a12a AK: Ensure StringBuilder encodes U+10000 as 2 UTF-16 code units 2025-08-07 02:05:50 +02:00
Timothy Flynn
1b611fba67 AK: Ensure Utf16FlyString is hash-compatible with Utf16View/Utf16String 2025-08-07 02:05:50 +02:00
Timothy Flynn
2dc0a3b3ce AK: Add trim methods to Utf16String that skip allocation when not needed
If the string does not begin with any of the provided code units, we do
not need to create a new string.
2025-08-05 15:13:36 +02:00
Timothy Flynn
0bf565b97f AK: Allow comparing UTF-16 strings to UTF-8 strings
Before now, you could compare a Utf16View to a StringView, but it would
only be valid if the StringView were ASCII. When porting code to UTF-16,
it will be handy to have a code point-aware implementation for non-ASCII
StringViews.
2025-08-05 07:07:15 -04:00
Timothy Flynn
13ed6aba71 AK+LibIPC: Implement an encoder/decoder for UTF-16 strings 2025-08-02 10:10:14 -07:00
Timothy Flynn
a740bfd8ff AK+LibUnicode: Implement Unicode-aware UTF-16 case transformations 2025-07-25 18:16:22 +02:00
Timothy Flynn
df77ae1920 AK: Implement creating a UTF-16 string from a repeated code point 2025-07-25 18:16:22 +02:00
Timothy Flynn
f53389bab1 AK: Add a couple of Utf16String factories
* Utf16String::from_utf8_with_replacement_character
* Utf16String::from_code_point
2025-07-24 19:00:20 +02:00
Timothy Flynn
2803d66d87 AK: Support UTF-16 string formatting
The underlying storage used during string formatting is StringBuilder.
To support UTF-16 strings, this patch allows callers to specify a mode
during StringBuilder construction. The default mode is UTF-8, for which
StringBuilder remains unchanged.

In UTF-16 mode, we treat the StringBuilder's internal ByteBuffer as a
series of u16 code units. Appending a single character will append 2
bytes for that character (cast to a char16_t). Appending a StringView
will transcode the string to UTF-16.

Utf16String also gains the same memory optimization that we added for
String, where we hand-off the underlying buffer to Utf16String to avoid
having to re-allocate.

In the future, we may want to further optimize for ASCII strings. For
example, we could defer committing to the u16-esque storage until we
see a non-ASCII code point.
2025-07-18 12:45:38 -04:00
Timothy Flynn
fe676585f5 AK: Add a UTF-16 string with optimized short- and ASCII-string storage
This is a strictly UTF-16 string with some optimizations for ASCII.

* If created from a short UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an inlined byte buffer.

* If created with a long UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an outlined char buffer.

* If created with a short or long UTF-8 or UTF-16 string that is not
  ASCII, then the string is stored in an outlined char16 buffer.

We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
2025-07-18 12:45:38 -04:00