AK: Add a UTF-16 string with optimized short- and ASCII-string storage

This is a strictly UTF-16 string with some optimizations for ASCII. * If created from a short UTF-8 or UTF-16 string that is also ASCII, then the string is stored in an inlined byte buffer. * If created with a long UTF-8 or UTF-16 string that is also ASCII, then the string is stored in an outlined char buffer. * If created with a short or long UTF-8 or UTF-16 string that is not ASCII, then the string is stored in an outlined char16 buffer. We do not store short non-ASCII text in the inlined buffer to avoid confusion with operations such as `length_in_code_units` and `code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes in short string form. But we still want `length_in_code_units` to be 2, and `code_unit_at(0)` to be 0xD83D.
Author: https://github.com/trflynn89 Commit: fe676585f5 Pull-request: https://github.com/LadybirdBrowser/ladybird/pull/5388 Reviewed-by: https://github.com/shannonbooth ✅
2025-10-06 16:19:40 +00:00 · 2025-06-12 19:29:41 -04:00 · 2025-06-12 19:29:41 -04:00 · fe676585f5 · 2025-07-18 16:47:31 +00:00
commit fe676585f5
parent 8fbb80fffc
17 changed files with 1527 additions and 44 deletions
--- a/Libraries/LibUnicode/Segmenter.cpp
+++ b/Libraries/LibUnicode/Segmenter.cpp
@ -75,7 +75,12 @@ public:

    virtual void set_segmented_text(Utf16View const& text) override
    {
-        m_segmented_text = icu::UnicodeString { text.span().data(), static_cast<i32>(text.length_in_code_units()) };
+        if (text.has_ascii_storage()) {
+            set_segmented_text(MUST(text.to_utf8()));
+            return;
+        }
+
+        m_segmented_text = icu::UnicodeString { text.utf16_span().data(), static_cast<i32>(text.length_in_code_units()) };
        m_segmenter->setText(m_segmented_text.get<icu::UnicodeString>());
    }