Commit graph

66 commits

Author SHA1 Message Date
Timothy Flynn
d9502505c2 AK: Fix bounds assertions in Utf16View::iterator_offset 2025-07-28 18:30:50 +02:00
Timothy Flynn
67723ef83c AK: Add a method to peek ahead of a UTF-16 iterator 2025-07-28 18:30:50 +02:00
Timothy Flynn
21d7d236e6 AK: Add a method to check if a UTF-16 string contains any code point 2025-07-28 18:30:50 +02:00
Timothy Flynn
ed63a60247 AK: Return an empty optional when UTF-16 code unit lookup fails
Accidentally returned the wrong type here.
2025-07-28 12:25:11 +02:00
Timothy Flynn
baddac5155 AK: Implement a method to split a UTF-16 string 2025-07-28 12:25:11 +02:00
Timothy Flynn
48a3b2c28e AK: Implement a method to count instances of a needle in a UTF-16 string 2025-07-28 12:25:11 +02:00
Timothy Flynn
745f288796 AK: Implement a method to acquire a UTF-16 iterator's code unit offset
This is the same as Utf8View::iterator_offset().
2025-07-25 18:16:22 +02:00
Timothy Flynn
6c73dff120 AK: Implement a UTF-16 method to check if a string is ASCII whitespace 2025-07-24 19:00:20 +02:00
Jelle Raaijmakers
b1c3ce807b AK: Rename Utf16View::trim_whitespace() to ::trim_ascii_whitespace()
This reflects the naming of String::trim_ascii_whitespace() and better
indicates what exactly we're trimming.
2025-07-24 07:18:25 -04:00
Jelle Raaijmakers
9a03ee1c24 AK: Fix mention of renamed member in Utf16View 2025-07-24 07:18:25 -04:00
Jelle Raaijmakers
15178d5230 AK: Add ::ends_with() to Utf16View and Utf16StringBase
I noticed that we can significantly simplify ::starts_with(), and based
the new ::ends_with() on that.
2025-07-24 07:18:25 -04:00
Jelle Raaijmakers
7f8468b0e6 AK: Compare pointers in TypedTransfer<T>::compare()
We can return `true` quickly if the two pointers are identical.
2025-07-24 07:18:25 -04:00
Timothy Flynn
6ddbb70051 AK: Remove constexpr specifier from Utf16View::bytes()
The Span constructor used here uses reinterpret_cast under the hood, so
it and Utf16View::bytes() cannot be constexpr.
2025-07-22 13:33:51 -04:00
Timothy Flynn
ad7ac679fd AK: Compute Utf16View::code_point_offset_of correctly
There were a couple of issues here, including the following computation
could actually overflow to NumericLimits<size_t>::max():

    code_unit_offset -= it.length_in_code_units();
2025-07-22 17:17:33 +02:00
Timothy Flynn
0bbb725bcd AK: Mark a couple of methods in Utf16View.h as constexpr 2025-07-22 17:17:33 +02:00
Timothy Flynn
9582895759 AK+LibJS+LibWeb+LibRegex: Replace AK::Utf16Data with AK::Utf16String 2025-07-18 12:45:38 -04:00
Timothy Flynn
d40e3af697 AK: Implement UTF-16 string-to-number conversions 2025-07-18 12:45:38 -04:00
Timothy Flynn
6e0290ecaa AK: Define some UTF-16 helper methods
* contains
* escape_html_entities
* replace
* to_ascii_lowercase
* to_ascii_uppercase
* to_ascii_titlecase
* trim
* trim_whitespace
2025-07-18 12:45:38 -04:00
Timothy Flynn
fe676585f5 AK: Add a UTF-16 string with optimized short- and ASCII-string storage
This is a strictly UTF-16 string with some optimizations for ASCII.

* If created from a short UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an inlined byte buffer.

* If created with a long UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an outlined char buffer.

* If created with a short or long UTF-8 or UTF-16 string that is not
  ASCII, then the string is stored in an outlined char16 buffer.

We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
2025-07-18 12:45:38 -04:00
Timothy Flynn
418409aa6f AK: Fix usage of constexpr within Utf16View and related utilities
* Error and ErrorOr are not themelves constexpr, so a function returning
  these types cannot be constexpr.

* The UDL was trying to call Utf16View::validate, which is not constexpr
  itself. The compiler will actually already raise an error if a UTF-16
  literal is invalid, so let's just avoid the call altogether.
2025-07-05 01:25:22 +12:00
Timothy Flynn
9fc3e72db2 AK+Everywhere: Allow lonely UTF-16 surrogates by default
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
2025-07-03 09:51:56 -04:00
Timothy Flynn
86b1c78c1a AK+Everywhere: Prepare Utf16View for integration with a UTF-16 string
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.

This also adds a UDL to construct a Utf16View from a string literal:

    auto string = u"hello"sv;

This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
2025-07-03 09:51:56 -04:00
Timothy Flynn
c17b067e1d AK: Completely remove endianness from Utf16View APIs
These were mostly removed in 7628ddfaf7.
This removes the few remaining cases, as no callers are providing any
non-host endianness. This is just to prevent weird API dissymmetry
between Utf16View and an upcoming Utf16String.
2025-07-03 09:51:56 -04:00
Timothy Flynn
a0eb47e2fc AK: Add hash traits for Utf16View
This is based on the hash in JS::Utf16StringImpl::compute_hash.
2025-07-03 09:51:56 -04:00
Timothy Flynn
2abc955ca9 AK: Allow treating UTF-16 views with lonely surrogates as valid
Much of the web requires us to allow lonely surrogates in UTF-16 data.
The default behavior to disallow such code units has not been changed
here - that will be changed in an upcoming commit.
2025-07-03 09:51:56 -04:00
Timothy Flynn
d978a582a0 AK: Add a Utf16View ASCII validator 2025-07-03 09:51:56 -04:00
Timothy Flynn
66006d3812 AK+LibJS: Extract some UTF-16 helpers for use in an outside class
An upcoming Utf16String will need access to these helpers. Let's make
them publicly available.
2025-07-03 09:51:56 -04:00
Timothy Flynn
efa9737cf7 AK+LibJS: Do not set UTF-16 code point length to its code unit length 2025-06-25 22:20:47 +02:00
Jelle Raaijmakers
cc0a28ee7d AK: Add Utf16View::find_code_unit_offset(_ignoring_case) 2025-06-13 15:08:26 +02:00
Shannon Booth
5cf87dcfdc AK: Add a Utf16View::is_code_unit_less_than helper
This seems like the natural place to put this since it is specific
to UTF-16.
2025-05-17 08:00:59 -04:00
Ali Mohammad Pur
eea81738cd AK+Everywhere: Recognise that surrogates in utf16 aren't all that common
For the slight cost of counting code points when converting between
encodings and a teeny bit of memory, this commit adds a fast path for
all-happy utf-16 substrings and code point operations.

This seems to be a significant chunk of time spent in many regex
benchmarks.
2025-04-23 07:56:02 -06:00
Andreas Kling
0c93a07fb1 AK: Shrink Utf16View
Use a sentinel value instead of Optional for the cached length in code
points, shrinking Utf16View from 32 to 24 bytes.
2025-04-16 10:04:50 +02:00
Andreas Kling
7628ddfaf7 AK: Remove endianness override from Utf16View
Utf16View is now always in "host" endian mode. This makes it smaller
and less branchy for everyone!
2025-04-16 10:04:50 +02:00
Andreas Kling
0e9480b944 AK+LibTextCodec: Stop using Utf16View endianness override
This is preparation for removing the endianness override, since it was
only used by a single client: LibTextCodec.

While here, add helpers and make use of simdutf for fast conversion.
2025-04-16 10:04:50 +02:00
Andrew Kaster
5e7e6475c6 AK: Annotate [[no_unique_address]] members with NO_UNIQUE_ADDRESS macro 2025-04-15 02:19:06 -06:00
Andreas Kling
b2779ad9f7 AK: Shrink Utf16View from 40 bytes to 32 bytes
This ends up making RegexStringView smaller, which means less stuff to
copy when forking in the regex engine.

Thanks to Leon for suggesting the [[no_unique_address]] trick!
2025-04-09 07:22:01 +02:00
Jonne Ransijn
04920d06f0 AK: Use simdutf when appending UTF-16 to StringBuilder
Adds a fast path for valid UTF-16 using `simdutf`, and fall back to
the slow path for unmatched surrogates.
2024-10-30 10:28:24 +01:00
Timothy Flynn
7a17c654d2 AK: Add a method to compute UTF-16 length from a UTF-8 string 2024-07-31 05:55:34 -04:00
Timothy Flynn
71c29504af AK: Support non-native endianness in Utf16View
Utf16View currently assumes host endianness. Add support for specifying
either big or little endianness (which we mostly just pipe through to
simdutf). This will allow using simdutf facilities with LibTextCodec.
2024-07-18 19:43:57 +02:00
Timothy Flynn
32ffe9bbfc AK: Replace UTF-16 validation and length computation with simdutf 2024-07-18 14:46:25 +02:00
Timothy Flynn
ec492a1a08 Everywhere: Run clang-format
The following command was used to clang-format these files:

    clang-format-18 -i $(find . \
        -not \( -path "./\.*" -prune \) \
        -not \( -path "./Base/*" -prune \) \
        -not \( -path "./Build/*" -prune \) \
        -not \( -path "./Toolchain/*" -prune \) \
        -not \( -path "./Ports/*" -prune \) \
        -type f -name "*.cpp" -o -name "*.mm" -o -name "*.h")

There are a couple of weird cases where clang-format now thinks that a
pointer access in an initializer list, e.g. `m_member(ptr->foo)`, is a
lambda return statement, and it puts spaces around the `->`.
2024-04-24 16:50:01 -04:00
Timothy Flynn
1b4a23095c AK: Add a Utf16View::starts_with method
Based heavily on Utf8View::starts_with.
2024-01-04 12:43:10 +01:00
Timothy Flynn
c46ba7e68d AK: Allow constructing a UTF-16 view from a UTF-16 string literal
UTF-16 string literals are a language-level feature. It is convenient to
be able to construct a Utf16View from these strings.
2024-01-04 12:43:10 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Timothy Flynn
370ea9441c AK: Define an alias for Utf16View's iterator type
Utf8View and Utf32View do so already. This allows using these views more
readily in generic code.
2023-11-08 12:54:26 -05:00
MacDue
63b11030f0 Everywhere: Use ReadonlySpan<T> instead of Span<T const> 2023-02-08 19:15:45 +00:00
Timothy Flynn
2eacc7aec1 AK: Add Utf16View::to_utf8 to convert the view to a UTF-8 AK::String 2023-01-09 23:00:24 +00:00
Timothy Flynn
d0403ec14f AK+Everywhere: Rename Utf16View::to_utf8 to to_deprecated_string
A subsequent commit will add to_utf8 back to create an AK::String.
2023-01-09 23:00:24 +00:00
Timothy Flynn
d793262beb AK+Everywhere: Make UTF-16 to UTF-8 converter fallible
This could fail to allocate the underlying storage needed to store the
UTF-8 data. Propagate this error.
2023-01-08 12:13:15 +01:00
Timothy Flynn
1edb96376b AK+Everywhere: Make UTF-8 and UTF-32 to UTF-16 converters fallible
These could fail to allocate the underlying storage needed to store the
UTF-16 data. Propagate these errors.
2023-01-08 12:13:15 +01:00