Commit graph

48 commits

Author SHA1 Message Date
Timothy Flynn
6e0290ecaa AK: Define some UTF-16 helper methods
* contains
* escape_html_entities
* replace
* to_ascii_lowercase
* to_ascii_uppercase
* to_ascii_titlecase
* trim
* trim_whitespace
2025-07-18 12:45:38 -04:00
Timothy Flynn
fe676585f5 AK: Add a UTF-16 string with optimized short- and ASCII-string storage
This is a strictly UTF-16 string with some optimizations for ASCII.

* If created from a short UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an inlined byte buffer.

* If created with a long UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an outlined char buffer.

* If created with a short or long UTF-8 or UTF-16 string that is not
  ASCII, then the string is stored in an outlined char16 buffer.

We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
2025-07-18 12:45:38 -04:00
Timothy Flynn
8fbb80fffc AK: Do not fall back to simdutf for UTF-16 ASCII validation
This was a mistake. Consider U+201C (LEFT DOUBLE QUOTATION MARK). This
code point is encoded as the bytes 0x1c 0x20 in UTF-16LE. Both of these
bytes are ASCII if interpreted as UTF-8. But the string itself is most
certainly not ASCII.
2025-07-18 12:45:38 -04:00
Timothy Flynn
9fc3e72db2 AK+Everywhere: Allow lonely UTF-16 surrogates by default
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
2025-07-03 09:51:56 -04:00
Timothy Flynn
86b1c78c1a AK+Everywhere: Prepare Utf16View for integration with a UTF-16 string
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.

This also adds a UDL to construct a Utf16View from a string literal:

    auto string = u"hello"sv;

This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
2025-07-03 09:51:56 -04:00
Timothy Flynn
c17b067e1d AK: Completely remove endianness from Utf16View APIs
These were mostly removed in 7628ddfaf7.
This removes the few remaining cases, as no callers are providing any
non-host endianness. This is just to prevent weird API dissymmetry
between Utf16View and an upcoming Utf16String.
2025-07-03 09:51:56 -04:00
Timothy Flynn
2abc955ca9 AK: Allow treating UTF-16 views with lonely surrogates as valid
Much of the web requires us to allow lonely surrogates in UTF-16 data.
The default behavior to disallow such code units has not been changed
here - that will be changed in an upcoming commit.
2025-07-03 09:51:56 -04:00
Timothy Flynn
d978a582a0 AK: Add a Utf16View ASCII validator 2025-07-03 09:51:56 -04:00
Timothy Flynn
66006d3812 AK+LibJS: Extract some UTF-16 helpers for use in an outside class
An upcoming Utf16String will need access to these helpers. Let's make
them publicly available.
2025-07-03 09:51:56 -04:00
Jelle Raaijmakers
408165d2f4 AK: Return early in utf8_to_utf16() for empty strings
No need to validate an empty string.
2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
2d6da6e112 AK: Remove superfluous check from Utf16View::subtring_view()
The `Span::slice()` operation just below it performs the exact same
check.
2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
0d543b604b AK: Make more use of lazily calculated code point count in Utf16View
In 0c93a07fb1, a lazily calculated code
point count was introduced but was not used in all places where we need
that count. No functional changes.
2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
cc0a28ee7d AK: Add Utf16View::find_code_unit_offset(_ignoring_case) 2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
7d7f6fa494 AK: Remove superfluous check from Utf16View::equals_ignoring_case()
Returning true if both lengths are 0 is already handled by the default
case.
2025-06-13 15:08:26 +02:00
Shannon Booth
5cf87dcfdc AK: Add a Utf16View::is_code_unit_less_than helper
This seems like the natural place to put this since it is specific
to UTF-16.
2025-05-17 08:00:59 -04:00
Andreas Kling
6efc5c54b5 AK: Make Utf16View::to_utf8() use simdutf fast path more often
By piggybacking on the already-optimized implementation in
StringBuilder, we can get simdutf for the AllowInvalidCodeUnits::Yes
case here as well.

1.03x speedup on Speedometer's TodoMVC-jQuery.
2025-05-09 21:36:59 +02:00
Ali Mohammad Pur
eea81738cd AK+Everywhere: Recognise that surrogates in utf16 aren't all that common
For the slight cost of counting code points when converting between
encodings and a teeny bit of memory, this commit adds a fast path for
all-happy utf-16 substrings and code point operations.

This seems to be a significant chunk of time spent in many regex
benchmarks.
2025-04-23 07:56:02 -06:00
Andreas Kling
0c93a07fb1 AK: Shrink Utf16View
Use a sentinel value instead of Optional for the cached length in code
points, shrinking Utf16View from 32 to 24 bytes.
2025-04-16 10:04:50 +02:00
Andreas Kling
7628ddfaf7 AK: Remove endianness override from Utf16View
Utf16View is now always in "host" endian mode. This makes it smaller
and less branchy for everyone!
2025-04-16 10:04:50 +02:00
Andreas Kling
0e9480b944 AK+LibTextCodec: Stop using Utf16View endianness override
This is preparation for removing the endianness override, since it was
only used by a single client: LibTextCodec.

While here, add helpers and make use of simdutf for fast conversion.
2025-04-16 10:04:50 +02:00
Timothy Flynn
d19b31529f AK+Meta: Update simdutf to version 5.5.0
Contains many fixes found upstream by fuzzers. Also includes fixes for
CPU-specific inconsistencies with null inputs.
2024-09-19 15:48:57 -04:00
Timothy Flynn
7a17c654d2 AK: Add a method to compute UTF-16 length from a UTF-8 string 2024-07-31 05:55:34 -04:00
Andrew Kaster
45301e8169 Everywhere: Remove AK_DONT_REPLACE_STD macro
Let's just always include `<utility>`. Placing our own incompatible with
the STL declaration of these functions in AK was always fishy to begin
with.
2024-07-30 18:38:02 -06:00
Timothy Flynn
74d644a216 AK: Explicitly check for null data in Utf16View
The underlying CPU-specific instructions for operating on UTF-16 strings
behave differently for null inputs. Add an explicit check for this state
for consistency.
2024-07-21 19:57:07 +02:00
Timothy Flynn
71c29504af AK: Support non-native endianness in Utf16View
Utf16View currently assumes host endianness. Add support for specifying
either big or little endianness (which we mostly just pipe through to
simdutf). This will allow using simdutf facilities with LibTextCodec.
2024-07-18 19:43:57 +02:00
Timothy Flynn
0c14a9417a AK: Replace converting to and from UTF-16 with simdutf
The one behavior difference is that we will now actually fail on invalid
code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was
arguably a bug that this wasn't already the case.
2024-07-18 14:46:25 +02:00
Timothy Flynn
32ffe9bbfc AK: Replace UTF-16 validation and length computation with simdutf 2024-07-18 14:46:25 +02:00
Diego
7560b640f3 AK: Add AllowSurrogates to UTF-8 validator
The [UTF-8](https://datatracker.ietf.org/doc/html/rfc3629#page-5)
standard says to reject strings with upper or lower surrogates. However,
in many standards, ECMAScript included, unpaired surrogates (and
therefore UTF-8 surrogates) are allowed in strings. So, this commit
extends the UTF-8 validation API with `AllowSurrogates`, which will
reject upper and lower surrogate characters.
2024-06-09 12:16:32 +02:00
Timothy Flynn
1b4a23095c AK: Add a Utf16View::starts_with method
Based heavily on Utf8View::starts_with.
2024-01-04 12:43:10 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Nico Weber
aa9037eed4 AK: Add spec comments to Utf16CodePointIterator::operator*() 2023-01-22 21:30:44 +00:00
Timothy Flynn
2eacc7aec1 AK: Add Utf16View::to_utf8 to convert the view to a UTF-8 AK::String 2023-01-09 23:00:24 +00:00
Timothy Flynn
d0403ec14f AK+Everywhere: Rename Utf16View::to_utf8 to to_deprecated_string
A subsequent commit will add to_utf8 back to create an AK::String.
2023-01-09 23:00:24 +00:00
Timothy Flynn
d793262beb AK+Everywhere: Make UTF-16 to UTF-8 converter fallible
This could fail to allocate the underlying storage needed to store the
UTF-8 data. Propagate this error.
2023-01-08 12:13:15 +01:00
Timothy Flynn
1edb96376b AK+Everywhere: Make UTF-8 and UTF-32 to UTF-16 converters fallible
These could fail to allocate the underlying storage needed to store the
UTF-16 data. Propagate these errors.
2023-01-08 12:13:15 +01:00
Timothy Flynn
425c168ded AK+LibJS+LibRegex: Define an alias for UTF-16 string data storage
Instead of writing out "Vector<u16, 1>" everywhere, let's have a name
for it.
2023-01-08 12:13:15 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Linus Groh
d26aabff04 Everywhere: Run clang-format 2022-12-03 23:52:23 +00:00
Idan Horowitz
44e8c05c67 AK: Add a Utf16View::code_unit_offset_of(Utf16CodePointIterator) helper
This helper can be used to used to retrieve the code unit offset of an
active Utf16CodePointIterator efficiently.
2022-01-31 21:05:04 +02:00
Timothy Flynn
6efbafa6e0 Everywhere: Update copyrights with my new serenityos.org e-mail :^) 2022-01-31 18:23:22 +00:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Andreas Kling
87290e300e AK: Simplify Utf16View::operator==(Utf16View) 2021-10-02 18:32:56 +02:00
Andreas Kling
024367d82e LibJS+AK: Use Vector<u16, 1> for UTF-16 string storage
It's very common to encounter single-character strings in JavaScript on
the web. We can make such strings significantly lighter by having a
1-character inline capacity on the Vectors.
2021-10-02 17:39:38 +02:00
Timothy Flynn
70080feab2 AK+LibJS: Implement String.from{CharCode,CodePoint} using UTF-16 strings
Most of String.prototype and RegExp.prototype is implemented with UTF-16
so this is to prevent extra copying of the string data.
2021-08-04 11:18:24 +02:00
Timothy Flynn
510bbcd8e0 AK+LibRegex: Add Utf16View::code_point_at and use it in RegexStringView
The current method of iterating through the string to access a code
point hurts performance quite badly for very large strings. The test262
test "RegExp/property-escapes/generated/Any.js" previously took 3 hours
to complete; this one change brings it down to under 10 seconds.
2021-08-04 11:18:24 +02:00
Timothy Flynn
0e6375558d AK+LibRegex: Partially implement case insensitive UTF-16 comparison
This will work for ASCII code points. Unicode case folding will be
needed for non-ASCII.
2021-07-23 23:06:57 +01:00
Timothy Flynn
2e45e52993 AK: Add UTF-16 helper methods required for use within LibRegex
To be used as a RegexStringView variant, Utf16View must provide a couple
more helper methods. It must also not default its assignment operators,
because that implicitly deletes move/copy constructors.
2021-07-23 23:06:57 +01:00
Timothy Flynn
9b83cd1abf AK: Add Utf16View for decoding UTF-16 strings
Also includes a way to transcode from and to UTF-8 strings.
2021-07-22 09:10:44 +02:00