There were a couple of issues here, including the following computation
could actually overflow to NumericLimits<size_t>::max():
code_unit_offset -= it.length_in_code_units();
This is a strictly UTF-16 string with some optimizations for ASCII.
* If created from a short UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an inlined byte buffer.
* If created with a long UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an outlined char buffer.
* If created with a short or long UTF-8 or UTF-16 string that is not
ASCII, then the string is stored in an outlined char16 buffer.
We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
* Error and ErrorOr are not themelves constexpr, so a function returning
these types cannot be constexpr.
* The UDL was trying to call Utf16View::validate, which is not constexpr
itself. The compiler will actually already raise an error if a UTF-16
literal is invalid, so let's just avoid the call altogether.
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.
This also adds a UDL to construct a Utf16View from a string literal:
auto string = u"hello"sv;
This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
These were mostly removed in 7628ddfaf7.
This removes the few remaining cases, as no callers are providing any
non-host endianness. This is just to prevent weird API dissymmetry
between Utf16View and an upcoming Utf16String.
Much of the web requires us to allow lonely surrogates in UTF-16 data.
The default behavior to disallow such code units has not been changed
here - that will be changed in an upcoming commit.
For the slight cost of counting code points when converting between
encodings and a teeny bit of memory, this commit adds a fast path for
all-happy utf-16 substrings and code point operations.
This seems to be a significant chunk of time spent in many regex
benchmarks.
This is preparation for removing the endianness override, since it was
only used by a single client: LibTextCodec.
While here, add helpers and make use of simdutf for fast conversion.
This ends up making RegexStringView smaller, which means less stuff to
copy when forking in the regex engine.
Thanks to Leon for suggesting the [[no_unique_address]] trick!
Utf16View currently assumes host endianness. Add support for specifying
either big or little endianness (which we mostly just pipe through to
simdutf). This will allow using simdutf facilities with LibTextCodec.
The following command was used to clang-format these files:
clang-format-18 -i $(find . \
-not \( -path "./\.*" -prune \) \
-not \( -path "./Base/*" -prune \) \
-not \( -path "./Build/*" -prune \) \
-not \( -path "./Toolchain/*" -prune \) \
-not \( -path "./Ports/*" -prune \) \
-type f -name "*.cpp" -o -name "*.mm" -o -name "*.h")
There are a couple of weird cases where clang-format now thinks that a
pointer access in an initializer list, e.g. `m_member(ptr->foo)`, is a
lambda return statement, and it puts spaces around the `->`.
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')