This is preparation for removing the endianness override, since it was
only used by a single client: LibTextCodec.
While here, add helpers and make use of simdutf for fast conversion.
This involves yeeting the 'invalid' union member as it was not really
checked against properly anyway; now the 'invalid' state is simply
StringData*{nullptr}, which was assumed to not exist previously.
Note that this is still accessing inactive union members, but is
promising to the compiler that they're fine where they are (the provided
debug macro AK_STRINGBASE_VERIFY_LAUNDER_DEBUG makes the
would-be-UB-if-not-for-launder ops verify that the operation is correct)
Should fix the GCC build.
These are copied and modified from ByteString, with the addition of a
Case parameter so that we can construct them in lowercase instead of
having to them make a copy.
These don't have to worry about the input not being valid UTF-8 and
so can be infallible (and can even return self if no changes needed.)
We use this instead of Infra::to_ascii_{upper,lower}_case in LibWeb.
Currently, invoking StringBuilder::to_string will re-allocate the string
data to construct the String. This is wasteful both in terms of memory
and speed.
The goal here is to simply hand the string buffer over to String, and
let String take ownership of that buffer. To do this, StringBuilder must
have the same memory layout as Detail::StringData. This layout is just
the members of the StringData class followed by the string itself.
So when a StringBuilder is created, we reserve sizeof(StringData) bytes
at the front of the buffer. StringData can then construct itself into
the buffer with placement new.
Things to note:
* StringData must now be aware of the actual capacity of its buffer, as
that can be larger than the string size.
* We must take care not to pass ownership of inlined string buffers, as
these live on the stack.
The one behavior difference is that we will now actually fail on invalid
code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was
arguably a bug that this wasn't already the case.
currently crashes with an assertion failure in `String::repeated` if
malloc can't serve a `count * input_size` sized request, so add
`String::repeated_with_error` to propagate the error.
Before this change, the global FlyString table looked like this:
HashMap<StringView, Detail::StringBase>
After this change, we have:
HashTable<Detail::StringData const*, FlyStringTableHashTraits>
The custom hash traits are used to extract the stored hash from
StringData which avoids having to rehash the StringView repeatedly like
we did before.
This necessitated a handful of smaller changes to make it work.
The idea is to eventually get rid of protected state in StringBase. To
do this, we first need to remove all references to m_data and
m_short_string from String.
This starts separating memory management of string data and string
utilities like `String::formatted`. This would also allow to reuse the
same storage in `DeprecatedString` in the future.
UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.
This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
Instead of using a StringBuilder, add a String::repeated(String, N)
overload that takes advantage of knowing it's already all UTF-8.
This makes the following microbenchmark go 4x faster:
"foo".repeat(100_000_000)
And for single character strings, we can even go 10x faster:
"x".repeat(100_000_000)
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
Now that ""_string is infallible, the only benefit of explicitly
constructing a short string is the ability to do it at compile-time. But
we never do that, so let's simplify the API and remove this
implementation detail from it.
This was missed in 02b74e5a70
We need to disable consteval in AK::String as well as AK::StringView,
and we need to disable it when building both the tools build and the
fuzzer build.
Calling `from_utf8` with a DeprecatedString will hide the fact that we
have a DeprecatedString, while using `from_deprecated_string` with a
StringView will silently and needlessly allocate a DeprecatedString,
so let's forbid that.