For the web, we allow a wobbly UTF-16 encoding (i.e. lonely surrogates
are permitted). Only in a few exceptional cases do we strictly require
valid UTF-16. As such, our `validate(AllowLonelySurrogates::Yes)` calls
will always succeed. It's a wasted effort to ever make such a check.
This patch eliminates such invocations. The validation methods will now
only check for strict UTF-16, and are only invoked when needed.
When we build a UTF-16 string, we currently always switch to the UTF-16
storage mode inside StringBuilder. Then when it comes time to create the
string, we switch the storage to ASCII if possible (by shifting the
underlying bytes up).
Instead, let's start out with ASCII storage and then switch to UTF-16
storage once we see a non-ASCII code point. For most strings, this will
avoid allocating 2x the memory, and avoids many ASCII validation calls.
In the following synthetic benchmark, the simdutf version is 4x faster:
BENCHMARK_CASE(find)
{
auto string = u"😀Foo😀Bar"sv;
for (size_t i = 0; i < 100'000'000; ++i)
(void)string.find_code_unit_offset('a');
}
Before now, you could compare a Utf16View to a StringView, but it would
only be valid if the StringView were ASCII. When porting code to UTF-16,
it will be handy to have a code point-aware implementation for non-ASCII
StringViews.
There were a couple of issues here, including the following computation
could actually overflow to NumericLimits<size_t>::max():
code_unit_offset -= it.length_in_code_units();
This is a strictly UTF-16 string with some optimizations for ASCII.
* If created from a short UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an inlined byte buffer.
* If created with a long UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an outlined char buffer.
* If created with a short or long UTF-8 or UTF-16 string that is not
ASCII, then the string is stored in an outlined char16 buffer.
We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
* Error and ErrorOr are not themelves constexpr, so a function returning
these types cannot be constexpr.
* The UDL was trying to call Utf16View::validate, which is not constexpr
itself. The compiler will actually already raise an error if a UTF-16
literal is invalid, so let's just avoid the call altogether.
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.
This also adds a UDL to construct a Utf16View from a string literal:
auto string = u"hello"sv;
This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
These were mostly removed in 7628ddfaf7.
This removes the few remaining cases, as no callers are providing any
non-host endianness. This is just to prevent weird API dissymmetry
between Utf16View and an upcoming Utf16String.
Much of the web requires us to allow lonely surrogates in UTF-16 data.
The default behavior to disallow such code units has not been changed
here - that will be changed in an upcoming commit.
For the slight cost of counting code points when converting between
encodings and a teeny bit of memory, this commit adds a fast path for
all-happy utf-16 substrings and code point operations.
This seems to be a significant chunk of time spent in many regex
benchmarks.
This is preparation for removing the endianness override, since it was
only used by a single client: LibTextCodec.
While here, add helpers and make use of simdutf for fast conversion.
This ends up making RegexStringView smaller, which means less stuff to
copy when forking in the regex engine.
Thanks to Leon for suggesting the [[no_unique_address]] trick!