Originally I added this to use it in Utf16View::ends_with(), but the
final implementation ended up a lot simpler. I chose to keep this anyway
since it mirrors Span::starts_with().
There were a couple of issues here, including the following computation
could actually overflow to NumericLimits<size_t>::max():
code_unit_offset -= it.length_in_code_units();
And do the same for Utf8View::code_point_offset_of(). Some of these
`VERIFY`s of the view's length were introduced recently, but they caused
the parsing of named capture groups in RegexParser to crash in some
situations.
Instead, allow indexing at the view's length: the byte offset of code
point `length()` is known, even though that code point does not exist in
the view. Similarly, we know the code point offset at byte offset
`byte_length()`. Beyond those offsets, we still crash.
Fixes 13 failures in test262's `language/literals/regexp/named-groups`.
Utf16FlyString more or less works exactly the same as FlyString. It will
store the raw encoded data of the string instance. If the string is a
short ASCII string, Utf16FlyString holds the ShortString bytes; else,
Utf16FlyString holds a pointer to the Utf16StringData.
The underlying storage used during string formatting is StringBuilder.
To support UTF-16 strings, this patch allows callers to specify a mode
during StringBuilder construction. The default mode is UTF-8, for which
StringBuilder remains unchanged.
In UTF-16 mode, we treat the StringBuilder's internal ByteBuffer as a
series of u16 code units. Appending a single character will append 2
bytes for that character (cast to a char16_t). Appending a StringView
will transcode the string to UTF-16.
Utf16String also gains the same memory optimization that we added for
String, where we hand-off the underlying buffer to Utf16String to avoid
having to re-allocate.
In the future, we may want to further optimize for ASCII strings. For
example, we could defer committing to the u16-esque storage until we
see a non-ASCII code point.
This is a strictly UTF-16 string with some optimizations for ASCII.
* If created from a short UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an inlined byte buffer.
* If created with a long UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an outlined char buffer.
* If created with a short or long UTF-8 or UTF-16 string that is not
ASCII, then the string is stored in an outlined char16 buffer.
We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
This was a mistake. Consider U+201C (LEFT DOUBLE QUOTATION MARK). This
code point is encoded as the bytes 0x1c 0x20 in UTF-16LE. Both of these
bytes are ASCII if interpreted as UTF-8. But the string itself is most
certainly not ASCII.
Our floating point number parser was based on the fast_float library:
https://github.com/fastfloat/fast_float
However, our implementation only supports 8-bit characters. To support
UTF-16, we will need to be able to convert char16_t-based strings to
numbers as well. This works out-of-the-box with fast_float.
We can also use fast_float for integer parsing.
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.
This also adds a UDL to construct a Utf16View from a string literal:
auto string = u"hello"sv;
This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
Much of the web requires us to allow lonely surrogates in UTF-16 data.
The default behavior to disallow such code units has not been changed
here - that will be changed in an upcoming commit.
We now explicitly enabling support for the minimum libraries needed
to build and run the AK testsuite. 81/82 tests are running and
passing. The exception is LexicalPath, as some path behaviour on
Windows is different than Unix, so the current tests will have lots of
platform specific failures. The implementer of LexicalPathWindows
recommended windows-specific tests here, so I will do that in a
follow up.
This test was only useful when AK/PrintfImplementation.h existed. But
that was removed 11 months ago, so since then this has just been
testing std library functions not implemented by us.
This gives us a more human-looking serialization of numbers by default,
and in case a fixed number of decimal digits is actually wanted, we
still have the 'f' specifier.
A challenge for getting LibTest working on Windows has always
been CrashTest. It implements death tests similar to Google Test
where a child process is cloned to invoke the expression that
should abort/terminate the program. Then the exit code of the
child is used by the parent test process to verify if the
application correctly aborted/terminated due to invoking
the expression.
The problem was that finding an equivalent way to port Crash::run()
to Windows was not looking very likely as publicly exposed Win32/
Native APIs have no equivalent to fork(); however, Windows actually
does have native support for process cloning via undocumented NT
APIs that clever people reverse engineered and published, see
`NtCreateUserProcess()`.
All that being said, this `EXPECT_DEATH()` implementation avoids
needing to use a child process in general, allowing us to remove
CrashTest in favour of a single cross-platform solution for death
tests.
Instead of going through String::formatted(), we now have a specialized
code path for base-10 serialization directly to UTF-8.
This is roughly 5-10x faster than the previous implementation, depending
on how many digits we end up outputting.
1.07x speedup on MicroBench/for-in-indexed-properties.js
This will change behaviour for moved-from `Optional<T>`s, since they
will now no longer clear their value if `T` is trivial. However, a
moved-from value should be considered to be in an unspecified state.
Use `Optional<T>::clear` or `Optional<T>::release_value` instead.