Before porting to UTF-16, these instances held a String. The port to
UTF-16 changed them to hold the original string as a StringView, and
lazily allocated the UTF-16 message as needed. This somehow negatively
impacting the zlib.js benchmark in the Octane suite.
This ports the lexer to UTF-16 and deals with the immediate fallout up
to the AST. The AST will be dealt with in upcoming commits.
The lexer will still accept UTF-8 strings as input, and will transcode
them to UTF-16 for lexing. This doesn't actually incur a new allocation,
as we were already converting the input StringView to a ByteString for
each lexer.
One immediate logical benefit here is that we do not need to know off-
hand how many UTF-8 bytes some special code points occupy. They all
happen to be a single UTF-16 code unit. So instead of advancing the
lexer by 3 positions in some cases, we can just always advance by 1.
For the web, we allow a wobbly UTF-16 encoding (i.e. lonely surrogates
are permitted). Only in a few exceptional cases do we strictly require
valid UTF-16. As such, our `validate(AllowLonelySurrogates::Yes)` calls
will always succeed. It's a wasted effort to ever make such a check.
This patch eliminates such invocations. The validation methods will now
only check for strict UTF-16, and are only invoked when needed.
This has quite a lot of fall out. But the majority of it is just type or
UDL substitution, where the changes just fall through to other function
calls.
By changing property key storage to UTF-16, the main affected areas are:
* NativeFunction names must now be UTF-16
* Bytecode identifiers must now be UTF-16
* Module/binding names must now be UTF-16
This reverts commit c14173f651. We
should only annotate the minimum number of symbols that external
consumers actually use, so I am starting from scratch to do that
RegExp was the only caller relying on being able to provide an offset
larger than the string length. So let's do a pre-check in RegExp and
then enforce that the offsets we receive in Utf16View are valid.
There apparently is a bit of a disconnect between the spec asking us to
construct the pattern using code points and LibRegex not being able to
swallow those. Whenever we had multi-byte code points in the pattern and
tried to match that in unicode mode, we would fail.
Change the parser to encode all non-ASCII code units. Fixes 2 test262
cases in `language/literals/regexp`.
The underlying storage used during string formatting is StringBuilder.
To support UTF-16 strings, this patch allows callers to specify a mode
during StringBuilder construction. The default mode is UTF-8, for which
StringBuilder remains unchanged.
In UTF-16 mode, we treat the StringBuilder's internal ByteBuffer as a
series of u16 code units. Appending a single character will append 2
bytes for that character (cast to a char16_t). Appending a StringView
will transcode the string to UTF-16.
Utf16String also gains the same memory optimization that we added for
String, where we hand-off the underlying buffer to Utf16String to avoid
having to re-allocate.
In the future, we may want to further optimize for ASCII strings. For
example, we could defer committing to the u16-esque storage until we
see a non-ASCII code point.
This is a strictly UTF-16 string with some optimizations for ASCII.
* If created from a short UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an inlined byte buffer.
* If created with a long UTF-8 or UTF-16 string that is also ASCII,
then the string is stored in an outlined char buffer.
* If created with a short or long UTF-8 or UTF-16 string that is not
ASCII, then the string is stored in an outlined char16 buffer.
We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
This will allow structured deserialization in LibWeb to create shared
buffers without having to go through CreateSharedByteDataBlock. That
will let us pass in the underlying ByteBuffer, rather than having to
allocate a second buffer via CreateSharedByteDataBlock and memcpy the
data into it.
This is an active proposal at stage 3 of the TC39 proposal process.
See: https://tc39.es/proposal-dynamic-code-brand-checks/
See: https://github.com/tc39/proposal-dynamic-code-brand-checks
This proposal essentially adds support for the TrustedScript type from
the Trusted Types specification to eval and Function. This in turn
pipes support for the type into the CSP hook to check if the CSP allows
dynamic code compilation.
However, it currently doesn't support ShadowRealms, so the
implementation here is a close approximation, using PerformEval as the
basis.
See: https://github.com/tc39/proposal-dynamic-code-brand-checks/issues/19
This is required to support the new function signature for the CSP
hook, and will allow us to slot in Trusted Types support in the future.
Our floating point number parser was based on the fast_float library:
https://github.com/fastfloat/fast_float
However, our implementation only supports 8-bit characters. To support
UTF-16, we will need to be able to convert char16_t-based strings to
numbers as well. This works out-of-the-box with fast_float.
We can also use fast_float for integer parsing.