This way, we still perform UTF-8 validation, but don't go through the
slow generic code path that rebuilds the decoded string one code point
at a time.
This was a bottleneck when loading a canned copy of reddit.com, which
ended up being ~120 MiB large.
- Time spent decoding UTF-8 before this change: 1192 ms
- Time spent decoding UTF-8 after this change: 154 ms
That's still a long time, but 7.7x faster is nothing to sneeze at! :^)
Note that if the input fails UTF-8 validation, we still fall back to
the slow path and insert replacement characters per the WHATWG Encoding
spec: https://encoding.spec.whatwg.org/#utf-8-decode
The Encoding specification maps ISO-8859-1 to windows-1252 and expects
the windows-1252 translation table to be used, which differs from
ISO-8859-1 for 0x80-0x9F.
Other contexts expect to get the actual ISO-8859-1 encoding, with 1-to-1
mapping to U+0000-U+00FF, when requesting it.
`decoder_for_exact_name` is introduced, which skips the mapping from
aliases to the encoding name done by `get_standardized_encoding`.
UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.
This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
The UTF-8 decoder will currently crash if it is provided invalid UTF-8
input. Instead, change its behavior to match that of all other decoders
to replace invalid code points with U+FFFD. This is required by the web.
Using char causes bytes equal to or over 0x80 to be treated as a
negative value and produce incorrect results when implicitly casting to
u32.
For example, `atob` in LibWeb uses this decoder to convert non-ASCII
values to UTF-8, but non-ASCII values are >= 0x80 and thus produces
incorrect results in such cases:
```js
Uint8Array.from(atob("u660"), c => c.charCodeAt(0));
```
This used to produce [253, 253, 253] instead of [187, 174, 180].
Required by Cloudflare's IUAM challenges.
There were two problems:
1. They didn't handle surrogates
2. They used signed chars, leading to eg 0x00e4 being treated as 0xffe4
Also add a basic test that catches both issues.
There's some code duplication with Utf16CodePointIterator::operator*(),
but let's get things working first.
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.