These changes are compatible with clang-format 16 and will be mandatory
when we eventually bump clang-format version. So, since there are no
real downsides, let's commit them now.
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
Previously we were unable to parse code like `yield/2` because `/2`
was parsed as a regex. At the same time `for (a in / b/)` was parsed
as a division.
This is solved by defaulting to division in the lexer, but calling
`force_slash_as_regex()` from the parser whenever an IdentifierName
is parsed as a ReservedWord.
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Before this change we would ignore that the second backslash is escaped
and template strings ending with ` \\` would be unterminated as the
second slash was used to escape the closing quote.
While null StringViews are just as bad, these prevent the removal of
StringView(char const*) as that constructor accepts a nullptr.
No functional changes.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
If the current character under the lexer cursor is ASCII, we don't need
to create a Utf8View to consume a full code point.
This gives a ~3% speedup when parsing the largest Discord JS file.
Previously we might swallow invalid unicode point which would skip valid
ascii characters. This could be dangerous as we might skip a '"' thus
not closing a string where we should.
This might have been exploitable as it would not have been clear what
code gets executed when looking at a script.
Another approach to this would be simply replacing all invalid
characters with the replacement character (this is what v8 does). But
our lexer and parser are currently not set up for such a change.
Before this a closing html comment would not be treated as a comment if
directly following a block comment which was not the first token of its
first line.
This commit adds support for the most bare bones version of async
functions, support for async generator functions, async arrow functions
and await expressions are TODO.
The position is added to manually in the line terminator and Unicode
character cases. While it checks for EOF after doing so, the EOF check
used `!=` instead of `<`, meaning if the position went _over_ the
source length, it wouldn't think it was EOF and would cause read buffer
overflows.
For example, `0xea` followed by `0xfd` would cause this.
By using the FlyString(StringView) constructor instead of the
FlyString(String) one, we can dodge a temporary String construction.
This improves parsing time on a large chunk of JS by ~1.6%.
Before this change, Lexer::is_identifier_{start,middle}() would do a
Unicode property lookup via Unicode::code_point_has_property() quite
frequently, especially for common characters like .,;{}[]() etc.
Since these and any other ASCII characters not covered by the alpha /
alphanumeric check are known to not have the ID_Start / ID_Continue
(except '_', which is special-cased now) properties, we can easily
avoid this function call.
When we save/load state in the parser, we preserve the lexer state by
simply making a copy of it. This was made extremely heavy by the lexer
keeping a cache of all parsed identifiers.
It keeps the cache to ensure that StringViews into parsed Unicode escape
sequences don't become dangling views when the Token goes out of scope.
This patch solves the problem by replacing the Vector<FlyString> which
was used to cache the identifiers with a ref-counted
HashTable<FlyString> instead.
Since the purpose of the cache is just to keep FlyStrings alive, it's
fine for all Lexer instances to share the cache. And as a bonus, using a
HashTable instead of a Vector replaces the O(n) accesses with O(1) ones.
This makes a 1.9 MiB JavaScript file parse in 0.6s instead of 24s. :^)
This bug was discovered via OSS fuzz, it's possible to fall through
to this assert with a char_size == 1, so we need to account for that
in the VERIFY(..).
Repro test case can be found in the OSS fuzz bug:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37296
Added a test to ensure the behavior stays the same.
We now throw on a direct usage of an escaped keywords with a specific
error to make it more clear to the user.
For example, "property.br\u{64}wn" should resolve to "property.brown".
To support this behavior, this commit changes the Token class to hold
both the evaluated identifier name and a view into the original source
for the unevaluated name. There are some contexts in which identifiers
are not allowed to contain Unicode escape sequences; for example, export
statements of the form "export {} from foo.js" forbid escapes in the
identifier "from".
The test file is added to .prettierignore because prettier will replace
all escaped Unicode sequences with their unescaped value.
If we consumed whitespace and/or comments after a RegexLiteral token,
the following token must not be RegexFlags - no whitespace or comments
are allowed between the closing / and the flag characters.
Fixes#8201.
Stage 3 since August 2019 - we already have shebang stripping
implemented in js(1), so this removes it from there in favor of adding
support to the lexer directly.
Most straightforward proposal and implementation I've ever seen :^)
https://github.com/tc39/proposal-hashbang