This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
This switches to using a simple string equality check if the regex
pattern is strictly a string literal.
Technically this optimisation can also be made on bounded literal
patterns like /[abc]def/ or /abc|def/ as well, but those are
significantly more complex to implement due to our bytecode-only
approach.
This makes the compiler assign a serial ID to each checkpoint instead of
using the IP as the identifier.
This will be used in a future commit to replace the backing store of
checkpoints with a vector.
In 7c5e30daaa, the focus was "only" on
Userland/Libraries/, whereas this commit cleans up the remaining
headers in the repo, and any new badly-formatted include.
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Previously, for a regex such as /[a-sy-z]/i, we would incorrectly think
the character "u" fell into the range "a-s" because neither of the
conditions "u > s && U > s" or "u < a && U < a" would be true, resulting
in the lookup falling back to assuming the character is in the range.
Instead, first explicitly check if the character falls into the range,
rather than checking if it falls outside the range. If the explicit
checks fail, then we know the character is outside the range.
While null StringViews are just as bad, these prevent the removal of
StringView(char const*) as that constructor accepts a nullptr.
No functional changes.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
[^XYZ] is not(X | Y | Z), we used to translate this to
not(X) | not(Y) | not(Z), this commit makes LibRegex interpret this
pattern as not(X) & not(Y) & not(Z).
The lowercase version of a range is not required to be a valid range,
instead of casefolding the range and making it invalid, check twice with
both cases of the input character (which are the same as the input if
not insensitive).
This time includes an actual test :^)
We had a really naive and simplistic implementation, which lead to
various issues where the optimiser incorrectly rewrote the regex to use
atomic groups; this commit fixes that.
ECMA-262 defines \s as:
Return the CharSet containing all characters corresponding to a code
point on the right-hand side of the WhiteSpace or LineTerminator
productions.
The LineTerminator production is simply: U+000A, U+000D, U+2028, or
U+2029. Unfortunately there isn't a Unicode property that covers just
those code points.
The WhiteSpace production is: U+0009, U+000B, U+000C, U+FEFF, or any
code point with the Space_Separator general category.
If the Unicode generators are disabled, this will fall back to ASCII
space code points.
As ECMA262 regex allows `[^]` and literal newlines to match newlines in
the input string, we shouldn't split the input string into lines, rather
simply make boundaries and catchall patterns capable of checking for
these conditions specifically.
Instead of leaking all capture groups and selectively clearing some,
simply avoid leaking things and only "define" the ones that need to
exist.
This *actually* implements the capture groups ECMA262 quirk.
Also adds the test removed in the previous commit (to avoid messing up
test runs across bisects).
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
It's very common to encounter single-character strings in JavaScript on
the web. We can make such strings significantly lighter by having a
1-character inline capacity on the Vectors.
Doing so would increase memory consumption by quite a bit, since many
useless copies of the checkpoints hashmap would be created and later
thrown away.
- Make sure that all the Repeat ops are reset (otherwise the operation
would not be correct when going over the Repeat op a second time)
- Make sure that all matches that are allowed to fail are backed by a
fork, otherwise the last failing fork would not have anywhere to
return to.
Fixes#9707.