Not accounting for opcode size when calculating incoming jump edges
meant that we were merging nodes where we otherwise shouldn't have been,
for example /.*a|.*b/.
Finishes what 7f6b70fafb started.
Having one part use length and another code unit length lead to crashes,
the added test ensures we don't mess that up again.
This prevents empty matches from overwriting non-empty captures in
quantified alternations. Fixes patterns like (a|a?)+ where the optional
branch would incorrectly overwrite meaningful captures with empty
strings.
We had typo'd using ClassSetReservedDoublePunctuator which was
resulting in a parse error for the regex:
([^\\:]+?)
With the 'v' flag set.
Co-Authored-By: Ali Mohammad Pur <mpfard@serenityos.org>
As LibRegex was not specified in TEST_DIRECTORIES, the existing
Tests/LibRegex subdirectory was not actually included during
configuration. Also the RegexLibC test has not been needed
since migration away from Serenitys LibC was done, so
that test has been fully removed. I also renamed the
Regex.cpp test to TestRegex.cpp to match the naming
convention of most test targets.
Fixes a bunch of websites breaking because we now verify jump offsets by
trying to remove 0-offset jumps.
This has been broken for a good while, it was just rare to see Repeat
inside alternatives that lended themselves well to tree alts.
This removes another Match member that required destruction. The "API"
for accessing the strings is definitely a bit awkward. We'll think of
something nicer eventually.
This mode made a lot of incorrect assumptions about string lifetimes,
and instead of fixing it, let's just remove it and tweak the few unit
tests that used it.
If some state has already been tried, skip over it as it would never
lead to a match regardless.
This fixes performance/memory issues in cases like
/(a+)+b/.exec("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
or
/(a|a?)+b/...
Fixes#2622.
Due to optimiser shenanigans in the tree alternative form, some
JumpNonEmpty ops might be moved before their Checkpoint instruction.
It is safe to assume the distance between the nonexistent checkpoint and
the current op is zero, so just do that.
Take record of the named capture group prior to parsing the group's
body. This requires removal of the recorded minimum length of the named
capture group directly, and now needs to be looked up via the group
minimu lengths table.
fe46b2c141 added the reset-temp-inverse flag, but set it up so all
tempinverse ops were negated at the start of the next op; this commit
makes it so these flags actually persist for one op and not zero.
Fixes#2296.
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
This takes the previous alternation optimisation and applies it to all
the alternation blocks instead of just the few instructions at the
start.
By generating a trie of instructions, all logically equivalent
instructions will be consolidated into a single node, allowing the
engine to avoid checking the same thing multiple times.
For instance, given the pattern /abc|ac|ab/, this optimisation would
generate the following tree:
- a
| - b
| | - c
| | | - <accept>
| | - <accept>
| - c
| | - <accept>
which will attempt to match 'a' or 'b' only once, and would also limit
the number of backtrackings performed in case alternatives fails to
match.
This optimisation is currently gated behind a simple cost model that
estimates the number of instructions generated, which is pessimistic for
small patterns, though the change in performance in such patterns is not
particularly large.