Commit graph

15 commits

Author SHA1 Message Date
Ali Mohammad Pur
97a333608e LibRegex: Make codegen+optimisation for alternatives much faster
Just a little thinking outside the box, and we can now parse and
optimise a million copies of "a|" chained together in just a second :^)
2022-02-20 11:53:59 +01:00
Ali Mohammad Pur
3b0943d24c LibRegex: Correct the alternative matching order when one is empty
Previously we were compiling `/a|/` into what effectively would be
`/|a`, which is clearly incorrect.
2022-02-14 11:30:50 +01:00
Ali Mohammad Pur
6a4c8a66ae LibRegex: Only skip full instructions when optimizing alternations
It makes no sense to skip half of an instruction, so make sure to skip
only full instructions!
2022-02-09 21:02:24 +00:00
Ali Mohammad Pur
cd83325c7c LibRegex: Preserve capture groups and matches across ForkReplace
This makes the (flawed) ForkStay inserted as a loop header unnecessary,
and finally fixes LibRegex rewriting weird loops in weird ways.
2022-01-22 00:35:49 +00:00
Ali Mohammad Pur
bfe8f312f3 LibRegex: Correct jump offset to the start of the loop block
Previously we were jumping to the new end of the previous block (created
by the newly inserted ForkStay), correct the offset to jump to the
correct block as shown in the comments.
Fixes #12033.
2022-01-21 18:14:08 +03:30
Hendiadyoin1
b674de6957 LibRegex: Add some implied auto qualifiers 2021-12-21 18:17:28 -08:00
Ali Mohammad Pur
b8f03bb072 LibRegex: Make append_alternation() significantly faster
...by flattening the underlying bytecode chunks first.
Also avoid calling DisjointChunks::size() inside a loop.
This is a very significant improvement in performance, making the
compilation of a large regex with lots of alternatives take only ~100ms
instead of many minutes (I ran out of patience waiting for it) :^)
2021-12-21 22:10:07 +01:00
Ali Mohammad Pur
d2e51fafa9 LibRegex: Merge alternations based on blocks and not instructions
The instructions can have dependencies (e.g. Repeat), so only unify
equal blocks instead of consecutive instructions.
Fixes #11247.

Also adds the minimal test case(s) from that issue.
2021-12-15 19:36:45 +03:30
Ali Mohammad Pur
387df06385 LibRegex: Avoid rewriting a+ as a* as part of atomic rewriting
The initial `ForkStay` is only needed if the looping block has a
following block, if there's no following block or the following block
does not attempt to match anything, we should not insert the ForkStay,
otherwise we would be rewriting `a+` as `a*` by allowing the 'end' to be
executed.
Fixes #10952.
2021-11-18 09:09:22 +01:00
Ali Mohammad Pur
ac856cb965 LibRegex: Don't ignore empty alternatives in append_alternation()
Doing so would cause patterns like `(a|)` to not match the empty string.
2021-10-29 15:57:59 +02:00
Ali Mohammad Pur
8f722302d9 LibRegex: Use a match table for character classes
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
2021-10-03 19:16:36 +02:00
Andreas Kling
2758d99bbc LibRegex: Flatten bytecode before performing optimizations
This avoids doing DisjointChunks traversal for every bytecode access,
significantly reducing startup time for large regular expressions.
2021-09-29 18:45:26 +02:00
Ali Mohammad Pur
741886a4c4 LibRegex: Make the optimizer understand references and capture groups
Otherwise the fork in patterns like `(1+)\1` would be (incorrectly)
optimized away.
2021-09-15 15:52:28 +04:30
Ali Mohammad Pur
bf0315ff8f LibRegex: Avoid excessive Vector copy when compiling regexps
Previously we would've copied the bytecode instead of moving the chunks
around, use the fancy new DisjointChunks<T> abstraction to make that
happen automagically.
This decreases vector copies and uses of memmove() by nearly 10x :^)
2021-09-14 21:33:15 +04:30
Ali Mohammad Pur
246ab432ff LibRegex: Add a basic optimization pass
This currently tries to convert forking loops to atomic groups, and
unify the left side of alternations.
2021-09-13 14:38:53 +04:30