Commit graph

146 commits

Author SHA1 Message Date
Callum Law
8ada4b7fdc LibRegex: Account for opcode size when calculating incoming jump edges
Not accounting for opcode size when calculating incoming jump edges
meant that we were merging nodes where we otherwise shouldn't have been,
for example /.*a|.*b/.
2025-07-28 17:06:58 +02:00
Ali Mohammad Pur
c7ad6cd508 LibRegex: Use code unit length in more places that apply
Some checks are pending
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
CI / macOS, arm64, Sanitizer, Clang (push) Waiting to run
CI / Linux, x86_64, Fuzzers, Clang (push) Waiting to run
CI / Linux, x86_64, Sanitizer, GNU (push) Waiting to run
CI / Linux, x86_64, Sanitizer, Clang (push) Waiting to run
Package the js repl as a binary artifact / Linux, arm64 (push) Waiting to run
Package the js repl as a binary artifact / macOS, arm64 (push) Waiting to run
Package the js repl as a binary artifact / Linux, x86_64 (push) Waiting to run
Finishes what 7f6b70fafb started.
Having one part use length and another code unit length lead to crashes,
the added test ensures we don't mess that up again.
2025-07-24 23:09:01 +02:00
aplefull
e2f8f5a350 LibRegex: Fix capture groups in quantified alternations
Some checks are pending
CI / macOS, arm64, Sanitizer, Clang (push) Waiting to run
CI / Linux, x86_64, Fuzzers, Clang (push) Waiting to run
CI / Linux, x86_64, Sanitizer, GNU (push) Waiting to run
CI / Linux, x86_64, Sanitizer, Clang (push) Waiting to run
Package the js repl as a binary artifact / Linux, arm64 (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Package the js repl as a binary artifact / macOS, arm64 (push) Waiting to run
Package the js repl as a binary artifact / Linux, x86_64 (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
This prevents empty matches from overwriting non-empty captures in
quantified alternations. Fixes patterns like (a|a?)+ where the optional
branch would incorrectly overwrite meaningful captures with empty
strings.
2025-07-24 10:40:16 +02:00
Timothy Flynn
2dfcc4c307 LibRegex: Compare code units (not code points) in non-Unicode char range 2025-07-21 23:44:18 +02:00
Timothy Flynn
9582895759 AK+LibJS+LibWeb+LibRegex: Replace AK::Utf16Data with AK::Utf16String 2025-07-18 12:45:38 -04:00
Ali Mohammad Pur
5b45223d5f LibRegex: Account for uppercase characters in insensitive patterns 2025-07-12 11:26:23 +02:00
Shannon Booth
bd6581fe22 LibRegex: Correctly use ClassSetReservedPunctuator in ClassSetCharacter
Some checks are pending
CI / macOS, arm64, Sanitizer, Clang (push) Waiting to run
CI / Linux, x86_64, Fuzzers, Clang (push) Waiting to run
CI / Linux, x86_64, Sanitizer, GNU (push) Waiting to run
CI / Linux, x86_64, Sanitizer, Clang (push) Waiting to run
Package the js repl as a binary artifact / Linux, arm64 (push) Waiting to run
Package the js repl as a binary artifact / macOS, arm64 (push) Waiting to run
Package the js repl as a binary artifact / Linux, x86_64 (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
We had typo'd using ClassSetReservedDoublePunctuator which was
resulting in a parse error for the regex:

([^\\:]+?)

With the 'v' flag set.

Co-Authored-By: Ali Mohammad Pur <mpfard@serenityos.org>
2025-07-10 11:41:02 +02:00
ayeteadoe
25f5936dee CMake: Rename serenity_* helper functions/macros to ladybird_* 2025-07-03 23:19:41 +02:00
aplefull
486602e796 LibRegex: Fix handling of + quantifier with zero-width matches
Some checks are pending
CI / Lagom (arm64, Sanitizer_CI, false, macOS, macos-15, Clang) (push) Waiting to run
CI / Lagom (x86_64, Fuzzers_CI, false, Linux, blacksmith-16vcpu-ubuntu-2404, Clang) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, false, Linux, blacksmith-16vcpu-ubuntu-2404, GNU) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, true, Linux, blacksmith-16vcpu-ubuntu-2404, Clang) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (arm64, macOS, macOS-arm64, macos-15) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (x86_64, Linux, Linux-x86_64, blacksmith-8vcpu-ubuntu-2404) (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
Small change that allows quantifiers using Fork* forms (e.g., +) to
succeed after one match, even if that match has zero width.
2025-06-02 15:52:26 +02:00
Ali Mohammad Pur
cfc241f61d LibRegex: Make the trie rewrite optimisation maintain the alt order
This is required by the spec.
2025-05-21 14:28:45 +02:00
ayeteadoe
11bca38f91 CMake: Build LibRegex tests in Tests/LibRegex not Meta/Lagom
As LibRegex was not specified in TEST_DIRECTORIES, the existing
Tests/LibRegex subdirectory was not actually included during
configuration. Also the RegexLibC test has not been needed
since migration away from Serenitys LibC was done, so
that test has been fully removed. I also renamed the
Regex.cpp test to TestRegex.cpp to match the naming
convention of most test targets.
2025-05-14 02:05:12 -06:00
Ali Mohammad Pur
022cd1adca LibRegex: Use the right offset when patching jumps through fork-trees
Fixes #4474.
2025-04-27 12:16:15 +02:00
Ali Mohammad Pur
fca1d33fec LibRegex: Correctly calculate the target for Repeat in table alts
Fixes a bunch of websites breaking because we now verify jump offsets by
trying to remove 0-offset jumps.
This has been broken for a good while, it was just rare to see Repeat
inside alternatives that lended themselves well to tree alts.
2025-04-24 01:17:27 -06:00
Ali Mohammad Pur
76f5dce3db LibRegex: Flatten capture group list in MatchState
This makes copying the capture group COWVector significantly cheaper,
as we no longer have to run any constructors for it - just memcpy.
2025-04-18 17:09:27 +02:00
Andreas Kling
96f1f15ad6 LibRegex: Remove unused Utf8View/Utf32View support in RegexStringView 2025-04-16 10:04:50 +02:00
Andreas Kling
5308d77600 LibRegex: Don't use Optional<T> inside regex::Match
This prevented Match from being trivially copyable, which we want it to
be for fast Vector copying.
2025-04-14 17:40:13 +02:00
Andreas Kling
54edf29f1b LibRegex: Make Match::capture_group_name an index into the string table
This removes another Match member that required destruction. The "API"
for accessing the strings is definitely a bit awkward. We'll think of
something nicer eventually.
2025-04-14 17:40:13 +02:00
Ali Mohammad Pur
69050da929 LibRegex: Merge inverse string table mappings separately
Some checks are pending
CI / Lagom (arm64, Sanitizer_CI, false, macos-15, macOS, Clang) (push) Waiting to run
CI / Lagom (x86_64, Fuzzers_CI, false, ubuntu-24.04, Linux, Clang) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, false, ubuntu-24.04, Linux, GNU) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, true, ubuntu-24.04, Linux, Clang) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (arm64, macos-15, macOS, macOS-universal2) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (x86_64, ubuntu-24.04, Linux, Linux-x86_64) (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
2025-04-06 20:21:16 +02:00
Ali Mohammad Pur
299b9ca572 LibRegex: Check backreference index before looking it up
If a backref happens after it's cleared, the slot may be cleared
already.
2025-04-06 20:21:16 +02:00
Andreas Kling
6b6d3b32a4 LibRegex: Remove the StringCopyMatches mode
This mode made a lot of incorrect assumptions about string lifetimes,
and instead of fixing it, let's just remove it and tweak the few unit
tests that used it.
2025-03-24 22:27:17 +00:00
mikiubo
c85df78c4c LibRegex: Remove orphaned save points in nested LookAhead 2025-03-17 16:11:02 +01:00
Tim Ledbetter
b9ac99d2eb Revert "LibRegex: Remove orphaned save points in nested LookAhead"
This reverts commit f2678bfcb8.
2025-03-14 19:57:33 +00:00
mikiubo
f2678bfcb8 LibRegex: Remove orphaned save points in nested LookAhead 2025-03-14 09:41:41 +01:00
Ali Mohammad Pur
a37315da87 Tests: Get rid of clang-format: off in the Regex tests
Should've done this a long time ago, but now is better than never.
2025-03-09 14:37:57 +01:00
Ali Mohammad Pur
5355710481 LibRegex: Don't treat single-jump blocks as noop in the optimizer 2025-03-09 14:37:57 +01:00
aplefull
389a63d6bf LibRegex: Allow duplicate named capture groups in separate alternatives 2025-03-05 14:36:09 +01:00
aplefull
61744322ad LibRegex: Ensure nullable quantifiers backtrack when input remains
Makes patterns like `/(a?b??)*/` correctly match the string
2025-03-02 15:19:04 +01:00
mikiubo
8a6f7b787e LibRegex: Use depth-first search in regex optimizer
Some checks are pending
CI / Lagom (arm64, Sanitizer_CI, false, macos-15, macOS, Clang) (push) Waiting to run
CI / Lagom (x86_64, Fuzzers_CI, false, ubuntu-24.04, Linux, Clang) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, false, ubuntu-24.04, Linux, GNU) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, true, ubuntu-24.04, Linux, Clang) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (macos-14, macOS, macOS-universal2) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (ubuntu-24.04, Linux, Linux-x86_64) (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
use depth-first search in optimizer code bacause using breadth-first
search generate a bug. Add test example in test lib.
2025-02-25 00:09:20 +01:00
Ali Mohammad Pur
08ebfaff17 LibRegex: Take trailing inversion state into account in block comparison
Fixes #3421.
2025-02-01 11:30:02 +01:00
Ali Mohammad Pur
cce000d57c LibRegex: Don't repeat the same fork again
If some state has already been tried, skip over it as it would never
lead to a match regardless.
This fixes performance/memory issues in cases like
/(a+)+b/.exec("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
or
/(a|a?)+b/...

Fixes #2622.
2025-01-17 10:13:51 +01:00
Ali Mohammad Pur
50733c564c LibRegex: Use the *actually* correct repeat start offset for Repeat
Fixes #2931 and various frequent crashes.
2024-12-23 13:13:52 +01:00
Ali Mohammad Pur
eee90f4aa2 LibRegex: Treat checks against nonexistent checkpoints as empty
Due to optimiser shenanigans in the tree alternative form, some
JumpNonEmpty ops might be moved before their Checkpoint instruction.
It is safe to assume the distance between the nonexistent checkpoint and
the current op is zero, so just do that.
2024-12-13 10:00:16 +01:00
Marc Jessome
efcaf991e6 LibRegex: Ensure nested capture groups have non-conflicting names
Take record of the named capture group prior to parsing the group's
body. This requires removal of the recorded minimum length of the named
capture group directly, and now needs to be looked up via the group
minimu lengths table.
2024-11-24 10:26:09 +01:00
Ali Mohammad Pur
5a4d657a4e LibRegex: Avoid generating ForkJumps when jumping to the next alt block
Fixes #2398.
2024-11-17 20:12:39 +01:00
Ali Mohammad Pur
00bc22c332 LibRegex: Don't immediately ignore TempInverse in optimizer
fe46b2c141 added the reset-temp-inverse flag, but set it up so all
tempinverse ops were negated at the start of the next op; this commit
makes it so these flags actually persist for one op and not zero.

Fixes #2296.
2024-11-17 09:03:29 -05:00
Ali Mohammad Pur
dabd60180f LibRegex: Don't ignore references that weren't bound in checked blocks
Fixes #2281.
2024-11-12 10:37:57 +01:00
Ali Mohammad Pur
00c45243bd LibRegex: Don't blindly accept inverted charclasses for atomic rewrite 2024-10-24 07:36:51 -04:00
Ali Mohammad Pur
cc1f0c3af2 LibRegex: Restore checkpoints when restoring the state post-fork
Fixes the lockup/OOM in #968.
2024-10-09 11:20:58 +02:00
Andreas Kling
cc4b3cbacc Meta: Update my e-mail address everywhere 2024-10-04 13:19:50 +02:00
Gingeh
de588a97c0 LibRegex: Only search start of line if pattern begins with ^ 2024-09-30 12:28:22 +02:00
Ali Mohammad Pur
27a38932da LibRegex: Account for extra explicit And/Or in class parser assertion
Fixes #23691.
2024-03-24 08:24:46 +01:00
Ali Mohammad Pur
e265d81277 LibRegex: Correct And/Or and inversion interplay semantics
This commit also fixes an incorrect test case from very early on, our
behaviour now matches the ECMA262 spec in this case.

Fixes #21786.
2024-01-11 11:36:09 +01:00
Ali Mohammad Pur
267040dde7 LibRegex: Error out on Eof when parsing nonempty class range elements
Fixes #22507.
2023-12-31 15:36:42 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Timothy Flynn
e122039c99 LibRegex: Support non-ASCII case-insensitive character comparisons
Specifically, when the Unicode flag is set, use Unicode-aware case
folding to case-insensitively compare code points.
2023-11-08 12:54:26 -05:00
Timothy Flynn
3fbf33bd37 LibRegex: Change a couple function parameters to east-const
Automatically done by clang-format-17 (and clang-format-16 leaves these
alone afterwards).
2023-11-08 12:54:26 -05:00
Ali Mohammad Pur
4d71f4edc4 LibRegex: Don't add the Repeat instruction size to its jump target
This was causing the calculated jump target to become invalid, leading
to possibly invalid optimisations and (more likely) crashes.
Fixes #21047.
2023-09-15 18:07:23 +03:30
Ali Mohammad Pur
4d27257c45 LibRegex: Treat backwards jumps to IP 0 as normal backwards jumps too
This shows up in something like /\d+|x/, where the `+` ends up jumping
to the start of its own alternative.
2023-08-16 22:20:24 +03:30
Ali Mohammad Pur
e689422564 LibRegex: Keep track of instruction positions for backwards tree jumps 2023-08-05 16:40:04 +02:00
Ali Mohammad Pur
4e69eb89e8 LibRegex: Generate a search tree when patterns would benefit from it
This takes the previous alternation optimisation and applies it to all
the alternation blocks instead of just the few instructions at the
start.
By generating a trie of instructions, all logically equivalent
instructions will be consolidated into a single node, allowing the
engine to avoid checking the same thing multiple times.
For instance, given the pattern /abc|ac|ab/, this optimisation would
generate the following tree:
    - a
    | - b
    | | - c
    | | | - <accept>
    | | - <accept>
    | - c
    | | - <accept>
which will attempt to match 'a' or 'b' only once, and would also limit
the number of backtrackings performed in case alternatives fails to
match.

This optimisation is currently gated behind a simple cost model that
estimates the number of instructions generated, which is pessimistic for
small patterns, though the change in performance in such patterns is not
particularly large.
2023-07-31 05:31:33 +02:00