Commit graph

56 commits

Author SHA1 Message Date
Ali Mohammad Pur
1b127ac082 LibRegex: Apply atomic loop rewrite in one more case
This commit makes LibRegex's atomic loop rewrite opt also accept cases
where the follow block jumps to the end of the forking block
(which is essentially a loop without a proper header in fancy clothes)

This makes patterns like /([^x]*)x/ where the loop is not _immediately_
followed by a block significantly faster.
2024-10-25 09:15:51 +02:00
Ali Mohammad Pur
536e4d1d36 LibRegex: Cache 1 MiB worth of bytecode for each parser
While LibJS does cache regex literals, there's more than one way to
create regex objects, this cache is hit regularly just browsing the web,
though no real measurement has been done on its potential speedups.
2024-10-24 20:55:08 +02:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Ali Mohammad Pur
4e69eb89e8 LibRegex: Generate a search tree when patterns would benefit from it
This takes the previous alternation optimisation and applies it to all
the alternation blocks instead of just the few instructions at the
start.
By generating a trie of instructions, all logically equivalent
instructions will be consolidated into a single node, allowing the
engine to avoid checking the same thing multiple times.
For instance, given the pattern /abc|ac|ab/, this optimisation would
generate the following tree:
    - a
    | - b
    | | - c
    | | | - <accept>
    | | - <accept>
    | - c
    | | - <accept>
which will attempt to match 'a' or 'b' only once, and would also limit
the number of backtrackings performed in case alternatives fails to
match.

This optimisation is currently gated behind a simple cost model that
estimates the number of instructions generated, which is pessimistic for
small patterns, though the change in performance in such patterns is not
particularly large.
2023-07-31 05:31:33 +02:00
Ali Mohammad Pur
11abca421a LibRegex: Always inline get_opcode()
This function's prologue was showing up as very hot in profiles, taking
almost 9% of runtime in our Regex test suite.
2023-07-31 05:31:33 +02:00
Andreas Kling
17bff999c8 LibRegex: Remove redundant VERIFY in OpCode::argument()
The bounds check here is already performed by the at() we're calling,
and removing it appears to have non-trivial performance impact.
2023-07-14 09:09:11 +02:00
Ali Mohammad Pur
2d6f50932b LibRegex: Assign unique serial IDs to checkpoints
This makes the compiler assign a serial ID to each checkpoint instead of
using the IP as the identifier.
This will be used in a future commit to replace the backing store of
checkpoints with a vector.
2023-07-14 08:59:19 +02:00
Ben Wiederhake
6fd478b6ce Everywhere: Remove unused includes of AK/Format.h
These instances were detected by searching for files that include
AK/Format.h, but don't match the regex:

\\b(CheckedFormatString|critical_dmesgln|dbgln|dbgln_if|dmesgln|FormatBu
ilder|__FormatIfSupported|FormatIfSupported|FormatParser|FormatString|Fo
rmattable|Formatter|__format_value|HasFormatter|max_format_arguments|out
|outln|set_debug_enabled|StandardFormatter|TypeErasedFormatParams|TypeEr
asedParameter|VariadicFormatParams|v_critical_dmesgln|vdbgln|vdmesgln|vf
ormat|vout|warn|warnln|warnln_if)\\b

(Without the linebreaks.)

This regex is pessimistic, so there might be more files that don't
actually use any formatting functions.

Observe that this revealed that Userland/Libraries/LibC/signal.cpp is
missing an include.

In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
2023-01-02 20:27:20 -05:00
Moustafa Raafat
ae2abcebbb Everywhere: Use C++ concepts instead of requires clauses 2022-12-09 11:25:30 +00:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Linus Groh
d26aabff04 Everywhere: Run clang-format 2022-12-03 23:52:23 +00:00
Ali Mohammad Pur
464ac85a1b LibRegex: Avoid copying MatchInput when getting argument descriptions 2022-11-17 20:13:04 +03:30
Ali Mohammad Pur
598dc74a76 LibRegex: Partially implement the ECMAScript unicodeSets proposal
This skips the new string unicode properties additions, along with \q{}.
2022-07-20 21:25:59 +01:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Ali Mohammad Pur
6e655b7f89 LibRegex: Fully interpret the Compare Op when looking for overlaps
We had a really naive and simplistic implementation, which lead to
various issues where the optimiser incorrectly rewrote the regex to use
atomic groups; this commit fixes that.
2022-07-04 23:09:53 +02:00
Ali Mohammad Pur
1a35e27490 LibRegex: Make FailForks fail all forks up to the last save point
This makes negative lookarounds with more than one fork behave
correctly.
Fixes #11350.
2021-12-25 18:41:10 +01:00
Hendiadyoin1
27a967371b LibRegex: Remove duplicate declaration of execution_result_name(...) 2021-12-21 18:17:28 -08:00
Hendiadyoin1
5885e70df7 LibRegex: Remove some meaningless/useless const-qualifiers
Also replace String creation from `""` with `String::empty()`
2021-12-21 18:17:28 -08:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Ali Mohammad Pur
b9ffa0ad2e LibRegex: Transform 0,1 min/unbounded max repetitions to * or +
The implementation of transform_bytecode_repetition_min_max expects at
least min=1, so let's also optimise on our way out of a bug where
'x{0,}' would cause a crash.
2021-10-09 15:57:05 +02:00
Ali Mohammad Pur
8f722302d9 LibRegex: Use a match table for character classes
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
2021-10-03 19:16:36 +02:00
Ali Mohammad Pur
bf0315ff8f LibRegex: Avoid excessive Vector copy when compiling regexps
Previously we would've copied the bytecode instead of moving the chunks
around, use the fancy new DisjointChunks<T> abstraction to make that
happen automagically.
This decreases vector copies and uses of memmove() by nearly 10x :^)
2021-09-14 21:33:15 +04:30
Ali Mohammad Pur
246ab432ff LibRegex: Add a basic optimization pass
This currently tries to convert forking loops to atomic groups, and
unify the left side of alternations.
2021-09-13 14:38:53 +04:30
Ali Mohammad Pur
abbe9da255 LibRegex: Make infinite repetitions short-circuit on empty matches
This makes (addmittedly weird) patterns like `(a*)*` work correctly
without going into an infinite fork loop.
2021-09-06 13:51:30 +04:30
Ali Mohammad Pur
dd82c2e9b4 LibRegex: Correctly handle failing in the middle of explicit repeats
- Make sure that all the Repeat ops are reset (otherwise the operation
  would not be correct when going over the Repeat op a second time)
- Make sure that all matches that are allowed to fail are backed by a
  fork, otherwise the last failing fork would not have anywhere to
  return to.
Fixes #9707.
2021-09-01 13:36:53 +02:00
Ali Mohammad Pur
98624fe03f LibRegex: Implement min/max repetition using the Repeat bytecode
This makes repetitions with large max bounds work correctly.
Also fixes an OOM issue found by OSS-Fuzz:
https://oss-fuzz.com/testcase?key=4725721980338176
2021-08-31 16:37:49 +02:00
Timothy Flynn
9509433e25 LibRegex: Implement and use a REPEAT operation for bytecode repetition
Currently, when we need to repeat an instruction N times, we simply add
that instruction N times in a for-loop. This doesn't scale well with
extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1.

Instead, add a new REPEAT bytecode operation to defer this loop from the
parser to the runtime executor. This allows the parser to complete sans
any loops (for this instruction), and allows the executor to bail early
if the repeated bytecode fails.

Note: The templated ByteCode methods are to allow the Posix parsers to
continue using u32 because they are limited to N = 2^20.
2021-08-15 11:43:45 +01:00
Timothy Flynn
a0b72f5ad3 LibRegex: Remove (mostly) unused regex::MatchOutput
This struct holds a counter for the number of executed operations, and
vectors for matches, captures groups, and named capture groups. Each of
the vectors is unused. Remove the struct and just keep a separate
counter for the executed operations.
2021-08-15 11:43:45 +01:00
Timothy Flynn
f1ce998d73 LibRegex+LibJS: Combine named and unnamed capture groups in MatchState
Combining these into one list helps reduce the size of MatchState, and
as a result, reduces the amount of memory consumed during execution of
very large regex matches.

Doing this also allows us to remove a few regex byte code instructions:
ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference.
Named groups now behave the same as unnamed groups for these operations.
Note that SaveRightNamedCaptureGroup still exists to cache the matched
group name.

This also removes the recursion level from the MatchState, as it can
exist as a local variable in Matcher::execute instead.
2021-08-15 11:43:45 +01:00
Timothy Flynn
484ccfadc3 LibRegex: Support property escapes of Unicode script extensions 2021-08-04 13:50:32 +01:00
Timothy Flynn
06088df729 LibRegex: Support property escapes of the Unicode script property
Note that unlike binary properties and general categories, scripts must
be specified in the non-binary (Script=Value) form.
2021-08-04 13:50:32 +01:00
Timothy Flynn
1e10d6d7ce LibRegex: Support property escapes of Unicode General Categories
This changes LibRegex to parse the property escape as a Variant of
Unicode Property & General Category values. A byte code instruction is
added to perform matching based on General Category values.
2021-08-02 21:02:09 +04:30
Timothy Flynn
d485cf29d7 LibRegex+LibUnicode: Begin implementing Unicode property escapes
This supports some binary property matching. It does not support any
properties not yet parsed by LibUnicode, nor does it support value
matching (such as Script_Extensions=Latin).
2021-07-30 21:26:31 +01:00
Ali Mohammad Pur
36bfc912fc LibRegex: Switch to east-const style 2021-07-23 21:19:21 +04:30
Ali Mohammad Pur
c8b2199251 LibRegex: Clear previous capture group contents in ECMA262 mode
ECMA262 requires that the capture groups only contain the values from
the last iteration, e.g. `((c)(a)?(b))` should _not_ contain 'a' in the
second capture group when matching "cabcb".
2021-07-23 21:19:21 +04:30
Gunnar Beutner
36e36507d5 Everywhere: Prefer using {:#x} over 0x{:x}
We have a dedicated format specifier which adds the "0x" prefix, so
let's use that instead of adding it manually.
2021-07-22 08:57:01 +02:00
Ali Mohammad Pur
f364fcec5d LibRegex+Everywhere: Make LibRegex more unicode-aware
This commit makes LibRegex (mostly) capable of operating on any of
the three main string views:
- StringView for raw strings
- Utf8View for utf-8 encoded strings
- Utf32View for raw unicode strings

As a result, regexps with unicode strings should be able to properly
handle utf-8 and not stop in the middle of a code point.
A future commit will update LibJS to use the correct type of string
depending on the flags.
2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
addfa1e82e LibRegex: Make the bytecode transformation functions static
They were pretty confusing when compared with other non-transforming
functions.
2021-07-10 13:33:08 +02:00
Andreas Kling
e59bf87374 Userland: Replace VERIFY(is<T>) with verify_cast<T>
Instead of doing a VERIFY(is<T>(x)) and *then* casting it to T, we can
just do the cast right away with verify_cast<T>. :^)
2021-06-24 21:13:09 +02:00
Gunnar Beutner
5bfe601152 LibRegex: Remove unused code 2021-06-14 16:09:58 +04:30
Gunnar Beutner
a167941852 LibRegex: Use a plain pointer for OpCode::m_state 2021-06-14 16:09:58 +04:30
Gunnar Beutner
d3c2a3caea LibRegex: Avoid initialization checks in get_opcode_by_id() 2021-06-14 16:09:58 +04:30
Gunnar Beutner
281f39073d LibRegex: Make get_opcode() return a reference
Previously this would return a pointer which could be null if the
requested opcode was invalid. This should never be the case though
so let's VERIFY() that instead.
2021-06-14 16:09:58 +04:30
Gunnar Beutner
cd49fb0229 LibRegex: Remove return value for setters 2021-06-14 16:09:58 +04:30
Gunnar Beutner
1fb4471506 LibRegex: Use a plain array to store opcodes
Using a hash map is unnecessary because the number of opcodes and their
IDs never change.
2021-06-14 16:09:58 +04:30
Andreas Kling
dc65f54c06 AK: Rename Vector::append(Vector) => Vector::extend(Vector)
Let's make it a bit more clear when we're appending the elements from
one vector to the end of another vector.
2021-06-12 13:24:45 +02:00
Brian Gianforcaro
1682f0b760 Everything: Move to SPDX license identifiers in all files.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.

See: https://spdx.dev/resources/use/#identifiers

This was done with the `ambr` search and replace tool.

 ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
2021-04-22 11:22:27 +02:00
Andreas Kling
c68dcf45b6 LibRegex: Convert String::format() => String::formatted() 2021-04-21 23:49:02 +02:00
AnotherTest
e9279d1790 LibRegex: Allow a '?' suffix for brace quantifiers
This fixes another compat point in #6042.
2021-04-10 09:16:03 +02:00