Commit graph

101 commits

Author SHA1 Message Date
Timothy Flynn
484ccfadc3 LibRegex: Support property escapes of Unicode script extensions 2021-08-04 13:50:32 +01:00
Timothy Flynn
06088df729 LibRegex: Support property escapes of the Unicode script property
Note that unlike binary properties and general categories, scripts must
be specified in the non-binary (Script=Value) form.
2021-08-04 13:50:32 +01:00
Timothy Flynn
27d555bab0 LibRegex: Track string position in both code units and code points
In non-Unicode mode, the existing MatchState::string_position is tracked
in code units; in Unicode mode, it is tracked in code points.

In order for some RegexStringView operations to be performant, it is
useful for the MatchState to have a field to always track the position
in code units. This will allow RegexStringView methods (e.g. operator[])
to perform lookups based on code unit offsets, rather than needing to
iterate over the entire string to find a code point offset.
2021-08-04 11:18:24 +02:00
Timothy Flynn
510bbcd8e0 AK+LibRegex: Add Utf16View::code_point_at and use it in RegexStringView
The current method of iterating through the string to access a code
point hurts performance quite badly for very large strings. The test262
test "RegExp/property-escapes/generated/Any.js" previously took 3 hours
to complete; this one change brings it down to under 10 seconds.
2021-08-04 11:18:24 +02:00
Timothy Flynn
dc9f516339 LibRegex: Generate negated property escapes as a single instruction
These were previously generated as two instructions, Compare [Inverse]
and Compare [Property].
2021-08-02 21:02:09 +04:30
Timothy Flynn
4de4312827 LibRegex: Support property escapes of the form \p{Type=Value}
Before now, only binary properties could be parsed. Non-binary props are
of the form "Type=Value", where "Type" may be General_Category, Script,
or Script_Extension (or their aliases). Of these, LibUnicode currently
supports General_Category, so LibRegex can parse only that type.
2021-08-02 21:02:09 +04:30
Timothy Flynn
1e10d6d7ce LibRegex: Support property escapes of Unicode General Categories
This changes LibRegex to parse the property escape as a Variant of
Unicode Property & General Category values. A byte code instruction is
added to perform matching based on General Category values.
2021-08-02 21:02:09 +04:30
Ali Mohammad Pur
d5984d296f LibRegex: Make Matcher<>::match(Vector<>) take a reference to the vector
It was previously copying the entire vector every time, which is not a
nice thing to do. :^)
2021-08-02 17:22:50 +04:30
Ali Mohammad Pur
a7653e6a05 LibRegex: Use a bump-allocated linked list for fork save states
This makes it avoid the excessively high malloc() traffic.
2021-08-02 17:22:50 +04:30
Ali Mohammad Pur
5f342e4fa9 LibRegex: Make Fork{Jump,Stay} non-recursive
This makes very fork-heavy expressions (like `(aa)*`) not run out of
stack space when matching very long strings.
2021-08-02 17:22:50 +04:30
Brian Gianforcaro
18d6f9ed5c Libraries: Remove unused header includes 2021-08-01 08:10:16 +02:00
Timothy Flynn
d485cf29d7 LibRegex+LibUnicode: Begin implementing Unicode property escapes
This supports some binary property matching. It does not support any
properties not yet parsed by LibUnicode, nor does it support value
matching (such as Script_Extensions=Latin).
2021-07-30 21:26:31 +01:00
Timothy Flynn
1400e3cf58 LibRegex: Allow separately parsing patterns and creating Regex objects
Adds a static method to parse a regex pattern and return the result, and
a constructor to accept a parse result. This is to allow LibJS to parse
the pattern string of a RegExpLiteral once and hand off regex objects
any number of times thereafter.
2021-07-30 21:26:31 +01:00
Timothy Flynn
b162517065 LibRegex: Take ownership of pattern string and fix move operations
The Regex object created a copy of the pattern string anyways, so tweak
the constructor to allow callers to move() pattern strings into the
regex.

The Regex move constructor and assignment operator currently result in
memory corruption. The Regex object stores a Matcher object, which holds
a reference to the Regex object. So when the Regex object is moved, that
reference is no longer valid. To fix this, the reference stored in the
Matcher must be updated when the Regex is moved.
2021-07-30 21:26:31 +01:00
Timothy Flynn
4a72b2c879 LibRegex: Allow RegexOptions to be declared at compile time 2021-07-30 21:26:31 +01:00
Ali Mohammad Pur
faef523567 LibRegex: Make unclosed-at-eof brace quantifiers an error
Otherwise we'd just loop trying to parse it over and over again, for
instance in `/a{/` or `/a{1,/`.
Unless we're parsing in Annex B mode, which allows `{` as a normal
ExtendedSourceCharacter.
2021-07-24 20:52:43 +04:30
Ali Mohammad Pur
1dd1378159 LibRegex: Preserve the type of the match when clearing capture groups
Even though the contents are supposed to be reset, the type should stay
unchanged, as that's an assumption the engine is making.
2021-07-24 20:52:43 +04:30
Timothy Flynn
345ef6abba LibRegex: Support ECMA-262 Unicode escapes of the form "\u{code_point}"
When the Unicode flag is set, regular expressions may escape code points
by surrounding the hexadecimal code point with curly braces, e.g. \u{41}
is the character "A".

When the Unicode flag is not set, this should be considered a repetition
symbol - \u{41} is the character "u" repeated 41 times. This is left as
a TODO for now.
2021-07-23 23:06:57 +01:00
Timothy Flynn
0e6375558d AK+LibRegex: Partially implement case insensitive UTF-16 comparison
This will work for ASCII code points. Unicode case folding will be
needed for non-ASCII.
2021-07-23 23:06:57 +01:00
Timothy Flynn
47f6bb38a1 LibRegex: Support UTF-16 RegexStringView and improve Unicode matching
When the Unicode option is not set, regular expressions should match
based on code units; when it is set, they should match based on code
points. To do so, the regex parser must combine surrogate pairs when
the Unicode option is set. Further, RegexStringView needs to know if
the flag is set in order to return code point vs. code unit based
string lengths and substrings.
2021-07-23 23:06:57 +01:00
Ali Mohammad Pur
36bfc912fc LibRegex: Switch to east-const style 2021-07-23 21:19:21 +04:30
Ali Mohammad Pur
c8b2199251 LibRegex: Clear previous capture group contents in ECMA262 mode
ECMA262 requires that the capture groups only contain the values from
the last iteration, e.g. `((c)(a)?(b))` should _not_ contain 'a' in the
second capture group when matching "cabcb".
2021-07-23 21:19:21 +04:30
Gunnar Beutner
36e36507d5 Everywhere: Prefer using {:#x} over 0x{:x}
We have a dedicated format specifier which adds the "0x" prefix, so
let's use that instead of adding it manually.
2021-07-22 08:57:01 +02:00
Ali Mohammad Pur
f364fcec5d LibRegex+Everywhere: Make LibRegex more unicode-aware
This commit makes LibRegex (mostly) capable of operating on any of
the three main string views:
- StringView for raw strings
- Utf8View for utf-8 encoded strings
- Utf32View for raw unicode strings

As a result, regexps with unicode strings should be able to properly
handle utf-8 and not stop in the middle of a code point.
A future commit will update LibJS to use the correct type of string
depending on the flags.
2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
2961982277 LibRegex: Use <...> includes in RegexMatch.h 2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
5089fd8b3c LibRegex: Also print a newline after each debug line
Otherwise the new debug line would be printed right after the previous
one without a newline.
2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
052004f92d LibRegex: Partially implement string compare for Utf32View 2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
da1fda73a7 LibRegex: Implement line splitting for Utf32View
Co-authored-by: Timothy Flynn <trflynn89@pm.me>
2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
9cdea2d521 LibRegex: Consider EOF in the middle of a range an error 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
1b2728f1ed LibRegex: Don't attempt to insert invalid bytecode in {B,E}RE 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
6e35b94034 LibRegex: Implement lookaround in ERE 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
5f4e1338a1 LibRegex: Allow empty character classes in {B,E}RE 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
189922f442 LibRegex: Disallow excessively large repetition counts in {B,E}RE 2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
f9fed0b167 LibRegex+LibC: Make re_nsub available to the user
To comply with Dr.POSIX, this field has to be user-accessible.
2021-07-13 07:04:06 +02:00
Ali Mohammad Pur
11a8476cf4 LibRegex: Use the parser state capture group count in BRE
Otherwise the users won't know how many capture groups are in the
parsed regular expression.
2021-07-10 23:14:08 +04:30
Ali Mohammad Pur
1c584e9d80 LibRegex: Correctly parse BRE bracket expressions
Commonly, bracket expressions are in fact, enclosed in brackets.
2021-07-10 22:58:24 +04:30
Ali Mohammad Pur
daa6d99e6e LibRegex: Add support for non-extended regular expressions in regcomp()
Fixes part of #8506.
2021-07-10 13:33:08 +02:00
Ali Mohammad Pur
54d89609de LibRegex: Add support for the Basic POSIX regular expressions
This implements the internal regex stuff for #8506.
2021-07-10 13:33:08 +02:00
Ali Mohammad Pur
addfa1e82e LibRegex: Make the bytecode transformation functions static
They were pretty confusing when compared with other non-transforming
functions.
2021-07-10 13:33:08 +02:00
Timothy Flynn
0f0ac37b56 LibRegex: Break from execution loop when the sticky flag is set
If the sticky flag is set, the regex execution loop should break
immediately even if the execution was a failure. The specification for
several RegExp.prototype methods (e.g. exec and @@split) rely on this
behavior.
2021-07-09 19:45:55 +01:00
Timothy Flynn
65003241e4 LibRegex: Allow dollar signs in ECMA262 named capture groups
Fixes 1 test262 test.
2021-07-06 22:33:17 +01:00
Andrew Kaster
5e8a0c014e LibRegex: Make regex::Regex move-constructible and move-assignable
For some reason the default move constructor and default move-assign
operator were deleted, so we explicitly default them instead.
2021-06-30 08:18:28 +04:30
Andreas Kling
e59bf87374 Userland: Replace VERIFY(is<T>) with verify_cast<T>
Instead of doing a VERIFY(is<T>(x)) and *then* casting it to T, we can
just do the cast right away with verify_cast<T>. :^)
2021-06-24 21:13:09 +02:00
sin-ack
74d76528d6 LibRegex: Display correct position for Compare in REGEX_DEBUG
When REGEX_DEBUG is enabled, LibRegex dumps a table of information
regarding the state of the regex bytecode execution. The Compare opcode
manipulates state.string_position directly, so the string_position value
cannot be used to display where the comparison started; therefore, this
patch introduces a new variable to keep track of where we were before
the comparison happened.
2021-06-16 16:30:12 +04:30
sin-ack
6b2e264093 LibRegex: Fix incorrect case-sensitive comparisons
A tiny typo was introduced in bc8d16ad which caused all case insensitive
comparisons to fail.
2021-06-16 16:30:12 +04:30
Gunnar Beutner
5bfe601152 LibRegex: Remove unused code 2021-06-14 16:09:58 +04:30
Gunnar Beutner
a167941852 LibRegex: Use a plain pointer for OpCode::m_state 2021-06-14 16:09:58 +04:30
Gunnar Beutner
d3c2a3caea LibRegex: Avoid initialization checks in get_opcode_by_id() 2021-06-14 16:09:58 +04:30
Gunnar Beutner
794dc368f1 LibRegex: Avoid prepending items to vectors 2021-06-14 16:09:58 +04:30
Gunnar Beutner
214410b397 LibRegex: Avoid making unnecessary string copies 2021-06-14 16:09:58 +04:30