Commit graph

16 commits

Author SHA1 Message Date
Timothy Flynn
27d555bab0 LibRegex: Track string position in both code units and code points
In non-Unicode mode, the existing MatchState::string_position is tracked
in code units; in Unicode mode, it is tracked in code points.

In order for some RegexStringView operations to be performant, it is
useful for the MatchState to have a field to always track the position
in code units. This will allow RegexStringView methods (e.g. operator[])
to perform lookups based on code unit offsets, rather than needing to
iterate over the entire string to find a code point offset.
2021-08-04 11:18:24 +02:00
Timothy Flynn
510bbcd8e0 AK+LibRegex: Add Utf16View::code_point_at and use it in RegexStringView
The current method of iterating through the string to access a code
point hurts performance quite badly for very large strings. The test262
test "RegExp/property-escapes/generated/Any.js" previously took 3 hours
to complete; this one change brings it down to under 10 seconds.
2021-08-04 11:18:24 +02:00
Ali Mohammad Pur
5f342e4fa9 LibRegex: Make Fork{Jump,Stay} non-recursive
This makes very fork-heavy expressions (like `(aa)*`) not run out of
stack space when matching very long strings.
2021-08-02 17:22:50 +04:30
Ali Mohammad Pur
1dd1378159 LibRegex: Preserve the type of the match when clearing capture groups
Even though the contents are supposed to be reset, the type should stay
unchanged, as that's an assumption the engine is making.
2021-07-24 20:52:43 +04:30
Timothy Flynn
0e6375558d AK+LibRegex: Partially implement case insensitive UTF-16 comparison
This will work for ASCII code points. Unicode case folding will be
needed for non-ASCII.
2021-07-23 23:06:57 +01:00
Timothy Flynn
47f6bb38a1 LibRegex: Support UTF-16 RegexStringView and improve Unicode matching
When the Unicode option is not set, regular expressions should match
based on code units; when it is set, they should match based on code
points. To do so, the regex parser must combine surrogate pairs when
the Unicode option is set. Further, RegexStringView needs to know if
the flag is set in order to return code point vs. code unit based
string lengths and substrings.
2021-07-23 23:06:57 +01:00
Ali Mohammad Pur
36bfc912fc LibRegex: Switch to east-const style 2021-07-23 21:19:21 +04:30
Ali Mohammad Pur
f364fcec5d LibRegex+Everywhere: Make LibRegex more unicode-aware
This commit makes LibRegex (mostly) capable of operating on any of
the three main string views:
- StringView for raw strings
- Utf8View for utf-8 encoded strings
- Utf32View for raw unicode strings

As a result, regexps with unicode strings should be able to properly
handle utf-8 and not stop in the middle of a code point.
A future commit will update LibJS to use the correct type of string
depending on the flags.
2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
2961982277 LibRegex: Use <...> includes in RegexMatch.h 2021-07-18 21:10:55 +04:30
Ali Mohammad Pur
da1fda73a7 LibRegex: Implement line splitting for Utf32View
Co-authored-by: Timothy Flynn <trflynn89@pm.me>
2021-07-18 21:10:55 +04:30
sin-ack
74d76528d6 LibRegex: Display correct position for Compare in REGEX_DEBUG
When REGEX_DEBUG is enabled, LibRegex dumps a table of information
regarding the state of the regex bytecode execution. The Compare opcode
manipulates state.string_position directly, so the string_position value
cannot be used to display where the comparison started; therefore, this
patch introduces a new variable to keep track of where we were before
the comparison happened.
2021-06-16 16:30:12 +04:30
Brian Gianforcaro
1682f0b760 Everything: Move to SPDX license identifiers in all files.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.

See: https://spdx.dev/resources/use/#identifiers

This was done with the `ambr` search and replace tool.

 ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
2021-04-22 11:22:27 +02:00
AnotherTest
f05e518cbc LibRegex: Implement section B.1.4. of the ECMA262 spec
This allows the parser to deal with crazy patterns like the one
in #5517.
2021-02-27 07:31:01 +01:00
Andreas Kling
5d180d1f99 Everywhere: Rename ASSERT => VERIFY
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)

Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.

We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
2021-02-23 20:56:54 +01:00
asynts
5c5665c1e7 Everywhere: Replace a bundle of dbg with dbgln.
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.
2021-01-22 22:14:30 +01:00
Andreas Kling
13d7c09125 Libraries: Move to Userland/Libraries/ 2021-01-12 12:17:46 +01:00
Renamed from Libraries/LibRegex/RegexMatch.h (Browse further)