ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2025-10-19 14:40:18 +00:00

Author	SHA1	Message	Date
Timothy Flynn	484ccfadc3	LibRegex: Support property escapes of Unicode script extensions	2021-08-04 13:50:32 +01:00
Timothy Flynn	06088df729	LibRegex: Support property escapes of the Unicode script property Note that unlike binary properties and general categories, scripts must be specified in the non-binary (Script=Value) form.	2021-08-04 13:50:32 +01:00
Timothy Flynn	27d555bab0	LibRegex: Track string position in both code units and code points In non-Unicode mode, the existing MatchState::string_position is tracked in code units; in Unicode mode, it is tracked in code points. In order for some RegexStringView operations to be performant, it is useful for the MatchState to have a field to always track the position in code units. This will allow RegexStringView methods (e.g. operator[]) to perform lookups based on code unit offsets, rather than needing to iterate over the entire string to find a code point offset.	2021-08-04 11:18:24 +02:00
Timothy Flynn	510bbcd8e0	AK+LibRegex: Add Utf16View::code_point_at and use it in RegexStringView The current method of iterating through the string to access a code point hurts performance quite badly for very large strings. The test262 test "RegExp/property-escapes/generated/Any.js" previously took 3 hours to complete; this one change brings it down to under 10 seconds.	2021-08-04 11:18:24 +02:00
Timothy Flynn	dc9f516339	LibRegex: Generate negated property escapes as a single instruction These were previously generated as two instructions, Compare [Inverse] and Compare [Property].	2021-08-02 21:02:09 +04:30
Timothy Flynn	4de4312827	LibRegex: Support property escapes of the form \p{Type=Value} Before now, only binary properties could be parsed. Non-binary props are of the form "Type=Value", where "Type" may be General_Category, Script, or Script_Extension (or their aliases). Of these, LibUnicode currently supports General_Category, so LibRegex can parse only that type.	2021-08-02 21:02:09 +04:30
Timothy Flynn	1e10d6d7ce	LibRegex: Support property escapes of Unicode General Categories This changes LibRegex to parse the property escape as a Variant of Unicode Property & General Category values. A byte code instruction is added to perform matching based on General Category values.	2021-08-02 21:02:09 +04:30
Ali Mohammad Pur	d5984d296f	LibRegex: Make Matcher<>::match(Vector<>) take a reference to the vector It was previously copying the entire vector every time, which is not a nice thing to do. :^)	2021-08-02 17:22:50 +04:30
Ali Mohammad Pur	a7653e6a05	LibRegex: Use a bump-allocated linked list for fork save states This makes it avoid the excessively high malloc() traffic.	2021-08-02 17:22:50 +04:30
Ali Mohammad Pur	5f342e4fa9	LibRegex: Make Fork{Jump,Stay} non-recursive This makes very fork-heavy expressions (like `(aa)*`) not run out of stack space when matching very long strings.	2021-08-02 17:22:50 +04:30
Brian Gianforcaro	18d6f9ed5c	Libraries: Remove unused header includes	2021-08-01 08:10:16 +02:00
Timothy Flynn	d485cf29d7	LibRegex+LibUnicode: Begin implementing Unicode property escapes This supports some binary property matching. It does not support any properties not yet parsed by LibUnicode, nor does it support value matching (such as Script_Extensions=Latin).	2021-07-30 21:26:31 +01:00
Timothy Flynn	1400e3cf58	LibRegex: Allow separately parsing patterns and creating Regex objects Adds a static method to parse a regex pattern and return the result, and a constructor to accept a parse result. This is to allow LibJS to parse the pattern string of a RegExpLiteral once and hand off regex objects any number of times thereafter.	2021-07-30 21:26:31 +01:00
Timothy Flynn	b162517065	LibRegex: Take ownership of pattern string and fix move operations The Regex object created a copy of the pattern string anyways, so tweak the constructor to allow callers to move() pattern strings into the regex. The Regex move constructor and assignment operator currently result in memory corruption. The Regex object stores a Matcher object, which holds a reference to the Regex object. So when the Regex object is moved, that reference is no longer valid. To fix this, the reference stored in the Matcher must be updated when the Regex is moved.	2021-07-30 21:26:31 +01:00
Timothy Flynn	4a72b2c879	LibRegex: Allow RegexOptions to be declared at compile time	2021-07-30 21:26:31 +01:00
Ali Mohammad Pur	faef523567	LibRegex: Make unclosed-at-eof brace quantifiers an error Otherwise we'd just loop trying to parse it over and over again, for instance in `/a{/` or `/a{1,/`. Unless we're parsing in Annex B mode, which allows `{` as a normal ExtendedSourceCharacter.	2021-07-24 20:52:43 +04:30
Ali Mohammad Pur	1dd1378159	LibRegex: Preserve the type of the match when clearing capture groups Even though the contents are supposed to be reset, the type should stay unchanged, as that's an assumption the engine is making.	2021-07-24 20:52:43 +04:30
Timothy Flynn	345ef6abba	LibRegex: Support ECMA-262 Unicode escapes of the form "\u{code_point}" When the Unicode flag is set, regular expressions may escape code points by surrounding the hexadecimal code point with curly braces, e.g. \u{41} is the character "A". When the Unicode flag is not set, this should be considered a repetition symbol - \u{41} is the character "u" repeated 41 times. This is left as a TODO for now.	2021-07-23 23:06:57 +01:00
Timothy Flynn	0e6375558d	AK+LibRegex: Partially implement case insensitive UTF-16 comparison This will work for ASCII code points. Unicode case folding will be needed for non-ASCII.	2021-07-23 23:06:57 +01:00
Timothy Flynn	47f6bb38a1	LibRegex: Support UTF-16 RegexStringView and improve Unicode matching When the Unicode option is not set, regular expressions should match based on code units; when it is set, they should match based on code points. To do so, the regex parser must combine surrogate pairs when the Unicode option is set. Further, RegexStringView needs to know if the flag is set in order to return code point vs. code unit based string lengths and substrings.	2021-07-23 23:06:57 +01:00
Ali Mohammad Pur	36bfc912fc	LibRegex: Switch to east-const style	2021-07-23 21:19:21 +04:30
Ali Mohammad Pur	c8b2199251	LibRegex: Clear previous capture group contents in ECMA262 mode ECMA262 requires that the capture groups only contain the values from the last iteration, e.g. `((c)(a)?(b))` should _not_ contain 'a' in the second capture group when matching "cabcb".	2021-07-23 21:19:21 +04:30
Gunnar Beutner	36e36507d5	Everywhere: Prefer using {:#x} over 0x{:x} We have a dedicated format specifier which adds the "0x" prefix, so let's use that instead of adding it manually.	2021-07-22 08:57:01 +02:00
Ali Mohammad Pur	f364fcec5d	LibRegex+Everywhere: Make LibRegex more unicode-aware This commit makes LibRegex (mostly) capable of operating on any of the three main string views: - StringView for raw strings - Utf8View for utf-8 encoded strings - Utf32View for raw unicode strings As a result, regexps with unicode strings should be able to properly handle utf-8 and not stop in the middle of a code point. A future commit will update LibJS to use the correct type of string depending on the flags.	2021-07-18 21:10:55 +04:30
Ali Mohammad Pur	2961982277	LibRegex: Use <...> includes in RegexMatch.h	2021-07-18 21:10:55 +04:30
Ali Mohammad Pur	5089fd8b3c	LibRegex: Also print a newline after each debug line Otherwise the new debug line would be printed right after the previous one without a newline.	2021-07-18 21:10:55 +04:30
Ali Mohammad Pur	052004f92d	LibRegex: Partially implement string compare for Utf32View	2021-07-18 21:10:55 +04:30
Ali Mohammad Pur	da1fda73a7	LibRegex: Implement line splitting for Utf32View Co-authored-by: Timothy Flynn <trflynn89@pm.me>	2021-07-18 21:10:55 +04:30
Ali Mohammad Pur	9cdea2d521	LibRegex: Consider EOF in the middle of a range an error	2021-07-13 07:04:06 +02:00
Ali Mohammad Pur	1b2728f1ed	LibRegex: Don't attempt to insert invalid bytecode in {B,E}RE	2021-07-13 07:04:06 +02:00
Ali Mohammad Pur	6e35b94034	LibRegex: Implement lookaround in ERE	2021-07-13 07:04:06 +02:00
Ali Mohammad Pur	5f4e1338a1	LibRegex: Allow empty character classes in {B,E}RE	2021-07-13 07:04:06 +02:00
Ali Mohammad Pur	189922f442	LibRegex: Disallow excessively large repetition counts in {B,E}RE	2021-07-13 07:04:06 +02:00
Ali Mohammad Pur	f9fed0b167	LibRegex+LibC: Make re_nsub available to the user To comply with Dr.POSIX, this field has to be user-accessible.	2021-07-13 07:04:06 +02:00
Ali Mohammad Pur	11a8476cf4	LibRegex: Use the parser state capture group count in BRE Otherwise the users won't know how many capture groups are in the parsed regular expression.	2021-07-10 23:14:08 +04:30
Ali Mohammad Pur	1c584e9d80	LibRegex: Correctly parse BRE bracket expressions Commonly, bracket expressions are in fact, enclosed in brackets.	2021-07-10 22:58:24 +04:30
Ali Mohammad Pur	daa6d99e6e	LibRegex: Add support for non-extended regular expressions in regcomp() Fixes part of #8506.	2021-07-10 13:33:08 +02:00
Ali Mohammad Pur	54d89609de	LibRegex: Add support for the Basic POSIX regular expressions This implements the internal regex stuff for #8506.	2021-07-10 13:33:08 +02:00
Ali Mohammad Pur	addfa1e82e	LibRegex: Make the bytecode transformation functions static They were pretty confusing when compared with other non-transforming functions.	2021-07-10 13:33:08 +02:00
Timothy Flynn	0f0ac37b56	LibRegex: Break from execution loop when the sticky flag is set If the sticky flag is set, the regex execution loop should break immediately even if the execution was a failure. The specification for several RegExp.prototype methods (e.g. exec and @@split) rely on this behavior.	2021-07-09 19:45:55 +01:00
Timothy Flynn	65003241e4	LibRegex: Allow dollar signs in ECMA262 named capture groups Fixes 1 test262 test.	2021-07-06 22:33:17 +01:00
Andrew Kaster	5e8a0c014e	LibRegex: Make regex::Regex move-constructible and move-assignable For some reason the default move constructor and default move-assign operator were deleted, so we explicitly default them instead.	2021-06-30 08:18:28 +04:30
Andreas Kling	e59bf87374	Userland: Replace VERIFY(is<T>) with verify_cast<T> Instead of doing a VERIFY(is<T>(x)) and then casting it to T, we can just do the cast right away with verify_cast<T>. :^)	2021-06-24 21:13:09 +02:00
sin-ack	74d76528d6	LibRegex: Display correct position for Compare in REGEX_DEBUG When REGEX_DEBUG is enabled, LibRegex dumps a table of information regarding the state of the regex bytecode execution. The Compare opcode manipulates state.string_position directly, so the string_position value cannot be used to display where the comparison started; therefore, this patch introduces a new variable to keep track of where we were before the comparison happened.	2021-06-16 16:30:12 +04:30
sin-ack	6b2e264093	LibRegex: Fix incorrect case-sensitive comparisons A tiny typo was introduced in `bc8d16ad` which caused all case insensitive comparisons to fail.	2021-06-16 16:30:12 +04:30
Gunnar Beutner	5bfe601152	LibRegex: Remove unused code	2021-06-14 16:09:58 +04:30
Gunnar Beutner	a167941852	LibRegex: Use a plain pointer for OpCode::m_state	2021-06-14 16:09:58 +04:30
Gunnar Beutner	d3c2a3caea	LibRegex: Avoid initialization checks in get_opcode_by_id()	2021-06-14 16:09:58 +04:30
Gunnar Beutner	794dc368f1	LibRegex: Avoid prepending items to vectors	2021-06-14 16:09:58 +04:30
Gunnar Beutner	214410b397	LibRegex: Avoid making unnecessary string copies	2021-06-14 16:09:58 +04:30

1 2 3

101 commits