ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2025-07-25 02:12:39 +00:00

Author	SHA1	Message	Date
Timothy Flynn	6e0290ecaa	AK: Define some UTF-16 helper methods * contains * escape_html_entities * replace * to_ascii_lowercase * to_ascii_uppercase * to_ascii_titlecase * trim * trim_whitespace	2025-07-18 12:45:38 -04:00
Timothy Flynn	fe676585f5	AK: Add a UTF-16 string with optimized short- and ASCII-string storage This is a strictly UTF-16 string with some optimizations for ASCII. * If created from a short UTF-8 or UTF-16 string that is also ASCII, then the string is stored in an inlined byte buffer. * If created with a long UTF-8 or UTF-16 string that is also ASCII, then the string is stored in an outlined char buffer. * If created with a short or long UTF-8 or UTF-16 string that is not ASCII, then the string is stored in an outlined char16 buffer. We do not store short non-ASCII text in the inlined buffer to avoid confusion with operations such as `length_in_code_units` and `code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes in short string form. But we still want `length_in_code_units` to be 2, and `code_unit_at(0)` to be 0xD83D.	2025-07-18 12:45:38 -04:00
Timothy Flynn	8fbb80fffc	AK: Do not fall back to simdutf for UTF-16 ASCII validation This was a mistake. Consider U+201C (LEFT DOUBLE QUOTATION MARK). This code point is encoded as the bytes 0x1c 0x20 in UTF-16LE. Both of these bytes are ASCII if interpreted as UTF-8. But the string itself is most certainly not ASCII.	2025-07-18 12:45:38 -04:00
Timothy Flynn	9fc3e72db2	AK+Everywhere: Allow lonely UTF-16 surrogates by default By definition, the web allows lonely surrogates by default. Let's have our string APIs reflect this, so we don't have to pass an allow option all over the place.	2025-07-03 09:51:56 -04:00
Timothy Flynn	86b1c78c1a	AK+Everywhere: Prepare Utf16View for integration with a UTF-16 string To prepare for an upcoming Utf16String, this migrates Utf16View to store its data as a char16_t. Most function definitions are moved inline and made constexpr. This also adds a UDL to construct a Utf16View from a string literal: auto string = u"hello"sv; This let's us remove the NTTP Utf16View constructor, as we have found that such constructors bloat binary size quite a bit.	2025-07-03 09:51:56 -04:00
Timothy Flynn	c17b067e1d	AK: Completely remove endianness from Utf16View APIs These were mostly removed in `7628ddfaf7`. This removes the few remaining cases, as no callers are providing any non-host endianness. This is just to prevent weird API dissymmetry between Utf16View and an upcoming Utf16String.	2025-07-03 09:51:56 -04:00
Timothy Flynn	2abc955ca9	AK: Allow treating UTF-16 views with lonely surrogates as valid Much of the web requires us to allow lonely surrogates in UTF-16 data. The default behavior to disallow such code units has not been changed here - that will be changed in an upcoming commit.	2025-07-03 09:51:56 -04:00
Timothy Flynn	d978a582a0	AK: Add a Utf16View ASCII validator	2025-07-03 09:51:56 -04:00
Timothy Flynn	66006d3812	AK+LibJS: Extract some UTF-16 helpers for use in an outside class An upcoming Utf16String will need access to these helpers. Let's make them publicly available.	2025-07-03 09:51:56 -04:00
Jelle Raaijmakers	408165d2f4	AK: Return early in `utf8_to_utf16()` for empty strings No need to validate an empty string.	2025-06-13 15:08:26 +02:00
Jelle Raaijmakers	2d6da6e112	AK: Remove superfluous check from `Utf16View::subtring_view()` The `Span::slice()` operation just below it performs the exact same check.	2025-06-13 15:08:26 +02:00
Jelle Raaijmakers	0d543b604b	AK: Make more use of lazily calculated code point count in `Utf16View` In `0c93a07fb1`, a lazily calculated code point count was introduced but was not used in all places where we need that count. No functional changes.	2025-06-13 15:08:26 +02:00
Jelle Raaijmakers	cc0a28ee7d	AK: Add `Utf16View::find_code_unit_offset(_ignoring_case)`	2025-06-13 15:08:26 +02:00
Jelle Raaijmakers	7d7f6fa494	AK: Remove superfluous check from `Utf16View::equals_ignoring_case()` Returning true if both lengths are 0 is already handled by the default case.	2025-06-13 15:08:26 +02:00
Shannon Booth	5cf87dcfdc	AK: Add a Utf16View::is_code_unit_less_than helper This seems like the natural place to put this since it is specific to UTF-16.	2025-05-17 08:00:59 -04:00
Andreas Kling	6efc5c54b5	AK: Make Utf16View::to_utf8() use simdutf fast path more often By piggybacking on the already-optimized implementation in StringBuilder, we can get simdutf for the AllowInvalidCodeUnits::Yes case here as well. 1.03x speedup on Speedometer's TodoMVC-jQuery.	2025-05-09 21:36:59 +02:00
Ali Mohammad Pur	eea81738cd	AK+Everywhere: Recognise that surrogates in utf16 aren't all that common For the slight cost of counting code points when converting between encodings and a teeny bit of memory, this commit adds a fast path for all-happy utf-16 substrings and code point operations. This seems to be a significant chunk of time spent in many regex benchmarks.	2025-04-23 07:56:02 -06:00
Andreas Kling	0c93a07fb1	AK: Shrink Utf16View Use a sentinel value instead of Optional for the cached length in code points, shrinking Utf16View from 32 to 24 bytes.	2025-04-16 10:04:50 +02:00
Andreas Kling	7628ddfaf7	AK: Remove endianness override from Utf16View Utf16View is now always in "host" endian mode. This makes it smaller and less branchy for everyone!	2025-04-16 10:04:50 +02:00
Andreas Kling	0e9480b944	AK+LibTextCodec: Stop using Utf16View endianness override This is preparation for removing the endianness override, since it was only used by a single client: LibTextCodec. While here, add helpers and make use of simdutf for fast conversion.	2025-04-16 10:04:50 +02:00
Timothy Flynn	d19b31529f	AK+Meta: Update simdutf to version 5.5.0 Contains many fixes found upstream by fuzzers. Also includes fixes for CPU-specific inconsistencies with null inputs.	2024-09-19 15:48:57 -04:00
Timothy Flynn	7a17c654d2	AK: Add a method to compute UTF-16 length from a UTF-8 string	2024-07-31 05:55:34 -04:00
Andrew Kaster	45301e8169	Everywhere: Remove AK_DONT_REPLACE_STD macro Let's just always include `<utility>`. Placing our own incompatible with the STL declaration of these functions in AK was always fishy to begin with.	2024-07-30 18:38:02 -06:00
Timothy Flynn	74d644a216	AK: Explicitly check for null data in Utf16View The underlying CPU-specific instructions for operating on UTF-16 strings behave differently for null inputs. Add an explicit check for this state for consistency.	2024-07-21 19:57:07 +02:00
Timothy Flynn	71c29504af	AK: Support non-native endianness in Utf16View Utf16View currently assumes host endianness. Add support for specifying either big or little endianness (which we mostly just pipe through to simdutf). This will allow using simdutf facilities with LibTextCodec.	2024-07-18 19:43:57 +02:00
Timothy Flynn	0c14a9417a	AK: Replace converting to and from UTF-16 with simdutf The one behavior difference is that we will now actually fail on invalid code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was arguably a bug that this wasn't already the case.	2024-07-18 14:46:25 +02:00
Timothy Flynn	32ffe9bbfc	AK: Replace UTF-16 validation and length computation with simdutf	2024-07-18 14:46:25 +02:00
Diego	7560b640f3	AK: Add `AllowSurrogates` to UTF-8 validator The [UTF-8](https://datatracker.ietf.org/doc/html/rfc3629#page-5) standard says to reject strings with upper or lower surrogates. However, in many standards, ECMAScript included, unpaired surrogates (and therefore UTF-8 surrogates) are allowed in strings. So, this commit extends the UTF-8 validation API with `AllowSurrogates`, which will reject upper and lower surrogate characters.	2024-06-09 12:16:32 +02:00
Timothy Flynn	1b4a23095c	AK: Add a Utf16View::starts_with method Based heavily on Utf8View::starts_with.	2024-01-04 12:43:10 +01:00
Ali Mohammad Pur	5e1499d104	Everywhere: Rename {Deprecated => Byte}String This commit un-deprecates DeprecatedString, and repurposes it as a byte string. As the null state has already been removed, there are no other particularly hairy blockers in repurposing this type as a byte string (what it _really_ is). This commit is auto-generated: $ xs=$(ack -l \bDeprecatedString\b\\|deprecated_string AK Userland \ Meta Ports Ladybird Tests Kernel) $ perl -pie 's/\bDeprecatedString\b/ByteString/g; s/deprecated_string/byte_string/g' $xs $ clang-format --style=file -i \ $(git diff --name-only \| grep \.cpp\\|\.h) $ gn format $(git ls-files '.gn' '.gni')	2023-12-17 18:25:10 +03:30
Nico Weber	aa9037eed4	AK: Add spec comments to Utf16CodePointIterator::operator*()	2023-01-22 21:30:44 +00:00
Timothy Flynn	2eacc7aec1	AK: Add Utf16View::to_utf8 to convert the view to a UTF-8 AK::String	2023-01-09 23:00:24 +00:00
Timothy Flynn	d0403ec14f	AK+Everywhere: Rename Utf16View::to_utf8 to to_deprecated_string A subsequent commit will add to_utf8 back to create an AK::String.	2023-01-09 23:00:24 +00:00
Timothy Flynn	d793262beb	AK+Everywhere: Make UTF-16 to UTF-8 converter fallible This could fail to allocate the underlying storage needed to store the UTF-8 data. Propagate this error.	2023-01-08 12:13:15 +01:00
Timothy Flynn	1edb96376b	AK+Everywhere: Make UTF-8 and UTF-32 to UTF-16 converters fallible These could fail to allocate the underlying storage needed to store the UTF-16 data. Propagate these errors.	2023-01-08 12:13:15 +01:00
Timothy Flynn	425c168ded	AK+LibJS+LibRegex: Define an alias for UTF-16 string data storage Instead of writing out "Vector<u16, 1>" everywhere, let's have a name for it.	2023-01-08 12:13:15 +01:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
Linus Groh	d26aabff04	Everywhere: Run clang-format	2022-12-03 23:52:23 +00:00
Idan Horowitz	44e8c05c67	AK: Add a Utf16View::code_unit_offset_of(Utf16CodePointIterator) helper This helper can be used to used to retrieve the code unit offset of an active Utf16CodePointIterator efficiently.	2022-01-31 21:05:04 +02:00
Timothy Flynn	6efbafa6e0	Everywhere: Update copyrights with my new serenityos.org e-mail :^)	2022-01-31 18:23:22 +00:00
Andreas Kling	8b1108e485	Everywhere: Pass AK::StringView by value	2021-11-11 01:27:46 +01:00
Andreas Kling	87290e300e	AK: Simplify Utf16View::operator==(Utf16View)	2021-10-02 18:32:56 +02:00
Andreas Kling	024367d82e	LibJS+AK: Use Vector<u16, 1> for UTF-16 string storage It's very common to encounter single-character strings in JavaScript on the web. We can make such strings significantly lighter by having a 1-character inline capacity on the Vectors.	2021-10-02 17:39:38 +02:00
Timothy Flynn	70080feab2	AK+LibJS: Implement String.from{CharCode,CodePoint} using UTF-16 strings Most of String.prototype and RegExp.prototype is implemented with UTF-16 so this is to prevent extra copying of the string data.	2021-08-04 11:18:24 +02:00
Timothy Flynn	510bbcd8e0	AK+LibRegex: Add Utf16View::code_point_at and use it in RegexStringView The current method of iterating through the string to access a code point hurts performance quite badly for very large strings. The test262 test "RegExp/property-escapes/generated/Any.js" previously took 3 hours to complete; this one change brings it down to under 10 seconds.	2021-08-04 11:18:24 +02:00
Timothy Flynn	0e6375558d	AK+LibRegex: Partially implement case insensitive UTF-16 comparison This will work for ASCII code points. Unicode case folding will be needed for non-ASCII.	2021-07-23 23:06:57 +01:00
Timothy Flynn	2e45e52993	AK: Add UTF-16 helper methods required for use within LibRegex To be used as a RegexStringView variant, Utf16View must provide a couple more helper methods. It must also not default its assignment operators, because that implicitly deletes move/copy constructors.	2021-07-23 23:06:57 +01:00
Timothy Flynn	9b83cd1abf	AK: Add Utf16View for decoding UTF-16 strings Also includes a way to transcode from and to UTF-8 strings.	2021-07-22 09:10:44 +02:00

48 commits