Commit graph

3919 commits

Author SHA1 Message Date
Aliaksandr Kalenik
d47a22150d AK: Define operator== for HashMap 2025-07-30 11:06:05 +02:00
Grant Knowlton
9e1e4f3b15 AK: Validate compressed tags in IPv4-mapped IPv6 address
This disallows parsing IPv4 mapped IPv6 address strings with multiple
compression prefixes.  Tests are provided for the updated
functionality.
2025-07-30 00:53:10 +02:00
Timothy Flynn
d9502505c2 AK: Fix bounds assertions in Utf16View::iterator_offset 2025-07-28 18:30:50 +02:00
Timothy Flynn
67723ef83c AK: Add a method to peek ahead of a UTF-16 iterator 2025-07-28 18:30:50 +02:00
Timothy Flynn
21d7d236e6 AK: Add a method to check if a UTF-16 string contains any code point 2025-07-28 18:30:50 +02:00
Timothy Flynn
96e75a023b AK: Implement a UTF-16 UnixDateTime stringifier 2025-07-28 12:25:11 +02:00
Timothy Flynn
ed63a60247 AK: Return an empty optional when UTF-16 code unit lookup fails
Accidentally returned the wrong type here.
2025-07-28 12:25:11 +02:00
Timothy Flynn
baddac5155 AK: Implement a method to split a UTF-16 string 2025-07-28 12:25:11 +02:00
Timothy Flynn
48a3b2c28e AK: Implement a method to count instances of a needle in a UTF-16 string 2025-07-28 12:25:11 +02:00
Timothy Flynn
1375e6bf39 AK+LibJS+LibWeb: Use simdutf to create well-formed strings 2025-07-26 00:40:06 +02:00
Timothy Flynn
a740bfd8ff AK+LibUnicode: Implement Unicode-aware UTF-16 case transformations 2025-07-25 18:16:22 +02:00
Timothy Flynn
df77ae1920 AK: Implement creating a UTF-16 string from a repeated code point 2025-07-25 18:16:22 +02:00
Timothy Flynn
a46e9b2adb AK: Compute the correct capacity in StringBuilder::try_append_repeated
This was mistakenly broken in 2803d66d87.
2025-07-25 18:16:22 +02:00
Timothy Flynn
745f288796 AK: Implement a method to acquire a UTF-16 iterator's code unit offset
This is the same as Utf8View::iterator_offset().
2025-07-25 18:16:22 +02:00
Jelle Raaijmakers
0b96690f0c AK: Add HashMap::update()
This updates a HashMap by copying another HashMap's keys and values.
2025-07-25 16:22:06 +02:00
Timothy Flynn
6c73dff120 AK: Implement a UTF-16 method to check if a string is ASCII whitespace 2025-07-24 19:00:20 +02:00
Timothy Flynn
f53389bab1 AK: Add a couple of Utf16String factories
* Utf16String::from_utf8_with_replacement_character
* Utf16String::from_code_point
2025-07-24 19:00:20 +02:00
Jelle Raaijmakers
b1c3ce807b AK: Rename Utf16View::trim_whitespace() to ::trim_ascii_whitespace()
This reflects the naming of String::trim_ascii_whitespace() and better
indicates what exactly we're trimming.
2025-07-24 07:18:25 -04:00
Jelle Raaijmakers
9a03ee1c24 AK: Fix mention of renamed member in Utf16View 2025-07-24 07:18:25 -04:00
Jelle Raaijmakers
15178d5230 AK: Add ::ends_with() to Utf16View and Utf16StringBase
I noticed that we can significantly simplify ::starts_with(), and based
the new ::ends_with() on that.
2025-07-24 07:18:25 -04:00
Jelle Raaijmakers
7f8468b0e6 AK: Compare pointers in TypedTransfer<T>::compare()
We can return `true` quickly if the two pointers are identical.
2025-07-24 07:18:25 -04:00
Jelle Raaijmakers
54dd45d3f6 AK: Add Span::ends_with()
Originally I added this to use it in Utf16View::ends_with(), but the
final implementation ended up a lot simpler. I chose to keep this anyway
since it mirrors Span::starts_with().
2025-07-24 07:18:25 -04:00
Timothy Flynn
6ddbb70051 AK: Remove constexpr specifier from Utf16View::bytes()
The Span constructor used here uses reinterpret_cast under the hood, so
it and Utf16View::bytes() cannot be constexpr.
2025-07-22 13:33:51 -04:00
Timothy Flynn
42b41431eb AK+LibJS: Enforce limits in Utf16View offset computations
RegExp was the only caller relying on being able to provide an offset
larger than the string length. So let's do a pre-check in RegExp and
then enforce that the offsets we receive in Utf16View are valid.
2025-07-22 17:17:33 +02:00
Timothy Flynn
ad7ac679fd AK: Compute Utf16View::code_point_offset_of correctly
There were a couple of issues here, including the following computation
could actually overflow to NumericLimits<size_t>::max():

    code_unit_offset -= it.length_in_code_units();
2025-07-22 17:17:33 +02:00
Timothy Flynn
0bbb725bcd AK: Mark a couple of methods in Utf16View.h as constexpr 2025-07-22 17:17:33 +02:00
Jelle Raaijmakers
265e278275 AK: Allow indexing at length in Utf8View::byte_offset_of()
And do the same for Utf8View::code_point_offset_of(). Some of these
`VERIFY`s of the view's length were introduced recently, but they caused
the parsing of named capture groups in RegexParser to crash in some
situations.

Instead, allow indexing at the view's length: the byte offset of code
point `length()` is known, even though that code point does not exist in
the view. Similarly, we know the code point offset at byte offset
`byte_length()`. Beyond those offsets, we still crash.

Fixes 13 failures in test262's `language/literals/regexp/named-groups`.
2025-07-22 09:10:32 -04:00
dmaivel
52a23dc02e AK+LibWeb/CSS: Add lower-greek counter style 2025-07-21 15:18:17 +01:00
Jelle Raaijmakers
86dc3ce001 AK: Add dbgln_dump() macro
This turns:

  dbgln_dump(some_expression() + 1);

Into:

  dbgln("some_expression() + 1: {}", (some_expression() + 1));
2025-07-18 14:40:00 -04:00
Timothy Flynn
9582895759 AK+LibJS+LibWeb+LibRegex: Replace AK::Utf16Data with AK::Utf16String 2025-07-18 12:45:38 -04:00
Timothy Flynn
d40e3af697 AK: Implement UTF-16 string-to-number conversions 2025-07-18 12:45:38 -04:00
Timothy Flynn
6e0290ecaa AK: Define some UTF-16 helper methods
* contains
* escape_html_entities
* replace
* to_ascii_lowercase
* to_ascii_uppercase
* to_ascii_titlecase
* trim
* trim_whitespace
2025-07-18 12:45:38 -04:00
Timothy Flynn
7f069efbc4 AK: Implement a flyweight string for Utf16String
Utf16FlyString more or less works exactly the same as FlyString. It will
store the raw encoded data of the string instance. If the string is a
short ASCII string, Utf16FlyString holds the ShortString bytes; else,
Utf16FlyString holds a pointer to the Utf16StringData.
2025-07-18 12:45:38 -04:00
Timothy Flynn
2803d66d87 AK: Support UTF-16 string formatting
The underlying storage used during string formatting is StringBuilder.
To support UTF-16 strings, this patch allows callers to specify a mode
during StringBuilder construction. The default mode is UTF-8, for which
StringBuilder remains unchanged.

In UTF-16 mode, we treat the StringBuilder's internal ByteBuffer as a
series of u16 code units. Appending a single character will append 2
bytes for that character (cast to a char16_t). Appending a StringView
will transcode the string to UTF-16.

Utf16String also gains the same memory optimization that we added for
String, where we hand-off the underlying buffer to Utf16String to avoid
having to re-allocate.

In the future, we may want to further optimize for ASCII strings. For
example, we could defer committing to the u16-esque storage until we
see a non-ASCII code point.
2025-07-18 12:45:38 -04:00
Timothy Flynn
fe676585f5 AK: Add a UTF-16 string with optimized short- and ASCII-string storage
This is a strictly UTF-16 string with some optimizations for ASCII.

* If created from a short UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an inlined byte buffer.

* If created with a long UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an outlined char buffer.

* If created with a short or long UTF-8 or UTF-16 string that is not
  ASCII, then the string is stored in an outlined char16 buffer.

We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
2025-07-18 12:45:38 -04:00
Timothy Flynn
8fbb80fffc AK: Do not fall back to simdutf for UTF-16 ASCII validation
This was a mistake. Consider U+201C (LEFT DOUBLE QUOTATION MARK). This
code point is encoded as the bytes 0x1c 0x20 in UTF-16LE. Both of these
bytes are ASCII if interpreted as UTF-8. But the string itself is most
certainly not ASCII.
2025-07-18 12:45:38 -04:00
Aliaksandr Kalenik
6be559f639 AK: Define ConstIterator for SegmentedVector 2025-07-13 19:15:05 +02:00
Shannon Booth
8bd43f2cb9 AK: Add hash traits for Optional<T>
To enable storing it in a hashmap. 13 is a somewhat arbitrary
value, something like 0 is not appropriate since a lot of types
return 0 as a hash for an invalid / empty state.
2025-07-07 20:24:47 +01:00
Andrew Kaster
ad1938086d AK: Add missing find_package command for fast_float
This was missed in 62d9a84b8d
2025-07-07 06:47:06 -04:00
Timothy Flynn
01ebf1eb07 AK: Replace surrogates in String::from_utf8_with_replacement_character
Some checks are pending
CI / macOS, arm64, Sanitizer_CI, Clang (push) Waiting to run
CI / Linux, x86_64, Fuzzers_CI, Clang (push) Waiting to run
CI / Linux, x86_64, Sanitizer_CI, GNU (push) Waiting to run
CI / Linux, x86_64, Sanitizer_CI, Clang (push) Waiting to run
Package the js repl as a binary artifact / Linux, arm64 (push) Waiting to run
Package the js repl as a binary artifact / macOS, arm64 (push) Waiting to run
Package the js repl as a binary artifact / Linux, x86_64 (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
We are expected to replace lonely surrogates with U+FFFD when decoding
UTF-8 text.
2025-07-06 04:30:17 +12:00
Timothy Flynn
51afbf5280 AK: Simplify BOM parsing in String::from_utf8_with_replacement_character 2025-07-06 04:30:17 +12:00
Gingeh
f098bd029c LibTextCodec: Replace unmatched utf16 surrogates 2025-07-05 09:58:57 -04:00
Timothy Flynn
418409aa6f AK: Fix usage of constexpr within Utf16View and related utilities
* Error and ErrorOr are not themelves constexpr, so a function returning
  these types cannot be constexpr.

* The UDL was trying to call Utf16View::validate, which is not constexpr
  itself. The compiler will actually already raise an error if a UTF-16
  literal is invalid, so let's just avoid the call altogether.
2025-07-05 01:25:22 +12:00
ayeteadoe
25f5936dee CMake: Rename serenity_* helper functions/macros to ladybird_* 2025-07-03 23:19:41 +02:00
Timothy Flynn
69074a3841 AK: Avoid double allocations when converting UTF-16 LE/BE to UTF-8
We can form the UTF-8 string in-place.
2025-07-03 11:45:23 -04:00
Timothy Flynn
62d9a84b8d AK+Everywhere: Replace custom number parsers with fast_float
Some checks failed
CI / macOS, arm64, Sanitizer_CI, Clang (push) Waiting to run
CI / Linux, x86_64, Fuzzers_CI, Clang (push) Waiting to run
CI / Linux, x86_64, Sanitizer_CI, GNU (push) Waiting to run
CI / Linux, x86_64, Sanitizer_CI, Clang (push) Waiting to run
Package the js repl as a binary artifact / Linux, arm64 (push) Waiting to run
Package the js repl as a binary artifact / macOS, arm64 (push) Waiting to run
Package the js repl as a binary artifact / Linux, x86_64 (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
Build Dev Container Image / build (push) Has been cancelled
Our floating point number parser was based on the fast_float library:
https://github.com/fastfloat/fast_float

However, our implementation only supports 8-bit characters. To support
UTF-16, we will need to be able to convert char16_t-based strings to
numbers as well. This works out-of-the-box with fast_float.

We can also use fast_float for integer parsing.
2025-07-03 09:51:56 -04:00
Timothy Flynn
9fc3e72db2 AK+Everywhere: Allow lonely UTF-16 surrogates by default
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
2025-07-03 09:51:56 -04:00
Timothy Flynn
86b1c78c1a AK+Everywhere: Prepare Utf16View for integration with a UTF-16 string
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.

This also adds a UDL to construct a Utf16View from a string literal:

    auto string = u"hello"sv;

This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
2025-07-03 09:51:56 -04:00
Timothy Flynn
c17b067e1d AK: Completely remove endianness from Utf16View APIs
These were mostly removed in 7628ddfaf7.
This removes the few remaining cases, as no callers are providing any
non-host endianness. This is just to prevent weird API dissymmetry
between Utf16View and an upcoming Utf16String.
2025-07-03 09:51:56 -04:00
Timothy Flynn
a0eb47e2fc AK: Add hash traits for Utf16View
This is based on the hash in JS::Utf16StringImpl::compute_hash.
2025-07-03 09:51:56 -04:00