Commit graph

571 commits

Author SHA1 Message Date
Aliaksandr Kalenik
d47a22150d AK: Define operator== for HashMap 2025-07-30 11:06:05 +02:00
Grant Knowlton
9e1e4f3b15 AK: Validate compressed tags in IPv4-mapped IPv6 address
This disallows parsing IPv4 mapped IPv6 address strings with multiple
compression prefixes.  Tests are provided for the updated
functionality.
2025-07-30 00:53:10 +02:00
Timothy Flynn
d9502505c2 AK: Fix bounds assertions in Utf16View::iterator_offset 2025-07-28 18:30:50 +02:00
Timothy Flynn
67723ef83c AK: Add a method to peek ahead of a UTF-16 iterator 2025-07-28 18:30:50 +02:00
Timothy Flynn
21d7d236e6 AK: Add a method to check if a UTF-16 string contains any code point 2025-07-28 18:30:50 +02:00
Timothy Flynn
ed63a60247 AK: Return an empty optional when UTF-16 code unit lookup fails
Accidentally returned the wrong type here.
2025-07-28 12:25:11 +02:00
Timothy Flynn
baddac5155 AK: Implement a method to split a UTF-16 string 2025-07-28 12:25:11 +02:00
Timothy Flynn
48a3b2c28e AK: Implement a method to count instances of a needle in a UTF-16 string 2025-07-28 12:25:11 +02:00
Andrew Kaster
7d669b8b0c AK: Update Swift test for Utf16String changes 2025-07-26 23:33:58 +02:00
Timothy Flynn
a740bfd8ff AK+LibUnicode: Implement Unicode-aware UTF-16 case transformations 2025-07-25 18:16:22 +02:00
Timothy Flynn
df77ae1920 AK: Implement creating a UTF-16 string from a repeated code point 2025-07-25 18:16:22 +02:00
Jelle Raaijmakers
0b96690f0c AK: Add HashMap::update()
This updates a HashMap by copying another HashMap's keys and values.
2025-07-25 16:22:06 +02:00
Timothy Flynn
6c73dff120 AK: Implement a UTF-16 method to check if a string is ASCII whitespace 2025-07-24 19:00:20 +02:00
Timothy Flynn
f53389bab1 AK: Add a couple of Utf16String factories
* Utf16String::from_utf8_with_replacement_character
* Utf16String::from_code_point
2025-07-24 19:00:20 +02:00
Jelle Raaijmakers
15178d5230 AK: Add ::ends_with() to Utf16View and Utf16StringBase
I noticed that we can significantly simplify ::starts_with(), and based
the new ::ends_with() on that.
2025-07-24 07:18:25 -04:00
Jelle Raaijmakers
54dd45d3f6 AK: Add Span::ends_with()
Originally I added this to use it in Utf16View::ends_with(), but the
final implementation ended up a lot simpler. I chose to keep this anyway
since it mirrors Span::starts_with().
2025-07-24 07:18:25 -04:00
Timothy Flynn
ad7ac679fd AK: Compute Utf16View::code_point_offset_of correctly
There were a couple of issues here, including the following computation
could actually overflow to NumericLimits<size_t>::max():

    code_unit_offset -= it.length_in_code_units();
2025-07-22 17:17:33 +02:00
Timothy Flynn
f595e47c1f AK: Add unit tests for Utf16View::code_unit_offset_of 2025-07-22 17:17:33 +02:00
Jelle Raaijmakers
265e278275 AK: Allow indexing at length in Utf8View::byte_offset_of()
And do the same for Utf8View::code_point_offset_of(). Some of these
`VERIFY`s of the view's length were introduced recently, but they caused
the parsing of named capture groups in RegexParser to crash in some
situations.

Instead, allow indexing at the view's length: the byte offset of code
point `length()` is known, even though that code point does not exist in
the view. Similarly, we know the code point offset at byte offset
`byte_length()`. Beyond those offsets, we still crash.

Fixes 13 failures in test262's `language/literals/regexp/named-groups`.
2025-07-22 09:10:32 -04:00
Timothy Flynn
9582895759 AK+LibJS+LibWeb+LibRegex: Replace AK::Utf16Data with AK::Utf16String 2025-07-18 12:45:38 -04:00
Timothy Flynn
d40e3af697 AK: Implement UTF-16 string-to-number conversions 2025-07-18 12:45:38 -04:00
Timothy Flynn
6e0290ecaa AK: Define some UTF-16 helper methods
* contains
* escape_html_entities
* replace
* to_ascii_lowercase
* to_ascii_uppercase
* to_ascii_titlecase
* trim
* trim_whitespace
2025-07-18 12:45:38 -04:00
Timothy Flynn
7f069efbc4 AK: Implement a flyweight string for Utf16String
Utf16FlyString more or less works exactly the same as FlyString. It will
store the raw encoded data of the string instance. If the string is a
short ASCII string, Utf16FlyString holds the ShortString bytes; else,
Utf16FlyString holds a pointer to the Utf16StringData.
2025-07-18 12:45:38 -04:00
Timothy Flynn
2803d66d87 AK: Support UTF-16 string formatting
The underlying storage used during string formatting is StringBuilder.
To support UTF-16 strings, this patch allows callers to specify a mode
during StringBuilder construction. The default mode is UTF-8, for which
StringBuilder remains unchanged.

In UTF-16 mode, we treat the StringBuilder's internal ByteBuffer as a
series of u16 code units. Appending a single character will append 2
bytes for that character (cast to a char16_t). Appending a StringView
will transcode the string to UTF-16.

Utf16String also gains the same memory optimization that we added for
String, where we hand-off the underlying buffer to Utf16String to avoid
having to re-allocate.

In the future, we may want to further optimize for ASCII strings. For
example, we could defer committing to the u16-esque storage until we
see a non-ASCII code point.
2025-07-18 12:45:38 -04:00
Timothy Flynn
fe676585f5 AK: Add a UTF-16 string with optimized short- and ASCII-string storage
This is a strictly UTF-16 string with some optimizations for ASCII.

* If created from a short UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an inlined byte buffer.

* If created with a long UTF-8 or UTF-16 string that is also ASCII,
  then the string is stored in an outlined char buffer.

* If created with a short or long UTF-8 or UTF-16 string that is not
  ASCII, then the string is stored in an outlined char16 buffer.

We do not store short non-ASCII text in the inlined buffer to avoid
confusion with operations such as `length_in_code_units` and
`code_unit_at`. For example, "😀" would be stored as 4 UTF-8 bytes
in short string form. But we still want `length_in_code_units` to
be 2, and `code_unit_at(0)` to be 0xD83D.
2025-07-18 12:45:38 -04:00
Timothy Flynn
8fbb80fffc AK: Do not fall back to simdutf for UTF-16 ASCII validation
This was a mistake. Consider U+201C (LEFT DOUBLE QUOTATION MARK). This
code point is encoded as the bytes 0x1c 0x20 in UTF-16LE. Both of these
bytes are ASCII if interpreted as UTF-8. But the string itself is most
certainly not ASCII.
2025-07-18 12:45:38 -04:00
Timothy Flynn
01ebf1eb07 AK: Replace surrogates in String::from_utf8_with_replacement_character
Some checks are pending
CI / macOS, arm64, Sanitizer_CI, Clang (push) Waiting to run
CI / Linux, x86_64, Fuzzers_CI, Clang (push) Waiting to run
CI / Linux, x86_64, Sanitizer_CI, GNU (push) Waiting to run
CI / Linux, x86_64, Sanitizer_CI, Clang (push) Waiting to run
Package the js repl as a binary artifact / Linux, arm64 (push) Waiting to run
Package the js repl as a binary artifact / macOS, arm64 (push) Waiting to run
Package the js repl as a binary artifact / Linux, x86_64 (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
We are expected to replace lonely surrogates with U+FFFD when decoding
UTF-8 text.
2025-07-06 04:30:17 +12:00
ayeteadoe
25f5936dee CMake: Rename serenity_* helper functions/macros to ladybird_* 2025-07-03 23:19:41 +02:00
Timothy Flynn
62d9a84b8d AK+Everywhere: Replace custom number parsers with fast_float
Some checks failed
CI / macOS, arm64, Sanitizer_CI, Clang (push) Waiting to run
CI / Linux, x86_64, Fuzzers_CI, Clang (push) Waiting to run
CI / Linux, x86_64, Sanitizer_CI, GNU (push) Waiting to run
CI / Linux, x86_64, Sanitizer_CI, Clang (push) Waiting to run
Package the js repl as a binary artifact / Linux, arm64 (push) Waiting to run
Package the js repl as a binary artifact / macOS, arm64 (push) Waiting to run
Package the js repl as a binary artifact / Linux, x86_64 (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
Build Dev Container Image / build (push) Has been cancelled
Our floating point number parser was based on the fast_float library:
https://github.com/fastfloat/fast_float

However, our implementation only supports 8-bit characters. To support
UTF-16, we will need to be able to convert char16_t-based strings to
numbers as well. This works out-of-the-box with fast_float.

We can also use fast_float for integer parsing.
2025-07-03 09:51:56 -04:00
Timothy Flynn
9fc3e72db2 AK+Everywhere: Allow lonely UTF-16 surrogates by default
By definition, the web allows lonely surrogates by default. Let's have
our string APIs reflect this, so we don't have to pass an allow option
all over the place.
2025-07-03 09:51:56 -04:00
Timothy Flynn
86b1c78c1a AK+Everywhere: Prepare Utf16View for integration with a UTF-16 string
To prepare for an upcoming Utf16String, this migrates Utf16View to store
its data as a char16_t. Most function definitions are moved inline and
made constexpr.

This also adds a UDL to construct a Utf16View from a string literal:

    auto string = u"hello"sv;

This let's us remove the NTTP Utf16View constructor, as we have found
that such constructors bloat binary size quite a bit.
2025-07-03 09:51:56 -04:00
Timothy Flynn
2abc955ca9 AK: Allow treating UTF-16 views with lonely surrogates as valid
Much of the web requires us to allow lonely surrogates in UTF-16 data.
The default behavior to disallow such code units has not been changed
here - that will be changed in an upcoming commit.
2025-07-03 09:51:56 -04:00
Timothy Flynn
d978a582a0 AK: Add a Utf16View ASCII validator 2025-07-03 09:51:56 -04:00
Timothy Flynn
35a1832d08 Tests/AK: Rename TestUtf16 / TestUtf8 to TestUtf16View / TestUtf8View
These are the files they actually test, so let's rename them to avoid
confusion with an upcoming Utf16String test.
2025-07-03 09:51:56 -04:00
Luke Wilde
31a8004ddb AK: Add the ability to consume specifically by a predicate
This will be used by Content Security Policy to consume the next
character, if it matches a whole range of characters, such as
is_ascii_alpha.
2025-07-01 10:24:24 +12:00
Tomasz Strejczek
8f8e51b1fc AK: Implement AK::UnixDateTime::to_string()
Copy implementation of LibCore::DateTime::to_string()
to AK.
Rename TestDuration.cpp to TestTime.cpp and add
there tests for to_string().
2025-06-19 18:42:45 -06:00
Tomasz Strejczek
e03c558a0a AK: Implement demangle() for MSVC ABI
This implements demangle() using Windows API. Also some rudimentary
test is provided.
2025-06-17 18:39:18 -06:00
Sam Atkins
26105b8b11 AK: Add a Formatter for Checked
This goes in Format.h instead of Checked.h, to avoid an include cycle.
2025-06-17 20:44:01 +02:00
Jelle Raaijmakers
6f926e6977 AK: Add Utf8View::code_point_offset_of() 2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
cc0a28ee7d AK: Add Utf16View::find_code_unit_offset(_ignoring_case) 2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
7d7f6fa494 AK: Remove superfluous check from Utf16View::equals_ignoring_case()
Returning true if both lengths are 0 is already handled by the default
case.
2025-06-13 15:08:26 +02:00
Jelle Raaijmakers
b558b4dba6 AK: Add Span<T>::index_of(ReadonlySpan)
This will be used for case-sensitive substring index matches in a later
commit.
2025-06-13 15:08:26 +02:00
ayeteadoe
8cf01a25c2 AK: Add initial support for AK testsuite on Windows
Some checks are pending
CI / Lagom (arm64, Sanitizer_CI, false, macos-15, macOS, Clang) (push) Waiting to run
CI / Lagom (x86_64, Fuzzers_CI, false, ubuntu-24.04, Linux, Clang) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, false, ubuntu-24.04, Linux, GNU) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, true, ubuntu-24.04, Linux, Clang) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (arm64, macos-15, macOS, macOS-universal2) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (x86_64, ubuntu-24.04, Linux, Linux-x86_64) (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
We now explicitly enabling support for the minimum libraries needed
to build and run the AK testsuite. 81/82 tests are running and
passing. The exception is LexicalPath, as some path behaviour on
Windows is different than Unix, so the current tests will have lots of
platform specific failures. The implementer of LexicalPathWindows
recommended windows-specific tests here, so I will do that in a
follow up.
2025-05-20 10:58:43 -06:00
Ashton
5f5ae6bf8b AK: Replace wchar_t formatting with char32_t
Some checks are pending
CI / Lagom (arm64, Sanitizer_CI, false, macos-15, macOS, Clang) (push) Waiting to run
CI / Lagom (x86_64, Fuzzers_CI, false, ubuntu-24.04, Linux, Clang) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, false, ubuntu-24.04, Linux, GNU) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, true, ubuntu-24.04, Linux, Clang) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (arm64, macos-15, macOS, macOS-universal2) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (x86_64, ubuntu-24.04, Linux, Linux-x86_64) (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
This makes TestFormat fully cross-platform as we no longer have to
work around the 16 vs 32-bit wide strings
2025-05-18 19:18:13 -06:00
Ashton
4b3a3b0856 AK: Remove redundant TestPrint test
This test was only useful when AK/PrintfImplementation.h existed. But
that was removed 11 months ago, so since then this has just been
testing std library functions not implemented by us.
2025-05-18 19:18:13 -06:00
Andreas Kling
734bc2a0ea AK: Strip trailing zero decimals in default formatting of float numbers
This gives us a more human-looking serialization of numbers by default,
and in case a fixed number of decimal digits is actually wanted, we
still have the 'f' specifier.
2025-05-18 17:23:34 +02:00
ayeteadoe
744fd91d0b LibTest: Support death tests without child process cloning
A challenge for getting LibTest working on Windows has always
been CrashTest. It implements death tests similar to Google Test
where a child process is cloned to invoke the expression that
should abort/terminate the program. Then the exit code of the
child is used by the parent test process to verify if the
application correctly aborted/terminated due to invoking
the expression.

The problem was that finding an equivalent way to port Crash::run()
to Windows was not looking very likely as publicly exposed Win32/
Native APIs have no equivalent to fork(); however, Windows actually
does have native support for process cloning via undocumented NT
APIs that clever people reverse engineered and published, see
`NtCreateUserProcess()`.

All that being said, this `EXPECT_DEATH()` implementation avoids
needing to use a child process in general, allowing us to remove
CrashTest in favour of a single cross-platform solution for death
tests.
2025-05-16 13:23:32 -06:00
Andreas Kling
cf6e2531d9 AK: Make String::number() much faster for integer types
Some checks are pending
CI / Lagom (arm64, Sanitizer_CI, false, macos-15, macOS, Clang) (push) Waiting to run
CI / Lagom (x86_64, Fuzzers_CI, false, ubuntu-24.04, Linux, Clang) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, false, ubuntu-24.04, Linux, GNU) (push) Waiting to run
CI / Lagom (x86_64, Sanitizer_CI, true, ubuntu-24.04, Linux, Clang) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (arm64, macos-15, macOS, macOS-universal2) (push) Waiting to run
Package the js repl as a binary artifact / build-and-package (x86_64, ubuntu-24.04, Linux, Linux-x86_64) (push) Waiting to run
Run test262 and test-wasm / run_and_update_results (push) Waiting to run
Lint Code / lint (push) Waiting to run
Label PRs with merge conflicts / auto-labeler (push) Waiting to run
Push notes / build (push) Waiting to run
Instead of going through String::formatted(), we now have a specialized
code path for base-10 serialization directly to UTF-8.

This is roughly 5-10x faster than the previous implementation, depending
on how many digits we end up outputting.

1.07x speedup on MicroBench/for-in-indexed-properties.js
2025-05-02 19:13:03 +02:00
Tim Ledbetter
31dea89fe0 AK: Add lowest common multiple and greatest common divisor functions 2025-04-23 09:13:45 +01:00
Jonne Ransijn
bb20a0d8f8 AK: Allow the Optional<T> move assignment operator to be trivial
This will change behaviour for moved-from `Optional<T>`s, since they
will now no longer clear their value if `T` is trivial. However, a
moved-from value should be considered to be in an unspecified state.
Use `Optional<T>::clear` or `Optional<T>::release_value` instead.
2025-04-22 21:19:31 -06:00