With the newly supported fuzzy matching in our test-web runner, we can
now define the expected maximum color channel and pixel count errors per
failing test and set a baseline they should not exceed.
The figures I added to these tests all come from my macOS M4 machine.
Most discrepancies seem to come from color calculations being slightly
off.
WPT reference tests can add metadata to tests to instruct the test
runner how to interpret the results. Because of this, it is not enough
to have an action that starts loading the (mis)match reference: we need
the test runner to receive the metadata so it can act accordingly.
This sets our test runner up for potentially supporting multiple
(mis)match references, and fuzzy rendering matches - the latter will be
implemented in the following commit.
This produces more granural information on the actual pixel errors
present between two bitmaps. It now also asserts that both bitmaps are
the same size, since we should never compare two differently sized
bitmaps anyway.
Of the available 71 screenshot tests that we have, 42 fail on macOS
arm64. Let's make it possible to skip those and run the remaining
succeeding screenshot tests on arm64 anyway.
When a test has a `reftest-wait` class, we now wait for fonts to load
and then for 2 `requestAnimationFrame` callbacks. This matches the
behavior of the WPT runner and allows more imported WPT tests to work
correctly.
Now that headless mode is built into the main Ladybird executable, the
headless-browser's only purpose is to run tests. So let's move it to the
testing directory and rename it to test-web (a la test-js / test-wasm).