LibWeb: Fix numeric character reference at EOF leaking its last digit

Previously, if the NumericCharacterReferenceEnd state was reached when current_input_character was None, then the DONT_CONSUME_NEXT_INPUT_CHARACTER macro would restore back before the EOF, and allow the next state (after the SWITCH_TO_RETURN_STATE) to proceed with the last digit of the numeric character reference. For example, with something like `&#1111`, before this commit the output would incorrectly be `<code point with the value 1111>1` instead of just `<code point with the value 1111>`. Instead of putting the `if (current_input_character.has_value())` check inside NumericCharacterReferenceEnd directly, it was instead added to DONT_CONSUME_NEXT_INPUT_CHARACTER, because all usages of the macro benefit from this check, even if the other existing usage sites don't exhibit any bugs without it: - In MarkupDeclarationOpen, if the current_input_character is EOF, then the previous character is always `!`, so restoring and then checking forward for strings like `--`, `DOCTYPE`, etc won't match and the BogusComment state will run one extra time (once for `!` and once for EOF) with no practical consequences. With the `has_value()` check, BogusComment will only run once with EOF. - In AfterDOCTYPEName, ConsumeNextResult::RanOutOfCharacters can only occur when stopping at the insertion point, and because of how the code is structured, it is guaranteed that current_input_character is either `P` or `S`, so the `has_value()` check is irrelevant.
Author: https://github.com/squeek502 Commit: df87a9689c Pull-request: https://github.com/LadybirdBrowser/ladybird/pull/3163 Reviewed-by: https://github.com/gmta ✅
2025-10-18 22:19:50 +00:00 · 2024-12-20 06:05:37 -08:00 · 2024-12-20 06:05:37 -08:00 · df87a9689c · 2025-01-06 23:44:49 +00:00
commit df87a9689c
parent 752deaf6ef
3 changed files with 15 additions and 6 deletions
--- a/Libraries/LibWeb/HTML/Parser/HTMLTokenizer.cpp
+++ b/Libraries/LibWeb/HTML/Parser/HTMLTokenizer.cpp
@ -94,9 +94,10 @@ namespace Web::HTML {
        }                                                        \
    } while (0)

-#define DONT_CONSUME_NEXT_INPUT_CHARACTER \
-    do {                                  \
-        restore_to(m_prev_utf8_iterator); \
+#define DONT_CONSUME_NEXT_INPUT_CHARACTER        \
+    do {                                         \
+        if (current_input_character.has_value()) \
+            restore_to(m_prev_utf8_iterator);    \
    } while (0)

 #define ON(code_point) \