mirror of
https://github.com/LadybirdBrowser/ladybird.git
synced 2025-07-26 02:40:22 +00:00
Introduces a few ad-hoc modifications to the DAFSA aimed to increase performance while keeping the data size small. - The 'first layer' of nodes is extracted out and replaced with a lookup table. This turns the search for the first character from O(n) to O (1), and doesn't increase the data size because all first characters in the set of named character references have the values 'a'-'z'/'A'-'Z', so a lookup array of exactly 52 elements can be used. The lookup table stores the cumulative "number" fields that would be calculated by a linear scan that matches a given node, thus allowing the unique index to be built-up as normal with a O(1) search instead of a linear scan. - The 'second layer' of nodes is also extracted out and searches of the second layer are done using a bit field of 52 bits (the set bits of the bit field depend on the first character's value), where each set bit corresponds to one of 'a'-'z'/'A'-'Z' (similar to the first layer, the second layer can only contain ASCII alphabetic characters). The bit field is then re-used (along with an offset) to get the index into the array of second layer nodes. This technique ultimately allows for storing the minimum number of nodes in the second layer, and therefore only increasing the size of the data by the size of the 'first to second layer link' info which is 52 * 8 = 416 bytes. - After the second layer, the rest of the data is stored using a mostly-normal DAFSA, but there are still a few differences: - The "number" field is cumulative, in the same way that the first/second layer store a cumulative "number" field. This cuts down slightly on the amount of work done during the search of a list of children, and we can get away with it because the cumulative "number" fields of the remaining nodes in the DAFSA (after the first and second layer nodes were extracted out) happens to require few enough bits that we can store the cumulative version while staying under our 32-bit budget. - Instead of storing a 'last sibling' flag to denote the end of a list of children, the length of each node's list of children is stored. Again, this is mostly done just because there are enough bits available to do so while keeping the DAFSA node within 32 bits. - Note: Together, these modifications open up the possibility of using a binary search instead of a linear search over the children, but due to the consistently small lengths of the lists of children in the remaining DAFSA, a linear search actually seems to be the better option. The new data size is 24,724 bytes, up from 24,412 bytes (+312, -104 from the 52 first layer nodes going from 4-bytes to 2-bytes, and +416 from the addition of the 'first to second layer link' data). In terms of raw matching speed (outside the context of the tokenizer), this provides about a 1.72x speedup. In very named-character-reference-heavy tokenizer benchmarks, this provides about a 1.05x speedup (the effect of named character reference matching speed is diluted when benchmarking the tokenizer). Additionally, fixes the size of the named character reference data when targeting Windows.
698 lines
27 KiB
C++
698 lines
27 KiB
C++
/*
|
|
* Copyright (c) 2024, the SerenityOS developers.
|
|
*
|
|
* SPDX-License-Identifier: BSD-2-Clause
|
|
*/
|
|
|
|
#include "GeneratorUtil.h"
|
|
#include <AK/Array.h>
|
|
#include <AK/CharacterTypes.h>
|
|
#include <AK/FixedArray.h>
|
|
#include <AK/SourceGenerator.h>
|
|
#include <AK/StringBuilder.h>
|
|
#include <LibCore/ArgsParser.h>
|
|
#include <LibMain/Main.h>
|
|
|
|
// The goal is to encode the necessary data compactly while still allowing for fast matching of
|
|
// named character references, and taking full advantage of the note in the spec[1] that:
|
|
//
|
|
// > This list [of named character references] is static and will not be expanded or changed in the future.
|
|
//
|
|
// An overview of the approach taken (see [2] for more background/context):
|
|
//
|
|
// First, a deterministic acyclic finite state automaton (DAFSA) [3] is constructed from the set of
|
|
// named character references. The nodes in the DAFSA are populated with a "number" field that
|
|
// represents the count of all possible valid words from that node. This "number" field allows for
|
|
// minimal perfect hashing, where each word in the set corresponds to a unique index. The unique
|
|
// index of a word in the set is calculated during traversal/search of the DAFSA:
|
|
// - For any non-matching node that is iterated when searching a list of children, add their number
|
|
// to the unique index
|
|
// - For nodes that match the current character, if the node is a valid end-of-word, add 1 to the
|
|
// unique index
|
|
// Note that "searching a list of children" is assumed to use a linear scan, so, for example, if
|
|
// a list of children contained 'a', 'b', 'c', and 'd' (in that order), and the character 'c' was
|
|
// being searched for, then the "number" of both 'a' and 'b' would get added to the unique index,
|
|
// and then 1 would be added after matching 'c' (this minimal perfect hashing strategy comes from [4]).
|
|
//
|
|
// Something worth noting is that a DAFSA can be used with the set of named character references
|
|
// (with minimal perfect hashing) while keeping the nodes of the DAFSA <= 32-bits. This is a property
|
|
// that really matters, since any increase over 32-bits would immediately double the size of the data
|
|
// due to padding bits when storing the nodes in a contiguous array.
|
|
//
|
|
// There are also a few modifications made to the DAFSA to increase performance:
|
|
// - The 'first layer' of nodes is extracted out and replaced with a lookup table. This turns
|
|
// the search for the first character from O(n) to O(1), and doesn't increase the data size because
|
|
// all first characters in the set of named character references have the values 'a'-'z'/'A'-'Z',
|
|
// so a lookup array of exactly 52 elements can be used. The lookup table stores the cumulative
|
|
// "number" fields that would be calculated by a linear scan that matches a given node, thus allowing
|
|
// the unique index to be built-up as normal with a O(1) search instead of a linear scan.
|
|
// - The 'second layer' of nodes is also extracted out and searches of the second layer are done
|
|
// using a bit field of 52 bits (the set bits of the bit field depend on the first character's value),
|
|
// where each set bit corresponds to one of 'a'-'z'/'A'-'Z' (similar to the first layer, the second
|
|
// layer can only contain ASCII alphabetic characters). The bit field is then re-used (along with
|
|
// an offset) to get the index into the array of second layer nodes. This technique ultimately
|
|
// allows for storing the minimum number of nodes in the second layer, and therefore only increasing the
|
|
// size of the data by the size of the 'first to second layer link' info which is 52 * 8 = 416 bytes.
|
|
// - After the second layer, the rest of the data is stored using a mostly-normal DAFSA, but there
|
|
// are still a few differences:
|
|
// - The "number" field is cumulative, in the same way that the first/second layer store a
|
|
// cumulative "number" field. This cuts down slightly on the amount of work done during
|
|
// the search of a list of children, and we can get away with it because the cumulative
|
|
// "number" fields of the remaining nodes in the DAFSA (after the first and second layer
|
|
// nodes were extracted out) happens to require few enough bits that we can store the
|
|
// cumulative version while staying under our 32-bit budget.
|
|
// - Instead of storing a 'last sibling' flag to denote the end of a list of children, the
|
|
// length of each node's list of children is stored. Again, this is mostly done just because
|
|
// there are enough bits available to do so while keeping the DAFSA node within 32 bits.
|
|
// - Note: Together, these modifications open up the possibility of using a binary search instead
|
|
// of a linear search over the children, but due to the consistently small lengths of the lists
|
|
// of children in the remaining DAFSA, a linear search actually seems to be the better option.
|
|
//
|
|
// [1]: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
|
|
// [2]: https://www.ryanliptak.com/blog/better-named-character-reference-tokenization/
|
|
// [3]: https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
|
|
// [4]: Applications of finite automata representing large vocabularies (Cláudio L. Lucchesi,
|
|
// Tomasz Kowaltowski, 1993) https://doi.org/10.1002/spe.4380230103
|
|
|
|
ErrorOr<void> generate_header_file(Core::File& file);
|
|
ErrorOr<void> generate_implementation_file(JsonObject& named_character_reference_data, Core::File& file);
|
|
|
|
ErrorOr<int> ladybird_main(Main::Arguments arguments)
|
|
{
|
|
StringView generated_header_path;
|
|
StringView generated_implementation_path;
|
|
StringView json_path;
|
|
|
|
Core::ArgsParser args_parser;
|
|
args_parser.add_option(generated_header_path, "Path to the Entities header file to generate", "generated-header-path", 'h', "generated-header-path");
|
|
args_parser.add_option(generated_implementation_path, "Path to the Entities implementation file to generate", "generated-implementation-path", 'c', "generated-implementation-path");
|
|
args_parser.add_option(json_path, "Path to the JSON file to read from", "json-path", 'j', "json-path");
|
|
args_parser.parse(arguments);
|
|
|
|
auto json = TRY(read_entire_file_as_json(json_path));
|
|
VERIFY(json.is_object());
|
|
auto named_character_reference_data = json.as_object();
|
|
|
|
auto generated_header_file = TRY(Core::File::open(generated_header_path, Core::File::OpenMode::Write));
|
|
auto generated_implementation_file = TRY(Core::File::open(generated_implementation_path, Core::File::OpenMode::Write));
|
|
|
|
TRY(generate_header_file(*generated_header_file));
|
|
TRY(generate_implementation_file(named_character_reference_data, *generated_implementation_file));
|
|
|
|
return 0;
|
|
}
|
|
|
|
struct Codepoints {
|
|
u32 first;
|
|
u32 second;
|
|
};
|
|
|
|
inline static StringView get_second_codepoint_enum_name(u32 codepoint)
|
|
{
|
|
switch (codepoint) {
|
|
case 0x0338:
|
|
return "CombiningLongSolidusOverlay"sv;
|
|
case 0x20D2:
|
|
return "CombiningLongVerticalLineOverlay"sv;
|
|
case 0x200A:
|
|
return "HairSpace"sv;
|
|
case 0x0333:
|
|
return "CombiningDoubleLowLine"sv;
|
|
case 0x20E5:
|
|
return "CombiningReverseSolidusOverlay"sv;
|
|
case 0xFE00:
|
|
return "VariationSelector1"sv;
|
|
case 0x006A:
|
|
return "LatinSmallLetterJ"sv;
|
|
case 0x0331:
|
|
return "CombiningMacronBelow"sv;
|
|
default:
|
|
return "None"sv;
|
|
}
|
|
}
|
|
|
|
ErrorOr<void> generate_header_file(Core::File& file)
|
|
{
|
|
StringBuilder builder;
|
|
SourceGenerator generator { builder };
|
|
generator.append(R"~~~(
|
|
#pragma once
|
|
|
|
#include <AK/Optional.h>
|
|
#include <AK/Types.h>
|
|
|
|
namespace Web::HTML {
|
|
|
|
// Uses u32 to match the `first` field of NamedCharacterReferenceCodepoints for bit-field packing purposes.
|
|
enum class NamedCharacterReferenceSecondCodepoint : u32 {
|
|
None,
|
|
CombiningLongSolidusOverlay, // U+0338
|
|
CombiningLongVerticalLineOverlay, // U+20D2
|
|
HairSpace, // U+200A
|
|
CombiningDoubleLowLine, // U+0333
|
|
CombiningReverseSolidusOverlay, // U+20E5
|
|
VariationSelector1, // U+FE00
|
|
LatinSmallLetterJ, // U+006A
|
|
CombiningMacronBelow, // U+0331
|
|
};
|
|
|
|
inline Optional<u16> named_character_reference_second_codepoint_value(NamedCharacterReferenceSecondCodepoint codepoint)
|
|
{
|
|
switch (codepoint) {
|
|
case NamedCharacterReferenceSecondCodepoint::None:
|
|
return {};
|
|
case NamedCharacterReferenceSecondCodepoint::CombiningLongSolidusOverlay:
|
|
return 0x0338;
|
|
case NamedCharacterReferenceSecondCodepoint::CombiningLongVerticalLineOverlay:
|
|
return 0x20D2;
|
|
case NamedCharacterReferenceSecondCodepoint::HairSpace:
|
|
return 0x200A;
|
|
case NamedCharacterReferenceSecondCodepoint::CombiningDoubleLowLine:
|
|
return 0x0333;
|
|
case NamedCharacterReferenceSecondCodepoint::CombiningReverseSolidusOverlay:
|
|
return 0x20E5;
|
|
case NamedCharacterReferenceSecondCodepoint::VariationSelector1:
|
|
return 0xFE00;
|
|
case NamedCharacterReferenceSecondCodepoint::LatinSmallLetterJ:
|
|
return 0x006A;
|
|
case NamedCharacterReferenceSecondCodepoint::CombiningMacronBelow:
|
|
return 0x0331;
|
|
default:
|
|
VERIFY_NOT_REACHED();
|
|
}
|
|
}
|
|
|
|
// Note: The first codepoint could fit in 17 bits, and the second could fit in 4 (if unsigned).
|
|
// However, to get any benefit from minimizing the struct size, it would need to be accompanied by
|
|
// bit-packing the g_named_character_reference_codepoints_lookup array.
|
|
struct NamedCharacterReferenceCodepoints {
|
|
u32 first : 24; // Largest value is U+1D56B
|
|
NamedCharacterReferenceSecondCodepoint second : 8;
|
|
};
|
|
static_assert(sizeof(NamedCharacterReferenceCodepoints) == 4);
|
|
|
|
struct NamedCharacterReferenceFirstLayerNode {
|
|
// Really only needs 12 bits.
|
|
u16 number;
|
|
};
|
|
static_assert(sizeof(NamedCharacterReferenceFirstLayerNode) == 2);
|
|
|
|
struct NamedCharacterReferenceFirstToSecondLayerLink {
|
|
u64 mask : 52;
|
|
u64 second_layer_offset : 12;
|
|
};
|
|
static_assert(sizeof(NamedCharacterReferenceFirstToSecondLayerLink) == 8);
|
|
|
|
// Note: It is possible to fit this information within 24 bits, which could then allow for tightly
|
|
// bit-packing the second layer array. This would reduce the size of the array by 630 bytes.
|
|
struct NamedCharacterReferenceSecondLayerNode {
|
|
// Could be 10 bits
|
|
u16 child_index;
|
|
u8 number;
|
|
// Could be 4 bits
|
|
u8 children_len : 7;
|
|
bool end_of_word : 1;
|
|
};
|
|
static_assert(sizeof(NamedCharacterReferenceSecondLayerNode) == 4);
|
|
|
|
struct NamedCharacterReferenceNode {
|
|
// The actual alphabet of characters used in the list of named character references only
|
|
// includes 61 unique characters ('1'...'8', ';', 'a'...'z', 'A'...'Z').
|
|
u8 character;
|
|
// Typically, nodes are numbered with "an integer which gives the number of words that
|
|
// would be accepted by the automaton starting from that state." This numbering
|
|
// allows calculating "a one-to-one correspondence between the integers 1 to L
|
|
// (L is the number of words accepted by the automaton) and the words themselves."
|
|
//
|
|
// This allows us to have a minimal perfect hashing scheme such that it's possible to store
|
|
// and lookup the codepoint transformations of each named character reference using a separate
|
|
// array.
|
|
//
|
|
// This uses that idea, but instead of storing a per-node number that gets built up while
|
|
// searching a list of children, the cumulative number that would result from adding together
|
|
// the numbers of all the previous sibling nodes is stored instead. This cuts down on a bit
|
|
// of work done while searching while keeping the minimal perfect hashing strategy intact.
|
|
//
|
|
// Empirically, the largest number in our DAFSA is 51, so all number values could fit in a u6.
|
|
u8 number : 7;
|
|
bool end_of_word : 1;
|
|
// Index of the first child of this node.
|
|
// There are 3190 nodes in our DAFSA after the first and second layers were extracted out, so
|
|
// all indexes can fit in a u12 (there would be 3872 nodes with the first/second layers
|
|
// included, so still a u12).
|
|
u16 child_index : 12;
|
|
u16 children_len : 4;
|
|
};
|
|
static_assert(sizeof(NamedCharacterReferenceNode) == 4);
|
|
|
|
extern NamedCharacterReferenceNode g_named_character_reference_nodes[];
|
|
extern NamedCharacterReferenceFirstLayerNode g_named_character_reference_first_layer[];
|
|
extern NamedCharacterReferenceFirstToSecondLayerLink g_named_character_reference_first_to_second_layer[];
|
|
extern NamedCharacterReferenceSecondLayerNode g_named_character_reference_second_layer[];
|
|
|
|
Optional<NamedCharacterReferenceCodepoints> named_character_reference_codepoints_from_unique_index(u16 unique_index);
|
|
|
|
} // namespace Web::HTML
|
|
|
|
)~~~");
|
|
|
|
TRY(file.write_until_depleted(generator.as_string_view().bytes()));
|
|
return {};
|
|
}
|
|
|
|
static u8 ascii_alphabetic_to_index(u8 c)
|
|
{
|
|
ASSERT(AK::is_ascii_alpha(c));
|
|
return c <= 'Z' ? (c - 'A') : (c - 'a' + 26);
|
|
}
|
|
|
|
class Node final : public RefCounted<Node> {
|
|
private:
|
|
struct NonnullRefPtrNodeTraits {
|
|
static unsigned hash(NonnullRefPtr<Node> const& node)
|
|
{
|
|
u32 hash = 0;
|
|
for (int i = 0; i < 128; i++) {
|
|
hash ^= ptr_hash(node->m_children[i].ptr());
|
|
}
|
|
hash ^= int_hash(static_cast<u32>(node->m_is_terminal));
|
|
return hash;
|
|
}
|
|
static bool equals(NonnullRefPtr<Node> const& a, NonnullRefPtr<Node> const& b)
|
|
{
|
|
if (a->m_is_terminal != b->m_is_terminal)
|
|
return false;
|
|
for (int i = 0; i < 128; i++) {
|
|
if (a->m_children[i] != b->m_children[i])
|
|
return false;
|
|
}
|
|
return true;
|
|
}
|
|
};
|
|
|
|
public:
|
|
static NonnullRefPtr<Node> create()
|
|
{
|
|
return adopt_ref(*new (nothrow) Node());
|
|
}
|
|
|
|
using NodeTableType = HashTable<NonnullRefPtr<Node>, NonnullRefPtrNodeTraits, false>;
|
|
|
|
void calc_numbers()
|
|
{
|
|
m_number = static_cast<u16>(m_is_terminal);
|
|
for (int i = 0; i < 128; i++) {
|
|
if (m_children[i] == nullptr)
|
|
continue;
|
|
m_children[i]->calc_numbers();
|
|
m_number += m_children[i]->m_number;
|
|
}
|
|
}
|
|
|
|
u8 num_direct_children()
|
|
{
|
|
u8 num = 0;
|
|
for (int i = 0; i < 128; i++) {
|
|
if (m_children[i] != nullptr)
|
|
num += 1;
|
|
}
|
|
return num;
|
|
}
|
|
|
|
u64 get_ascii_alphabetic_bit_mask()
|
|
{
|
|
u64 mask = 0;
|
|
for (int i = 0; i < 128; i++) {
|
|
if (m_children[i] == nullptr)
|
|
continue;
|
|
mask |= ((u64)1) << ascii_alphabetic_to_index(i);
|
|
}
|
|
return mask;
|
|
}
|
|
|
|
Array<RefPtr<Node>, 128>& children() { return m_children; }
|
|
|
|
void set_as_terminal() { m_is_terminal = true; }
|
|
|
|
bool is_terminal() const { return m_is_terminal; }
|
|
|
|
u16 number() const { return m_number; }
|
|
|
|
private:
|
|
Node() = default;
|
|
|
|
Array<RefPtr<Node>, 128> m_children { 0 };
|
|
bool m_is_terminal { false };
|
|
u16 m_number { 0 };
|
|
};
|
|
|
|
struct UncheckedNode {
|
|
RefPtr<Node> parent;
|
|
char character;
|
|
RefPtr<Node> child;
|
|
};
|
|
|
|
class DafsaBuilder {
|
|
AK_MAKE_NONCOPYABLE(DafsaBuilder);
|
|
|
|
public:
|
|
using MappingType = HashMap<StringView, String>;
|
|
|
|
DafsaBuilder()
|
|
: m_root(Node::create())
|
|
{
|
|
}
|
|
|
|
void insert(StringView str)
|
|
{
|
|
// Must be inserted in sorted order
|
|
VERIFY(str > m_previous_word);
|
|
|
|
size_t common_prefix_len = 0;
|
|
for (size_t i = 0; i < min(str.length(), m_previous_word.length()); i++) {
|
|
if (str[i] != m_previous_word[i])
|
|
break;
|
|
common_prefix_len++;
|
|
}
|
|
|
|
minimize(common_prefix_len);
|
|
|
|
RefPtr<Node> node;
|
|
if (m_unchecked_nodes.size() == 0)
|
|
node = m_root;
|
|
else
|
|
node = m_unchecked_nodes.last().child;
|
|
|
|
auto remaining = str.substring_view(common_prefix_len);
|
|
for (char const c : remaining) {
|
|
VERIFY(node->children().at(c) == nullptr);
|
|
|
|
auto child = Node::create();
|
|
node->children().at(c) = child;
|
|
m_unchecked_nodes.append(UncheckedNode { node, c, child });
|
|
node = child;
|
|
}
|
|
node->set_as_terminal();
|
|
|
|
bool fits = str.copy_characters_to_buffer(m_previous_word_buf, sizeof(m_previous_word_buf));
|
|
// It's guaranteed that m_previous_word_buf is large enough to hold the longest named character reference
|
|
VERIFY(fits);
|
|
m_previous_word = StringView(m_previous_word_buf, str.length());
|
|
}
|
|
|
|
void minimize(size_t down_to)
|
|
{
|
|
if (m_unchecked_nodes.size() == 0)
|
|
return;
|
|
while (m_unchecked_nodes.size() > down_to) {
|
|
auto unchecked_node = m_unchecked_nodes.take_last();
|
|
auto child = unchecked_node.child.release_nonnull();
|
|
auto it = m_minimized_nodes.find(child);
|
|
if (it != m_minimized_nodes.end()) {
|
|
unchecked_node.parent->children().at(unchecked_node.character) = *it;
|
|
} else {
|
|
m_minimized_nodes.set(child);
|
|
}
|
|
}
|
|
}
|
|
|
|
void calc_numbers()
|
|
{
|
|
m_root->calc_numbers();
|
|
}
|
|
|
|
Optional<size_t> get_unique_index(StringView str)
|
|
{
|
|
size_t index = 0;
|
|
Node* node = m_root.ptr();
|
|
|
|
for (char const c : str) {
|
|
if (node->children().at(c) == nullptr)
|
|
return {};
|
|
for (int sibling_c = 0; sibling_c < 128; sibling_c++) {
|
|
if (node->children().at(sibling_c) == nullptr)
|
|
continue;
|
|
if (sibling_c < c) {
|
|
index += node->children().at(sibling_c)->number();
|
|
}
|
|
}
|
|
node = node->children().at(c);
|
|
if (node->is_terminal())
|
|
index += 1;
|
|
}
|
|
|
|
return index;
|
|
}
|
|
|
|
NonnullRefPtr<Node> root()
|
|
{
|
|
return m_root;
|
|
}
|
|
|
|
private:
|
|
NonnullRefPtr<Node> m_root;
|
|
Node::NodeTableType m_minimized_nodes;
|
|
Vector<UncheckedNode> m_unchecked_nodes;
|
|
char m_previous_word_buf[64];
|
|
StringView m_previous_word = { m_previous_word_buf, 0 };
|
|
};
|
|
|
|
struct NodeData {
|
|
u8 character;
|
|
u8 number;
|
|
bool end_of_word;
|
|
u16 child_index;
|
|
u8 children_len;
|
|
};
|
|
|
|
static u16 queue_children(NonnullRefPtr<Node> const& node, Vector<NonnullRefPtr<Node>>& queue, HashMap<Node*, u16>& child_indexes, u16 first_available_index)
|
|
{
|
|
auto current_available_index = first_available_index;
|
|
for (u8 c = 0; c < 128; c++) {
|
|
if (node->children().at(c) == nullptr)
|
|
continue;
|
|
auto child = NonnullRefPtr(*node->children().at(c));
|
|
|
|
if (!child_indexes.contains(child.ptr())) {
|
|
auto child_num_children = child->num_direct_children();
|
|
if (child_num_children > 0) {
|
|
child_indexes.set(child, current_available_index);
|
|
current_available_index += child_num_children;
|
|
}
|
|
queue.append(child);
|
|
}
|
|
}
|
|
return current_available_index;
|
|
}
|
|
|
|
static u16 write_children_data(NonnullRefPtr<Node> const& node, Vector<NodeData>& node_data, Vector<NonnullRefPtr<Node>>& queue, HashMap<Node*, u16>& child_indexes, u16 first_available_index)
|
|
{
|
|
auto current_available_index = first_available_index;
|
|
u8 unique_index_tally = 0;
|
|
for (u8 c = 0; c < 128; c++) {
|
|
if (node->children().at(c) == nullptr)
|
|
continue;
|
|
auto child = NonnullRefPtr(*node->children().at(c));
|
|
auto child_num_children = child->num_direct_children();
|
|
|
|
if (!child_indexes.contains(child.ptr())) {
|
|
if (child_num_children > 0) {
|
|
child_indexes.set(child, current_available_index);
|
|
current_available_index += child_num_children;
|
|
}
|
|
queue.append(child);
|
|
}
|
|
|
|
node_data.append({ c, unique_index_tally, child->is_terminal(), child_indexes.get(child).value_or(0), child_num_children });
|
|
|
|
unique_index_tally += child->number();
|
|
}
|
|
return current_available_index;
|
|
}
|
|
|
|
// Does not include the root node
|
|
static void write_node_data(DafsaBuilder& dafsa_builder, Vector<NodeData>& node_data, HashMap<Node*, u16>& child_indexes)
|
|
{
|
|
Vector<NonnullRefPtr<Node>> queue;
|
|
|
|
u16 first_available_index = 1;
|
|
first_available_index = queue_children(dafsa_builder.root(), queue, child_indexes, first_available_index);
|
|
|
|
child_indexes.clear_with_capacity();
|
|
first_available_index = 1;
|
|
auto second_layer_length = queue.size();
|
|
for (size_t i = 0; i < second_layer_length; i++) {
|
|
auto node = queue.take_first();
|
|
first_available_index = queue_children(node, queue, child_indexes, first_available_index);
|
|
}
|
|
|
|
while (queue.size() > 0) {
|
|
auto node = queue.take_first();
|
|
first_available_index = write_children_data(node, node_data, queue, child_indexes, first_available_index);
|
|
}
|
|
}
|
|
|
|
ErrorOr<void> generate_implementation_file(JsonObject& named_character_reference_data, Core::File& file)
|
|
{
|
|
StringBuilder builder;
|
|
SourceGenerator generator { builder };
|
|
DafsaBuilder dafsa_builder;
|
|
|
|
named_character_reference_data.for_each_member([&](auto& key, auto&) {
|
|
dafsa_builder.insert(key.bytes_as_string_view().substring_view(1));
|
|
});
|
|
dafsa_builder.minimize(0);
|
|
dafsa_builder.calc_numbers();
|
|
|
|
// As a sanity check, confirm that the minimal perfect hashing doesn't
|
|
// have any collisions
|
|
{
|
|
HashTable<size_t> index_set;
|
|
|
|
named_character_reference_data.for_each_member([&](auto& key, auto&) {
|
|
auto index = dafsa_builder.get_unique_index(key.bytes_as_string_view().substring_view(1)).value();
|
|
VERIFY(!index_set.contains(index));
|
|
index_set.set(index);
|
|
});
|
|
VERIFY(named_character_reference_data.size() == index_set.size());
|
|
}
|
|
|
|
auto index_to_codepoints = MUST(FixedArray<Codepoints>::create(named_character_reference_data.size()));
|
|
|
|
named_character_reference_data.for_each_member([&](auto& key, auto& value) {
|
|
auto codepoints = value.as_object().get_array("codepoints"sv).value();
|
|
auto unique_index = dafsa_builder.get_unique_index(key.bytes_as_string_view().substring_view(1)).value();
|
|
auto array_index = unique_index - 1;
|
|
u32 second_codepoint = 0;
|
|
if (codepoints.size() == 2) {
|
|
second_codepoint = codepoints[1].template as_integer<u32>();
|
|
}
|
|
index_to_codepoints[array_index] = Codepoints { codepoints[0].template as_integer<u32>(), second_codepoint };
|
|
});
|
|
|
|
generator.append(R"~~~(
|
|
#include <LibWeb/HTML/Parser/Entities.h>
|
|
|
|
namespace Web::HTML {
|
|
|
|
static NamedCharacterReferenceCodepoints g_named_character_reference_codepoints_lookup[] = {
|
|
)~~~");
|
|
|
|
for (auto codepoints : index_to_codepoints) {
|
|
auto member_generator = generator.fork();
|
|
member_generator.set("first_codepoint", MUST(String::formatted("0x{:X}", codepoints.first)));
|
|
member_generator.set("second_codepoint_name", get_second_codepoint_enum_name(codepoints.second));
|
|
member_generator.append(R"~~~( {@first_codepoint@, NamedCharacterReferenceSecondCodepoint::@second_codepoint_name@},
|
|
)~~~");
|
|
}
|
|
|
|
Vector<NodeData> node_data;
|
|
HashMap<Node*, u16> child_indexes;
|
|
write_node_data(dafsa_builder, node_data, child_indexes);
|
|
|
|
generator.append(R"~~~(};
|
|
|
|
NamedCharacterReferenceNode g_named_character_reference_nodes[] = {
|
|
{ 0, 0, false, 0, 0 },
|
|
)~~~");
|
|
|
|
for (auto data : node_data) {
|
|
auto member_generator = generator.fork();
|
|
member_generator.set("char", StringView(&data.character, 1));
|
|
member_generator.set("number", String::number(data.number));
|
|
member_generator.set("end_of_word", MUST(String::formatted("{}", data.end_of_word)));
|
|
member_generator.set("child_index", String::number(data.child_index));
|
|
member_generator.set("children_len", String::number(data.children_len));
|
|
member_generator.append(R"~~~( { '@char@', @number@, @end_of_word@, @child_index@, @children_len@ },
|
|
)~~~");
|
|
}
|
|
|
|
generator.append(R"~~~(};
|
|
|
|
NamedCharacterReferenceFirstLayerNode g_named_character_reference_first_layer[] = {
|
|
)~~~");
|
|
|
|
auto num_children = dafsa_builder.root()->num_direct_children();
|
|
VERIFY(num_children == 52); // A-Z, a-z exactly
|
|
u16 unique_index_tally = 0;
|
|
for (u8 c = 0; c < 128; c++) {
|
|
if (dafsa_builder.root()->children().at(c) == nullptr)
|
|
continue;
|
|
VERIFY(AK::is_ascii_alpha(c));
|
|
auto child = dafsa_builder.root()->children().at(c);
|
|
|
|
auto member_generator = generator.fork();
|
|
member_generator.set("number", String::number(unique_index_tally));
|
|
member_generator.append(R"~~~( { @number@ },
|
|
)~~~");
|
|
|
|
unique_index_tally += child->number();
|
|
}
|
|
|
|
generator.append(R"~~~(};
|
|
|
|
NamedCharacterReferenceFirstToSecondLayerLink g_named_character_reference_first_to_second_layer[] = {
|
|
)~~~");
|
|
|
|
u16 second_layer_offset = 0;
|
|
for (u8 c = 0; c < 128; c++) {
|
|
if (dafsa_builder.root()->children().at(c) == nullptr)
|
|
continue;
|
|
VERIFY(AK::is_ascii_alpha(c));
|
|
auto child = dafsa_builder.root()->children().at(c);
|
|
auto bit_mask = child->get_ascii_alphabetic_bit_mask();
|
|
|
|
auto member_generator = generator.fork();
|
|
member_generator.set("bit_mask", String::number(bit_mask));
|
|
member_generator.set("second_layer_offset", String::number(second_layer_offset));
|
|
member_generator.append(R"~~~( { @bit_mask@ull, @second_layer_offset@ },
|
|
)~~~");
|
|
|
|
second_layer_offset += child->num_direct_children();
|
|
}
|
|
|
|
generator.append(R"~~~(};
|
|
|
|
NamedCharacterReferenceSecondLayerNode g_named_character_reference_second_layer[] = {
|
|
)~~~");
|
|
|
|
for (u8 c = 0; c < 128; c++) {
|
|
if (dafsa_builder.root()->children().at(c) == nullptr)
|
|
continue;
|
|
VERIFY(AK::is_ascii_alpha(c));
|
|
auto first_layer_node = dafsa_builder.root()->children().at(c);
|
|
|
|
u8 unique_index_tally = 0;
|
|
for (u8 child_c = 0; child_c < 128; child_c++) {
|
|
if (first_layer_node->children().at(child_c) == nullptr)
|
|
continue;
|
|
VERIFY(AK::is_ascii_alpha(child_c));
|
|
auto second_layer_node = first_layer_node->children().at(child_c);
|
|
auto child_num_children = second_layer_node->num_direct_children();
|
|
auto child_index = child_indexes.get(second_layer_node).value_or(0);
|
|
|
|
auto member_generator = generator.fork();
|
|
member_generator.set("child_index", String::number(child_index));
|
|
member_generator.set("number", String::number(unique_index_tally));
|
|
member_generator.set("children_len", String::number(child_num_children));
|
|
member_generator.set("end_of_word", MUST(String::formatted("{}", second_layer_node->is_terminal())));
|
|
member_generator.append(R"~~~( { @child_index@, @number@, @children_len@, @end_of_word@ },
|
|
)~~~");
|
|
unique_index_tally += second_layer_node->number();
|
|
}
|
|
}
|
|
|
|
generator.append(R"~~~(};
|
|
|
|
// Note: The unique index is 1-based.
|
|
Optional<NamedCharacterReferenceCodepoints> named_character_reference_codepoints_from_unique_index(u16 unique_index) {
|
|
if (unique_index == 0) return {};
|
|
return g_named_character_reference_codepoints_lookup[unique_index - 1];
|
|
}
|
|
|
|
} // namespace Web::HTML
|
|
)~~~");
|
|
|
|
TRY(file.write_until_depleted(generator.as_string_view().bytes()));
|
|
return {};
|
|
}
|