Commit graph

13 commits

Author SHA1 Message Date
Luke
19d6884529 LibWeb: Implement quirks mode detection
This allows us to determine which mode to render the page in.

Exposes "doctype" and "compatMode" on Document.
Exposes "name", "publicId" and "systemId" on DocumentType.
2020-07-21 01:08:32 +02:00
stelar7
5eb39a5f61 LibWeb: Update parser with more insertion modes :^)
Implements handling of InHeadNoScript, InSelectInTable, InTemplate,
InFrameset, AfterFrameset, and AfterAfterFrameset.
2020-06-21 10:13:31 +02:00
Andreas Kling
b6288163f1 LibWeb: Make the new HTML parser parse input as UTF-8
We already convert the input to UTF-8 before starting the tokenizer,
so all this patch had to do was switch the tokenizer to use an Utf8View
for its input (and to emit 32-bit codepoints.)
2020-06-04 21:12:17 +02:00
Kyle McLean
1ad81e4833 LibWeb: Parse "br" end tags during "in body" 2020-06-04 09:09:33 +02:00
Andreas Kling
4788bcd6f8 LibWeb: Add HTMLToken::make_character()
It's tedious to make character tokens manually all the time.
2020-05-28 18:43:52 +02:00
Andreas Kling
772b51038e LibWeb: Parse "input" tags during the "in body" insertion mode 2020-05-28 12:19:18 +02:00
Andreas Kling
f62a8d3b19 LibWeb: Handle some more parser inputs in the "in head" insertion mode 2020-05-25 20:16:48 +02:00
Andreas Kling
20911efd4d LibWeb: More work on the HTML parser and tokenizer
The parser can now switch the state of the tokenizer! Very webby. :^)
2020-05-24 23:54:22 +02:00
Andreas Kling
31db3f21ae LibWeb: Start implementing character token parsing
Now that we've gotten rid of the misguided character buffering in the
tokenizer, it actually spits out character tokens that we have to deal
with in the parser.

This patch implements enough to bring us back to speed with simple.html
2020-05-24 23:54:22 +02:00
Andreas Kling
fd1b31d0ff LibWeb: Start building the tree building part of the new HTML parser
This patch adds a new HTMLDocumentParser class. It keeps a tokenizer
object internally and feeds itself with one token at a time from it.

The names and idioms in this class are expressed as closely to the
actual HTML parsing spec as possible, to make development as easy
and bug free as possible. :^)

This is going to become pretty large, but it's pretty cool!
2020-05-24 00:14:23 +02:00
Andreas Kling
6caa5661f3 LibWeb: Teach HTMLTokenizer how to tokenize attributes
Properly tokenize single-quoted, double-quoted and unquoted attributes!
2020-05-23 01:22:15 +02:00
Andreas Kling
004ef9a86b LibWeb: Minor tweaks to HTMLToken declaration 2020-05-22 23:45:02 +02:00
Andreas Kling
272b35d2e1 LibWeb: Begin work on a spec-compliant HTML parser
In order to actually view the web as it is, we're gonna need a proper
HTML parser. So let's build one!

This patch introduces the Web::HTMLTokenizer class, which currently
operates on a StringView input stream where it fetches (ASCII only atm)
codepoints and tokenizes acccording to the HTML spec tokenization algo.

The tokenizer state machine looks a bit weird but is written in a way
that tries to mimic the spec as closely as possible, in order to make
development easier and bugs less likely.

This initial version is far from finished, but it can parse a trivial
document with a DOCTYPE and open/close tags. :^)
2020-05-22 21:46:13 +02:00