ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2025-06-15 14:51:52 +00:00

Author	SHA1	Message	Date
Simon Wanner	eb9ed10573	LibTextCodec: Add windows-1253 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	2d35687db0	LibTextCodec: Add windows-874 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	1b6878b6ca	LibTextCodec: Add KOI8-U decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	1fd3a6f48c	LibTextCodec: Add ISO-8859-16 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	3e882f26db	LibTextCodec: Sort checks in decoder_for mostly alphabetically Keeps checks for common encodings (Latin1 & UTF-*) at the top.	2024-05-27 20:50:50 +02:00
Simon Wanner	56241df604	LibTextCodec: Add ISO-8859-14 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	4188e328ac	LibTextCodec: Add ISO-8859-13 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	cc640f4363	LibTextCodec: Add ISO-8859-10 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	d73220837e	LibTextCodec: Add ISO-8859-8(-I) decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	24028e353e	LibTextCodec: Add ISO-8859-7 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	01c3b8091a	LibTextCodec: Add ISO-8859-6 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	763d904ad5	LibTextCodec: Add ISO-8859-5 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	c6b17320db	LibTextCodec: Add ISO-8859-4 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	6c84edaaa2	LibTextCodec: Add ISO-8859-3 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	fc783199f1	LibTextCodec: Add IBM866 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	96b3c35358	LibTextCodec: Implement table based decoders as SingleByteDecoder Instead of copy-pasting the implementation, let's use a single class. This "Single Byte Decoder" concept even exists in the Encoding Spec :^)	2024-05-27 20:50:50 +02:00
Michal Grich	7a6d84d036	LibTextCodec: Add Windows-1250 text decoder This commit is adding Windows-1250 decoding based on unicode.org mapping table.	2024-04-23 16:26:16 +02:00
Andreas Kling	3c039903fb	LibTextCodec+AK: Don't validate UTF-8 strings twice UTF8Decoder was already converting invalid data into replacement characters while converting, so we know for sure we have valid UTF-8 by the time conversion is finished. This patch adds a new StringBuilder::to_string_without_validation() and uses it to make UTF8Decoder avoid half the work it was doing.	2023-12-30 13:49:50 +01:00
Nico Weber	8f47acee6a	LibTextCodec: Add PDFDocEncoding decoder	2023-11-22 09:08:06 -07:00
Idan Horowitz	079c96376c	LibTextCodec: Support validating encoded inputs	2023-11-17 16:02:36 +01:00
Luke Wilde	eaa4048870	LibTextCodec: Add "get output encoding" from the Encoding specification	2023-06-19 06:12:26 +02:00
Timothy Flynn	00fa23237a	LibTextCodec: Change UTF-8's decoder to replace invalid code points The UTF-8 decoder will currently crash if it is provided invalid UTF-8 input. Instead, change its behavior to match that of all other decoders to replace invalid code points with U+FFFD. This is required by the web.	2023-05-12 05:47:36 +02:00
Andreas Kling	a504ac3e2a	Everywhere: Rename equals_ignoring_case => equals_ignoring_ascii_case Let's make it clear that these functions deal with ASCII case only.	2023-03-10 13:15:44 +01:00
Luke Wilde	e864444fe3	LibTextCodec/Latin1: Iterate over input string with u8 instead of char Using char causes bytes equal to or over 0x80 to be treated as a negative value and produce incorrect results when implicitly casting to u32. For example, `atob` in LibWeb uses this decoder to convert non-ASCII values to UTF-8, but non-ASCII values are >= 0x80 and thus produces incorrect results in such cases: ```js Uint8Array.from(atob("u660"), c => c.charCodeAt(0)); ``` This used to produce [253, 253, 253] instead of [187, 174, 180]. Required by Cloudflare's IUAM challenges.	2023-02-28 08:46:06 +00:00
Sam Atkins	2db168acc1	LibTextCodec+Everywhere: Port Decoders to new Strings	2023-02-19 17:15:47 +01:00
Sam Atkins	3c5090e172	LibTextCodec: Return Optional<Decoder&> from `bom_sniff_to_decoder()`	2023-02-19 17:15:47 +01:00
Sam Atkins	f2a9426885	LibTextCodec+Everywhere: Return Optional<Decoder&> from `decoder_for()`	2023-02-19 17:15:47 +01:00
Sam Atkins	d6075ef5b5	LibTextCodec+Everywhere: Make TextCodec::decoder_for() take a StringView We don't need a full String/DeprecatedString inside this function, so we might as well not force users to create one.	2023-02-15 12:48:26 -05:00
Nico Weber	eac2b2382c	LibTextCodec: Add a MacRoman decoder Allows displaying `<meta charset="x-mac-roman">` html files. (`:set fenc=macroman`, `:w` in vim to save in that encoding.)	2023-01-24 14:37:20 +00:00
Nico Weber	b14b5a4d06	LibTextCodec: Simplify Latin1Decoder::process() a tiny bit	2023-01-24 14:37:20 +00:00
Nico Weber	3423b54eb9	LibTextCodec: Make utf-16be and utf-16le codecs actually work There were two problems: 1. They didn't handle surrogates 2. They used signed chars, leading to eg 0x00e4 being treated as 0xffe4 Also add a basic test that catches both issues. There's some code duplication with Utf16CodePointIterator::operator*(), but let's get things working first.	2023-01-22 21:30:44 +00:00
Linus Groh	57dc179b1f	Everywhere: Rename to_{string => deprecated_string}() where applicable This will make it easier to support both string types at the same time while we convert code, and tracking down remaining uses. One big exception is Value::to_string() in LibJS, where the name is dictated by the ToString AO.	2022-12-06 08:54:33 +01:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
sin-ack	3f3f45580a	Everywhere: Add sv suffix to strings relying on StringView(char const) Each of these strings would previously rely on StringView's char const constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.	2022-07-12 23:11:35 +02:00
Idan Horowitz	086969277e	Everywhere: Run clang-format	2022-04-01 21:24:45 +01:00
Karol Kosek	b006a60366	LibTextCodec: Pass code points instead of bytes on UTF-8 string process Previously we were passing raw UTF-8 bytes as code points, which caused CSS content properties to display incorrect characters. This makes bullet separators in Wikipedia templates display correctly.	2022-03-29 01:01:32 +02:00
Hendiadyoin1	6a95df2526	LibTextCodec: Don't allocate Strings on encoding normalisation This ripples down to LibWeb's HTML and XHR decoders, which therefore become less allocation heavy.	2022-03-21 10:48:17 +01:00
Jelle Raaijmakers	9c2a7c0e03	LibTextCodec: Add support for the UTF16-LE encoding	2022-03-08 14:51:06 +01:00
Luke Wilde	0e0f98a45e	LibTextCodec: Add x-user-defined decoder It's a pretty simple charset: the bottom 128 bytes (0x00-0x7F) are standard ASCII, while the top 128 bytes (0x80-0xFF) are mapped to a portion of the Unicode Private Use Area, specifically 0xF780-0xF7FF. This is used by Google Maps for certain blobs.	2022-02-12 12:53:28 +01:00
Luke Wilde	835a344337	LibTextCodec: Add decoder function that overrides given decoder on BOM This functions takes a user-provided decoder and will only use it if no BOM is in the input. If there is a BOM, it will ignore the given decoder and instead decode the input with the appropriate Unicode decoder for the detected BOM. This is only to be used where it's specifically needed, for example XHR uses this for compatibility with deployed content. As such, it has an obnoxious name to discourage usage.	2022-02-12 12:53:28 +01:00
Luke Wilde	94965ba28d	LibTextCodec: Add BOM sniffer This takes the input and sniffs it for a BOM. If it has the UTF-8 or UTF-16BE BOM, it will return their respective decoder. Currently we don't have a UTF-16LE decoder, so it will assert TODO if it detects a UTF-16LE BOM. If there is no recognisable BOM, it will return no decoder.	2022-02-12 12:53:28 +01:00
Daniel Bertalan	6003b6f4d3	LibTextCodec: Do not allocate the various decoders These objects contain no data members, so there is no point in creating 1-byte heap allocations for them. We don't need to have them as static local variables, as they are trivially constructible, so they can simply be global variables.	2022-01-28 23:31:00 +01:00
Dmitry Petrov	6f5102f435	LibTextCodec: Add alternate Cyrillic (aka Koi8-r) encoding Fixes #6840.	2021-12-16 22:44:45 +01:00
Andreas Kling	8b1108e485	Everywhere: Pass AK::StringView by value	2021-11-11 01:27:46 +01:00
Sam Atkins	d7ffa51424	LibTextCodec: Ignore BYTE ORDER MARK at the start of utf8/16 strings Before, this was getting included as part of the output text, which was confusing the HTML parser. Nobody needs the BOM after we have identified the codec, so now we remove it when converting to UTF-8.	2021-09-15 17:00:18 +02:00
sin-ack	e6818388e4	LibTextCodec: Add "process" API for allocation-free code point iteration This commit adds a new process method to all Decoder subclasses which do what to_utf8 used to do, and allows callers to customize the handling of individiual UTF-8 code points through a callback. Decoder::to_utf8 now uses this API to generate a string via StringBuilder, preserving the original behavior.	2021-08-30 00:08:40 +02:00
Andreas Kling	ed7a2f21ff	LibTextCodec: Remove unused is_standardized_encoding()	2021-08-20 15:31:46 +02:00
Aatos Majava	3b2a528b33	LibTextCodec: Add Turkish (aka ISO-8859-9, Windows-1254) encoding	2021-06-23 16:32:47 +01:00
Aatos Majava	7597cca5c6	LibTextCodec: Add ISO-8859-15 (aka Latin-9) encoding	2021-06-15 15:12:09 +01:00
Max Wipfli	d325403cb5	LibTextCodec: Use Optional<String> for get_standardized_encoding This patch changes get_standardized_encoding to use an Optional<String> return type instead of just returning the null string when unable to match the provided encoding to one of the canonical encoding names. This is part of an effort to move away from using null strings towards explicitly using Optional<String> to indicate that the String may not have a value.	2021-05-18 21:02:07 +02:00

1 2

59 commits