Commit graph

241 commits

Author SHA1 Message Date
Tim Schumacher
ed4c2f2f8e LibCore: Rename Stream::read_all to read_until_eof
This generally seems like a better name, especially if we somehow also
need a better name for "read the entire buffer, but not the entire file"
somewhere down the line.
2022-12-12 14:16:42 +01:00
Thomas Queiroz
6debd967ba Lagom/CodeGenerators: Use HashMap::try_ensure_capacity 2022-12-10 14:29:46 +01:00
Tim Schumacher
2fc2025f49 LibCore: Move Core::Stream::File::exists() to Core::File
`Core::Stream::File` shouldn't hold any utility methods that are
unrelated to constructing a `Core::Stream`, so let's just replace the
existing `Core::File::exists` with the nicer looking implementation.
2022-12-08 12:52:14 +00:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Linus Groh
babfc13c84 Everywhere: Remove 'clang-format off' comments that are no longer needed
https://github.com/SerenityOS/serenity/pull/15654#issuecomment-1322554496
2022-12-03 23:52:23 +00:00
Linus Groh
d26aabff04 Everywhere: Run clang-format 2022-12-03 23:52:23 +00:00
Timothy Flynn
b2164ad979 Meta: Do not hard-code index types for UCD/CLDR/TZDB code generators
Hand-picking the smallest index type that fits a particular generated
array started with commit 3ad159537e. This
was to reduce the size of the generated library.

Since then, the number of types using UniqueStorage has grown a ton,
creating a long list of types for which index types are manually picked.
When a new UCD/CLDR/TZDB is released, and the current index type no
longer fits the generated data, we fail to generate. Tracking down which
index caused the failure is a pretty annoying process.

Instead, we can just use size_t while in the generators themselves, then
automatically pick the size needed for the generated code.
2022-11-18 17:00:51 +00:00
Gunnar Beutner
4e406b0730 Meta+LibUnicode: Avoid relocations for emoji data
Previously each emoji had its own symbol in the library which was then
referred to by another symbol. This caused thousands of avoidable data
relocations at load time.

This saves about 122kB RAM for each process which uses LibUnicode.
2022-11-06 17:34:06 +01:00
Gunnar Beutner
2d3567ee92 Meta+LibUnicode: Avoid relocations for static unicode data
Previously the s_decomposition_mappings variable would refer to other
data in s_decomposition_mappings_data. This would cause thousands of
avoidable relocations at load time.

This saves about 128kB RAM for each process which uses LibUnicode.
2022-11-06 17:34:06 +01:00
Timothy Flynn
b820b9b2ff LibUnicode: Make the generated .h and .cpp paths for emoji data optional
This is to allow people making emoji to run the generator to create the
expected commit message format.
2022-11-03 16:37:04 +00:00
Timothy Flynn
bd592480e4 Meta: Replace Bash script for generating emoji.txt with C++ generator
We currently have two build-time parsers for the UCD's emoji-test.txt
file. To prepare for future changes, this removes the Bash parser and
moves its functionality to the newer C++ parser.
2022-10-27 12:59:56 +02:00
demostanis
3e8b5ac920 AK+Everywhere: Turn bool keep_empty to an enum in split* functions 2022-10-24 23:29:18 +01:00
Timothy Flynn
f08a979b96 LibUnicode: Remove GCC codegen workaround
Reverts commits:
ffbf5596cd
f190e394b3
2022-10-07 18:21:40 +01:00
Timothy Flynn
f38c68177b LibUnicode: Update code point ideographic replacements for Unicode 15 2022-10-07 18:17:40 +01:00
Andreas Kling
f190e394b3 LibUnicode: Let's use the GCC 11/12 workaround on all platforms
I seem to be getting some miscompiles on Linux as well, so let's make
the hitherto macOS-specific workaround universal.
2022-10-06 17:15:28 +02:00
matcool
70d0c1616f LibUnicode: Add decomposition mappings and Unicode normalization
The mappings are exposed via `Unicode::code_point_decomposition(u32)`
and `Unicode::code_point_decompositions()`, the latter being useful for
reverse searching a code point from its decomposition.

The normalization code does not make use of `Quick_Check` props (https://www.unicode.org/reports/tr44/#Decompositions_and_Normalization),
meaning no quick check optimizations.
2022-10-06 08:24:39 -04:00
Nico Weber
2af028132a AK+Everywhere: Add AK_COMPILER_{GCC,CLANG} and use them most places
Doesn't use them in libc headers so that those don't have to pull in
AK/Platform.h.

AK_COMPILER_GCC is set _only_ for gcc, not for clang too. (__GNUC__ is
defined in clang builds as well.) Using AK_COMPILER_GCC simplifies
things some.

AK_COMPILER_CLANG isn't as much of a win, other than that it's
consistent with AK_COMPILER_GCC.
2022-10-04 23:35:07 +01:00
Nico Weber
ffbf5596cd Lagom: Work around gcc codegen bug
Without this, GenerateUnicodeData crashes when run during the build.
With this, `serenity.sh run` brings up a running SerenityOS.
Since GenerateUnicodeData doesn't take a lot of time to run, just
disable optimizations to work around the problem for now.

Works around #15449.
2022-10-03 15:30:51 +01:00
Timothy Flynn
739798e075 LibUnicode: Use recently added Core::Stream::read_all in code generators
The generators had a manual implementation when Core::Stream did not
have a read_all method.
2022-09-21 14:04:22 +01:00
Timothy Flynn
b7ef36aa36 LibUnicode: Parse and generate custom emoji added for SerenityOS
Parse emoji from emoji-serenity.txt to allow displaying their names and
grouping them together in the EmojiInputDialog.

This also adds an "Unknown" value to the EmojiGroup enum. This will be
useful for emoji that aren't found in the UCD, or for when UCD downloads
are disabled.
2022-09-11 20:33:57 +01:00
Timothy Flynn
0aadd4869d LibUnicode: Generate emoji data for non-fully-qualified emoji
This allows us to find emoji data for files such as /res/emoji/U+A9.png.
U+00A9 is not fully-qualified (its full form is U+00A9 U+FE0F). But the
UCD has unqualified data for this code point; generating it allows us to
categorize these emoji appropriately in the EmojiInputDialog.
2022-09-11 20:33:57 +01:00
Timothy Flynn
b61eca0a1e LibUncode: Parse and generate emoji code point data
According to TR #51, the "best definition of the full set [of emojis] is
in the emoji-test.txt file". This defines not only the emoji themselves,
but the order in which they should be displayed, and what "group" of
emojis they belong to.
2022-09-08 23:12:31 +01:00
Timothy Flynn
f082b6ae48 LibUnicode: Generate a separate Locale enumeration for special casing
The UCD only cares about a few locales for special casing rules (az, lt,
and tr). Unfortunately, LibUnicode cannot use LibLocale once the
libraries are separate because LibLocale will need to use LibUnicode for
many more things; thus there would be a circular dependency. Instead,
just generate the small enum needed for this one use case.
2022-09-05 14:37:16 -04:00
Timothy Flynn
43a3471298 LibLocale: Move locale source files to the LibLocale folder
These are still included in LibUnicode, but this updates their location
and the include paths of other files which include them.
2022-09-05 14:37:16 -04:00
Timothy Flynn
ff48220dca Userland: Move files destined for LibLocale to the Locale namespace 2022-09-05 14:37:16 -04:00
Timothy Flynn
1e0276f541 LibLocale+LibUnicode: Move generated CLDR data files to LibLocale folder
They are still included into LibUnicode, but this moves their generated
location to be under LibLocale.
2022-09-05 14:37:16 -04:00
Timothy Flynn
89d1813b5d LibUnicode: Move CLDR data generators to a LibLocale subfolder
To prepare for placing all CLDR generated data in a new library,
LibLocale, this moves the code generators for the CLDR data to the
LibLocale subfolder.
2022-09-05 14:37:16 -04:00
davidot
cd763de280 LibJS+LibUnicode: Move some constant arrays to a separate header
Since LibUnicode depends on this data it used to include
Intl/AbstractOperations which in turn includes a number of other LibJS
headers. By moving this to its own header with minimal includes we can
save on rebuilding LibUnicode for unrelated LibJS header changes.
2022-08-27 10:55:44 -04:00
Timothy Flynn
ca92e37ae0 LibUnicode: Generate code point display names with run-length encoding
Similar to commit becec35, our code point display name data was a large
list of StringViews. RLE can be used here as well to remove about 32 MB
from the initialized data section to the read-only section.

Some of the refactoring to store strings as indices into an RLE array
also lets us clean up some of the code point name generators.
2022-08-17 15:42:12 +01:00
Timothy Flynn
2c2ede8581 LibUnicode: Mark UniqueStringStorage::generate as constant
This is just to allow it to be invoked from callers who hold a constant
UniqueStringStorage instance.
2022-08-17 15:42:12 +01:00
Timothy Flynn
becec3578f LibTimeZone+LibUnicode: Generate string data with run-length encoding
Currently, the unique string lists are stored in the initialized data
sections of their shared libraries. In order to move the data to the
read-only section, generate the strings using RLE arrays.

We generate two arrays: the first is the RLE data itself, the second is
a list of indices into the RLE array for each string. We then generate a
decoding method to convert an RLE string to a StringView.
2022-08-16 16:56:17 +02:00
Timothy Flynn
ae2acc8cdf LibJS+LibUnicode: Generate a set of default DateTimeFormat patterns
This isn't called out in TR-35, but before ICU even looks at CLDR data,
it adds a hard-coded set of default patterns to each locale's calendar.
It has done this since 2006 when its DateTimeFormat feature was first
created. Several test262 tests depend on this, which under ECMA-402,
falls into "implementation defined" behavior. For compatibility, we
can do the same in LibUnicode.
2022-07-22 23:51:56 +01:00
Timothy Flynn
32c07bc6c3 LibUnicode: Generate per-locale data for the "noon" fixed day period
Note that not all locales have this day period.
2022-07-21 20:36:03 +01:00
Timothy Flynn
16b673eaa9 LibUnicode: Check whether a calendar symbol for a locale actually exists
In the generated unique string list, index 0 is the empty string, and is
used to indicate a value doesn't exist in the CLDR. Check for this
before returning an empty calendar symbol.

For example, an upcoming commit will add the fixed day period "noon",
which not all locales support.
2022-07-21 20:36:03 +01:00
Timothy Flynn
0f26ab89ae LibJS+LibUnicode: Handle flexible day periods on both sides of midnight
Commit ec7d535 only partially handled the case of flexible day periods
rolling over midnight, in that it only worked for hours after midnight.
For example, the en locale defines a day period range of [21:00, 06:00).
The previous method of adding 24 hours to the given hour would change
e.g. 23:00 to 47:00, which isn't valid.
2022-07-21 20:36:03 +01:00
Timothy Flynn
b2709f161e LibUnicode: Generate per-locale approximately & range separator symbols 2022-07-20 22:30:16 +01:00
Timothy Flynn
b24b9c0a65 LibUnicode: Fallback to per-locale default calendars
When patterns, symbols, etc. for a requested calendar are not found, use
the locale's default calendar.
2022-07-15 12:31:43 +02:00
Timothy Flynn
c849cb9d76 LibUnicode: Fallback to per-locale default numbering systems
When patterns, grouping digits, symbols, etc. for a requested numbering
system are not found, use the locale's default numbering system. This
will allow using the correct digits e.g. for the locale "en-u-nu-arab"
even though the "en" locale only contains patterns for the "latn"
numbering system.
2022-07-15 12:31:43 +02:00
Timothy Flynn
f8f7015419 LibUnicode: Generate a method to lookup locale-preferred keyword values 2022-07-15 12:31:43 +02:00
Timothy Flynn
80568d5776 LibUnicode: Generate a method to lookup available keyword values 2022-07-15 12:31:43 +02:00
Timothy Flynn
c2e5b20eb6 LibUnicode: Generate available values for the keywords co, kf, kn, hc
This also ensures we only include values we actually support in the
generated list of available values.
2022-07-15 12:31:43 +02:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
sin-ack
e5f09ea170 Everywhere: Split Error::from_string_literal and Error::from_string_view
Error::from_string_literal now takes direct char const*s, while
Error::from_string_view does what Error::from_string_literal used to do:
taking StringViews. This change will remove the need to insert `sv`
after error strings when returning string literal errors once
StringView(char const*) is removed.

No functional changes.
2022-07-12 23:11:35 +02:00
sin-ack
7456904a39 Meta+Userland: Simplify some formatters
These are mostly minor mistakes I've encountered while working on the
removal of StringView(char const*). The usage of builder.put_string over
Format<FormatString>::format is preferrable as it will avoid the
indirection altogether when there's no formatting to be done. Similarly,
there is no need to do format(builder, "{}", number) when
builder.put_u64(number) works equally well.

Additionally a few Strings where only constant strings were used are
replaced with StringViews.
2022-07-12 23:11:35 +02:00
Timothy Flynn
a337b059dd LibUnicode: Parse and generate per-locale plural ranges 2022-07-12 00:43:34 +01:00
Timothy Flynn
232df4196b LibUnicode: Replace NumberFormat::Plurality with Unicode::PluralCategory
To prepare for using plural rules within number & duration format, this
removes the NumberFormat::Plurality enumeration.

This also adds PluralCategory::ExactlyZero & PluralCategory::ExactlyOne.
These are used in locales like French, where PluralCategory::One really
means any value from 0.00 to 1.99. PluralCategory::ExactlyOne means only
the value 1, as the name implies. These exact rules are not known by the
general plural rules, they are explicitly for number / currency format.
2022-07-08 20:33:52 +02:00
Timothy Flynn
cc5c707649 LibJS+LibUnicode: Do not generate the PluralCategory enum
The PluralCategory enum is currently generated for plural rules. Instead
of generating it, this moves the enum to the public LibUnicode header.
While it was nice to auto-discover these values, they are well defined
by TR-35, and we will need their values from within the number format
code generator (which can't rely on the plural rules generator having
run yet). Further, number format will require additional values in the
enum that plural rules doesn't know about.
2022-07-08 20:33:52 +02:00
Timothy Flynn
bf85bf2a9e LibJS: Use Intl.PluralRules within Intl.RelativeFormat
The Polish test cases added here cover previous failures from test262,
due to the way that 0 is specified to be "many" in Polish.
2022-07-08 11:51:54 +02:00
Timothy Flynn
8aeacccd82 LibUnicode: Generate a list of available plural categories per locale
Separate lists are generated for cardinal and ordinal form.
2022-07-08 11:51:54 +02:00