ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2025-10-09 17:49:39 +00:00

Author	SHA1	Message	Date
Timothy Flynn	15947aa1f0	LibUnicode: Add an hour-cycle field to DateTimeFormat's format pattern	2022-01-10 16:18:05 +01:00
Timothy Flynn	498b741434	LibUnicode: Use LibTimeZone's list of time zone names LibUnicode no longer needs to generate a list of time zone names that it parsed from metaZones.json. We can defer to the TZDB for a golden list of time zones.	2022-01-08 12:45:34 +01:00
Timothy Flynn	6d7d9dd324	LibUnicode: Do not assume time zones & meta zones have a 1-to-1 mapping The generator parses metaZones.json to form a mapping of meta zones to time zones (AKA "golden zone" in TR-35). This parser errantly assumed this was a 1-to-1 mapping.	2022-01-06 22:28:01 +01:00
Timothy Flynn	1116a29c19	LibUnicode: Remove now unused Unicode symbol loader All generated sources are now linked via weak symbols.	2022-01-04 22:49:43 +00:00
Timothy Flynn	437b9fe204	LibUnicode: Convert UnicodeData to link with weak symbols	2022-01-04 22:49:43 +00:00
Timothy Flynn	f576142fe8	LibJS+LibUnicode: Convert UnicodeLocale to link with weak symbols	2022-01-04 22:49:43 +00:00
Timothy Flynn	ba4cdf34f8	LibUnicode: Convert UnicodeDateTimeFormat to link with weak symbols	2022-01-04 22:49:43 +00:00
Timothy Flynn	98709d9be1	LibUnicode: Convert UnicodeNumberFormat to link with weak symbols Currently, we load the generated Unicode symbols with dlopen at runtime. This is unnecessary as of `565a880ce5`. Applications that want Unicode data now link directly against the shared library holding that data. So the same functionality can be achieved with weak symbols.	2022-01-04 22:49:43 +00:00
Timothy Flynn	126a3fe180	LibUnicode: Add minimal support for generic & offset-based time zones ECMA-402 now supports short-offset, long-offset, short-generic, and long-generic time zone name formatting. For example, in the en-US locale the America/Eastern time zone would be formatted as: short-offset: GMT-5 long-offset: GMT-05:00 short-generic: ET long-generic: Eastern Time We currently only support the UTC time zone, however. Therefore, this very minimal implementation does not consider GMT offset or generic display names. Instead, the CLDR defines specific strings for UTC.	2022-01-03 15:11:59 +01:00
Timothy Flynn	c417374dd6	LibUnicode: Remove linkage from LibUnicode to LibUnicodeData LibUnicodeData can now be loaded dynamically at runtime.	2021-12-21 13:09:49 -08:00
Timothy Flynn	15e1498419	LibUnicode: Dynamically load the generated UnicodeDateTimeFormat symbols	2021-12-21 13:09:49 -08:00
Timothy Flynn	a1f0ca59ae	LibUnicode: Dynamically load the generated UnicodeNumberFormat symbols	2021-12-21 13:09:49 -08:00
Timothy Flynn	09be26b5d2	LibUnicode: Dynamically load the generated UnicodeLocale symbols	2021-12-21 13:09:49 -08:00
Timothy Flynn	3fd53baa25	LibUnicode: Dynamically load the generated UnicodeData symbols The generated data for libunicodedata.so is quite large, and loading it is a price paid by nearly every application by way of depending on LibRegex. In order to defer this cost until an application actually uses one of the surrounding APIs, dynamically load the generated symbols. To be able to load the symbols dynamically, the generated methods must have demangled names. Typically, this is accomplished with `extern "C"` blocks. The clang toolchain complains about this here because the types returned from the generators are strictly C++ types. So to demangle the names, we use the asm() compiler directive to manually define a symbol name; the caveat is that we must be sure the symbols are unique. As an extra precaution, we prefix each symbol name with "unicode_". For more details, see: https://gcc.gnu.org/onlinedocs/gcc/Asm-Labels.html This symbol loader used in this implementation provides the additional benefit of removing many [[maybe_unused]] attributes from the LibUnicode methods. Internally, if ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF, the loader is able to stub out the function pointers it returns. Note that as of this commit, LibUnicode is still directly linked against LibUnicodeData. This commit is just a first step towards removing that.	2021-12-21 13:09:49 -08:00
Timothy Flynn	749d5ebd68	LibUnicode: Add missing forward declarations to forwarding header	2021-12-21 13:09:49 -08:00
Timothy Flynn	97508b74eb	LibUnicode: Remove declaration of function which moved to another header Unicode::get_number_system_symbol is declared in UnicodeNumberFormat and defined in UnicodeNumberFormat.cpp.	2021-12-21 13:09:49 -08:00
Timothy Flynn	92233660b8	LibUnicode: Compile generated sources optimized for size This breaks LibUnicode into two libraries: LibUnicode containing the public APIs for accessing the library, and LibUnicodeData containing the generated source files. LibUnicodeData has compile options optimized for size, which save about 1MB of data in total.	2021-12-15 13:26:03 +00:00
Timothy Flynn	62ff029890	LibUnicode: Generate CalendarSymbols in a predetermined order Similar to commit `2a7f36b392`, this change moves the generated CalendarSymbol enumeration to the public LibUnicode/NumberFormat.h header with a pre-defined set of symbols that we need. This is to prepare for uniquely generating the CalendarSymbols structure.	2021-12-13 21:28:56 -08:00
Timothy Flynn	2a7f36b392	LibJS+LibUnicode: Generate unique numeric symbol lists There are 443 number system objects generated, each of which held an array of number system symbols. Of those 443 arrays, only 39 are unique. To uniquely store these, this change moves the generated NumericSymbol enumeration to the public LibUnicode/NumberFormat.h header with a pre- defined set of symbols that we need. This is to ensure the generated, unique arrays are created in a known order with known symbols. While it is unfortunate to no longer discover these symbols at generation time, it does allow us to ignore unwanted symbols and perform less string-to- enumeration conversions at lookup time.	2021-12-11 14:17:47 +00:00
Timothy Flynn	a417c23de0	LibUnicode: Parse and generate per-locale day period ranges	2021-12-10 21:27:24 +00:00
Timothy Flynn	fa8e881cfa	LibUnicode: Parse and generate secondary day period symbols Generate morning2, afternoon2, evening2, and night2 symbols.	2021-12-10 21:27:24 +00:00
Timothy Flynn	76aab821f4	LibJS+LibUnicode: Rename some Unicode::DayPeriod values In the CLDR, there aren't "night" values, there are "night1" & "night2" values. This is for locales which use a different name for nighttime depending on the hour. For example, the ja locale uses "夜" between the hours of 19:00 and 23:00, and "夜中" between the hours of 23:00 and 04:00. Our CLDR parser is currently ignoring "night2", so this rename is to prepare for that. We could probably come up with better names, but in the end, the API in LibUnicode will be such that outside callers won't even see Night1, etc.	2021-12-10 21:27:24 +00:00
Timothy Flynn	2024d9e9ea	LibUnicode: Add method to combine two format pattern skeletons The fields of the generated elements must be in the same order as the table here: https://unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table Further, only one field from each group of fields is allowed.	2021-12-09 23:43:04 +00:00
Timothy Flynn	9d4c4303fd	LibUnicode: Parse and generate date time range format patterns	2021-12-09 23:43:04 +00:00
Timothy Flynn	fe84a365c2	LibUnicode: Parse and generate format pattern skeletons Pattern skeletons are more or less the "key" of format patterns. Every format pattern is assigned a skeleton. Interval patterns (which are not yet parsed) are also assigned a skeleton - this is used to match them to an "owning" format pattern. So we will use the skeleton generated here to match format patterns at runtime with their available interval patterns. An alternative approach would be to append interval patterns directly to their owning format pattern, but this has some draw backs: 1. Skeletons aren't totally unique. A skeleton may appear in both the "dateFormats" and "availableFormats" objects, in which case the same interval formats would be generated more than once. 2. Otherwise unique format patterns may only differ by the interval patterns assigned to them. This would cause the UniqueStorage for the format patterns to increase in size, impacting both compile times and libunicode.so size.	2021-12-09 23:43:04 +00:00
Timothy Flynn	b76e44f66f	LibUnicode: Parse and generate time zone names in long and short form	2021-12-08 11:29:36 +00:00
Timothy Flynn	2bbf8aa24c	LibUnicode: Generate era, month, weekday and day period calendar symbols The parsing in parse_calendar_symbols() might be a bit more verbose than it really needs to be, but it is to ensure the symbols are generated in a known order that we can control with enumerations.	2021-12-08 11:29:36 +00:00
Timothy Flynn	6ace4000bf	LibJS+LibUnicode: Supply field type in CalendarPattern's for-each method Some callers will want different behavior depending on what field is being provided to the callback.	2021-12-08 11:29:36 +00:00
Timothy Flynn	f02ecc1da2	LibUnicode: Fix copy-paste error in calendar_pattern_style_to_string The string returned must be lowercase.	2021-12-01 16:36:26 +00:00
Timothy Flynn	7e6ad172a4	LibUnicode: Support code point names that apply to ranges of code points For example, consider the following adjacent entries in UnicodeData.txt: 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;; 4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;; Our current implementation would assign the display name "CJK Ideograph Extension A" to code points U+3400 & U+4DBF, but not to the code points in between. Not only should those code points be assigned a name, but the Unicode spec also has formatting rules on what the names should be (the names for these ranged code points are not as they appear in UnicodeData.txt). The spec also defines names for code point ranges that actually are listed individually in UnicodeData.txt. For example: 2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;; 2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;; 2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;; Code points are only coalesced into a range if all fields after the name are equivalent. Our parser will insert the range and its name formatting pattern when it comes across the first code point in that range, then ignore other code points in that range. This reduces the number of names we generated by nearly 2,000.	2021-11-30 11:24:02 +01:00
Timothy Flynn	16151aa7d5	LibJS+LibUnicode: Implement the Intl.DateTimeFormat constructor	2021-11-29 22:48:46 +00:00
Timothy Flynn	6dbdfb6ba1	LibUnicode: Add special handling of hour cycle (hc) Unicode keywords For other keywords, allowed values per locale are generated at compile time. But since the CLDR doesn't present hour cycles on a per-locale basis, and hour cycles lookups depend on runtime data, we must handle hour cycle keyword lookups differently than other keywords.	2021-11-29 22:48:46 +00:00
Timothy Flynn	48ce72e472	LibUnicode: Parse and generate regional hour cycles Unlike most data in the CLDR, hour cycles are not stored on a per-locale basis. Instead, they are keyed by a string that is usually a region, but sometimes is a locale. Therefore, given a locale, to determine the hour cycles for that locale, we: 1. Check if the locale itself is assigned hour cycles. 2. If the locale has a region, check if that region is assigned hour cycles. 3. Otherwise, maximize that locale, and if the maximized locale has a region, check if that region is assigned hour cycles. 4. If the above all fail, fallback to the "001" region. Further, each locale's default hour cycle is the first assigned hour cycle.	2021-11-29 22:48:46 +00:00
Timothy Flynn	7872934861	LibUnicode: Parse and generate available candidate format patterns These formats are used by ECMA-402 when neither a date nor time style is specified. In that case, these patterns are searched for a best match.	2021-11-29 22:48:46 +00:00
Timothy Flynn	f471ecdbe9	LibUnicode: Parse and generate date, time, and date-time format patterns	2021-11-29 22:48:46 +00:00
Timothy Flynn	914675e826	LibJS+LibUnicode: Separate number formatting methods from Locale.h Currently, we generate separate data files for locale and number format related tables/methods, but provide public accessors for all of the data in one Locale.h file. Rather than continuing this trend for date-time, relative time, etc. formatting, it's a bit easier to reason about if the public accessors are also in separate files.	2021-11-29 22:48:46 +00:00
Ben Wiederhake	b06b54772e	Meta+LibUnicode: Provide code point names through library	2021-11-20 00:31:55 +01:00
Timothy Flynn	cafb717486	LibUnicode: Parse and generate CLDR unit data for Intl.NumberFormat The units data is in another CLDR package, cldr-units.	2021-11-16 23:14:09 +00:00
Timothy Flynn	80493908d3	LibUnicode: Tweak the definition of the plurality "many" As noted at the top of this method, this is a naive implementation of the Unicode plurality specification. But for now, we should tweak the defintion of "many" to be "more than 2" (which is what I had in mind when I wrote this, but forgot about fractions).	2021-11-16 23:14:09 +00:00
Timothy Flynn	04b8b87c17	LibJS+LibUnicode: Support multiple identifiers within format pattern This wasn't the case for compact patterns, but unit patterns can contain multiple (up to 2, really) identifiers that must each be recognized by LibJS. Each generated NumberFormat object now stores an array of identifiers parsed. The format pattern itself is encoded with the index into this array for that identifier, e.g. the compact format string "0K" will become "{number}{compactIdentifier:0}".	2021-11-16 23:14:09 +00:00
Timothy Flynn	3b68370212	LibJS+LibUnicode: Rename the generated compact_identifier to identifier This field is currently used to store the StringView into the compact name/symbol in the format string. Units will need to store a similar field, so rename the field to be more generic, and extract the parser for it.	2021-11-16 23:14:09 +00:00
Timothy Flynn	6d34a0b4e8	LibJS+LibUnicode: Rename method to select a NumberFormat plurality Instead of currency pattern lookups within select_currency_unit_pattern, rename the method to select_pattern_with_plurality and accept any list of patterns. This method will be needed for units.	2021-11-16 23:14:09 +00:00
Timothy Flynn	1f546476d5	LibJS+LibUnicode: Fix computation of compact pattern exponents The compact scale of each formatting rule was precomputed in commit: `be69eae651` Using the formula: compact scale = magnitude - pattern scale This computation was off-by-one. For example, consider the format key "10000-count-one", which maps to "00 thousand" in en-US. What we are really after is the exponent that best represents the string "thousand" for values greater than 10000 and less than 100000 (the next format key). We were previously doing: log10(10000) - "00 thousand".count("0") = 2 Which clearly isn't what we want. Instead, if we do: log10(10000) + 1 - "00 thousand".count("0") = 3 We get the correct exponent for each format key for each locale. This commit also renames the generated variable from "compact_scale" to "exponent" to match the terminology used in ECMA-402.	2021-11-16 00:56:55 +00:00
Timothy Flynn	48d5684780	LibUnicode: Parse compact identifiers and replace them with a format key For example, in en-US, the decimal, long compact pattern for numbers between 10,000 and 100,000 is "00 thousand". In that pattern, "thousand" is the compact identifier, and the generated format pattern is now "{number} {compactIdentifier}". This also generates that identifier as its own field in the NumberFormat structure.	2021-11-16 00:56:55 +00:00
Timothy Flynn	30fbb7d9cd	LibUnicode: Parse and generate scientific formatting rules	2021-11-14 17:00:35 +00:00
Timothy Flynn	3b7f5af042	LibUnicode: Generate primary and secondary number grouping sizes Most locales have a single grouping size (the number of integer digits to be written before inserting a grouping separator). However some have a primary and secondary size. We parse the primary size as the size used for the least significant integer digits, and the secondary size for the most significant.	2021-11-14 10:35:19 +00:00
Timothy Flynn	c65dea64bd	LibJS+LibUnicode: Don't remove {currency} keys in GetNumberFormatPattern In order to implement Intl.NumberFormat.prototype.formatToParts, do not replace {currency} keys in the format pattern before ECMA-402 tells us to. Otherwise, the array return by formatToParts will not contain the expected currency key. Early replacement was done to avoid resolving the currency display more than once, as it involves a couple of round trips to search through LibUnicode data. So this adds a non-standard method to NumberFormat to do this resolution and cache the result. Another side effect of this change is that LibUnicode must replace unit format patterns of the form "{0} {1}" during code generation. These were previously skipped during code generation because LibJS would just replace the keys with the currency display at runtime. But now that the currency display injection is delayed, any {0} or {1} keys in the format pattern will cause PartitionNumberPattern to abort.	2021-11-13 19:01:25 +00:00
Timothy Flynn	0c9711efba	LibUnicode: Handle all space code points when creating currency patterns Previously, we were checking if the code point immediately before/after the {currency} key was U+00A0 (non-breaking space). Instead, to handle other spacing code points, we must check if the surrounding code point has the separator general category.	2021-11-13 19:01:25 +00:00
Timothy Flynn	ada4bab405	LibUnicode: Remove GeneralCategory::Symbol string lookup When I originally wrote this method, I had it in LibJS, where we can't refer to the GeneralCategory enumeration directly. This is a big TODO, anyone outside of LibUnicode can't assume the generated enumerations exist and must get these values by string lookup. But this function ended up living in LibUnicode, who can reference the enumeration.	2021-11-13 19:01:25 +00:00
Timothy Flynn	a701ed52fc	LibJS+LibUnicode: Fully implement currency number formatting Currencies are a bit strange; the layout of currency data in the CLDR is not particularly compatible with what ECMA-402 expects. For example, the currency format in the "en" and "ar" locales for the Latin script are: en: "¤#,##0.00" ar: "¤\u00A0#,##0.00" Note how the "ar" locale has a non-breaking space after the currency symbol (¤), but "en" does not. This does not mean that this space will appear in the "ar"-formatted string, nor does it mean that a space won't appear in the "en"-formatted string. This is a runtime decision based on the currency display chosen by the user ("$" vs. "USD" vs. "US dollar") and other rules in the Unicode TR-35 spec. ECMA-402 shies away from the nuances here with "implementation-defined" steps. LibUnicode will store the data parsed from the CLDR however it is presented; making decisions about spacing, etc. will occur at runtime based on user input.	2021-11-13 11:52:45 +00:00

1 2 3 4 5 ...

256 commits