-
Notifications
You must be signed in to change notification settings - Fork 542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore using icu4 #1227
base: main
Are you sure you want to change the base?
Explore using icu4 #1227
Conversation
Relates to chronotope#854, but doesn't replace datetime code as a mentioned possible approach. Uses icu4 for: * short_months * long_months * short_weekdays * long_weekdays * am_pm * decimal_point Still uses pure_rust_locales for (I couldn't figure out how to get this from icu4 yet...): * d_fmt * d_t_fmt * t_fmt * t_fmt_ampm
Codecov Report
@@ Coverage Diff @@
## 0.4.x #1227 +/- ##
==========================================
- Coverage 86.14% 86.03% -0.11%
==========================================
Files 37 37
Lines 13394 13663 +269
==========================================
+ Hits 11538 11755 +217
- Misses 1856 1908 +52
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
What's the benefit here? How does it impact compile sizes and binary sizes? |
My end goal is to support #281 and to improve localization overall - although I suppose it is subjective if the data from icu is better than the data in For the compile size are you referring to the size of I made a little test wrapper that depends on chrono ( without icu the bin is 4.6M |
So a 3M binary size penalty seems like a big downside... |
Good that you look at localization overall and not just adding "nd", "th" suffixes 👍. That looks to English-focused to me, as there are similar but different shortcomings when formatting in other languages. But I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU. Do we want to combine some localization data from ICU, while lacking their formatting API, with the formatting string format from libc? Given that we use a strftime formatting string like libc, it makes sense to me that we also use the localization data from libc. Our localization would be 'as good as', and match, libc. That is already a useful thing for the Rust ecosystem. |
We already don't include timezone data but leave it in chrono-tz because it would add 2+ Mb to the binary size (and there is the problem it can get outdated...). I would argue that that data is a lot closer to the core functionality of chrono. |
assert_eq!(dt.format_localized("%T%.3f", ar_SY).to_string(), "18:58:00.123"); | ||
assert_eq!(dt.format_localized("%T%.6f", ar_SY).to_string(), "18:58:00.123456"); | ||
assert_eq!(dt.format_localized("%T%.9f", ar_SY).to_string(), "18:58:00.123456780"); | ||
assert_eq!(dt.format_localized("%T%.f", ar_SY).to_string(), "18:58:00٫123456780"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An Arabic comma (or what is it called) in roman numerals doesn't seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, some locales have outdigits
defined as part of LC_CTYPE
, but pure-rust-locales
doesn't support these. I did check though and there's nothing defined for ar_SY
. I beleive that ICU has the numbers, though I'm still researching.
let nano = t.nanosecond() % 1_000_000_000; | ||
if nano == 0 { | ||
let ret = | ||
match *spec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rustfmt is at it again 😆. #1082 (comment)
Some links for a bit of context/history of our localization support:
Some of the goals where to not generate the data at build time, to not increase the binary size with locale data that is not used, and to work without allocating. |
I'll take a look at generating rust files from the icu data that only has the data we'd need. For example there was a lot of extra information included about non-gregorian calendars. Producing data in a format like |
Upon further testing by dumping out some mods from the ICU data I see results like: #[allow(non_snake_case,non_camel_case_types,dead_code,unused_imports)]
pub mod aa_DJ {
pub const SHORT_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
pub const LONG_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
pub const SHORT_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
pub const LONG_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
pub const AM_PM: &[&str] = &["AM", "PM"];
pub const DECIMAL_POINT: &str = ".";
pub const DIGITS: &[char] = &['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'];
} Since there the data for months doesn't look good. I don't think that it would be a good to use icu for months (and probably not weekdays either) The data for decimal and digits looks promising (the digits are displayed 9-0 on github, even though they are 0-9 in my terminal. Probably due to left to right vs right to left. As a side note, although arabic is read right to left, the numbers are constructed left to right): pub mod ar_SY {
[...]
pub const DECIMAL_POINT: &str = "٫";
pub const DIGITS: &[char] = &['٠', '١', '٢', '٣', '٤', '٥', '٦', '٧', '٨', '٩'];
} Still need to investigate the ordinal numbers... |
This shows a comparison between the glibc and icu for am_pm and the decimal point https://gist.github.com/bim9262/920d29bd6155e207239bf11f42fe5da6 There are a lot of interesting cases where glibc has something more specific than am/pm or icu will be more specific. Some examples to check out:
I'd be tempted to take use glibc unless it's blank or am/pm then try icu as the fallback. |
ICU has a set of patterns for date & time components. See https://cldr.unicode.org/translation/date-time/date-time-patterns It covers a lot more cases than the |
This sort of thing happens when the ICU project's data for a locale is incomplete. You can see the underlying data for that locale at https://github.com/unicode-org/cldr/blob/main/common/main/aa_DJ.xml. It's nearly empty. Compare it to a more complete locale at https://github.com/unicode-org/cldr/blob/main/common/main/am.xml. There's a whole system for locale "inheritance". In the case of |
One more thing re: CLDR formatting. If you want to support this I think it's best to come up with new functions that accept these formats, like |
In my opinion chrono should focus mostly on being a good date-and-time library. ICU can be better at localization. I propose to close this PR and instead make it easy to convert to/from |
Relates to #854, but doesn't replace datetime code as a mentioned possible approach.
Uses icu4 for:
Still uses pure_rust_locales for (I couldn't figure out how to get this from icu4 yet...):
Thanks for contributing to chrono!
If your feature is semver-compatible, please target the 0.4.x branch;
the main branch will be used for 0.5.0 development going forward.
Please consider adding a test to ensure your bug fix/feature will not break in the future.