Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv import date: Add dateformat "Locale" to pick current locale #2011

Merged
merged 1 commit into from
Sep 14, 2024

Conversation

christopherlam
Copy link
Contributor

another approach to #2010 -- add "Locale" which uses current icu parser with current locale. still has same fault as #2010.

@christopherlam christopherlam deleted the csv-date-locale-2 branch September 8, 2024 14:16
@gjanssens
Copy link
Member

I think this approach will actually get you closest to what you want in a universal way, provided the locale is set properly. While icu in Australian locale doesn't do what you want it to do, you could set LC_TIME=en_US for gnucash en_US does parse it properly. I don't know if we have to something extra in the code to have icu pick up this environment variable or whether it understands its own set of variables (I know for example postgres allows for icu specific parameters, but I don't know whether this is icu or postgres specific). The added advantage would be that each user could override LC_TIME as they see fit. So far only few requests for a date format outside of what we offer have been made. So a workaround that requires setting an environment file may be sufficient so far.

As to your remark that icu is not properly parsing "Sep" in the Australian locale, it looks like this was an intentional change. Apparently it's not considered as set in stone but it will need someone (or a few people) to offer enough "evidence" to warrant the change back from "Sept" to "Sep". Likewise for "June/Jun" and "July/Jul".

@christopherlam
Copy link
Contributor Author

Ok. I've just tried with all ICU english locales; all "dmy" outputs and expects "Sept" and all "mdy" outputs and expects "Sep" ☹️

@gjanssens
Copy link
Member

I have reworked your #2010 experiment a little to test for locales that can handle your Australian dates. On my system three remained after testing a date in each month:

76: 805 locales available. Testing 12 dates.
76: 08 Jan 2021 - available locales:  af af_NA af_ZA asa asa_TZ bem bem_ZM en_001 en_150 en_AE en_AG en_AI en_AT en_AU en_BB en_BE en_BM en_BS en_BW en_CC en_CH en_CK en_CM en_CX en_CY en_DE en_DG en_DK en_DM en_ER en_FI en_FJ en_FK en_FM en_GB en_GD en_GG en_GH en_GI en_GM en_GY en_HK en_IE en_IL en_IM en_IO en_JE en_JM en_KE en_KI en_KN en_KY en_LC en_LR en_LS en_MG en_MO en_MS en_MT en_MU en_MW en_MY en_NA en_NF en_NG en_NL en_NR en_NU en_PG en_PN en_PW en_RW en_SB en_SC en_SD en_SE en_SG en_SH en_SI en_SL en_SS en_SX en_SZ en_TC en_TK en_TO en_TT en_TV en_TZ en_UG en_VC en_VG en_VU en_WS en_ZA en_ZM fr_MA fy fy_NL ia ia_001 id id_ID jmc jmc_TZ jv jv_ID kde kde_TZ kea kea_CV ksb ksb_TZ lg lg_UG luy luy_KE mer mer_KE ms ms_BN ms_ID ms_MY ms_SG mt mt_MT naq naq_NA nl nl_AW nl_BE nl_BQ nl_CW nl_NL nl_SR nl_SX rwk rwk_TZ sq sq_AL sq_MK sq_XK su su_Latn su_Latn_ID sv sv_AX sv_FI sv_SE sw sw_CD sw_KE sw_TZ sw_UG vun vun_TZ xog xog_UG
76: 08 Feb 2021 - available locales:  af af_NA af_ZA asa asa_TZ bem bem_ZM en_001 en_150 en_AE en_AG en_AI en_AT en_AU en_BB en_BE en_BM en_BS en_BW en_CC en_CH en_CK en_CM en_CX en_CY en_DE en_DG en_DK en_DM en_ER en_FI en_FJ en_FK en_FM en_GB en_GD en_GG en_GH en_GI en_GM en_GY en_HK en_IE en_IL en_IM en_IO en_JE en_JM en_KE en_KI en_KN en_KY en_LC en_LR en_LS en_MG en_MO en_MS en_MT en_MU en_MW en_MY en_NA en_NF en_NG en_NL en_NR en_NU en_PG en_PN en_PW en_RW en_SB en_SC en_SD en_SE en_SG en_SH en_SI en_SL en_SS en_SX en_SZ en_TC en_TK en_TO en_TT en_TV en_TZ en_UG en_VC en_VG en_VU en_WS en_ZA en_ZM fy fy_NL ia ia_001 id id_ID jmc jmc_TZ jv jv_ID kde kde_TZ kea kea_CV ksb ksb_TZ lg lg_UG luy luy_KE mer mer_KE ms ms_BN ms_ID ms_MY ms_SG naq naq_NA nl nl_AW nl_BE nl_BQ nl_CW nl_NL nl_SR nl_SX rwk rwk_TZ sv sv_AX sv_FI sv_SE sw sw_CD sw_KE sw_TZ sw_UG vun vun_TZ xog xog_UG
76: 08 Mar 2021 - available locales:  en_001 en_150 en_AE en_AG en_AI en_AT en_AU en_BB en_BE en_BM en_BS en_BW en_CC en_CH en_CK en_CM en_CX en_CY en_DE en_DG en_DK en_DM en_ER en_FI en_FJ en_FK en_FM en_GB en_GD en_GG en_GH en_GI en_GM en_GY en_HK en_IE en_IL en_IM en_IO en_JE en_JM en_KE en_KI en_KN en_KY en_LC en_LR en_LS en_MG en_MO en_MS en_MT en_MU en_MW en_MY en_NA en_NF en_NG en_NL en_NR en_NU en_PG en_PN en_PW en_RW en_SB en_SC en_SD en_SE en_SG en_SH en_SI en_SL en_SS en_SX en_SZ en_TC en_TK en_TO en_TT en_TV en_TZ en_UG en_VC en_VG en_VU en_WS en_ZA en_ZM ia ia_001 id id_ID jv jv_ID kea kea_CV lg lg_UG luy luy_KE naq naq_NA xog xog_UG
76: 08 Apr 2021 - available locales:  en_001 en_150 en_AE en_AG en_AI en_AT en_AU en_BB en_BE en_BM en_BS en_BW en_CC en_CH en_CK en_CM en_CX en_CY en_DE en_DG en_DK en_DM en_ER en_FI en_FJ en_FK en_FM en_GB en_GD en_GG en_GH en_GI en_GM en_GY en_HK en_IE en_IL en_IM en_IO en_JE en_JM en_KE en_KI en_KN en_KY en_LC en_LR en_LS en_MG en_MO en_MS en_MT en_MU en_MW en_MY en_NA en_NF en_NG en_NL en_NR en_NU en_PG en_PN en_PW en_RW en_SB en_SC en_SD en_SE en_SG en_SH en_SI en_SL en_SS en_SX en_SZ en_TC en_TK en_TO en_TT en_TV en_TZ en_UG en_VC en_VG en_VU en_WS en_ZA en_ZM ia ia_001 id id_ID jv jv_ID luy luy_KE naq naq_NA
76: 08 May 2021 - available locales:  en_001 en_150 en_AE en_AG en_AI en_AT en_AU en_BB en_BE en_BM en_BS en_BW en_CC en_CH en_CK en_CM en_CX en_CY en_DE en_DG en_DK en_DM en_ER en_FI en_FJ en_FK en_FM en_GB en_GD en_GG en_GH en_GI en_GM en_GY en_HK en_IE en_IL en_IM en_IO en_JE en_JM en_KE en_KI en_KN en_KY en_LC en_LR en_LS en_MG en_MO en_MS en_MT en_MU en_MW en_MY en_NA en_NF en_NG en_NL en_NR en_NU en_PG en_PN en_PW en_RW en_SB en_SC en_SD en_SE en_SG en_SH en_SI en_SL en_SS en_SX en_SZ en_TC en_TK en_TO en_TT en_TV en_TZ en_UG en_VC en_VG en_VU en_WS en_ZA en_ZM naq naq_NA
76: 08 Jun 2021 - available locales:  en_001 en_150 en_AE en_AG en_AI en_AT en_BB en_BE en_BM en_BS en_BW en_CC en_CH en_CK en_CM en_CX en_CY en_DE en_DG en_DK en_DM en_ER en_FI en_FJ en_FK en_FM en_GB en_GD en_GG en_GH en_GI en_GM en_GY en_HK en_IE en_IL en_IM en_IO en_JE en_JM en_KE en_KI en_KN en_KY en_LC en_LR en_LS en_MG en_MO en_MS en_MT en_MU en_MW en_MY en_NA en_NF en_NG en_NL en_NR en_NU en_PG en_PN en_PW en_RW en_SB en_SC en_SD en_SE en_SG en_SH en_SI en_SL en_SS en_SX en_SZ en_TC en_TK en_TO en_TT en_TV en_TZ en_UG en_VC en_VG en_VU en_WS en_ZA en_ZM naq naq_NA
76: 08 Jul 2021 - available locales:  en_001 en_150 en_AE en_AG en_AI en_AT en_BB en_BE en_BM en_BS en_BW en_CC en_CH en_CK en_CM en_CX en_CY en_DE en_DG en_DK en_DM en_ER en_FI en_FJ en_FK en_FM en_GB en_GD en_GG en_GH en_GI en_GM en_GY en_HK en_IE en_IL en_IM en_IO en_JE en_JM en_KE en_KI en_KN en_KY en_LC en_LR en_LS en_MG en_MO en_MS en_MT en_MU en_MW en_MY en_NA en_NF en_NG en_NL en_NR en_NU en_PG en_PN en_PW en_RW en_SB en_SC en_SD en_SE en_SG en_SH en_SI en_SL en_SS en_SX en_SZ en_TC en_TK en_TO en_TT en_TV en_TZ en_UG en_VC en_VG en_VU en_WS en_ZA en_ZM naq naq_NA
76: 08 Aug 2021 - available locales:  en_001 en_150 en_AE en_AG en_AI en_AT en_BB en_BE en_BM en_BS en_BW en_CC en_CH en_CK en_CM en_CX en_CY en_DE en_DG en_DK en_DM en_ER en_FI en_FJ en_FK en_FM en_GB en_GD en_GG en_GH en_GI en_GM en_GY en_HK en_IE en_IL en_IM en_IO en_JE en_JM en_KE en_KI en_KN en_KY en_LC en_LR en_LS en_MG en_MO en_MS en_MT en_MU en_MW en_MY en_NA en_NF en_NG en_NL en_NR en_NU en_PG en_PN en_PW en_RW en_SB en_SC en_SD en_SE en_SG en_SH en_SI en_SL en_SS en_SX en_SZ en_TC en_TK en_TO en_TT en_TV en_TZ en_UG en_VC en_VG en_VU en_WS en_ZA en_ZM naq naq_NA
76: 08 Sep 2021 - available locales:  en_AE naq naq_NA
76: 08 Oct 2021 - available locales:  en_AE naq naq_NA
76: 08 Nov 2021 - available locales:  en_AE naq naq_NA
76: 08 Dec 2021 - available locales:  en_AE naq naq_NA
76: 3 locales left, checked in 0.217325 seconds:
76:  en_AE naq naq_NA

So what happens if you try en_AE naq or naq_NA? You may need to test both the short and the medium format.

@christopherlam
Copy link
Contributor Author

So what happens if you try en_AE naq or naq_NA? You may need to test both the short and the medium format.

Looks good however in my setup the en_AE doesn't exist. I don't know why. How about combining this ICU approach with #2015 ?

@christopherlam christopherlam force-pushed the csv-date-locale-2 branch 3 times, most recently from e7671b9 to ead1fe8 Compare September 12, 2024 11:50
@christopherlam christopherlam marked this pull request as ready for review September 12, 2024 12:25
@christopherlam
Copy link
Contributor Author

I think this is ready. Maybe "Locale" should be the first choice to match the Currency format (instead of the current y-m-d, or this branch's boost's UK parser.

The icu formatter and calendar objects are generated only once.

Here's small issue with ICU locale: the csv save settings will store "Locale" but won't specify which locale.

@gjanssens
Copy link
Member

Here's small issue with ICU locale: the csv save settings will store "Locale" but won't specify which locale.

You could argue that the exact locale is not set in the import preview itself and so we shouldn't save it. It's probably not the most user friendly view, but a pragmatic one in this case.

Saving the actual current locale even though it's not explicitly set can equally cause unexpected behaviour.
To really solve this, we will have to offer a better way to select actual locales in the importer. As we established that's a larger effort than we currently can or want to spend.

What is a problem IMO is that your added dropdown options alter the meanings of currently saved presets. That should be avoided. To solve this, you could add the new options below the ones that where already there.

That aside, I'm not too much in favour of adding the boost options. If you're not living in the US or the UK these options tempt at a simple solution for other countries that we can't provide. Boost only provides these two methods and won't for say the Netherlands or Vietnam. I can see the usefulness of the ISO date option. I don't know what others think of this.

@christopherlam
Copy link
Contributor Author

How about hiding boost's parser behind the existing options? Without boost my "30 Sep 2024" remains unparsable! Unfortunately dd MMM yyyy is becoming a defacto standard.

@jralls
Copy link
Member

jralls commented Sep 13, 2024

How about hiding boost's parser behind the existing options? Without boost my "30 Sep 2024" remains unparsable!

Maybe this is what you mean: if the locale is en_XX try the ICU parser and if that fails try the boost::gregorian one. If both fail raise an error.

@christopherlam
Copy link
Contributor Author

How about hiding boost's parser behind the existing options? Without boost my "30 Sep 2024" remains unparsable!

Maybe this is what you mean: if the locale is en_XX try the ICU parser and if that fails try the boost::gregorian one. If both fail raise an error.

I feel this is hacky. This current branch accepts @gjanssens feedback and will augment the existing dmy mdy ymd to use boost parser. See tests.

@jralls
Copy link
Member

jralls commented Sep 13, 2024

I don't think it's any more hacky than overloading the dmy/mdy options to accept month names for English only. The non-hacky fix is to get the Unicode Consortium to fix the CLDR (good luck with that) or ICU to get their parser to recognize the 3-letter versions of those months (slightly more likely but much patience required).

Another more general hack would be to introduce a correction table of some sort that might offer an alternative month abbreviation when there's something goofy in the CLDR.

But this is workable enough for a first release. There's no point in expending further effort until users tell us that it falls short.

@gjanssens
Copy link
Member

I wonder, how much overlap is there between the regexes we have for d-m-y/m-d-y and the date_from_uk_string/date_from_us_string boost functions ?

If the boost functions are a superset, we could just replace them instead of our regex based options. On the other hand, if there are date formats that our regex properly parses and boost doesn't, I would propose to first try the regex and then the boost function.

@christopherlam
Copy link
Contributor Author

They're complementary. The boost functions don't accept 2 digit years but parse wordy months. The current implementation uses heuristics for 2 digit years but fail wordy months.

@gjanssens
Copy link
Member

Ok. For me your implementation is good enough then.

@christopherlam
Copy link
Contributor Author

Thank you! Your "good enough" means a "good start" because I haven't completely tested enough combinations of invalid dates. There are some slight differences in the behaviours that will need tweaking before merging in.

@christopherlam
Copy link
Contributor Author

Ok now I'm happy that the tests are complete. The exception for invalid dates eg 31-feb isn't "std::invalid_argument"; therefore the tests more modified to capture all exceptions.

@christopherlam christopherlam force-pushed the csv-date-locale-2 branch 4 times, most recently from 0dcb877 to b28408d Compare September 13, 2024 14:56
1. Add dateformat "Locale" with ICU; uses current locale for date
   parsing. ICU's locale date parser may parse "3 May 2023" or
   "2024年9月13日" (LC_TIME=zh_TW.utf8) and maybe others.

2. Augment d-m-y m-d-y and y-m-d with boost UK/US/ISO parsers. This allows
   CSV import of dates with months as words as "30 Sep 2023" or
   "May 4, 1978" or "2023-Dec-25". Note boost parser cannot recognise
   2-digit years, therefore "30 Sep 24" is invalid.
@code-gnucash-org code-gnucash-org merged commit ab641b3 into Gnucash:stable Sep 14, 2024
4 checks passed
@christopherlam christopherlam deleted the csv-date-locale-2 branch September 14, 2024 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants