Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 797724 - Add support for custom date formats when importing CSV transactions #1710

Closed
wants to merge 4 commits into from

Conversation

christopherlam
Copy link
Contributor

@christopherlam christopherlam commented Jul 17, 2023

Allows csv import to interpret month Jan/Feb/Mar etc. Silliest implementation ever.

@christopherlam christopherlam force-pushed the stable-797724 branch 3 times, most recently from d793e2b to e0ae1c7 Compare July 17, 2023 15:42
@christopherlam christopherlam marked this pull request as ready for review July 17, 2023 16:20
@christopherlam
Copy link
Contributor Author

christopherlam commented Jul 18, 2023

This last one liner would solve majority* cases of month parsing, until we can find a good locale sensitive month parser.

*Majority users

I am not sure I've seen CSVs with localised dates such as "10 fév 2023" or "2023 年 7 月 1 日" or "१ जुलाई २०२३".

@jralls
Copy link
Member

jralls commented Jul 18, 2023

Majority users

Well, all of the half-dozen users who've reported the problem.

I am not sure I've seen CSVs with localised dates such as "10 fév 2023" or "2023 年 7 月 1 日" or "१ जुलाई २०२३".

But how often have you downloaded CSVs from French, Chinese, or Indian banks? I don't remember ever seeing a CSV with English month names instead of numbers, but if one American bank does it then there are bound to be others around the world.

The gold standard for localization is ICU. Their date parsing needs a formatter. Unfortunately both the boost::locale wrapper and boost::date_time I/O are stream oriented. That's a pain, but we deal with it elsewhere using std::stringstream.

You're ignoring half the problem, though: All three require that you tell them what format—%m, %b, or %B—to expect the month format in. Rather than trying to use a regex to figure it out, ask the user by adding e.g. 'b-d-y' and 'B-d-y' to the date selector listmodel.

The Chinese case is special: it's still "%Y%m%d" as 年, 月, and 日 can be converted to - and the result passed as an ISO formatted date. Do Indians really write dates with devangari numerals?

@christopherlam
Copy link
Contributor Author

christopherlam commented Jul 20, 2023

Here's the work so far but the compilation phase doesn't compile.

&& /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/bin/c++ -Werror -Wall -Wmissing-declarations -g gnucash/CMakeFiles/gnucash.dir/gnucash.cpp.o gnucash/CMakeFiles/gnucash.dir/gnucash-commands.cpp.o gnucash/CMakeFiles/gnucash.dir/gnucash-core-app.cpp.o gnucash/CMakeFiles/gnucash.dir/gnucash-gresources.c.o -o bin/gnucash -Wl,-rpath,/home/chris/sources/gnucash/stable-build/lib:/home/chris/sources/gnucash/stable-build/lib/gnucash:/gnu/store/6vqx7nip7r95h2nhvhb9vxzjpri581b9-gtk+-3.24.37/lib:/gnu/store/rr1vbf04j27z5465wsm1kdfaw3iriz2k-pango-1.50.10/lib:/gnu/store/n7vynkl0rkqmvahxji6530n8hmfscxsd-harfbuzz-5.3.1/lib:/gnu/store/ams35jy7lmxw5xas4b33zznwdx7m1d18-at-spi2-core-2.45.90/lib:/gnu/store/gg3kycn5wfjwskx3xfkk1qscjgsvaxcn-cairo-1.16.0/lib:/gnu/store/1ig678c8vqxvr60x8swmc5wriga7sjf3-gdk-pixbuf-2.42.8/lib:/gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib: lib/libgnc-module.so lib/gnucash/libgnc-qif-import.so lib/gnucash/libgnc-csv-import.so lib/gnucash/libgnc-csv-export.so lib/gnucash/libgnc-log-replay.so lib/gnucash/libgnc-bi-import.so lib/gnucash/libgnc-customer-import.so -L/gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/lib -L/gnu/store/1w1r6r56z9lhg8ghcb7lxss6mkn7d5l1-libgc-8.2.2/lib -lguile-3.0 -lgc -lpthread -ldl lib/gnucash/libgnc-generic-import.so lib/libgnc-gnome.so lib/gnucash/libgnc-ledger-core.so lib/gnucash/libgnc-report.so lib/gnucash/libgnc-expressions-guile.so lib/gnucash/libgnc-register-gnome.so lib/gnucash/libgnc-register-core.so lib/gnucash/libgnc-gnome-search.so lib/libgnc-html.so lib/gnucash/libgnc-gnome-utils.so /gnu/store/6vqx7nip7r95h2nhvhb9vxzjpri581b9-gtk+-3.24.37/lib/libgtk-3.so /gnu/store/6vqx7nip7r95h2nhvhb9vxzjpri581b9-gtk+-3.24.37/lib/libgdk-3.so /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libz.so /gnu/store/rr1vbf04j27z5465wsm1kdfaw3iriz2k-pango-1.50.10/lib/libpangocairo-1.0.so /gnu/store/rr1vbf04j27z5465wsm1kdfaw3iriz2k-pango-1.50.10/lib/libpango-1.0.so /gnu/store/n7vynkl0rkqmvahxji6530n8hmfscxsd-harfbuzz-5.3.1/lib/libharfbuzz.so /gnu/store/ams35jy7lmxw5xas4b33zznwdx7m1d18-at-spi2-core-2.45.90/lib/libatk-1.0.so /gnu/store/gg3kycn5wfjwskx3xfkk1qscjgsvaxcn-cairo-1.16.0/lib/libcairo-gobject.so /gnu/store/gg3kycn5wfjwskx3xfkk1qscjgsvaxcn-cairo-1.16.0/lib/libcairo.so /gnu/store/1ig678c8vqxvr60x8swmc5wriga7sjf3-gdk-pixbuf-2.42.8/lib/libgdk_pixbuf-2.0.so /gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib/libgio-2.0.so /gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib/libgobject-2.0.so lib/gnucash/libgnc-expressions.so lib/libgnc-locale-tax.so lib/libgnucash-guile.so -L/gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/lib -L/gnu/store/1w1r6r56z9lhg8ghcb7lxss6mkn7d5l1-libgc-8.2.2/lib -lguile-3.0 -lgc -lpthread -ldl lib/libgnc-backend-xml-utils.so -L/gnu/store/slzq3zqwj75lbrg4ly51hfhbv2vhryv5-zlib-1.2.13/lib -lz -ldl lib/libgnc-app-utils.so -lgio-2.0 -L/gnu/store/g9cpfynzy3ryv9jprgvwf3g2mnd5p11q-libxml2-2.9.14/lib -L/gnu/store/xvrcpsrd2y07i860cp8ckgysz4zjrgzf-libxslt-1.1.37/lib -lxslt -lxml2 lib/libgnc-engine.so -L/gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib -lgobject-2.0 -L/gnu/store/0ibv7vw1ff6f4c15p9c0k4izx4kqwlkr-icu4c-71.1/lib -licui18n -licuuc -licudata -Wl,--export-dynamic -lgmodule-2.0 -lglib-2.0 -pthread lib/libgnc-core-utils.so /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_date_time.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_filesystem.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_atomic.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_locale.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_chrono.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_thread.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_program_options.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_regex.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_system.so.1.80.0 /gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib/libglib-2.0.so && :
ld: lib/libgnc-engine.so: undefined reference to `icu_71::Calendar::get(icu_71::Calendar::EDateFields, UErrorCode&) const'
collect2: error: ld returned 1 exit status

@jralls
Copy link
Member

jralls commented Jul 20, 2023

Calendar::get says you want UDateField, i.e. UCAL_YEAR, UCAL_MONTH, and UCAL_DAY

@christopherlam christopherlam force-pushed the stable-797724 branch 2 times, most recently from de19d8a to 4d8b173 Compare July 20, 2023 10:59
@christopherlam
Copy link
Contributor Author

Still not sure how to capture non-punctuation in the regex. A-Za-z is woefully inadequate.

@jralls
Copy link
Member

jralls commented Jul 20, 2023

Still not sure how to capture non-punctuation in the regex. A-Za-z is woefully inadequate.

Use POSIX character classes. The general replacement for [A-Za-z] is [[:alpha:]] and its almost-antithesis is [[:punct:]]; both exclude the non-printing characters like 0x00-0x1f represented by [[:cntrl:]].

@christopherlam
Copy link
Contributor Author

Ok \p{L} is what we need.

@jralls
Copy link
Member

jralls commented Jul 20, 2023

Ok \p{L} is what we need.

The [[:alpha:]] spelling is more portable, \p{L} is a PCRE extension that boost::regex probably supports (L isn't explicitly mentioned on that page, but IIUC boost::regex just passes the regex on to libpcre) but std::regex doesn't.

(One of these years I'd like to switch everything to the immensely faster Compile Time Regex library that's mostly PCRE compatible.)

…ransactions

use boost::date_time_month_str_to_ushort to interpret number or month_string

an invalid month eg "Marches" will throw exception "Month number is
out of range 1..12" which is not localised.
@christopherlam
Copy link
Contributor Author

Using icu doesn't seem to cut it. The first commit does work well for anglo csv.

@jralls
Copy link
Member

jralls commented Aug 11, 2023

Using icu doesn't seem to cut it. The first commit does work well for anglo csv.

What is it doing wrong? Can you add a test case to illustrate?

@christopherlam
Copy link
Contributor Author

Here's a simple csv to try import. It won't interpret the mmm correctly in the first couple lines.

Date,Account Name,Number,Description,Full Category Path,Reconcile,Amount With Sym,Amount Num.,Value With Sym,Value Num.,Rate/Price
06/May/2020,Assets:Broker Account:CAD Cash,,Fund trading account,Assets:Current Assets:Checking Account,n,"CA$100,000.00","100,000.00","CA$100,000.00","100,000.00",1.0000
10/May/2020,Assets:Broker Account:CAD Cash,,Convert CAD to USD,Assets:Broker Account:USD Cash,n,"-CA$100,000.00","-100,000.00","-CA$100,000.00","-100,000.00",1.0000
01/07/2019,Assets:Broker Account:SPY,,BUY SPY,-- Split Transaction --,n,0 SPY,0,CA$9.95,9.95,0.000000
11/12/2019,Assets:Broker Account:SPY,,BUY SPY,-- Split Transaction --,n,0 SPY,0,CA$9.95,9.95,0.000000
18/03/2020,Assets:Broker Account:SPY,,SELL SPY,-- Split Transaction --,n,-75 SPY,-75,"-CA$12,000.00","-12,000.00",160.000000
01/04/2020,Assets:Broker Account:SPY,,BUY SPY,-- Split Transaction --,n,0 SPY,0,CA$9.95,9.95,0.000000
16/04/2020,Assets:Broker Account:SPY,,Return of Capital,Assets:Broker Account:USD Cash,n,0 SPY,0,"-CA$2,500.00","-2,500.00",0.000000
02/05/2020,Assets:Broker Account:SPY,,BUY SPY,-- Split Transaction --,n,0 SPY,0,CA$0.00,0.00,0.000000
11/05/2020,Assets:Broker Account:SPY,,Stock Split 2:1,-- Split Transaction --,n,450 SPY,450,CA$0.00,0.00,0.000000
21/05/2020,Assets:Broker Account:SPY,,SELL SPY,-- Split Transaction --,n,-135 SPY,-135,"-CA$21,500.00","-21,500.00",159 + 7/27
03/06/2020,Assets:Broker Account:SPY,,BUY SPY,-- Split Transaction --,n,0 SPY,0,CA$0.00,0.00,0.000000
10/06/2020,Assets:Broker Account:SPY,,SELL SPY,-- Split Transaction --,n,-915 SPY,-915,"-CA$128,100.00","-128,100.00",140.000000
10/06/2020,Assets:Broker Account:SPY,,SELL SPY Short,-- Split Transaction --,n,0 SPY,0,CA$9.95,9.95,0.000000
15/06/2020,Assets:Broker Account:SPY,,SELL SPY Short,-- Split Transaction --,n,0 SPY,0,CA$9.95,9.95,0.000000
16/06/2020,Assets:Broker Account:SPY,,BUY SPY Close Short,-- Split Transaction --,n,50 SPY,50,"CA$5,000.00","5,000.00",100.000000
17/06/2020,Assets:Broker Account:SPY,,Stock Split 2:1,-- Split Transaction --,n,-100 SPY,-100,CA$0.00,0.00,0.000000
18/06/2020,Assets:Broker Account:SPY,,Reverse Stock Split 1:2,-- Split Transaction --,n,100 SPY,100,CA$0.00,0.00,0.000000
19/06/2020,Assets:Broker Account:SPY,,Compensatory dividend payment,-- Split Transaction --,n,0 SPY,0,CA$0.00,0.00,0.000000
19/06/2020,Assets:Broker Account:SPY,,Compensatory Return of Capital,Assets:Broker Account:USD Cash,n,0 SPY,0,"CA$2,500.00","2,500.00",0.000000
19/06/2020,Assets:Broker Account:SPY,,Compensatory Notional Distribution,-- Split Transaction --,n,0 SPY,0,-CA$200.00,-200.00,0.000000
20/06/2020,Assets:Broker Account:SPY,,BUY SPY Close Short,-- Split Transaction --,n,100 SPY,100,"CA$8,000.00","8,000.00",80.000000
20/06/2020,Assets:Broker Account:SPY,,BUY SPY,-- Split Transaction --,n,0 SPY,0,CA$4.98,4.98,0.000000
21/06/2020,Assets:Broker Account:SPY,,Dividend,-- Split Transaction --,n,0 SPY,0,CA$0.00,0.00,0.000000
25/06/2020,Assets:Broker Account:SPY,,Partially Reinvested Dividend,-- Split Transaction --,n,1 SPY,1,CA$100.00,100.00,100.000000
26/06/2020,Assets:Broker Account:SPY,,Reverse Stock Split 1:2 -- redeem remainder 1 share (cash in lieu),-- Split Transaction --,n,0 SPY,0,CA$0.00,0.00,0.000000
26/06/2020,Assets:Broker Account:SPY,,Reverse Stock Split 1:2,-- Split Transaction --,n,-50 SPY,-50,CA$0.00,0.00,0.000000
27/06/2020,Assets:Broker Account:SPY,,Notional Distribution,-- Split Transaction --,n,0 SPY,0,CA$100.00,100.00,0.000000
10/05/2023,Assets:Current Assets:AUD,num,desc,-- Split Transaction --,n,$0.00,0.00,$0.00,0.00,0.0000
01/05/2020,Assets:Current Assets:Checking Account,,Opening Cash,Equity:Opening Balances,n,"CA$200,000.00","200,000.00","CA$200,000.00","200,000.00",1.0000

@jralls
Copy link
Member

jralls commented Aug 12, 2023

Well, have you tested status at each step to see whether and where it's raising an error?

@flywire
Copy link
Contributor

flywire commented Aug 12, 2023

I don't remember ever seeing a CSV with English month names instead of numbers

Of the four main Australian banks, NAB download only uses 3-letter month.

@christopherlam
Copy link
Contributor Author

You're ignoring half the problem, though: All three require that you tell them what format—%m, %b, or %B—to expect the month format in. Rather than trying to use a regex to figure it out, ask the user by adding e.g. 'b-d-y' and 'B-d-y' to the date selector listmodel.

btw instead of increasing options d/m/y d/b/y whereby b=month_name, I'd rather augment d/m/y so that m is numeric or alpha.

@jralls
Copy link
Member

jralls commented Aug 14, 2023

btw instead of increasing options d/m/y d/b/y whereby b=month_name, I'd rather augment d/m/y so that m is numeric or alpha.

I understand that's what you're trying to do, but it's not sufficient because of the need to distinguish between %b and %B. You're not going to be able to do that with a simple regex; you might not be able to do it with any regex considering that you have to account for multiple languages/locales and unicode blocks.

@christopherlam
Copy link
Contributor Author

need to distinguish between %b and %B

The idea is both aug and August would be parsed as 8.

@christopherlam
Copy link
Contributor Author

I'd really like to merge something like this soon. The FI I'll use frustratingly uses "30 Mar 2024" format in the CSV export. What should be the minimum requirement to parse month names?

@christopherlam
Copy link
Contributor Author

Ok if there's no need for a locale dropdown then it's much simpler. How about the "d-m-y" vs "m, d y" specifiers? Do they belong in the CSV dropdown as a limited list, or should be free text?

@flywire
Copy link
Contributor

flywire commented Aug 31, 2024

"m, d y" specifiers

This should specify the order rather than the format. Please drop the , from the display. One selection should support any delimiters (even none which would have to assume 2 digit day and month).

@jralls
Copy link
Member

jralls commented Aug 31, 2024

It's m d, y.

@flywire, the code already accommodates a variety of separators, the separators in the selection list are just to make the choices easier to read.

It occurs to me though that we might want a locale selector anyway because we have quite a few users who bank in more than one country and others who like to use GnuCash in a language/locale different from their system locale. The latter folks probably get their bank downloads written in the system locale's language.

If left to itself ICU::DateFormat::parse will use the locale to determine the dmy order, but that will also need a selector to overide it when necessary. It would be a nice touch for the selector to default to the locale on the first use and to remember the user's selections by account afterwards.

@christopherlam you don't have to do everything at once. If the current locale and dmy order selector are working that's enough for 5.9., you (or others) can add enhancements later.

@gjanssens
Copy link
Member

Ok let's thrash it out: a locale dropdown in global preferences or book properties, and the date parser automatically understands wordy months. The CSV has a date dropdown to choose dmy or mdy etc. Any further ideas?

The locale dropdown risks widening the scope immensely. As John already replied to @flywire there may be cases where a user wants to use GnuCash in a different locale than from the system. Or the csv files use a locale that are different from what GnuCash or the system uses.

For the first case (using a locale different from the system) we currently do support that already via the environment file, but it lacks a gui. We could replace that mechanism by setting a locale dropdown. It would still leave several unanswered questions:

  • would a user only want to do that for one book or for all GnuCash books
  • what if the book is shared with other users (non-concurrent access of course), is the intention the other users also get this locale override or not ?
  • how fine-grained a control would one offer via the gui? Locale has several parts like date, numeric, monetary,... What if the user wants a mix of these that's not found in a single locale ? Would you offer to override each separately ?

IMO this use case is best left for another PR.

The other use case is where the csv file uses different locale for date and number formatting. If a user downloads csv files from multiple financial institutions, it may well be the date and number formats are different for each institution. So from that point of view it continues to make sense to offer some way to customize the date and number formatting during the import itself (and optionally save it in a preset for future reuse).

I would consider replacing the current date and number formate dropdowns with two dropdowns that allow to select locales instead and present a list of all locales that are available on the system. For the date format dropdown we should come up with a way to present what the valid dates would look like when a given locale is selected. Likewise for the number format dropdown. It could be integrated either in each entry in the dropdown lists or in a separate widget.

To cover the case where none of the locales offers the date or number formats as used in the csv file, we could add a custom field next of below each of the two dropdowns to allow the user to enter a custom format string. I would suggest to also restrict valid entries in here to format strings that are properly parsable by ICU.

If the date or number format used in the csv file can not be specified in a format string that's parsable by ICU, I'd flag that as an invalid or unsupported format in the csv file and require the user to preprocess the csv file before importing.

With this in place the date and number parsing could completely get rid of the regexes and just depend straight on ICU's date and number parsing.

@christopherlam
Copy link
Contributor Author

Ok to be very specific how should the existing ui change:

image

Date Format gets a new default "Locale"? And should the locale then accept whatever is the default locale date format?

My initial process for this branch was simply to augment the "m" to accept anglo short or long month names, and I'm afraid the subsequent discussions about other locales or US dates has diluted the ui/ux discussion.

@gjanssens
Copy link
Member

My proposal was prompted by @jralls comment earlier:

It shouldn't be a kluge. It should work, have a reasonable UI, and be localizable to all supported languages. It should be general: You're writing for all GnuCash users, not just yourself. It needs unit tests covering at least the locales currently in the unit tests. It should be a clean commit set, and please take a couple of extra minutes to write clear and complete commit messages. Commit messages are how we communicate with our users and our future selves to explain what we did and why.

I think the current way to handle dates is already a kluge in itself. It doesn't work for you and you are also a GnuCash user. This method of date format selection was there before I embarked on the rewrite of the csv importer. I kept it at the time as it was not my focus. But today I think trying to simplify date parsing to one or more regexes is bound to fail on edge cases until we have essentially reimplemented the date parsing found in ICU. That library is specialized in handling locale specific parsing. It would be silly to attempt to reinvent that wheel. As it's way better at parsing dates, IMO relying on its parsers is the best long term path to avoid the kluge.

As for the specific UI change that would go with it:

  • in the date format dropdown
    • remove the 5 current options
    • replace it with a list of locales found on the system
    • prepend this with a "Use current locale format" option
    • optionally also present a "Custom" option
  • In a label next to this, print examples of the various date strings that the selected locale (or the current locale) supports (short date, long date, perhaps others still?)
  • If there is a custom option, we also need a text field in which to type a format string. It's up to the user to make it ICU compatible.
  • If the custom option is selected, whatever is typed in the text field is parsed to present the sample date strings in the example label. If ICU declares the string as not parsable, print an error message in the label instead.

Alternatively we could parse al possible date format string variations from all locales, deduplicate them and present the sample date strings in the date dropdown instead. The user is likely not too interested in the exact locale that can parse the date in the csv file, but rather in an example of a date string that matches the format of the csv file. This can also be prepended with a "Current Locale" option (which is likely what will work for most users) and an optional "Custom" to allow advanced users to provide their own.

@christopherlam
Copy link
Contributor Author

christopherlam commented Aug 31, 2024

@gjanssens are you suggesting we get rid of GncDateFormat vector and use ICU instead?

Edit: this will likely rid the dd/mm options.

@flywire
Copy link
Contributor

flywire commented Aug 31, 2024

I agree users need the ability to reset date format at runtime. The existing dropdown list is useful. There seems to be subtle differences in suggested implementation.

It could be delivered in two stages:

  1. Give user a program option setting for default Date Format including Locale. This would just set dmy order but user could use existing functionality to Load Settings or change Date Format option at runtime.

  2. As development effort permits, support character month strings. User should not have to select the specific string format to parse month and 2/4 digit year given order is known.

@gjanssens
Copy link
Member

@gjanssens are you suggesting we get rid of GncDateFormat vector and use ICU instead?

That's indeed what I'm suggesting.

Edit: this will likely rid the dd/mm options.

That thought did cross my mind. I have no metrics to estimate how frequently this option is ever used in real situations ("never say never" though). If we want to keep this feature the custom field may allow users to enter such a yearless date format themselves. Next question is then if ICU would accept such an incomplete date format string. I haven't found a conclusive answer in the ICU documentation, but the examples page suggests in comments that parsing with a custom format string should also be possible.

@gjanssens
Copy link
Member

I agree users need the ability to reset date format at runtime. The existing dropdown list is useful. There seems to be subtle differences in suggested implementation.

It could be delivered in two stages:

1. Give user a program option setting for default Date Format including `Locale`. This would just set `dmy` order but user could use existing functionality to Load Settings or change Date Format option at runtime.

This would add a nice default to the current dropdown for date formats. While nice it's tangential to the goal of this PR.

2. As development effort permits, support character month strings. User should not have to select the specific string format to parse month and 2/4 digit year given order is known.

I again will quote from @jralls ' requirements:

It should be general: You're writing for all GnuCash users

Assuming every (even short) date format consists of a day, month and year is very Western thinking. We can not assume this. When I look at the long list of locale formats present on my Fedora 39 linux box, I find several examples where this isn't the case:

  • ar_SA.utf8 - d_fmt="%A %e %B %Y"
  • bi_VU.utf8 - d_fmt="%A %d %b %Y"
  • en_HK.utf8 - d_fmt="%A, %B %d, %Y"
  • en_PH.utf8 - d_fmt="%A, %d %B, %Y"
  • eu_ES.utf8 - d_fmt="%a, %Y.eko %bren %da"
  • ...

Others do use three fields, but are using alternative numerical symbols:

  • fa_IR.utf8 - d_fmt="%Oy/%Om/%Od"
  • lzh_TW.utf8 - d_fmt="%OC%Oy年%B%Od日"
  • ...

A while back I read an article about a developer of a datetime library. The article mainly consisted of all the possible false assumptions people have about date and time if they are not specialized in writing date-time code. The cases above only shows a subset of these. I doubt we will be able to write a solid date parser that works as good as ICU's with less effort so I am in favour of going with ICU.

@christopherlam Perhaps this is more work than you are willing to take up or too difficult. That would be perfectly ok. If so we have to find a middle ground that scratches your itch but is at the same time sufficiently flexible that it can also work for others. I just don't know what that could be right now. My suggestion is meant as a first design improvement to discuss further on.

@christopherlam
Copy link
Contributor Author

Ok here's the next conundrum: there's 799 items returned in my uloc_countAvailable(), small section below. Too much!

fa
fa_AF
fa_IR
ff
ff_Adlm
ff_Adlm_BF
ff_Adlm_CM
ff_Adlm_GH

@jralls
Copy link
Member

jralls commented Sep 3, 2024

Ok here's the next conundrum: there's 799 items returned in my uloc_countAvailable(), small section below. Too much!

Too much indeed. You could reasonably limit it to the number of languages we have translations for, but at 50+ even that is pushing it for a one-level combo, especially so when you add in all of the place permutations.

The UI problem can be alleviated with a multi-level combo where the first level is continents, i.e. Africa, Americas, Asia, Europe, and Oceana, then the countries in each region and finally the languages supported for that country. That's not very satisfactory, it requires a lot of clicks to get to the locale you want. We could move the complicated UI to preferences where users would create a short list of locales that we'd put in the date parser selector.

Or we could punt the idea of accommodating arbitrary locales and rely only on the result of setlocale(LC_DATE, NULL).

@gjanssens
Copy link
Member

Hmm, I agree my idea was a bit naive... That's way too many options to present to the user.

The other idea I have always had in the back of my mind (and I believe it was originally suggested by @derekatkins ) is to automatically filter the possible date formats to present to that subset of formats that don't have any parsing errors on the selected date column.

That would require the user to first select (at least one) date column, the code would go over a list of possible date formats (extracted from locale data?) and filters out any that can not validly parse the date fields. Only if the resulting list has one or more valid date formate, the date format dropdown would be enabled, presenting just that subset. As long as there are still date parsing errors that either means some rows have to be skipped (like header lines) or the csv file has dates in a format we don't know how to support.

I realize this may slow down things a lot as well, because we have to test this for each supported locale. Given en_US and en_AU use different abbreviations for months, I can only assume this goes for several other locales using the same ordering of day, month and year as well. On the other hand, I think the elimination for potential candidates would happen fairly quickly. There would only be a few valid locale date formats for the first date you'd encounter. From there on the only ambiguity I'd expect still is whether 05-06-24 is to be interpreted as june 5th (as we do in Belgium) or May 6th as ango-saxon countries do.

@christopherlam
Copy link
Contributor Author

The other idea I have always had in the back of my mind (and I believe it was originally suggested by @derekatkins ) is to automatically filter the possible date formats to present to that subset of formats that don't have any parsing errors on the selected date column.

Nice idea.... Would you have any spare CPU cycles to make it work, likely with better architecture than my attempts, or would you be happy with my hacks here?

@jralls
Copy link
Member

jralls commented Sep 5, 2024

It's a nice user-friendly idea, but it's not really practical if we're going to have ICU do the parsing. Each test would be create locale, create calendar with the locale, try to parse the input, check the error, destroy the calendar, destroy the locale, start over, for an average of what, 400 locales? That's going to take a while.

But if we constrain the number of locales to try with a preference it becomes a lot more tractable. For nearly all users it will be the UI locale (i.e., the locale returned by setlocale(LC_DATE, NULL). For a few more it will be that plus the system locale (setlocale(LC_DATE, "")). There are a few others who do business in countries whose locales are neither the one they run GnuCash in nor the one that their computer is set up for. But how many of them import data from CSVs that have alpha months? It's got to be a vanishingly small number. It might well be none.

So for the first pass lets just do the current locale and see if anyone asks for more.

@christopherlam
Copy link
Contributor Author

christopherlam commented Sep 6, 2024

It's a nice user-friendly idea, but it's not really practical if we're going to have ICU do the parsing. Each test would be create locale, create calendar with the locale, try to parse the input, check the error, destroy the calendar, destroy the locale, start over, for an average of what, 400 locales? That's going to take a while.

See latest commit in #2010 which caches the icu objects trading some memory.

But if we constrain the number of locales to try with a preference it becomes a lot more tractable. For nearly all users it will be the UI locale (i.e., the locale returned by setlocale(LC_DATE, NULL). For a few more it will be that plus the system locale (setlocale(LC_DATE, "")). There are a few others who do business in countries whose locales are neither the one they run GnuCash in nor the one that their computer is set up for. But how many of them import data from CSVs that have alpha months? It's got to be a vanishingly small number. It might well be none.

It's not unreasonable to expect many FIs (incl banks brokers) to stubbornly assume everyone will use excel and spit "dd MMM yyyy" formats. I'm seeing at least 2 in my daily use and they won't update their systems to accommodate a few customers and output dd-mm-yyyy or yyyy-mm-dd.

So for the first pass lets just do the current locale and see if anyone asks for more.

As above, my locale is en_AU which is good for reports outputting AUD by default. However IMHO ICU's month parser is d*mb in expecting only "Sept" for September. Hence I like @gjanssens idea.

@jralls
Copy link
Member

jralls commented Sep 6, 2024

However IMHO ICU's month parser is d*mb in expecting only "Sept" for September. Hence I like @gjanssens idea.

I agree it's dumb, though maybe the Unicode Consortium's fault rather than ICUs. But even with cached locales it's going to take a while to test hundreds of locales until you get one that likes 'Sep'. You can't trust it with just scanning one date to set the locale because the first date might be Aug and it will pick en_AU then fail when it gets halfway down the transactions and finds Sep instead of Sept. Maybe limit the search to the other locales using the same language as the current locale?

@christopherlam
Copy link
Contributor Author

Ok benchmarked: with 500 dates, testing 800 locales takes 3s which is too slow. I can't think of a solution that would pass in en_AU and their dumb "Sept" requirement.

@gjanssens
Copy link
Member

How did you benchmark ?
If you test 800 locales against all of the 500 dates, I can imagine that would be too slow.
But perhaps it becomes reasonable if you remove each locale from the list as soon as it fails to parse one date. So you effectively attempt to parse the first date against all 800 locales. If 300 fail, attempt to parse the second date only against the 500 remaining locales and so on. I don't know how fast the list of valid locales would drop that way, but it may be worth a try (if that's not how you benchmarked already).

@christopherlam
Copy link
Contributor Author

How did you benchmark ? If you test 800 locales against all of the 500 dates, I can imagine that would be too slow. But perhaps it becomes reasonable if you remove each locale from the list as soon as it fails to parse one date. So you effectively attempt to parse the first date against all 800 locales. If 300 fail, attempt to parse the second date only against the 500 remaining locales and so on. I don't know how fast the list of valid locales would drop that way, but it may be worth a try (if that's not how you benchmarked already).

That's exactly what I did :-) f1629d7 for each of 500 dates, start with 800 locales, filter into successful parses, reducing locales into successful ones after each date. I've used memory-efficient std::swap. I've also rewritten to avoid try-except block. Maybe the algo isn't right?

@gjanssens
Copy link
Member

I quickly ran your code. The output is:

76: 805 locales available. Testing 500 dates.
76: 804 locales left, checked in 3.823 seconds:

That seems odd to me. You generate date strings all containing "09-22-2021". Out of the 805 locales you test against 804 parse this as a valid date ? That's very unlikely. I know for a fact it's invalid in most European countries and suspect even in several Asian countries. So something is not working as expected.

@christopherlam
Copy link
Contributor Author

odd indeed... GncDate is creating some garbage dates. Well within my blind spot.

76: 09-22-2021=22/09/21;en_US
76: 09-22-2021=22/09/21;en_US_POSIX
76: 09-22-2021=09/10/22;en_VC
76: 09-22-2021=09/10/22;en_VG
76: 09-22-2021=22/09/21;en_VI

@jralls
Copy link
Member

jralls commented Sep 7, 2024

With c164733 I get

75: 851 locales available. Testing 500 dates.
75: 38 locales left, checked in 0.336943 seconds:
75:  blo blo_BJ ceb ceb_PH chr chr_US ee ee_GH ee_TG en en_AS en_BI en_GU en_MH en_MP en_PH en_PR en_UM en_US en_US_POSIX en_VI es_PA es_PR fil fil_PH ks ks_Arab ks_Arab_IN lkt lkt_US or or_IN sd_Deva sd_Deva_IN xh xh_ZA zu zu_ZA

Which is certainly better than 3.58 seconds, but this is only testing a single date format, SHORT (d/m/yy for en_GB and m/d/yy for en_US). That delimiter is the only one the parser recognizes. If it's always / regardless of locale then we can do a simple substitution before passing the string to the parser. If the delimiter depends on the locale then we have to figure out how to modify the string to match what the locale doing the parsing requires.

It's probably faster and more reliable to count the alpha characters in the string to decide whether the MEDIUM or LONG format is needed than to test all the locales 3 times.

@christopherlam
Copy link
Contributor Author

How about we ignore icu completely, keep the dmy mdy ymd options, and use boost instead? We'll still remove the d-m and m-d options.

https://www.boost.org/doc/libs/1_31_0/libs/date_time/doc/class_date.html

@jralls
Copy link
Member

jralls commented Sep 8, 2024

How about we ignore icu completely, keep the dmy mdy ymd options, and use boost instead? We'll still remove the d-m and m-d options.

There's no localization.

@christopherlam
Copy link
Contributor Author

#2011 wins...

@christopherlam christopherlam deleted the stable-797724 branch September 11, 2024 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants