-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 797724 - Add support for custom date formats when importing CSV transactions #1710
Conversation
d793e2b
to
e0ae1c7
Compare
This last one liner would solve majority* cases of month parsing, until we can find a good locale sensitive month parser. *Majority users I am not sure I've seen CSVs with localised dates such as "10 fév 2023" or "2023 年 7 月 1 日" or "१ जुलाई २०२३". |
Well, all of the half-dozen users who've reported the problem.
But how often have you downloaded CSVs from French, Chinese, or Indian banks? I don't remember ever seeing a CSV with English month names instead of numbers, but if one American bank does it then there are bound to be others around the world. The gold standard for localization is ICU. Their date parsing needs a formatter. Unfortunately both the boost::locale wrapper and boost::date_time I/O are stream oriented. That's a pain, but we deal with it elsewhere using You're ignoring half the problem, though: All three require that you tell them what format— The Chinese case is special: it's still "%Y%m%d" as 年, 月, and 日 can be converted to - and the result passed as an ISO formatted date. Do Indians really write dates with devangari numerals? |
Here's the work so far but the compilation phase doesn't compile. && /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/bin/c++ -Werror -Wall -Wmissing-declarations -g gnucash/CMakeFiles/gnucash.dir/gnucash.cpp.o gnucash/CMakeFiles/gnucash.dir/gnucash-commands.cpp.o gnucash/CMakeFiles/gnucash.dir/gnucash-core-app.cpp.o gnucash/CMakeFiles/gnucash.dir/gnucash-gresources.c.o -o bin/gnucash -Wl,-rpath,/home/chris/sources/gnucash/stable-build/lib:/home/chris/sources/gnucash/stable-build/lib/gnucash:/gnu/store/6vqx7nip7r95h2nhvhb9vxzjpri581b9-gtk+-3.24.37/lib:/gnu/store/rr1vbf04j27z5465wsm1kdfaw3iriz2k-pango-1.50.10/lib:/gnu/store/n7vynkl0rkqmvahxji6530n8hmfscxsd-harfbuzz-5.3.1/lib:/gnu/store/ams35jy7lmxw5xas4b33zznwdx7m1d18-at-spi2-core-2.45.90/lib:/gnu/store/gg3kycn5wfjwskx3xfkk1qscjgsvaxcn-cairo-1.16.0/lib:/gnu/store/1ig678c8vqxvr60x8swmc5wriga7sjf3-gdk-pixbuf-2.42.8/lib:/gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib: lib/libgnc-module.so lib/gnucash/libgnc-qif-import.so lib/gnucash/libgnc-csv-import.so lib/gnucash/libgnc-csv-export.so lib/gnucash/libgnc-log-replay.so lib/gnucash/libgnc-bi-import.so lib/gnucash/libgnc-customer-import.so -L/gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/lib -L/gnu/store/1w1r6r56z9lhg8ghcb7lxss6mkn7d5l1-libgc-8.2.2/lib -lguile-3.0 -lgc -lpthread -ldl lib/gnucash/libgnc-generic-import.so lib/libgnc-gnome.so lib/gnucash/libgnc-ledger-core.so lib/gnucash/libgnc-report.so lib/gnucash/libgnc-expressions-guile.so lib/gnucash/libgnc-register-gnome.so lib/gnucash/libgnc-register-core.so lib/gnucash/libgnc-gnome-search.so lib/libgnc-html.so lib/gnucash/libgnc-gnome-utils.so /gnu/store/6vqx7nip7r95h2nhvhb9vxzjpri581b9-gtk+-3.24.37/lib/libgtk-3.so /gnu/store/6vqx7nip7r95h2nhvhb9vxzjpri581b9-gtk+-3.24.37/lib/libgdk-3.so /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libz.so /gnu/store/rr1vbf04j27z5465wsm1kdfaw3iriz2k-pango-1.50.10/lib/libpangocairo-1.0.so /gnu/store/rr1vbf04j27z5465wsm1kdfaw3iriz2k-pango-1.50.10/lib/libpango-1.0.so /gnu/store/n7vynkl0rkqmvahxji6530n8hmfscxsd-harfbuzz-5.3.1/lib/libharfbuzz.so /gnu/store/ams35jy7lmxw5xas4b33zznwdx7m1d18-at-spi2-core-2.45.90/lib/libatk-1.0.so /gnu/store/gg3kycn5wfjwskx3xfkk1qscjgsvaxcn-cairo-1.16.0/lib/libcairo-gobject.so /gnu/store/gg3kycn5wfjwskx3xfkk1qscjgsvaxcn-cairo-1.16.0/lib/libcairo.so /gnu/store/1ig678c8vqxvr60x8swmc5wriga7sjf3-gdk-pixbuf-2.42.8/lib/libgdk_pixbuf-2.0.so /gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib/libgio-2.0.so /gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib/libgobject-2.0.so lib/gnucash/libgnc-expressions.so lib/libgnc-locale-tax.so lib/libgnucash-guile.so -L/gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/lib -L/gnu/store/1w1r6r56z9lhg8ghcb7lxss6mkn7d5l1-libgc-8.2.2/lib -lguile-3.0 -lgc -lpthread -ldl lib/libgnc-backend-xml-utils.so -L/gnu/store/slzq3zqwj75lbrg4ly51hfhbv2vhryv5-zlib-1.2.13/lib -lz -ldl lib/libgnc-app-utils.so -lgio-2.0 -L/gnu/store/g9cpfynzy3ryv9jprgvwf3g2mnd5p11q-libxml2-2.9.14/lib -L/gnu/store/xvrcpsrd2y07i860cp8ckgysz4zjrgzf-libxslt-1.1.37/lib -lxslt -lxml2 lib/libgnc-engine.so -L/gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib -lgobject-2.0 -L/gnu/store/0ibv7vw1ff6f4c15p9c0k4izx4kqwlkr-icu4c-71.1/lib -licui18n -licuuc -licudata -Wl,--export-dynamic -lgmodule-2.0 -lglib-2.0 -pthread lib/libgnc-core-utils.so /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_date_time.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_filesystem.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_atomic.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_locale.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_chrono.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_thread.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_program_options.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_regex.so.1.80.0 /gnu/store/sqy3ccjj9hsvgrikg865ljhnwvdsssw2-profile/lib/libboost_system.so.1.80.0 /gnu/store/nb40pwd37v6i1g4b1fq4l6q4h9px3asr-glib-2.72.3/lib/libglib-2.0.so && : |
Calendar::get says you want UDateField, i.e. |
de19d8a
to
4d8b173
Compare
Still not sure how to capture non-punctuation in the regex. A-Za-z is woefully inadequate. |
Use POSIX character classes. The general replacement for |
Ok |
The (One of these years I'd like to switch everything to the immensely faster Compile Time Regex library that's mostly PCRE compatible.) |
26b9cb5
to
adba683
Compare
…ransactions use boost::date_time_month_str_to_ushort to interpret number or month_string an invalid month eg "Marches" will throw exception "Month number is out of range 1..12" which is not localised.
adba683
to
ffaf850
Compare
Using icu doesn't seem to cut it. The first commit does work well for anglo csv. |
What is it doing wrong? Can you add a test case to illustrate? |
Here's a simple csv to try import. It won't interpret the mmm correctly in the first couple lines.
|
Well, have you tested |
Of the four main Australian banks, NAB download only uses 3-letter month. |
btw instead of increasing options d/m/y d/b/y whereby b=month_name, I'd rather augment d/m/y so that m is numeric or alpha. |
I understand that's what you're trying to do, but it's not sufficient because of the need to distinguish between |
The idea is both aug and August would be parsed as 8. |
I'd really like to merge something like this soon. The FI I'll use frustratingly uses "30 Mar 2024" format in the CSV export. What should be the minimum requirement to parse month names? |
Ok if there's no need for a locale dropdown then it's much simpler. How about the "d-m-y" vs "m, d y" specifiers? Do they belong in the CSV dropdown as a limited list, or should be free text? |
This should specify the order rather than the format. Please drop the |
It's m d, y. @flywire, the code already accommodates a variety of separators, the separators in the selection list are just to make the choices easier to read. It occurs to me though that we might want a locale selector anyway because we have quite a few users who bank in more than one country and others who like to use GnuCash in a language/locale different from their system locale. The latter folks probably get their bank downloads written in the system locale's language. If left to itself ICU::DateFormat::parse will use the locale to determine the dmy order, but that will also need a selector to overide it when necessary. It would be a nice touch for the selector to default to the locale on the first use and to remember the user's selections by account afterwards. @christopherlam you don't have to do everything at once. If the current locale and dmy order selector are working that's enough for 5.9., you (or others) can add enhancements later. |
The locale dropdown risks widening the scope immensely. As John already replied to @flywire there may be cases where a user wants to use GnuCash in a different locale than from the system. Or the csv files use a locale that are different from what GnuCash or the system uses. For the first case (using a locale different from the system) we currently do support that already via the environment file, but it lacks a gui. We could replace that mechanism by setting a locale dropdown. It would still leave several unanswered questions:
IMO this use case is best left for another PR. The other use case is where the csv file uses different locale for date and number formatting. If a user downloads csv files from multiple financial institutions, it may well be the date and number formats are different for each institution. So from that point of view it continues to make sense to offer some way to customize the date and number formatting during the import itself (and optionally save it in a preset for future reuse). I would consider replacing the current date and number formate dropdowns with two dropdowns that allow to select locales instead and present a list of all locales that are available on the system. For the date format dropdown we should come up with a way to present what the valid dates would look like when a given locale is selected. Likewise for the number format dropdown. It could be integrated either in each entry in the dropdown lists or in a separate widget. To cover the case where none of the locales offers the date or number formats as used in the csv file, we could add a custom field next of below each of the two dropdowns to allow the user to enter a custom format string. I would suggest to also restrict valid entries in here to format strings that are properly parsable by ICU. If the date or number format used in the csv file can not be specified in a format string that's parsable by ICU, I'd flag that as an invalid or unsupported format in the csv file and require the user to preprocess the csv file before importing. With this in place the date and number parsing could completely get rid of the regexes and just depend straight on ICU's date and number parsing. |
Ok to be very specific how should the existing ui change: Date Format gets a new default "Locale"? And should the locale then accept whatever is the default locale date format? My initial process for this branch was simply to augment the "m" to accept anglo short or long month names, and I'm afraid the subsequent discussions about other locales or US dates has diluted the ui/ux discussion. |
My proposal was prompted by @jralls comment earlier:
I think the current way to handle dates is already a kluge in itself. It doesn't work for you and you are also a GnuCash user. This method of date format selection was there before I embarked on the rewrite of the csv importer. I kept it at the time as it was not my focus. But today I think trying to simplify date parsing to one or more regexes is bound to fail on edge cases until we have essentially reimplemented the date parsing found in ICU. That library is specialized in handling locale specific parsing. It would be silly to attempt to reinvent that wheel. As it's way better at parsing dates, IMO relying on its parsers is the best long term path to avoid the kluge. As for the specific UI change that would go with it:
Alternatively we could parse al possible date format string variations from all locales, deduplicate them and present the sample date strings in the date dropdown instead. The user is likely not too interested in the exact locale that can parse the date in the csv file, but rather in an example of a date string that matches the format of the csv file. This can also be prepended with a "Current Locale" option (which is likely what will work for most users) and an optional "Custom" to allow advanced users to provide their own. |
@gjanssens are you suggesting we get rid of Edit: this will likely rid the dd/mm options. |
I agree users need the ability to reset date format at runtime. The existing dropdown list is useful. There seems to be subtle differences in suggested implementation. It could be delivered in two stages:
|
That's indeed what I'm suggesting.
That thought did cross my mind. I have no metrics to estimate how frequently this option is ever used in real situations ("never say never" though). If we want to keep this feature the custom field may allow users to enter such a yearless date format themselves. Next question is then if ICU would accept such an incomplete date format string. I haven't found a conclusive answer in the ICU documentation, but the examples page suggests in comments that parsing with a custom format string should also be possible. |
This would add a nice default to the current dropdown for date formats. While nice it's tangential to the goal of this PR.
I again will quote from @jralls ' requirements:
Assuming every (even short) date format consists of a day, month and year is very Western thinking. We can not assume this. When I look at the long list of locale formats present on my Fedora 39 linux box, I find several examples where this isn't the case:
Others do use three fields, but are using alternative numerical symbols:
A while back I read an article about a developer of a datetime library. The article mainly consisted of all the possible false assumptions people have about date and time if they are not specialized in writing date-time code. The cases above only shows a subset of these. I doubt we will be able to write a solid date parser that works as good as ICU's with less effort so I am in favour of going with ICU. @christopherlam Perhaps this is more work than you are willing to take up or too difficult. That would be perfectly ok. If so we have to find a middle ground that scratches your itch but is at the same time sufficiently flexible that it can also work for others. I just don't know what that could be right now. My suggestion is meant as a first design improvement to discuss further on. |
Ok here's the next conundrum: there's 799 items returned in my
|
Too much indeed. You could reasonably limit it to the number of languages we have translations for, but at 50+ even that is pushing it for a one-level combo, especially so when you add in all of the place permutations. The UI problem can be alleviated with a multi-level combo where the first level is continents, i.e. Africa, Americas, Asia, Europe, and Oceana, then the countries in each region and finally the languages supported for that country. That's not very satisfactory, it requires a lot of clicks to get to the locale you want. We could move the complicated UI to preferences where users would create a short list of locales that we'd put in the date parser selector. Or we could punt the idea of accommodating arbitrary locales and rely only on the result of |
Hmm, I agree my idea was a bit naive... That's way too many options to present to the user. The other idea I have always had in the back of my mind (and I believe it was originally suggested by @derekatkins ) is to automatically filter the possible date formats to present to that subset of formats that don't have any parsing errors on the selected date column. That would require the user to first select (at least one) date column, the code would go over a list of possible date formats (extracted from locale data?) and filters out any that can not validly parse the date fields. Only if the resulting list has one or more valid date formate, the date format dropdown would be enabled, presenting just that subset. As long as there are still date parsing errors that either means some rows have to be skipped (like header lines) or the csv file has dates in a format we don't know how to support. I realize this may slow down things a lot as well, because we have to test this for each supported locale. Given en_US and en_AU use different abbreviations for months, I can only assume this goes for several other locales using the same ordering of day, month and year as well. On the other hand, I think the elimination for potential candidates would happen fairly quickly. There would only be a few valid locale date formats for the first date you'd encounter. From there on the only ambiguity I'd expect still is whether 05-06-24 is to be interpreted as june 5th (as we do in Belgium) or May 6th as ango-saxon countries do. |
Nice idea.... Would you have any spare CPU cycles to make it work, likely with better architecture than my attempts, or would you be happy with my hacks here? |
It's a nice user-friendly idea, but it's not really practical if we're going to have ICU do the parsing. Each test would be create locale, create calendar with the locale, try to parse the input, check the error, destroy the calendar, destroy the locale, start over, for an average of what, 400 locales? That's going to take a while. But if we constrain the number of locales to try with a preference it becomes a lot more tractable. For nearly all users it will be the UI locale (i.e., the locale returned by So for the first pass lets just do the current locale and see if anyone asks for more. |
See latest commit in #2010 which caches the icu objects trading some memory.
It's not unreasonable to expect many FIs (incl banks brokers) to stubbornly assume everyone will use excel and spit "dd MMM yyyy" formats. I'm seeing at least 2 in my daily use and they won't update their systems to accommodate a few customers and output dd-mm-yyyy or yyyy-mm-dd.
As above, my locale is en_AU which is good for reports outputting AUD by default. However IMHO ICU's month parser is d*mb in expecting only "Sept" for September. Hence I like @gjanssens idea. |
I agree it's dumb, though maybe the Unicode Consortium's fault rather than ICUs. But even with cached locales it's going to take a while to test hundreds of locales until you get one that likes 'Sep'. You can't trust it with just scanning one date to set the locale because the first date might be Aug and it will pick en_AU then fail when it gets halfway down the transactions and finds Sep instead of Sept. Maybe limit the search to the other locales using the same language as the current locale? |
Ok benchmarked: with 500 dates, testing 800 locales takes 3s which is too slow. I can't think of a solution that would pass in en_AU and their dumb "Sept" requirement. |
How did you benchmark ? |
That's exactly what I did :-) f1629d7 for each of 500 dates, start with 800 locales, filter into successful parses, reducing locales into successful ones after each date. I've used memory-efficient std::swap. I've also rewritten to avoid try-except block. Maybe the algo isn't right? |
I quickly ran your code. The output is:
That seems odd to me. You generate date strings all containing "09-22-2021". Out of the 805 locales you test against 804 parse this as a valid date ? That's very unlikely. I know for a fact it's invalid in most European countries and suspect even in several Asian countries. So something is not working as expected. |
odd indeed... GncDate is creating some garbage dates. Well within my blind spot.
|
With c164733 I get
Which is certainly better than 3.58 seconds, but this is only testing a single date format, SHORT ( It's probably faster and more reliable to count the alpha characters in the string to decide whether the MEDIUM or LONG format is needed than to test all the locales 3 times. |
How about we ignore icu completely, keep the dmy mdy ymd options, and use boost instead? We'll still remove the d-m and m-d options. https://www.boost.org/doc/libs/1_31_0/libs/date_time/doc/class_date.html |
There's no localization. |
#2011 wins... |
Allows csv import to interpret month Jan/Feb/Mar etc. Silliest implementation ever.