-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding german JSON-normalizer / changes to extract_datetime_de #175
base: master
Are you sure you want to change the base?
Conversation
|
If this is a bug across the spectrum i will add specific lines to all parsers |
(6 weeks later...) On reflection, it still seems like a bad idea to restrict year to > 60, because we almost certainly want to support utterances like, "What happened on June 13, '27?" |
In german spoken language you normally don't use this kind of abbrevation. As in written language. Another thing that i just saw. I don't know how the others handle Back to the abbrevation, a So, this has to be adjusted one way or the other. Parsing with
temp = datetime.strptime(datestr, "%B %d")
|
Nope, STT will just send it back as two consecutive numbers, along the lines of "18 Januar 20" (hence my concern, though I guess not if it doesn't apply to German.) |
Hm.. Just skimmed through parse_en and found no indicator that parsing a year like 18 january '20 is even possible. Maybe i'm missing something. "Feb 18 18" / "18 Feb 18" should also run into exceptions
which wouldn't get picked up by datetime.strptime(datestr, "%B %d %Y") > %Y(18 ->exception)
|
Oh, it absolutely doesn't work at the moment. Jarbas has a rewrite roadmapped, and there's (just over the past couple days) been some discussion about TZ management that will play into it. So will the existence of Nevertheless, at least in languages that intend to support that format - which is extremely common in English, but I obviously can't speak to German! - it seems unwise to place restrictions on years (except 0.) |
All this wouldn't be necessary if
|
That we can't control STT is an ongoing challenge. The module also transitioned slowly into supporting "written" input, which is beginning to pay dividends when STT returns weird variations on stuff like that =P It'll only get weirder as more STT engines proliferate. Things like this play back into the normalizer, as well, which should be able to sanitize most of these edge cases. Even still, with reliably normalized input, we'd be parsing something like "10 12" in a datetime extractor. This could be 10:12, Oct. 12, or Dec. 10. It's tricky business. This is both the upside and the downside of algorithmic vs. ML parsers. We can find rules, bake in edge cases, and that's that, but disambiguation is hard. |
A variation could be just to check ":" if if wordnext and wordnext[0].isdigit() and not ":" in wordnext:
datestr += " " + wordNext
used += 1
hasYear = True
else:
hasYear = False Nah, it is recognized from Google STT since i call "... 10 uhr 10", yet it randomly returns 10 uhr 10 uhr, 10.10 uhr or 10:10 uhr with the infamous "dreißig" bug (13 uhr 30 -> 13 uhr dreißig; not with 10,20,40,...) I just realized that i split along ":" and haven't had memorable problems (in production mode). So this might be a good idea to change it that way |
To allow future year abbrevations ('20) parsing. Essential ValueException to be able to parse the year if any is passed (not possible atm)
oh boy, oh boy. with this german parsers you stumble from brick to brick. First off, changed back 'temp' tz addition due to #180 With the addition of the german normalizer another unrecognized problem emerged. is changed wholesale. Will add these lines to the normalizer.json, though this has to be treated in specific parsers/formatters with "ein",... kicked out of the normalization |
can you please add some unittests for what this is supposed to be fixing with the datetime changes? if some other native speaker could double check this it would be great, but its quite simple so let's not block merging because of that i think this looks good in general, but need to test properly before hitting the green button glad to see more normalizers being migrated to the new json mechanism |
@@ -833,6 +778,7 @@ def date_found(): | |||
datestr = datestr.replace(monthsShort[idx], en_month) | |||
|
|||
temp = datetime.strptime(datestr, "%B %d") | |||
extractedDate = extractedDate.replace(hour=0, minute=0, second=0) | |||
if not hasYear: | |||
temp = temp.replace(year=extractedDate.year) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might need changing depending on #180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought so, yet i don't found a change in other parsers.
What about..
else:
#ignore the current HH:MM:SS if relative using days or greater
if hrOffset == 0 and minOffset == 0 and secOffset == 0:
extractedDate = extractedDate.replace(hour=0, minute=0, second=0)
in case of extract_datetime('today'). If coupled with a to_utc() (problem with weatherskill atm) this is causing possible dateime jumps.
wouldn't do harm if this is replaced also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue im anticipating is the temp
datetime being a naive datetime, it will cause an issue with timezones, in #180 you can see i changed that step in every language to add the timezone to the temp
datetime
wouldn't call this part a fix, more of a convenience addition. I'll add a test |
Description
Few things done here:
Add german json-Normalizer
Adding the possibility to parse utterances as such "5 June 20:00" - to set a reminder or else (prior 20:00 would be seen as a year date and not parsing 20:00 at all)
remove _de_numbers from parse_de since it's imoprted from parse_common_de
brushed up the code to be consistent across the spectrum
Type of PR
If your PR fits more than one category, there is a high chance you should submit more than one PR. Please consider this carefully before opening the PR.
Either delete those that do not apply, or add an x between the square brackets like so:
- [x]
CLA
👍