Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding german JSON-normalizer / changes to extract_datetime_de #175

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

emphasize
Copy link
Contributor

@emphasize emphasize commented Jan 28, 2021

Description

Few things done here:

  1. Add german json-Normalizer

  2. Adding the possibility to parse utterances as such "5 June 20:00" - to set a reminder or else (prior 20:00 would be seen as a year date and not parsing 20:00 at all)

  • check for ":" (20:00)
  1. remove _de_numbers from parse_de since it's imoprted from parse_common_de

  2. brushed up the code to be consistent across the spectrum

Type of PR

If your PR fits more than one category, there is a high chance you should submit more than one PR. Please consider this carefully before opening the PR.
Either delete those that do not apply, or add an x between the square brackets like so: - [x]

  • Bugfix
  • Feature implementation
  • Refactor of code (without functional changes)
  • Documentation improvements
  • Test improvements

CLA

👍

@devs-mycroft devs-mycroft added the CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) label Jan 28, 2021
@emphasize
Copy link
Contributor Author

emphasize commented Jan 28, 2021

Hm.. probably a bad idea to restrict year to >2000.
will replace it with >60 since this this would exclude minute (and ofc hour) time data

@emphasize emphasize changed the title fix extract_datetime offset-aware/naive bug fix extract_datetime(_de) offset-aware/naive bug Jan 28, 2021
@emphasize
Copy link
Contributor Author

If this is a bug across the spectrum i will add specific lines to all parsers

@ChanceNCounter
Copy link
Contributor

(6 weeks later...)

On reflection, it still seems like a bad idea to restrict year to > 60, because we almost certainly want to support utterances like, "What happened on June 13, '27?"

@emphasize
Copy link
Contributor Author

emphasize commented Mar 7, 2021

In german spoken language you normally don't use this kind of abbrevation. As in written language.

Another thing that i just saw. I don't know how the others handle datestr. But ours wouldn't work no matter what. (the relevant parts are broken out and python highlighted). The temp datetime object is only created with month and day, which is a problem if datestr contains the year.

Back to the abbrevation, a datestr like 18. Januar '20 (resp. 18 Januar 20) would be also greeted with an exception parsed like this.

So, this has to be adjusted one way or the other. Parsing with ' as indicator would be easy yet
i don't think the stt services send it back seen as a year date.

extractedDate = dateNow
extractedDate = extractedDate.replace(microsecond=0,
                                          second=0,
                                          minute=0,
                                          hour=0)

if datestr != "":
    en_months = ['january', 'february', 'march', 'april', 'may', 'june',
               'july', 'august', 'september', 'october', 'november',
               'december']
    en_monthsShort = ['jan', 'feb', 'mar', 'apr', 'may', 'june', 'july',
                      'aug',
                      'sept', 'oct', 'nov', 'dec']
    for idx, en_month in enumerate(en_months):
        datestr = datestr.replace(months[idx], en_month)
    for idx, en_month in enumerate(en_monthsShort):
        datestr = datestr.replace(monthsShort[idx], en_month)
    temp = datetime.strptime(datestr, "%B %d")
    if not hasYear:
        temp = temp.replace(year=extractedDate.year)
        if extractedDate < temp:
            extractedDate = extractedDate.replace(year=int(currentYear),
                                                  month=int(
                                                      temp.strftime(
                                                          "%m")),
                                                  day=int(temp.strftime(
                                                      "%d")))
        else:
            extractedDate = extractedDate.replace(
                year=int(currentYear) + 1,
                month=int(temp.strftime("%m")),
                day=int(temp.strftime("%d")))
    else:
        extractedDate = extractedDate.replace(
            year=int(temp.strftime("%Y")),
            month=int(temp.strftime("%m")),
            day=int(temp.strftime("%d")))

@ChanceNCounter
Copy link
Contributor

Nope, STT will just send it back as two consecutive numbers, along the lines of "18 Januar 20" (hence my concern, though I guess not if it doesn't apply to German.)

@emphasize
Copy link
Contributor Author

emphasize commented Mar 7, 2021

Hm.. Just skimmed through parse_en and found no indicator that parsing a year like 18 january '20 is even possible. Maybe i'm missing something.

"Feb 18 18" / "18 Feb 18" should also run into exceptions

elif word in months or word in monthsShort and not fromFlag:
        .
        .
        if wordPrev and (wordPrev[0].isdigit() or (wordPrev == "of" and wordPrevPrev[0].isdigit())):
        .
        .
              if wordNext and wordNext[0].isdigit():
                    #18 Feb 18
                    datestr += " " + wordNext
                    used += 1
                    hasYear = True
              else:
                    hasYear = False
        elif wordNext and wordNext[0].isdigit():
            datestr += " " + wordNext
            used += 1
            if wordNextNext and wordNextNext[0].isdigit():
                #Feb 18 18
                datestr += " " + wordNextNext
                used += 1
                hasYear = True
            else:
                hasYear = False

which wouldn't get picked up by datetime.strptime(datestr, "%B %d %Y") > %Y(18 ->exception)
you only wouldn't notice because of try

>>> date_string = "18"
>>> date_object = datetime.strptime(date_string, "%Y")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sweng/anaconda3/lib/python3.8/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/home/sweng/anaconda3/lib/python3.8/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '18' does not match format '%Y'

@ChanceNCounter
Copy link
Contributor

Oh, it absolutely doesn't work at the moment. Jarbas has a rewrite roadmapped, and there's (just over the past couple days) been some discussion about TZ management that will play into it. So will the existence of lingua_franca.config

Nevertheless, at least in languages that intend to support that format - which is extremely common in English, but I obviously can't speak to German! - it seems unwise to place restrictions on years (except 0.)

@emphasize
Copy link
Contributor Author

emphasize commented Mar 7, 2021

All this wouldn't be necessary if

  • not only the first char is checked to validate a number and on the other hand
  • STT not sending back some weird variations of 10:30

@ChanceNCounter
Copy link
Contributor

That we can't control STT is an ongoing challenge. The module also transitioned slowly into supporting "written" input, which is beginning to pay dividends when STT returns weird variations on stuff like that =P

It'll only get weirder as more STT engines proliferate.

Things like this play back into the normalizer, as well, which should be able to sanitize most of these edge cases.

Even still, with reliably normalized input, we'd be parsing something like "10 12" in a datetime extractor. This could be 10:12, Oct. 12, or Dec. 10. It's tricky business. This is both the upside and the downside of algorithmic vs. ML parsers. We can find rules, bake in edge cases, and that's that, but disambiguation is hard.

@emphasize
Copy link
Contributor Author

emphasize commented Mar 7, 2021

A variation could be just to check ":"

if if wordnext and wordnext[0].isdigit() and not ":" in wordnext:
    datestr += " " + wordNext
    used += 1
    hasYear = True                         
else:
    hasYear = False

Nah, it is recognized from Google STT since i call "... 10 uhr 10", yet it randomly returns 10 uhr 10 uhr, 10.10 uhr or 10:10 uhr with the infamous "dreißig" bug (13 uhr 30 -> 13 uhr dreißig; not with 10,20,40,...)

I just realized that i split along ":" and haven't had memorable problems (in production mode). So this might be a good idea to change it that way

@emphasize
Copy link
Contributor Author

emphasize commented Mar 29, 2021

oh boy, oh boy. with this german parsers you stumble from brick to brick.

First off, changed back 'temp' tz addition due to #180

With the addition of the german normalizer another unrecognized problem emerged.
In the former approach (directly changed words in normalzer_de instead of loading a json)
'ein': 1, 'eins': 1, 'eine': 1, 'einer': 1, 'einem': 1, 'einen': 1, 'eines': 1,

is changed wholesale.
Resulting in spoken sentences like "I have one(1) heck of a day" instead of "I have a heck of a day" (Maybe workable in english, yet isn't in german).

Will add these lines to the normalizer.json, though this has to be treated in specific parsers/formatters with "ein",... kicked out of the normalization

@emphasize emphasize changed the title fix extract_datetime(_de) offset-aware/naive bug adding german JSON-normalizer / changes to extract_datetime_de Mar 29, 2021
@JarbasAl
Copy link
Collaborator

can you please add some unittests for what this is supposed to be fixing with the datetime changes?

if some other native speaker could double check this it would be great, but its quite simple so let's not block merging because of that

i think this looks good in general, but need to test properly before hitting the green button

glad to see more normalizers being migrated to the new json mechanism

@JarbasAl JarbasAl added the de related to German language label Mar 29, 2021
@@ -833,6 +778,7 @@ def date_found():
datestr = datestr.replace(monthsShort[idx], en_month)

temp = datetime.strptime(datestr, "%B %d")
extractedDate = extractedDate.replace(hour=0, minute=0, second=0)
if not hasYear:
temp = temp.replace(year=extractedDate.year)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might need changing depending on #180

Copy link
Contributor Author

@emphasize emphasize Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought so, yet i don't found a change in other parsers.

What about..

else:
   #ignore the current HH:MM:SS if relative using days or greater
   if hrOffset == 0 and minOffset == 0 and secOffset == 0:
      extractedDate = extractedDate.replace(hour=0, minute=0, second=0)

in case of extract_datetime('today'). If coupled with a to_utc() (problem with weatherskill atm) this is causing possible dateime jumps.
wouldn't do harm if this is replaced also.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue im anticipating is the temp datetime being a naive datetime, it will cause an issue with timezones, in #180 you can see i changed that step in every language to add the timezone to the temp datetime

@emphasize
Copy link
Contributor Author

emphasize commented Mar 29, 2021

can you please add some unittests for what this is supposed to be fixing with the datetime changes?

wouldn't call this part a fix, more of a convenience addition.
For something like "set a reminder on the 5th of july 19:00 o'clock"
Basically if a day or month is given it would expect a year after july and treat 19:00 as such since
wordNext[0].isdigit()

I'll add a test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) de related to German language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants