adding german JSON-normalizer / changes to extract_datetime_de #175

emphasize · 2021-01-28T13:14:07Z

Description

Few things done here:

Add german json-Normalizer
Adding the possibility to parse utterances as such "5 June 20:00" - to set a reminder or else (prior 20:00 would be seen as a year date and not parsing 20:00 at all)

check for ":" (20:00)

remove _de_numbers from parse_de since it's imoprted from parse_common_de
brushed up the code to be consistent across the spectrum

Type of PR

If your PR fits more than one category, there is a high chance you should submit more than one PR. Please consider this carefully before opening the PR.
Either delete those that do not apply, or add an x between the square brackets like so: - [x]

CLA

👍

emphasize · 2021-01-28T13:23:25Z

~~Hm.. probably a bad idea to restrict year to >2000.~~
~~will replace it with >60 since this this would exclude minute (and ofc hour) time data~~

emphasize · 2021-01-28T13:38:32Z

If this is a bug across the spectrum i will add specific lines to all parsers

lingua_franca/lang/parse_de.py

ChanceNCounter · 2021-03-07T00:46:09Z

(6 weeks later...)

On reflection, it still seems like a bad idea to restrict year to > 60, because we almost certainly want to support utterances like, "What happened on June 13, '27?"

emphasize · 2021-03-07T20:08:44Z

In german spoken language you normally don't use this kind of abbrevation. As in written language.

Another thing that i just saw. I don't know how the others handle datestr. But ours wouldn't work no matter what. (the relevant parts are broken out and python highlighted). The temp datetime object is only created with month and day, which is a problem if datestr contains the year.

Back to the abbrevation, a datestr like 18. Januar '20 (resp. 18 Januar 20) would be also greeted with an exception parsed like this.

So, this has to be adjusted one way or the other. Parsing with ' as indicator would be easy yet
i don't think the stt services send it back seen as a year date.

extractedDate = dateNow
extractedDate = extractedDate.replace(microsecond=0,
                                          second=0,
                                          minute=0,
                                          hour=0)

if datestr != "":
    en_months = ['january', 'february', 'march', 'april', 'may', 'june',
               'july', 'august', 'september', 'october', 'november',
               'december']
    en_monthsShort = ['jan', 'feb', 'mar', 'apr', 'may', 'june', 'july',
                      'aug',
                      'sept', 'oct', 'nov', 'dec']
    for idx, en_month in enumerate(en_months):
        datestr = datestr.replace(months[idx], en_month)
    for idx, en_month in enumerate(en_monthsShort):
        datestr = datestr.replace(monthsShort[idx], en_month)

    temp = datetime.strptime(datestr, "%B %d")

    if not hasYear:
        temp = temp.replace(year=extractedDate.year)
        if extractedDate < temp:
            extractedDate = extractedDate.replace(year=int(currentYear),
                                                  month=int(
                                                      temp.strftime(
                                                          "%m")),
                                                  day=int(temp.strftime(
                                                      "%d")))
        else:
            extractedDate = extractedDate.replace(
                year=int(currentYear) + 1,
                month=int(temp.strftime("%m")),
                day=int(temp.strftime("%d")))
    else:
        extractedDate = extractedDate.replace(
            year=int(temp.strftime("%Y")),
            month=int(temp.strftime("%m")),
            day=int(temp.strftime("%d")))

ChanceNCounter · 2021-03-07T20:47:11Z

Nope, STT will just send it back as two consecutive numbers, along the lines of "18 Januar 20" (hence my concern, though I guess not if it doesn't apply to German.)

emphasize · 2021-03-07T20:56:10Z

Hm.. Just skimmed through parse_en and found no indicator that parsing a year like 18 january '20 is even possible. Maybe i'm missing something.

"Feb 18 18" / "18 Feb 18" should also run into exceptions

elif word in months or word in monthsShort and not fromFlag:
        .
        .
        if wordPrev and (wordPrev[0].isdigit() or (wordPrev == "of" and wordPrevPrev[0].isdigit())):
        .
        .
              if wordNext and wordNext[0].isdigit():
                    #18 Feb 18
                    datestr += " " + wordNext
                    used += 1
                    hasYear = True
              else:
                    hasYear = False
        elif wordNext and wordNext[0].isdigit():
            datestr += " " + wordNext
            used += 1
            if wordNextNext and wordNextNext[0].isdigit():
                #Feb 18 18
                datestr += " " + wordNextNext
                used += 1
                hasYear = True
            else:
                hasYear = False

which wouldn't get picked up by datetime.strptime(datestr, "%B %d %Y") > %Y(18 ->exception)
you only wouldn't notice because of try

>>> date_string = "18"
>>> date_object = datetime.strptime(date_string, "%Y")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sweng/anaconda3/lib/python3.8/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/home/sweng/anaconda3/lib/python3.8/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '18' does not match format '%Y'

ChanceNCounter · 2021-03-07T21:39:21Z

Oh, it absolutely doesn't work at the moment. Jarbas has a rewrite roadmapped, and there's (just over the past couple days) been some discussion about TZ management that will play into it. So will the existence of lingua_franca.config

Nevertheless, at least in languages that intend to support that format - which is extremely common in English, but I obviously can't speak to German! - it seems unwise to place restrictions on years (except 0.)

emphasize · 2021-03-07T22:03:29Z

All this wouldn't be necessary if

not only the first char is checked to validate a number and on the other hand
STT not sending back some weird variations of 10:30

ChanceNCounter · 2021-03-07T22:34:12Z

That we can't control STT is an ongoing challenge. The module also transitioned slowly into supporting "written" input, which is beginning to pay dividends when STT returns weird variations on stuff like that =P

It'll only get weirder as more STT engines proliferate.

Things like this play back into the normalizer, as well, which should be able to sanitize most of these edge cases.

Even still, with reliably normalized input, we'd be parsing something like "10 12" in a datetime extractor. This could be 10:12, Oct. 12, or Dec. 10. It's tricky business. This is both the upside and the downside of algorithmic vs. ML parsers. We can find rules, bake in edge cases, and that's that, but disambiguation is hard.

emphasize · 2021-03-07T22:46:27Z

A variation could be just to check ":"

if if wordnext and wordnext[0].isdigit() and not ":" in wordnext:
    datestr += " " + wordNext
    used += 1
    hasYear = True                         
else:
    hasYear = False

Nah, it is recognized from Google STT since i call "... 10 uhr 10", yet it randomly returns 10 uhr 10 uhr, 10.10 uhr or 10:10 uhr with the infamous "dreißig" bug (13 uhr 30 -> 13 uhr dreißig; not with 10,20,40,...)

I just realized that i split along ":" and haven't had memorable problems (in production mode). So this might be a good idea to change it that way

To allow future year abbrevations ('20) parsing. Essential ValueException to be able to parse the year if any is passed (not possible atm)

without typos

emphasize · 2021-03-29T12:26:48Z

oh boy, oh boy. with this german parsers you stumble from brick to brick.

First off, changed back 'temp' tz addition due to #180

With the addition of the german normalizer another unrecognized problem emerged.
In the former approach (directly changed words in normalzer_de instead of loading a json)
'ein': 1, 'eins': 1, 'eine': 1, 'einer': 1, 'einem': 1, 'einen': 1, 'eines': 1,

is changed wholesale.
Resulting in spoken sentences like "I have one(1) heck of a day" instead of "I have a heck of a day" (Maybe workable in english, yet isn't in german).

Will add these lines to the normalizer.json, though this has to be treated in specific parsers/formatters with "ein",... kicked out of the normalization

JarbasAl · 2021-03-29T15:54:52Z

can you please add some unittests for what this is supposed to be fixing with the datetime changes?

if some other native speaker could double check this it would be great, but its quite simple so let's not block merging because of that

i think this looks good in general, but need to test properly before hitting the green button

glad to see more normalizers being migrated to the new json mechanism

JarbasAl · 2021-03-29T16:30:58Z

lingua_franca/lang/parse_de.py

@@ -833,6 +778,7 @@ def date_found():
            datestr = datestr.replace(monthsShort[idx], en_month)

        temp = datetime.strptime(datestr, "%B %d")
+        extractedDate = extractedDate.replace(hour=0, minute=0, second=0)
        if not hasYear:
            temp = temp.replace(year=extractedDate.year)


this might need changing depending on #180

i thought so, yet i don't found a change in other parsers.

What about..

else: #ignore the current HH:MM:SS if relative using days or greater if hrOffset == 0 and minOffset == 0 and secOffset == 0: extractedDate = extractedDate.replace(hour=0, minute=0, second=0)

in case of extract_datetime('today'). If coupled with a to_utc() (problem with weatherskill atm) this is causing possible dateime jumps.
wouldn't do harm if this is replaced also.

the issue im anticipating is the temp datetime being a naive datetime, it will cause an issue with timezones, in #180 you can see i changed that step in every language to add the timezone to the temp datetime

emphasize · 2021-03-29T18:27:29Z

can you please add some unittests for what this is supposed to be fixing with the datetime changes?

wouldn't call this part a fix, more of a convenience addition.
For something like "set a reminder on the 5th of july 19:00 o'clock"
Basically if a day or month is given it would expect a year after july and treat 19:00 as such since
wordNext[0].isdigit()

I'll add a test

fix extract_datetime offset-aware/naive bug

6267fc1

devs-mycroft added the CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) label Jan 28, 2021

lower wordnext check to > 60

bafc771

emphasize changed the title ~~fix extract_datetime offset-aware/naive bug~~ fix extract_datetime(_de) offset-aware/naive bug Jan 28, 2021

JarbasAl reviewed Jan 28, 2021

View reviewed changes

lingua_franca/lang/parse_de.py Outdated Show resolved Hide resolved

extract tzinfo from extractedDate instead of default_timezone()

18712f6

emphasize added 7 commits March 8, 2021 00:19

datetime parsing: No restriction to year, added ValueError exception

a3cbd59

To allow future year abbrevations ('20) parsing. Essential ValueException to be able to parse the year if any is passed (not possible atm)

datetime parsing: No restriction to year, added ValueError exception

8a6d1d0

without typos

normalizer -> json

25119a5

revert changes to temp to reflect MycroftAI#180

85c6ae1

add german normalizer

8830769

remove wrong entries in word_replacements

7460cfd

set remove_articles=True

0089d23

add ein, eine,... to normalizer

f9287e2

emphasize changed the title ~~fix extract_datetime(_de) offset-aware/naive bug~~ adding german JSON-normalizer / changes to extract_datetime_de Mar 29, 2021

extractnumber -> extract_number

a222c6a

JarbasAl added the de related to German language label Mar 29, 2021

JarbasAl reviewed Mar 29, 2021

View reviewed changes

emphasize added 2 commits March 29, 2021 20:39

add test

6f11610

add test

4f85ac6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding german JSON-normalizer / changes to extract_datetime_de #175

adding german JSON-normalizer / changes to extract_datetime_de #175

emphasize commented Jan 28, 2021 •

edited

Loading

emphasize commented Jan 28, 2021 •

edited

Loading

emphasize commented Jan 28, 2021

ChanceNCounter commented Mar 7, 2021

emphasize commented Mar 7, 2021 •

edited

Loading

ChanceNCounter commented Mar 7, 2021

emphasize commented Mar 7, 2021 •

edited

Loading

ChanceNCounter commented Mar 7, 2021

emphasize commented Mar 7, 2021 •

edited

Loading

ChanceNCounter commented Mar 7, 2021

emphasize commented Mar 7, 2021 •

edited

Loading

emphasize commented Mar 29, 2021 •

edited

Loading

JarbasAl commented Mar 29, 2021

JarbasAl Mar 29, 2021

emphasize Mar 29, 2021 •

edited

Loading

JarbasAl Mar 29, 2021

emphasize commented Mar 29, 2021 •

edited

Loading

adding german JSON-normalizer / changes to extract_datetime_de #175

Are you sure you want to change the base?

adding german JSON-normalizer / changes to extract_datetime_de #175

Conversation

emphasize commented Jan 28, 2021 • edited Loading

Description

Type of PR

CLA

emphasize commented Jan 28, 2021 • edited Loading

emphasize commented Jan 28, 2021

ChanceNCounter commented Mar 7, 2021

emphasize commented Mar 7, 2021 • edited Loading

ChanceNCounter commented Mar 7, 2021

emphasize commented Mar 7, 2021 • edited Loading

ChanceNCounter commented Mar 7, 2021

emphasize commented Mar 7, 2021 • edited Loading

ChanceNCounter commented Mar 7, 2021

emphasize commented Mar 7, 2021 • edited Loading

emphasize commented Mar 29, 2021 • edited Loading

JarbasAl commented Mar 29, 2021

JarbasAl Mar 29, 2021

Choose a reason for hiding this comment

emphasize Mar 29, 2021 • edited Loading

Choose a reason for hiding this comment

JarbasAl Mar 29, 2021

Choose a reason for hiding this comment

emphasize commented Mar 29, 2021 • edited Loading

emphasize commented Jan 28, 2021 •

edited

Loading

emphasize commented Jan 28, 2021 •

edited

Loading

emphasize commented Mar 7, 2021 •

edited

Loading

emphasize commented Mar 7, 2021 •

edited

Loading

emphasize commented Mar 7, 2021 •

edited

Loading

emphasize commented Mar 7, 2021 •

edited

Loading

emphasize commented Mar 29, 2021 •

edited

Loading

emphasize Mar 29, 2021 •

edited

Loading

emphasize commented Mar 29, 2021 •

edited

Loading