Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISO 8601 format check fails on valid dates. #2098

Open
Ingmarvdg opened this issue Jun 17, 2024 · 13 comments
Open

ISO 8601 format check fails on valid dates. #2098

Ingmarvdg opened this issue Jun 17, 2024 · 13 comments

Comments

@Ingmarvdg
Copy link

When checking if a datetime string is ISO 8601 compliant, some valid datetimes fail.

Dates in the 10th month.

The date 2020-10-11 fails while the dates 2020-09-11 and 2020-11-12 are fine. It seems this is caused by this section in the regular expression ?((0[0-9]|1[12]) that should be ?((0[0-9]|1[0-2]).

Dates before 1900 or after 2099

The dates 2100-01-01 and 1899-01-01 fail due to this section in the regular expression *(19|20)[[:digit:]][[:digit:]].

@tools-soda
Copy link

SAS-3692

@pholser
Copy link
Contributor

pholser commented Jun 17, 2024

Same goes for any valid format value that would involve the year or year4 regexes: date inverse, for example.

@m1n0
Copy link
Contributor

m1n0 commented Jul 12, 2024

hi thanks for reporting, could you verify if this is not fixed by #2128?

@pholser
Copy link
Contributor

pholser commented Jul 12, 2024

@Ingmarvdg would you be willing to try out the above fix for your case?

@pholser
Copy link
Contributor

pholser commented Jul 15, 2024

@m1n0 @Ingmarvdg I can confirm that #2128 corrects the problem. However, I noticed also that ISO 8601 dates should accept 24-hr times, and currently they don't. I am issuing another PR to correct.

@pholser
Copy link
Contributor

pholser commented Jul 15, 2024

See also #2133

@Ingmarvdg
Copy link
Author

Hi @pholser I am using the spark-df version of soda-core, and it seems the problem is not yet corrected:

from soda.scan import Scan

scan = Scan()
scan.set_scan_definition_name("tmp")
scan.add_spark_session(spark, data_source_name="tmp")
scan.set_data_source_name("y")

scan.add_sodacl_yaml_str(
    """
checks for tmp:
    - invalid_count(datetime_string) = 0:
        valid format: date iso 8601
"""
)

df = spark.createDataFrame(
    [("1623-10-11T10:10:10.0000+01:00",)], schema=["datetime_string"] # This should pass
)
df.createOrReplaceTempView("tmp")

result_code = scan.execute()
results = scan.get_logs_text()
print(results)

Worse yet, the error has to do with the scan execution instead of resulting in a fail:

[19:05:02] Query execution error in y.a.failed_rows[invalid_count]: 
SELECT * FROM y 
 WHERE NOT (a IS NULL) AND NOT (a rlike('^ *(19|20)\\d\\d-?((0[0-9]|1[12])-?([012][0-9]|3[01])|W[0-5]\\d(-?[1-7])?|[0-3]\\d\\d)([ T](0[0-9]|1[012])(:?[0-5][0-9](:?[0-5][0-9]([.,]\\d+)?)?)?([+-](0[0-9]|1[012]):?[0-5][0-9]|Z)?)? *$')) 
 LIMIT 100

[19:05:02] Error occurred while executing scan.
  | 'DataFrame' object has no attribute 'offset'
INFO   | Soda Core 3.3.10
INFO   | Using DefaultSampler
ERROR  | Query execution error in y.a.failed_rows[invalid_count]: 
SELECT * FROM y 
 WHERE NOT (a IS NULL) AND NOT (a rlike('^ *(19|20)\\d\\d-?((0[0-9]|1[12])-?([012][0-9]|3[01])|W[0-5]\\d(-?[1-7])?|[0-3]\\d\\d)([ T](0[0-9]|1[012])(:?[0-5][0-9](:?[0-5][0-9]([.,]\\d+)?)?)?([+-](0[0-9]|1[012]):?[0-5][0-9]|Z)?)? *$')) 
 LIMIT 100 | 
ERROR  | Error occurred while executing scan. | 'DataFrame' object has no attribute 'offset'

@pholser
Copy link
Contributor

pholser commented Jul 15, 2024

@Ingmarvdg -- ok, didn't know about the spark-df version of soda-core. Has it incorporated the above change? I'm satisfied that the iso 8601 date check is improved with the change in soda-core itself.

@pholser
Copy link
Contributor

pholser commented Jul 16, 2024

The regex being fed to the query still appears to be the old incorrect one. I don't believe you've incorporated the soda-core update above into your setup.

@pholser
Copy link
Contributor

pholser commented Jul 16, 2024

@Ingmarvdg what version are you using? I doubt the fix is in a released version yet.

@Ingmarvdg
Copy link
Author

I pulled the most recent version of the main branch, then 'pip install .' from the spark-df folder. That should give your changes right?

@pholser
Copy link
Contributor

pholser commented Jul 16, 2024

Perhaps I changed something that didn't affect your bug.

@pholser
Copy link
Contributor

pholser commented Jul 16, 2024

Try also #2133

I added a format test for "1623-10-11T10:10:10.0000+01:00", and it seems to pass when the core tests are run for spark-df: https://github.com/sodadata/soda-core/actions/runs/9962160324/job/27525299481

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants