Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse more datetime formats #75

Merged
merged 1 commit into from
May 17, 2021

Conversation

cole-floodbase
Copy link
Contributor

I was running into an issue in stac-fastapi where POST requests for STAC items were failing. It looks like it's because the format string passed to strptime doesn't handle the time-secfrac bit of RFC3339. This PR adds a regression test that demonstrates the issue. It also adds a fix by using the Pydantic project's datetime parser instead.

Related to #65

@cole-floodbase
Copy link
Contributor Author

@vincentsarago What do you think? Let me know if I should change anything

@vincentsarago
Copy link
Member

I think @geospatial-jeff would be a better reviewer ;-)

@kylebarron
Copy link
Contributor

Again I'll let @geospatial-jeff do the reviewing, but I'd be curious to see a simple benchmark here. Presumably full datetime processing is an order of magnitude slower than regex parsing. I'm guessing full datetime processing might be on the order of .5-1ms, so still only relevant if you're handling millions of items.

@cole-floodbase
Copy link
Contributor Author

Thanks folks, I'll see if I can put together a quick benchmark

@kylebarron
Copy link
Contributor

kylebarron commented May 17, 2021

IMO the simplest is just to use %%timeit in IPython. Just do %timeit once with the existing master and with this branch.

Just mentioning this because I know from experience that some types of datetime parsing can be really, really slow. E.g. in making suncalc I found that pandas.to_datetime is like 100x slower than np.astype https://github.com/kylebarron/suncalc-py#benchmark

The first is 2800x faster than the second! Some of the difference here is that under the hood the non-vectorized approach uses pd.to_datetime while the vectorized implementation uses np.astype('datetime64[ns, UTC]'). pd.to_datetime is really slow!!

So basically my question is how many items do we expect to not have RFC3339 formatting? If 90% of input items are formatted as such, it would probably be significantly faster overall to first try regex parsing and then only fall back to pydantic's datetime parser if the regex matching fails.

Edit: But it looks like pydantic also uses regex entirely for its datetime parser, so maybe it isn't that slow 🤷‍♂️ https://github.com/samuelcolvin/pydantic/blob/master/pydantic/datetime_parse.py

@cole-floodbase
Copy link
Contributor Author

@kylebarron Yeah, great points. Here's my benchmark. It looks like this PR causes a speedup.

@kylebarron
Copy link
Contributor

kylebarron commented May 17, 2021

That's wild. I tried to quickly check too, but I can't even get existing master to work 😄

I'm testing with this item...

STAC Item
{
  "type": "Feature",
  "stac_version": "1.0.0-beta.2",
  "stac_extensions": [
    "eo",
    "view",
    "proj"
  ],
  "id": "S2B_1CCV_20181004_0_L2A",
  "bbox": [
    176.86465735875524,
    -72.9927453068842,
    178.4336680925981,
    -72.0124876694908
  ],
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          176.90734876525534,
          -72.9927453068842
        ],
        [
          176.86465735875524,
          -72.99146546737437
        ],
        [
          177.18936642547558,
          -72.0124876694908
        ],
        [
          178.4336680925981,
          -72.04567023611742
        ],
        [
          177.63304373965198,
          -72.55414509715332
        ],
        [
          176.90734876525534,
          -72.9927453068842
        ]
      ]
    ]
  },
  "properties": {
    "datetime": "2018-10-04T21:05:21Z",
    "platform": "sentinel-2b",
    "constellation": "sentinel-2",
    "instruments": [
      "msi"
    ],
    "gsd": 10,
    "data_coverage": 20.18,
    "view:off_nadir": 0,
    "eo:cloud_cover": 17.19,
    "proj:epsg": 32701,
    "sentinel:latitude_band": "C",
    "sentinel:grid_square": "CV",
    "sentinel:sequence": "0",
    "sentinel:product_id": "S2B_MSIL2A_20181004T210519_N0001_R071_T01CCV_20200307T115707",
    "created": "2020-08-30T10:49:43.719Z",
    "updated": "2020-08-30T10:49:43.719Z",
    "sentinel:valid_cloud_cover": true,
    "sentinel:utm_zone": 1,
    "sentinel:data_coverage": 20.18
  },
  "collection": "sentinel-s2-l2a-cogs",
  "assets": {
    "thumbnail": {
      "title": "Thumbnail",
      "type": "image/png",
      "href": "https://roda.sentinel-hub.com/sentinel-s2-l1c/tiles/1/C/CV/2018/10/4/0/preview.jpg",
      "roles": [
        "thumbnail"
      ]
    },
    "overview": {
      "title": "True color image",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/L2A_PVI.tif",
      "roles": [
        "overview"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.038,
          "center_wavelength": 0.6645,
          "name": "B04",
          "common_name": "red"
        },
        {
          "full_width_half_max": 0.045,
          "center_wavelength": 0.56,
          "name": "B03",
          "common_name": "green"
        },
        {
          "full_width_half_max": 0.098,
          "center_wavelength": 0.4966,
          "name": "B02",
          "common_name": "blue"
        }
      ],
      "gsd": 10,
      "proj:shape": [
        343,
        343
      ],
      "proj:transform": [
        320.0,
        0.0,
        300000.0,
        0.0,
        -320.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "info": {
      "title": "Original JSON metadata",
      "type": "application/json",
      "href": "https://roda.sentinel-hub.com/sentinel-s2-l2a/tiles/1/C/CV/2018/10/4/0/tileInfo.json",
      "roles": [
        "metadata"
      ]
    },
    "metadata": {
      "title": "Original XML metadata",
      "type": "application/xml",
      "href": "https://roda.sentinel-hub.com/sentinel-s2-l2a/tiles/1/C/CV/2018/10/4/0/metadata.xml",
      "roles": [
        "metadata"
      ]
    },
    "visual": {
      "title": "True color image",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/TCI.tif",
      "roles": [
        "overview"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.038,
          "center_wavelength": 0.6645,
          "name": "B04",
          "common_name": "red"
        },
        {
          "full_width_half_max": 0.045,
          "center_wavelength": 0.56,
          "name": "B03",
          "common_name": "green"
        },
        {
          "full_width_half_max": 0.098,
          "center_wavelength": 0.4966,
          "name": "B02",
          "common_name": "blue"
        }
      ],
      "gsd": 10,
      "proj:shape": [
        10980,
        10980
      ],
      "proj:transform": [
        10.0,
        0.0,
        300000.0,
        0.0,
        -10.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B01": {
      "title": "Band 1 (coastal)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.027,
          "center_wavelength": 0.4439,
          "name": "B01",
          "common_name": "coastal"
        }
      ],
      "gsd": 60,
      "proj:shape": [
        1830,
        1830
      ],
      "proj:transform": [
        60.0,
        0.0,
        300000.0,
        0.0,
        -60.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B02": {
      "title": "Band 2 (blue)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.098,
          "center_wavelength": 0.4966,
          "name": "B02",
          "common_name": "blue"
        }
      ],
      "gsd": 10,
      "proj:shape": [
        10980,
        10980
      ],
      "proj:transform": [
        10.0,
        0.0,
        300000.0,
        0.0,
        -10.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B03": {
      "title": "Band 3 (green)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.045,
          "center_wavelength": 0.56,
          "name": "B03",
          "common_name": "green"
        }
      ],
      "gsd": 10,
      "proj:shape": [
        10980,
        10980
      ],
      "proj:transform": [
        10.0,
        0.0,
        300000.0,
        0.0,
        -10.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B04": {
      "title": "Band 4 (red)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.038,
          "center_wavelength": 0.6645,
          "name": "B04",
          "common_name": "red"
        }
      ],
      "gsd": 10,
      "proj:shape": [
        10980,
        10980
      ],
      "proj:transform": [
        10.0,
        0.0,
        300000.0,
        0.0,
        -10.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B05": {
      "title": "Band 5",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B05.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.019,
          "center_wavelength": 0.7039,
          "name": "B05"
        }
      ],
      "gsd": 20,
      "proj:shape": [
        5490,
        5490
      ],
      "proj:transform": [
        20.0,
        0.0,
        300000.0,
        0.0,
        -20.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B06": {
      "title": "Band 6",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B06.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.018,
          "center_wavelength": 0.7402,
          "name": "B06"
        }
      ],
      "gsd": 20,
      "proj:shape": [
        5490,
        5490
      ],
      "proj:transform": [
        20.0,
        0.0,
        300000.0,
        0.0,
        -20.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B07": {
      "title": "Band 7",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B07.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.028,
          "center_wavelength": 0.7825,
          "name": "B07"
        }
      ],
      "gsd": 20,
      "proj:shape": [
        5490,
        5490
      ],
      "proj:transform": [
        20.0,
        0.0,
        300000.0,
        0.0,
        -20.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B08": {
      "title": "Band 8 (nir)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B08.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.145,
          "center_wavelength": 0.8351,
          "name": "B08",
          "common_name": "nir"
        }
      ],
      "gsd": 10,
      "proj:shape": [
        10980,
        10980
      ],
      "proj:transform": [
        10.0,
        0.0,
        300000.0,
        0.0,
        -10.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B8A": {
      "title": "Band 8A",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B8A.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.033,
          "center_wavelength": 0.8648,
          "name": "B8A"
        }
      ],
      "gsd": 20,
      "proj:shape": [
        5490,
        5490
      ],
      "proj:transform": [
        20.0,
        0.0,
        300000.0,
        0.0,
        -20.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B09": {
      "title": "Band 9",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B09.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.026,
          "center_wavelength": 0.945,
          "name": "B09"
        }
      ],
      "gsd": 60,
      "proj:shape": [
        1830,
        1830
      ],
      "proj:transform": [
        60.0,
        0.0,
        300000.0,
        0.0,
        -60.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B11": {
      "title": "Band 11 (swir16)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B11.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.143,
          "center_wavelength": 1.6137,
          "name": "B11",
          "common_name": "swir16"
        }
      ],
      "gsd": 20,
      "proj:shape": [
        5490,
        5490
      ],
      "proj:transform": [
        20.0,
        0.0,
        300000.0,
        0.0,
        -20.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "B12": {
      "title": "Band 12 (swir22)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B12.tif",
      "roles": [
        "data"
      ],
      "eo:bands": [
        {
          "full_width_half_max": 0.242,
          "center_wavelength": 2.22024,
          "name": "B12",
          "common_name": "swir22"
        }
      ],
      "gsd": 20,
      "proj:shape": [
        5490,
        5490
      ],
      "proj:transform": [
        20.0,
        0.0,
        300000.0,
        0.0,
        -20.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "AOT": {
      "title": "Aerosol Optical Thickness (AOT)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif",
      "roles": [
        "data"
      ],
      "proj:shape": [
        1830,
        1830
      ],
      "proj:transform": [
        60.0,
        0.0,
        300000.0,
        0.0,
        -60.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "WVP": {
      "title": "Water Vapour (WVP)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/WVP.tif",
      "roles": [
        "data"
      ],
      "proj:shape": [
        10980,
        10980
      ],
      "proj:transform": [
        10.0,
        0.0,
        300000.0,
        0.0,
        -10.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    },
    "SCL": {
      "title": "Scene Classification Map (SCL)",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/SCL.tif",
      "roles": [
        "data"
      ],
      "proj:shape": [
        5490,
        5490
      ],
      "proj:transform": [
        20.0,
        0.0,
        300000.0,
        0.0,
        -20.0,
        2000020.0,
        0.0,
        0.0,
        1.0
      ]
    }
  },
  "links": [
    {
      "rel": "self",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/S2B_1CCV_20181004_0_L2A.json",
      "type": "application/json"
    },
    {
      "rel": "canonical",
      "href": "https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/S2B_1CCV_20181004_0_L2A.json",
      "type": "application/json"
    },
    {
      "rel": "parent",
      "href": "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a"
    },
    {
      "rel": "collection",
      "href": "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a"
    },
    {
      "rel": "root",
      "href": "https://earth-search.aws.element84.com/v0/"
    },
    {
      "title": "Source STAC Item",
      "rel": "derived_from",
      "href": "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a/items/S2B_1CCV_20181004_0_L2A",
      "type": "application/json"
    }
  ]
}

and I'm getting this validation error on master

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-5-11c9be24caf5> in <module>
----> 1 Item(**feature)
      2 
      3 # new pr

~/.pyenv/versions/miniconda3-3.8-4.8.3/lib/python3.8/site-packages/pydantic/main.cpython-38-darwin.so in pydantic.main.BaseModel.__init__()

ValidationError: 1 validation error for Item
properties -> updated
  Invalid datetime, must match format (%Y-%m-%dT%H:%M:%SZ). (type=value_error)

I put a print at the top of _parse_rfc3339 and it shows

2020-08-30 10:49:43.719000+00:00

while the value of updated is "2020-08-30T10:49:43.719Z"
🤯
No idea where the pre-processing is before it hits _parse_rfc3339. Anyways, works fine with this PR 🤷‍♂️

@geospatial-jeff geospatial-jeff self-requested a review May 17, 2021 21:29
Copy link
Collaborator

@geospatial-jeff geospatial-jeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I like the regex validation better, agree with the things mentioned here.

We are using DATETIME_RFC339 (yikes I just realized it should be RFC3339) for JSON serialization of item properties which may cause the JSON serialized datetime format to be different than the one passed to the model:

This is fine with me, just bringing it up in case anyone sees a problem here.

@geospatial-jeff geospatial-jeff merged commit c2db77c into stac-utils:master May 17, 2021
@cole-floodbase
Copy link
Contributor Author

Thanks for reviewing and merging!

Good point on the serialization discrepancy. The major meaningful difference to me is that the sub-second portion of the datetime is discarded during serialization. But this probably doesn't matter most of the time.

Could be fixed by changing

https://github.com/stac-utils/stac-pydantic/blob/master/stac_pydantic/shared.py#L18

to include the %f (microsecond) part.

DATETIME_RFC339 = "%Y-%m-%dT%H:%M:%S.%fZ"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants