Faster ISD #113

zmoon · 2023-05-06T00:22:11Z

Currently it is pretty slow. Youhua reported that it took somewhat over an hour to get all US sites for a period with one unique year.

Andrew Lambert of NRL said that with his code he was able to do something similar (vis data only though?) loading from Amazon S3 in 3 minutes or so. So there is a lot of room for improvement.

I did some initial benchmarking on my desktop, picked a site year with larger file size: site 01010099999, year 2020, 748K. Loading with pd.read_fwf¹, specifying widths and dtypes (also using header=None, compression="gzip"). 3 options for this fixed-width (FW) file:

s3://noaa-isd-pds/data/2020/010100-99999-2020.gz ~ 0.7 s
https://noaa-isd-pds.s3.amazonaws.com/data/2020/010100-99999-2020.gz ~ 1 s
https://www1.ncdc.noaa.gov/pub/data/noaa/2020/010100-99999-2020.gz ~ 1.4 s

☝️ So there is a factor of 2 speed gain in this case by using the S3 URL instead. But note not all the processing that monetio does is included.

Loading this site-year similarly

ish.add_data(pd.date_range("2020/01/01", "2021/01/01", freq="D")[:-1], site="01010099999", resample=False)

(though more processing done, e.g. 99999 -> NaN) with monetio currently takes ~ 15 s. Probably by leveraging pd.read_fwf etc. like above instead of the current method, we can speed the processing up.

Also note that loading the CSV (s3://noaa-global-hourly-pds/2020/01010099999.csv) instead of the FW text seems not that much slower. Will have to compare with all the processing included etc.

Also note that https://www.ncei.noaa.gov/data/global-hourly/archive/ has compiled files (all sites, presumably) by year. These are just .tar.gzs of all the CSV or FW files though.

cc: @ytangnoaa @bbakernoaa

pd.read_fwf args are not all documented in its docstring, but they are mostly the same as pd.read_csv, with the differences noted here ↩

The text was updated successfully, but these errors were encountered:

zmoon · 2024-01-18T20:35:10Z

Remember, ISH-Lite is in the bucket as well

zmoon added the enhancement New feature or request label May 6, 2023

zmoon self-assigned this May 6, 2023

zmoon added this to the v0.3 milestone May 10, 2023

This was referenced Jun 5, 2023

SSL cert issue with NCDC ISD files #120

Closed

Fix OpenAQ reader #116

Merged

zmoon mentioned this issue Jun 13, 2023

Faster OpenAQ (using openaq-fetches) #122

Closed

This was referenced Jan 18, 2024

CLI: get data from ISH and ISH-Lite NOAA-CSL/MELODIES-MONET#236

Merged

Add retry and timeout options for ISH reader #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster ISD #113

Faster ISD #113

zmoon commented May 6, 2023 •

edited

Loading

zmoon commented Jan 18, 2024 •

edited

Loading

Faster ISD #113

Faster ISD #113

Comments

zmoon commented May 6, 2023 • edited Loading

Footnotes

zmoon commented Jan 18, 2024 • edited Loading

zmoon commented May 6, 2023 •

edited

Loading

zmoon commented Jan 18, 2024 •

edited

Loading