You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently it is pretty slow. Youhua reported that it took somewhat over an hour to get all US sites for a period with one unique year.
Andrew Lambert of NRL said that with his code he was able to do something similar (vis data only though?) loading from Amazon S3 in 3 minutes or so. So there is a lot of room for improvement.
I did some initial benchmarking on my desktop, picked a site year with larger file size: site 01010099999, year 2020, 748K. Loading with pd.read_fwf1, specifying widths and dtypes (also using header=None, compression="gzip"). 3 options for this fixed-width (FW) file:
s3://noaa-isd-pds/data/2020/010100-99999-2020.gz ~ 0.7 s
(though more processing done, e.g. 99999 -> NaN) with monetio currently takes ~ 15 s. Probably by leveraging pd.read_fwf etc. like above instead of the current method, we can speed the processing up.
Also note that loading the CSV (s3://noaa-global-hourly-pds/2020/01010099999.csv) instead of the FW text seems not that much slower. Will have to compare with all the processing included etc.
Currently it is pretty slow. Youhua reported that it took somewhat over an hour to get all US sites for a period with one unique year.
Andrew Lambert of NRL said that with his code he was able to do something similar (vis data only though?) loading from Amazon S3 in 3 minutes or so. So there is a lot of room for improvement.
I did some initial benchmarking on my desktop, picked a site year with larger file size: site 01010099999, year 2020, 748K. Loading with
pd.read_fwf
1, specifyingwidths
anddtype
s (also usingheader=None
,compression="gzip"
). 3 options for this fixed-width (FW) file:s3://noaa-isd-pds/data/2020/010100-99999-2020.gz
~ 0.7 s☝️ So there is a factor of 2 speed gain in this case by using the S3 URL instead. But note not all the processing that monetio does is included.
Loading this site-year similarly
(though more processing done, e.g. 99999 -> NaN) with monetio currently takes ~ 15 s. Probably by leveraging
pd.read_fwf
etc. like above instead of the current method, we can speed the processing up.Also note that loading the CSV (
s3://noaa-global-hourly-pds/2020/01010099999.csv
) instead of the FW text seems not that much slower. Will have to compare with all the processing included etc.Also note that https://www.ncei.noaa.gov/data/global-hourly/archive/ has compiled files (all sites, presumably) by year. These are just
.tar.gz
s of all the CSV or FW files though.cc: @ytangnoaa @bbakernoaa
Footnotes
pd.read_fwf
args are not all documented in its docstring, but they are mostly the same aspd.read_csv
, with the differences noted here ↩The text was updated successfully, but these errors were encountered: