Unable to read ECMWF ensemble into xarray with FastHerbie #196

karlwx · 2023-05-25T14:56:00Z

Hey Brian, I'm trying to download a whole run of ECMWF ensemble data with FastHerbie - it seems like the data format is causing things to break. Here is the error:

AttributeError                            Traceback (most recent call last)
Cell In[7], line 2
      1 H = FastHerbie(["2023-05-25 00:00"], model="ecmwf", product="enfo", fxx=[0,6,12])
----> 2 ds = H.xarray(":2t:")

File ~/.conda/envs/radar/lib/python3.9/site-packages/herbie/fast.py:294, in FastHerbie.xarray(self, searchString, max_threads, **xarray_kwargs)
    291     ds_list = [H.xarray(**xarray_kwargs) for H in self.file_exists]
    293 # Sort the DataSets, first by lead time (step), then by run time (time)
--> 294 ds_list.sort(key=lambda x: x.step.data.max())
    295 ds_list.sort(key=lambda x: x.time.data.max())
    297 # Reshape list with dimensions (len(DATES), len(fxx))

File ~/.conda/envs/radar/lib/python3.9/site-packages/herbie/fast.py:294, in FastHerbie.xarray.<locals>.<lambda>(x)
    291     ds_list = [H.xarray(**xarray_kwargs) for H in self.file_exists]
    293 # Sort the DataSets, first by lead time (step), then by run time (time)
--> 294 ds_list.sort(key=lambda x: x.step.data.max())
    295 ds_list.sort(key=lambda x: x.time.data.max())
    297 # Reshape list with dimensions (len(DATES), len(fxx))

AttributeError: 'list' object has no attribute 'step'

Also, a couple of suggestions regarding ECMWF data:

By default choosing product "enfo" returns one dataset with ensemble members, and one dataset with the ensemble control (I'm think this is the control run, not ensemble mean). It would be nice to update the product strings to allow retrieval of one or the other and not both.
While attempting to debug this issue I updated to the latest version. After doing so, I noticed the download speed was MUCH slower. It appears to be downloading the ensemble members one at a time, rather than all from a single file. Is there a way to revert to the older download method (I was on v0.0.10)?

The text was updated successfully, but these errors were encountered:

blaylockbk · 2023-05-25T18:03:25Z

Hi @karlwx,

Thanks for reporting this. This gives me a number of things to thing about; I can't promise I'll make any changes any time soon, but I will be looking into this.

I admit, I haven't spent much time testing the ECMWF data with the FastHerbie implementation.

By default choosing product "enfo" returns one dataset with ensemble members, and one dataset with the ensemble control (I'm think this is the control run, not ensemble mean). It would be nice to update the product strings to allow retrieval of one or the other and not both.

As you pointed out, the reason FastHerbie is not working here is because when cfgrib reads product="enfo", it returns multiple hypercubes; one for member number 0 and another for member numbers 1-50.

ds = Herbie(
    "2023-05-25 00:00",
    model="ecmwf",
    product="enfo",
).xarray(":2t:", verbose=True, remove_grib=False)

len(ds)  # list of 2 datasets

This downloads about 30 MB of data (in ~2 mins!!) and the grib messages are spread throughout the file which requires multiple cURL downloads (which could probably be optimized with multithreading; I'll look into that).

I'd suggest modifying your searchString to get only one of these two hypercubes. If you look at the inventory, you see ECMWF does provide the ensemble number. I am assuming "number=NaN" is the control.

H = Herbie("2023-05-25 00:00", model="ecmwf", product="enfo")
H.inventory(":2t:")

# This will get you what must be the control run...
H.inventory(":2t:sfc:g:")

# This will get you member number 1...
H.inventory(":2t:sfc:1:g:")

# This will get you members 1-5 that are in the same hypercube...
H.inventory(":2t:sfc:[1-5]:g:")

Now, using this searchString in FastHerbie, this does work (albeit slow), and gets the dataframe with number=0:

FH = FastHerbie(
    ["2023-05-25 00:00"],
    model="ecmwf",
    product="enfo",
    fxx=[0, 6, 12]
)
ds = FH.xarray(":2t:sfc:g:")
ds

While attempting to debug this issue I updated to the latest version. After doing so, I noticed the download speed was MUCH slower. It appears to be downloading the ensemble members one at a time, rather than all from a single file. Is there a way to revert to the older download method (I was on v0.0.10)?

You are right! I was having issues with using multithreading (run out of memory, jupyter crashing). Long story short, I don't yet know enough about Multithreading/Multiprocessing to keep things under control.

By default you will notice that max_threads=None in FastHerbie().xarray(). This downloads each file one at a time. You can still use multithreading and see if you gain some speed by setting max_threads to an integer.

This example took 18 seconds, whereas max_threads=None took 25 seconds on my home WiFi. Both are slow.

FH = FastHerbie(
    ["2023-05-25 00:00"],
    model="ecmwf",
    product="enfo",
    fxx=[0, 6, 12]
)
ds = FH.xarray(":2t:sfc:g:", max_threads=3)
ds

Hope this gives you some answers!

P.S. I'll have to look back at v0.0.10 and see why that was downloading stuff faster. I can't actually remember what changed since then.

karlwx · 2023-05-26T13:54:59Z

Thanks Brian! I am at least avoiding errors and seeing some improved speed by using your suggestions. I was wondering if something changed in the download process, or the process just became more verbose in recent versions. I was not seeing a bunch or curl requests in 0.10.0.

The bottleneck seems to be the need for multiple requests to download parts of a single file. I remember playing around with python's requests package and I was able to download multiple byte ranges with a single request. If that's possible with curl, that would be a huge improvement in speed!

blaylockbk · 2023-05-26T16:56:48Z

I was able to download multiple byte ranges with a single request. If that's possible with curl, that would be a huge improvement in speed!

This is a good idea. I seem to remember that the ability to do a multiple byte range request depends on the server the data is hosted, and AWS S3 didn't allow multiple byte range requests last I looked (several years ago).

Just googled this...looks like they still don't (which is unfortunate)...
https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html

But maybe Azure (where ECMWF data is) does???S
Hmmm, possibly not: https://stackoverflow.com/a/57882772/2383070

karlwx · 2023-06-03T13:19:10Z

That's unfortunate! I had the idea to try multiple byte ranges in a single request as a way to get around the hit limits on the NOMADS server. So I know it works there, maybe it would work directly on the ECMWF website (not Azure)?

blaylockbk added ECMWF Issues with access to ECMWF open data files FastHerbie labels May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read ECMWF ensemble into xarray with FastHerbie #196

Unable to read ECMWF ensemble into xarray with FastHerbie #196

karlwx commented May 25, 2023 •

edited

Loading

blaylockbk commented May 25, 2023

karlwx commented May 26, 2023 •

edited

Loading

blaylockbk commented May 26, 2023 •

edited

Loading

karlwx commented Jun 3, 2023

Unable to read ECMWF ensemble into xarray with FastHerbie #196

Unable to read ECMWF ensemble into xarray with FastHerbie #196

Comments

karlwx commented May 25, 2023 • edited Loading

blaylockbk commented May 25, 2023

karlwx commented May 26, 2023 • edited Loading

blaylockbk commented May 26, 2023 • edited Loading

karlwx commented Jun 3, 2023

karlwx commented May 25, 2023 •

edited

Loading

karlwx commented May 26, 2023 •

edited

Loading

blaylockbk commented May 26, 2023 •

edited

Loading