Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read ECMWF ensemble into xarray with FastHerbie #196

Open
karlwx opened this issue May 25, 2023 · 4 comments
Open

Unable to read ECMWF ensemble into xarray with FastHerbie #196

karlwx opened this issue May 25, 2023 · 4 comments
Labels
ECMWF Issues with access to ECMWF open data files FastHerbie

Comments

@karlwx
Copy link
Contributor

karlwx commented May 25, 2023

Hey Brian, I'm trying to download a whole run of ECMWF ensemble data with FastHerbie - it seems like the data format is causing things to break. Here is the error:

AttributeError                            Traceback (most recent call last)
Cell In[7], line 2
      1 H = FastHerbie(["2023-05-25 00:00"], model="ecmwf", product="enfo", fxx=[0,6,12])
----> 2 ds = H.xarray(":2t:")

File ~/.conda/envs/radar/lib/python3.9/site-packages/herbie/fast.py:294, in FastHerbie.xarray(self, searchString, max_threads, **xarray_kwargs)
    291     ds_list = [H.xarray(**xarray_kwargs) for H in self.file_exists]
    293 # Sort the DataSets, first by lead time (step), then by run time (time)
--> 294 ds_list.sort(key=lambda x: x.step.data.max())
    295 ds_list.sort(key=lambda x: x.time.data.max())
    297 # Reshape list with dimensions (len(DATES), len(fxx))

File ~/.conda/envs/radar/lib/python3.9/site-packages/herbie/fast.py:294, in FastHerbie.xarray.<locals>.<lambda>(x)
    291     ds_list = [H.xarray(**xarray_kwargs) for H in self.file_exists]
    293 # Sort the DataSets, first by lead time (step), then by run time (time)
--> 294 ds_list.sort(key=lambda x: x.step.data.max())
    295 ds_list.sort(key=lambda x: x.time.data.max())
    297 # Reshape list with dimensions (len(DATES), len(fxx))

AttributeError: 'list' object has no attribute 'step'

Also, a couple of suggestions regarding ECMWF data:

  • By default choosing product "enfo" returns one dataset with ensemble members, and one dataset with the ensemble control (I'm think this is the control run, not ensemble mean). It would be nice to update the product strings to allow retrieval of one or the other and not both.
  • While attempting to debug this issue I updated to the latest version. After doing so, I noticed the download speed was MUCH slower. It appears to be downloading the ensemble members one at a time, rather than all from a single file. Is there a way to revert to the older download method (I was on v0.0.10)?
@blaylockbk blaylockbk added ECMWF Issues with access to ECMWF open data files FastHerbie labels May 25, 2023
@blaylockbk
Copy link
Owner

Hi @karlwx,

Thanks for reporting this. This gives me a number of things to thing about; I can't promise I'll make any changes any time soon, but I will be looking into this.

I admit, I haven't spent much time testing the ECMWF data with the FastHerbie implementation.

By default choosing product "enfo" returns one dataset with ensemble members, and one dataset with the ensemble control (I'm think this is the control run, not ensemble mean). It would be nice to update the product strings to allow retrieval of one or the other and not both.

As you pointed out, the reason FastHerbie is not working here is because when cfgrib reads product="enfo", it returns multiple hypercubes; one for member number 0 and another for member numbers 1-50.

ds = Herbie(
    "2023-05-25 00:00",
    model="ecmwf",
    product="enfo",
).xarray(":2t:", verbose=True, remove_grib=False)

len(ds)  # list of 2 datasets

This downloads about 30 MB of data (in ~2 mins!!) and the grib messages are spread throughout the file which requires multiple cURL downloads (which could probably be optimized with multithreading; I'll look into that).

I'd suggest modifying your searchString to get only one of these two hypercubes. If you look at the inventory, you see ECMWF does provide the ensemble number. I am assuming "number=NaN" is the control.

H = Herbie("2023-05-25 00:00", model="ecmwf", product="enfo")
H.inventory(":2t:")

image

# This will get you what must be the control run...
H.inventory(":2t:sfc:g:")

# This will get you member number 1...
H.inventory(":2t:sfc:1:g:")

# This will get you members 1-5 that are in the same hypercube...
H.inventory(":2t:sfc:[1-5]:g:")

Now, using this searchString in FastHerbie, this does work (albeit slow), and gets the dataframe with number=0:

FH = FastHerbie(
    ["2023-05-25 00:00"],
    model="ecmwf",
    product="enfo",
    fxx=[0, 6, 12]
)
ds = FH.xarray(":2t:sfc:g:")
ds



While attempting to debug this issue I updated to the latest version. After doing so, I noticed the download speed was MUCH slower. It appears to be downloading the ensemble members one at a time, rather than all from a single file. Is there a way to revert to the older download method (I was on v0.0.10)?

You are right! I was having issues with using multithreading (run out of memory, jupyter crashing). Long story short, I don't yet know enough about Multithreading/Multiprocessing to keep things under control.

By default you will notice that max_threads=None in FastHerbie().xarray(). This downloads each file one at a time. You can still use multithreading and see if you gain some speed by setting max_threads to an integer.

This example took 18 seconds, whereas max_threads=None took 25 seconds on my home WiFi. Both are slow.

FH = FastHerbie(
    ["2023-05-25 00:00"],
    model="ecmwf",
    product="enfo",
    fxx=[0, 6, 12]
)
ds = FH.xarray(":2t:sfc:g:", max_threads=3)
ds

Hope this gives you some answers!

P.S. I'll have to look back at v0.0.10 and see why that was downloading stuff faster. I can't actually remember what changed since then.

@karlwx
Copy link
Contributor Author

karlwx commented May 26, 2023

Thanks Brian! I am at least avoiding errors and seeing some improved speed by using your suggestions. I was wondering if something changed in the download process, or the process just became more verbose in recent versions. I was not seeing a bunch or curl requests in 0.10.0.

The bottleneck seems to be the need for multiple requests to download parts of a single file. I remember playing around with python's requests package and I was able to download multiple byte ranges with a single request. If that's possible with curl, that would be a huge improvement in speed!

@blaylockbk
Copy link
Owner

blaylockbk commented May 26, 2023

I was able to download multiple byte ranges with a single request. If that's possible with curl, that would be a huge improvement in speed!

This is a good idea. I seem to remember that the ability to do a multiple byte range request depends on the server the data is hosted, and AWS S3 didn't allow multiple byte range requests last I looked (several years ago).

Just googled this...looks like they still don't (which is unfortunate)...
https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html
image

But maybe Azure (where ECMWF data is) does???S
Hmmm, possibly not: https://stackoverflow.com/a/57882772/2383070

@karlwx
Copy link
Contributor Author

karlwx commented Jun 3, 2023

That's unfortunate! I had the idea to try multiple byte ranges in a single request as a way to get around the hit limits on the NOMADS server. So I know it works there, maybe it would work directly on the ECMWF website (not Azure)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECMWF Issues with access to ECMWF open data files FastHerbie
Projects
None yet
Development

No branches or pull requests

2 participants