Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE: Parse GRIB .idx files #11

Closed
JackKelly opened this issue Sep 4, 2024 · 12 comments
Closed

FEATURE: Parse GRIB .idx files #11

JackKelly opened this issue Sep 4, 2024 · 12 comments
Assignees

Comments

@JackKelly
Copy link
Owner

.idx files for GEFS have this form:

<message number>:<byte_offset>:d=<init date in YYYYMMDDHH>:<variable>:<vertical level>:<forecast step>:<ensemble member>

For example:

jack@jack-NUC:~/dev/rust/hypergrib$ head gec00.t00z.pgrb2af000.idx 
1:0:d=2017010100:HGT:10 mb:anl:ENS=low-res ctl
2:50487:d=2017010100:TMP:10 mb:anl:ENS=low-res ctl
3:70653:d=2017010100:RH:10 mb:anl:ENS=low-res ctl
4:81565:d=2017010100:UGRD:10 mb:anl:ENS=low-res ctl
5:104906:d=2017010100:VGRD:10 mb:anl:ENS=low-res ctl
6:125690:d=2017010100:HGT:50 mb:anl:ENS=low-res ctl
7:184420:d=2017010100:TMP:50 mb:anl:ENS=low-res ctl
8:208654:d=2017010100:RH:50 mb:anl:ENS=low-res ctl
9:232073:d=2017010100:UGRD:50 mb:anl:ENS=low-res ctl
10:281494:d=2017010100:VGRD:50 mb:anl:ENS=low-res ctl
@JackKelly
Copy link
Owner Author

9416724

@JackKelly
Copy link
Owner Author

Next steps:

  • Remind myself how Zarr stores keys, and how that maps to xarray's labelled arrays. (Because Kerchunk uses Zarr's concept of keys).
  • Understand the Kerchunk manifest format (with a view to using the Kerchunk Parquet format)

@JackKelly
Copy link
Owner Author

Actually, I might build my own to start with. Because I'm not planning to use chunks to start with.

Can I use BTreeSets? One set for each dimension. Each set contains a ref to the message struct. But each set uses a different function to sort the sets??

@JackKelly
Copy link
Owner Author

Actually, I might build my own to start with. Because I'm not planning to use chunks to start with.

Can I use BTreeSets? One set for each dimension. Each set contains a ref to the message struct. But each set uses a different function to sort the sets??

Or maybe BTreeMaps, one for each dim?

@JackKelly
Copy link
Owner Author

JackKelly commented Sep 4, 2024

Some thoughts on storing the manifest in memory:

Option 1: Multiple BTreeMap<dimension_coord_type, BTreeSet<&GribMessage>>s. One per dimension.

For example, the BTreeMap for the init_time dimension might look like this:

(2020-01-01T00, {refs to all grib messages with this init time})
(2020-01-01T06, {refs to all grib messages with this init time})

To find the appropriate set of grib messages for a query, we'd loop through the BTreeMap for each dim, to get a set of refs to all grib messages, and find the intersection between the sets.

But this requires an enormous amount of duplication. For example, every grib message has every "step" (so "step 0" will map to a set of all grib messages; as will "step 1", etc.).

Option 2: Hierarchy

Basically mimic the directory hierarchy that's usually used to store NWPs. Something like:

init_time / step / variable / vertical_level / ensemble_member

But this will require lots of loops, I think?

Option 3: Use DuckDB 🙂

This probably requires the least code for me to write. So perhaps this is the most appropriate for the MVP?

Maybe write an extension for DuckDB so DuckDB can directly ingest .idx files?! Although the primary language for DuckDB extensions is C++ (see the extension-template). But there is work ongoing to write extensions in Rust. It sounds like it is just about possible to write extensions today in Rust. But it might be best to wait. TL;DR: I probably shouldn't write a DuckDB extension for my first pass!

@JackKelly
Copy link
Owner Author

I'm using DuckDB! Very impressed so far!

Next step: Work through the TODOs in crate/hypergrib_manifest/src/lib.rs

@JackKelly
Copy link
Owner Author

On reflection, I think I might go back to my original idea of manually writing functions to map from requested index ranges to GRIB messages.

Which probably requires a tree of BTreeMaps, similar to a directory hierarchy.

And some good error reporting for when the mapping fails

JackKelly added a commit that referenced this issue Sep 12, 2024
JackKelly added a commit that referenced this issue Sep 17, 2024
@JackKelly
Copy link
Owner Author

Next tasks:

  • impl GefsKey::try_from<Path>
  • test
  • More tests for GefsKey::to_path
  • define struct GefsCoordLabels
  • define and impl Dataset<K, C>
  • write some tests which test basic indexing functions!
  • impl code to convert from idx to Key

@JackKelly
Copy link
Owner Author

JackKelly commented Sep 23, 2024

Rust's HashMap is now based on hashbrown which is very fast. So maybe I should use a HashMap instead of a BTreeMap? Might be faster and I don't have to worry about ordering.

@JackKelly
Copy link
Owner Author

JackKelly commented Sep 30, 2024

New plan: NoHashMap. Instead, we'll algorithmically compute the path of the .idx files and load the .idx files on demand. See #14 (comment) and also see the new design.md in commit a89d30c

@JackKelly
Copy link
Owner Author

JackKelly commented Oct 8, 2024

I was planning to parse the parameter abbreviation strings (e.g. "TMP") into gribberish enums (see mpiannucci/gribberish#62). But implementing a clean way to map from the abbrev string to any parameter type was proving slightly tricky.

So, for the hypergrib MVP, I'll not bother decoding the abbrev strings. Instead I'll just use the abbrev strings to refer to the parameter. Specifically: The coordinate labels passed to xarray will just be the abbrev strings, with no additional metadata about the parameter.

Further down the line, we should definitely give the user more information about each parameter. We could use the GRIB2 tables recorded as .csv files in gdal. Perhaps this could be implemented in Python.

There is the issue that some .idx files (like HRRR) use parameter "abbreviations" like var discipline=0 center=7 local_table=1 parmcat=16 parm=201. That's OK for now because that will just be another string. But we definitely should decode that for the user.

For the MVP, I'll also not decode the vertical level or the ensemble member. In a future version we'll decode these.

For the MVP we will decode the step.

@JackKelly
Copy link
Owner Author

I'm gonna close this issue and start more focused issues

JackKelly added a commit that referenced this issue Oct 8, 2024
Updated plan, as discussed in #11#issuecomment-2399417349
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant