FEATURE: Parse GRIB `.idx` files #11

JackKelly · 2024-09-04T09:41:40Z

.idx files for GEFS have this form:

<message number>:<byte_offset>:d=<init date in YYYYMMDDHH>:<variable>:<vertical level>:<forecast step>:<ensemble member>

For example:

jack@jack-NUC:~/dev/rust/hypergrib$ head gec00.t00z.pgrb2af000.idx 
1:0:d=2017010100:HGT:10 mb:anl:ENS=low-res ctl
2:50487:d=2017010100:TMP:10 mb:anl:ENS=low-res ctl
3:70653:d=2017010100:RH:10 mb:anl:ENS=low-res ctl
4:81565:d=2017010100:UGRD:10 mb:anl:ENS=low-res ctl
5:104906:d=2017010100:VGRD:10 mb:anl:ENS=low-res ctl
6:125690:d=2017010100:HGT:50 mb:anl:ENS=low-res ctl
7:184420:d=2017010100:TMP:50 mb:anl:ENS=low-res ctl
8:208654:d=2017010100:RH:50 mb:anl:ENS=low-res ctl
9:232073:d=2017010100:UGRD:50 mb:anl:ENS=low-res ctl
10:281494:d=2017010100:VGRD:50 mb:anl:ENS=low-res ctl

The text was updated successfully, but these errors were encountered:

JackKelly · 2024-09-04T10:54:50Z

9416724

JackKelly · 2024-09-04T10:56:43Z

Next steps:

Remind myself how Zarr stores keys, and how that maps to xarray's labelled arrays. (Because Kerchunk uses Zarr's concept of keys).
Understand the Kerchunk manifest format (with a view to using the Kerchunk Parquet format)

JackKelly · 2024-09-04T11:45:02Z

Actually, I might build my own to start with. Because I'm not planning to use chunks to start with.

Can I use BTreeSets? One set for each dimension. Each set contains a ref to the message struct. But each set uses a different function to sort the sets??

JackKelly · 2024-09-04T11:45:45Z

Actually, I might build my own to start with. Because I'm not planning to use chunks to start with.

Can I use BTreeSets? One set for each dimension. Each set contains a ref to the message struct. But each set uses a different function to sort the sets??

Or maybe BTreeMaps, one for each dim?

JackKelly · 2024-09-04T14:11:11Z

Some thoughts on storing the manifest in memory:

Option 1: Multiple `BTreeMap<dimension_coord_type, BTreeSet<&GribMessage>>`s. One per dimension.

For example, the BTreeMap for the init_time dimension might look like this:

(2020-01-01T00, {refs to all grib messages with this init time})
(2020-01-01T06, {refs to all grib messages with this init time})

To find the appropriate set of grib messages for a query, we'd loop through the BTreeMap for each dim, to get a set of refs to all grib messages, and find the intersection between the sets.

But this requires an enormous amount of duplication. For example, every grib message has every "step" (so "step 0" will map to a set of all grib messages; as will "step 1", etc.).

Option 2: Hierarchy

Basically mimic the directory hierarchy that's usually used to store NWPs. Something like:

init_time / step / variable / vertical_level / ensemble_member

But this will require lots of loops, I think?

Option 3: Use DuckDB 🙂

This probably requires the least code for me to write. So perhaps this is the most appropriate for the MVP?

Maybe write an extension for DuckDB so DuckDB can directly ingest .idx files?! Although the primary language for DuckDB extensions is C++ (see the extension-template). But there is work ongoing to write extensions in Rust. It sounds like it is just about possible to write extensions today in Rust. But it might be best to wait. TL;DR: I probably shouldn't write a DuckDB extension for my first pass!

JackKelly · 2024-09-04T18:47:22Z

I'm using DuckDB! Very impressed so far!

Next step: Work through the TODOs in crate/hypergrib_manifest/src/lib.rs

JackKelly · 2024-09-06T12:00:37Z

On reflection, I think I might go back to my original idea of manually writing functions to map from requested index ranges to GRIB messages.

Which probably requires a tree of BTreeMaps, similar to a directory hierarchy.

And some good error reporting for when the mapping fails

JackKelly · 2024-09-18T17:31:39Z

Next tasks:

impl GefsKey::try_from<Path>
test
More tests for GefsKey::to_path
define struct GefsCoordLabels
define and impl Dataset<K, C>
write some tests which test basic indexing functions!
impl code to convert from idx to Key

JackKelly · 2024-09-23T16:24:33Z

Rust's HashMap is now based on hashbrown which is very fast. So maybe I should use a HashMap instead of a BTreeMap? Might be faster and I don't have to worry about ordering.

JackKelly · 2024-09-30T15:27:15Z

New plan: NoHashMap. Instead, we'll algorithmically compute the path of the .idx files and load the .idx files on demand. See #14 (comment) and also see the new design.md in commit a89d30c

JackKelly · 2024-10-08T10:03:55Z

I was planning to parse the parameter abbreviation strings (e.g. "TMP") into gribberish enums (see mpiannucci/gribberish#62). But implementing a clean way to map from the abbrev string to any parameter type was proving slightly tricky.

So, for the hypergrib MVP, I'll not bother decoding the abbrev strings. Instead I'll just use the abbrev strings to refer to the parameter. Specifically: The coordinate labels passed to xarray will just be the abbrev strings, with no additional metadata about the parameter.

Further down the line, we should definitely give the user more information about each parameter. We could use the GRIB2 tables recorded as .csv files in gdal. Perhaps this could be implemented in Python.

There is the issue that some .idx files (like HRRR) use parameter "abbreviations" like var discipline=0 center=7 local_table=1 parmcat=16 parm=201. That's OK for now because that will just be another string. But we definitely should decode that for the user.

For the MVP, I'll also not decode the vertical level or the ensemble member. In a future version we'll decode these.

For the MVP we will decode the step.

JackKelly · 2024-10-08T10:15:13Z

I'm gonna close this issue and start more focused issues

Updated plan, as discussed in #11#issuecomment-2399417349

JackKelly self-assigned this Sep 4, 2024

JackKelly added this to the Create manifest from `.idx` files on cloud object storage milestone Sep 4, 2024

JackKelly added a commit that referenced this issue Sep 12, 2024

Remove duckdb #11

d1eeb91

JackKelly added a commit that referenced this issue Sep 17, 2024

Updating design.md. #11

55ff1d6

JackKelly closed this as completed Oct 8, 2024

JackKelly added a commit that referenced this issue Oct 8, 2024

Update design.md

91e688d

Updated plan, as discussed in #11#issuecomment-2399417349

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEATURE: Parse GRIB `.idx` files #11

FEATURE: Parse GRIB `.idx` files #11

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024 •

edited

Loading

JackKelly commented Sep 4, 2024

JackKelly commented Sep 6, 2024

JackKelly commented Sep 18, 2024

JackKelly commented Sep 23, 2024 •

edited

Loading

JackKelly commented Sep 30, 2024 •

edited

Loading

JackKelly commented Oct 8, 2024 •

edited

Loading

JackKelly commented Oct 8, 2024

FEATURE: Parse GRIB .idx files #11

FEATURE: Parse GRIB .idx files #11

Comments

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024

JackKelly commented Sep 4, 2024 • edited Loading

Option 1: Multiple BTreeMap<dimension_coord_type, BTreeSet<&GribMessage>>s. One per dimension.

Option 2: Hierarchy

Option 3: Use DuckDB 🙂

JackKelly commented Sep 4, 2024

JackKelly commented Sep 6, 2024

JackKelly commented Sep 18, 2024

JackKelly commented Sep 23, 2024 • edited Loading

JackKelly commented Sep 30, 2024 • edited Loading

JackKelly commented Oct 8, 2024 • edited Loading

JackKelly commented Oct 8, 2024

FEATURE: Parse GRIB `.idx` files #11

FEATURE: Parse GRIB `.idx` files #11

JackKelly commented Sep 4, 2024 •

edited

Loading

Option 1: Multiple `BTreeMap<dimension_coord_type, BTreeSet<&GribMessage>>`s. One per dimension.

JackKelly commented Sep 23, 2024 •

edited

Loading

JackKelly commented Sep 30, 2024 •

edited

Loading

JackKelly commented Oct 8, 2024 •

edited

Loading