-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add idx metadata file handling #41
Comments
Example from the HRRR
|
@emfdavid , can you please link to the most general overview of your work on .idx files here? |
I created a parser that infers the offset and length |
@martindurant the notebook in that directory provides the overview of how to use the these methods. The pangeo showcase talk has my narrated version. |
FWIW, I've started tinkering with @mpiannucci please shout if you'd prefer the (I stumbled across this github issue after I started tinkering with parsing Rust's #[derive(PartialEq, Debug, serde::Deserialize)]
struct IdxRecord {
msg_id: u32,
byte_offset: u32,
init_time: String, // TODO: Use DateTime<Utc>?
nwp_variable: String, // TODO: Use NWPVariable enum?
vertical_level: String, // TODO: Use VerticalLevel enum?
forecast_step: String, // TODO: Use TimeDelta?
ensemble_member: String, // TODO: Use EnsembleMember enum?
}
/// `b` is the contents of an `.idx` file.
fn parse_idx(b: &[u8]) -> anyhow::Result<Vec<IdxRecord>> {
let mut rdr = csv::ReaderBuilder::new()
.delimiter(b':')
.has_headers(false)
.from_reader(b);
let mut records = vec![];
for result in rdr.deserialize() {
records.push(result?);
}
Ok(records)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_parse_idx() -> anyhow::Result<()> {
let idx_text = "\
1:0:d=2017010100:HGT:10 mb:anl:ENS=low-res ctl
2:50487:d=2017010100:TMP:10 mb:anl:ENS=low-res ctl
3:70653:d=2017010100:RH:10 mb:anl:ENS=low-res ctl
4:81565:d=2017010100:UGRD:10 mb:anl:ENS=low-res ctl
";
let records = parse_idx(idx_text.as_bytes())?;
assert_eq!(records.len(), 4);
assert_eq!(
records[0],
IdxRecord {
msg_id: 1,
byte_offset: 0,
init_time: String::from("d=2017010100"),
nwp_variable: String::from("HGT"),
vertical_level: String::from("10 mb"),
forecast_step: String::from("anl"),
ensemble_member: String::from("ENS=low-res ctl"),
}
);
Ok(())
}
} |
Thanks @JackKelly . Take whatever you need from the kerchunk files and outstanding PRs on the topic, of course. Most of that is not around the idx files themselves (which are simple) but how to map them onto sets of grib files, and how to query these mappings to make logical data sets. |
@mpiannucci I see that commit 01745ad implemented Is the intention of this github issue to implement a |
It would be great to have a complete solution in rust but if you want to jump start the real work in rust is reading the grib files. You should be able to change the codec in these demo files pretty easily if you want to experiment. |
This is sooo cool to see!
Yeah that was my intention, but i clearly have not gotten to build it. I would LOVE to have this as part of the gribberish repo if folks are open to that. |
@emfdavid wrote:
I hear you: The main performance benefits of using Rust will come from reading the grib files themselves, rather than the For the MVP of Plus, I'd love to help with |
@mpiannucci wrote:
Great! Let's make it happen 🙂 First, please may I ask a few quick questions about how best to approach this? 1) Where do I find the most authoritative definition of
|
This is exactly what we deal with in kerchunk - we scan the first of a set of files to get all the data we need, and then use the .idx files for all the rest to infer the complete data for all of them. From your point of view, this is somewhere between solutions 2) and 3). |
That's great to know, thanks @martindurant. Sorry to ask a slightly off-topic question but: What does kerchunk do if the metadata changes over the course of a dataset? e.g. an NWP dataset which is, say, 0.5 degree horizontal resolution from 2015 to 2020. But the NWP gets upgraded in 2021 to 0.25 degree resolution? (I've been worrying about this exact issue for |
It doesn't "do" anything, you would I think have to process the two batches separately. If the chunking is actually different in the overall dataset, then this can't be represented in the zarr model at all without at least variable chunking, and maybe not at all. This is why virtualizarr is a more general solution, and the two projects should probably be tied more closely together :) |
On the topic of how to handle I'm sure everyone has considered this already but it feels like there are three distinct types of message metadata:
IIUC, the Maybe a 5th option (following on from the 4 "potential paths forward" listed above) would be to split |
These are exactly the issues we struggled with last winter. @martindurant and @Anu-Ra-g did a great job adding documentation about how the the kerchunk solution works. What we found is that the idx file does not provide compete Type I metadata - at least not without reverse engineering the wgrib fortran code that wrote the "attrs" column which includes the product level and step in a form better suited to humans than machines. The solution implemented for kerchunk is to build a mapping from this attrs string to the Type I meta data you need to locate each chunk in the hyper kube. Resulting a table that stores all the type I & II data From this, and that static metadata about compression, dimensions and product attributes you can than construct any logical dataset you like. Operationally we know when NOAA is going to change the model - it is an operational product, but just to be sure, I compare reading the grib file directly with parsing the idx file for 1 in 1000 files to make sure nothing has changed. I would love to get rid of the mapping and improve the api for building the logical dataset, but I hope understanding one working solution can give you a boost to find a better way. If we come up with a standard from for the Type I & II metadata we may be able to get NOAA/NODD to build and maintain the database for us. This would be really exciting if we can show that a common form could support kerchunk, virtualizarr and hypergrib, I think we would be well on our way! |
And the name gribberish is totally awesome! |
This is all super-useful, @emfdavid!
|
This is fantastic - the data driven mapping between the IDX files and the cf grib metadata is definitely a bit of a hack. If you can build and maintain tables/code based on wgrib that would be strictly better.
The full dataset for all products, all steps & runtimes is billions of chunks for a single perturbation of GEFS.
I would love to see NODD actually build and maintain the manifest database for us. I think we could do it if the community comes together around a table schema. I think there is funding and a path to implementation. |
@rabernat , you might have some thoughts on the DB approach being sketched out here and related kerchunk threads. This is not exactly a kerchunk manifest, but a set of chunk details that can be made into various datasets, but no one dataset can use all the references (they wouldn't fit in the data model, with mutually conflicting coordinates).
This is the big goal! No one of us wants to read however many PB of data they have, or support the storage/query interface it would need to use. Making manifests for some specific view is doable, but the data is always being updated, so there needs to be a process too. |
Have NOAA hinted that they might have appetite for maintaining a public manifest for NODD? |
Yeah. I'm interested in defining a very concise manifest format but, even if the manifest only uses a single byte per chunk - which sounds impossibly concise(!) - then we're still looking at a manifest that's gigabytes in size. These dang NWP datasets are so BIG! Hmm, I see what you mean... this definitely needs more thought... It feels solvable though... |
(BTW, I'm going to propose that we move discussion of giga-scale manifests etcetera to the UPDATED WITH LINKS:
|
Yes - if we do determine we need a manifest and we came up with a information model that supports the community needs (hypergrib, virtualizarr, kerchunk) I think we could definitely find funding and enthusiastic support from NODD to operate it. If we can deliver a tool that doesn't need the manifest at all and can load whole data sets looking up chunks algorithmically - as you are now proposing - well that would be even better! |
@mpiannucci I've made a start on parsing I don't (yet) have great Rust macro skills. Please may I ask your advice: Do you think it should be possible to write a Rust macro to automatically convert a string like "TMP" to At the moment, I'm manually mapping from abbreviation strings to the appropriate enum variant (here's my code so far - we can move this into |
I'm afraid I'm moving away from the idea of parsing After the MVP, |
That looks like a good resource for a machine readable form... but to get all the NOAA special variables you will have to take a look at the ncep libs library. I opened an issue asking for the data to be exposed in a machine readable form, but I don't think it has happened yet. |
Yikes! Thanks for flagging that! GRIB is a bit of a mess, isn't it?! It feels like a useful contribution would be "just" to collate all these GRIB code tables into a single place, in a machine-readable form. (I've started a thread to track progress on this idea) |
Sorry, I linked to the wrong gdal path in my comment above. The following is the correct path, which contains more GRIB tables as CSVs: The README for that directory is here:
Does this CSV contain what you need: |
Yes! I wonder how often new entries are added... Of course, many "definitions" come with an implementation too, for example all the coordinate projections. |
After posting the comment where I said:
I discovered that the GDAL codebase appears to contain a bunch of vendor-specific tables:
Is this sufficient? Does gdal already contain all the tables we need? |
Obviously I don't know, but I wouldn't be too surprised if GDAL was mostly on top of this. Of course, how up to date that is, is another matter - but probably they have a pretty active user base pushing for updates as they arise. |
The trouble is that there are multiple formats for the code tables. GDAL is probably the most comprehensive resource. Youll notice that the format GDAL has probably doesnt match the WMO code tables but that doesnt really matter I hate the code tables impl in gribberish and was going to codegen them but never had the time. |
Oh, I'm in awe of your pure-Rust representation of the GRIB tables! Reading your code genuinely expanded my understanding of what can be represented in pure Rust! Making no promises... but I'm wondering about writing a new Rust crate which:
Does that sound useful? Or are you determined to use a pure-Rust representation of the code tables (where the Rust might be codegen'd from the GDAL CSVs). |
This is definitely more of a people problem than a technical problem. |
Good question. I've just requested to be subscribed to the gdal-dev mailing list, so I can ask this question! (The gdal github issues page says that github issues are only for feature requests and bug reports) |
On the topic of generating Rust code which represents the GRIB code table... Having just seen a great talk at EuroRust on codegen, and then finding this blog post on codegen, I'm now excited about codegen! (Prior to today I didn't know much about codegen and had assumed it'd be very hard). I've started a new issue: #63 |
So for some of the more opaque descriptions in the IDX files can we now parse the "level" and "step" descriptions to get coordinates and indices for variables like these?
The vertical intervals are particularly difficult and probably more obscure meteorological variables, but the time average and accumulation variables are really important. |
Not yet! I haven't written any code to parse the body of the I'm afraid I won't get round to parsing "level" and "step" from the body of I will extract the level string and product abbreviation from the body of the My plan for the development of Also, before we can parse the body of |
This is why I shouldn't name things - but I deeply appreciate the great names other people come up with.
This sounds like a great compromise to get at the meat of the problem in the MVP and leave some nasty string parsing logic to later. Suggest you avoid spending time on the CF attrs as well. Fix that later. Let's see that IO rate flatlined at the NIC limit then fix the little stuff. |
My little
|
Thanks @JackKelly ! The source code is at https://github.com/JackKelly/hypergrib/tree/main/crates/grib_tables , if that wasn't obvious. |
Oooh, good point, I've just updated the |
http://gradsusr.org/pipermail/gradsusr/2008-July/007358.html
https://github.com/j-m-adams/GrADS/blob/master/src/gribmap.c
EDIT: I think the formatting comes from wgrib2:
Vertical Levels
https://github.com/NOAA-EMC/NCEPLIBS-grib_util/blob/558fad4ae6121b5e1754177839cf7c8179abcb26/src/wgrib/wgrib.c#L2004
https://github.com/NOAA-EMC/NCEPLIBS-grib_util/blob/558fad4ae6121b5e1754177839cf7c8179abcb26/src/degrib2/prlevel.F90#L57
The text was updated successfully, but these errors were encountered: