Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add idx metadata file handling #41

Open
mpiannucci opened this issue Nov 21, 2023 · 43 comments
Open

Add idx metadata file handling #41

mpiannucci opened this issue Nov 21, 2023 · 43 comments
Labels
enhancement New feature or request python question Further information is requested rust Pull requests that update Rust code

Comments

@mpiannucci mpiannucci added enhancement New feature or request question Further information is requested python rust Pull requests that update Rust code labels Nov 21, 2023
@mpiannucci
Copy link
Owner Author

mpiannucci commented Nov 21, 2023

Example from the HRRR

1:0:d=2023031200:REFC:entire atmosphere:735 min fcst:
2:665218:d=2023031200:RETOP:cloud top:735 min fcst:
3:935932:d=2023031200:var discipline=0 center=7 local_table=1 parmcat=16 parm=201:entire atmosphere:735 min fcst:
4:1715820:d=2023031200:VIL:entire atmosphere:735 min fcst:
5:2136475:d=2023031200:VIS:surface:735 min fcst:
6:3522779:d=2023031200:REFD:1000 m above ground:735 min fcst:
7:3911359:d=2023031200:REFD:4000 m above ground:735 min fcst:
8:4134977:d=2023031200:GUST:surface:735 min fcst:
9:5452398:d=2023031200:UPHL:5000-2000 m above ground:735 min fcst:
10:5468548:d=2023031200:UGRD:80 m above ground:735 min fcst:
11:6663933:d=2023031200:VGRD:80 m above ground:735 min fcst:
12:7872079:d=2023031200:PRES:surface:735 min fcst:
13:9384269:d=2023031200:HGT:surface:735 min fcst:
14:11537964:d=2023031200:TMP:2 m above ground:735 min fcst:
15:12747367:d=2023031200:SPFH:2 m above ground:735 min fcst:
16:14101255:d=2023031200:DPT:2 m above ground:735 min fcst:
17:15296244:d=2023031200:UGRD:10 m above ground:735 min fcst:
18:17677859:d=2023031200:VGRD:10 m above ground:735 min fcst:
19:20059474:d=2023031200:WIND:10 m above ground:730-735 min ave fcst:
20:21952786:d=2023031200:UGRD:10 m above ground:730-735 min ave fcst:
21:23828763:d=2023031200:VGRD:10 m above ground:730-735 min ave fcst:
22:25670085:d=2023031200:DSWRF:surface:720-735 min ave fcst:
23:26068792:d=2023031200:VBDSF:surface:720-735 min ave fcst:
24:26512449:d=2023031200:CPOFP:surface:735 min fcst:
25:26725650:d=2023031200:PRATE:surface:735 min fcst:
26:26830043:d=2023031200:APCP:surface:720-735 min acc fcst:
27:27261050:d=2023031200:WEASD:surface:720-735 min acc fcst:
28:27546180:d=2023031200:FROZR:surface:720-735 min acc fcst:
29:27642511:d=2023031200:CSNOW:surface:735 min fcst:
30:27682184:d=2023031200:CICEP:surface:735 min fcst:
31:27685706:d=2023031200:CFRZR:surface:735 min fcst:
32:27691105:d=2023031200:CRAIN:surface:735 min fcst:
33:27764907:d=2023031200:TCOLW:entire atmosphere:735 min fcst:
34:28910791:d=2023031200:TCOLI:entire atmosphere:735 min fcst:
35:30763769:d=2023031200:HGT:cloud ceiling:735 min fcst:
36:32699978:d=2023031200:HGT:cloud base:735 min fcst:
37:35794489:d=2023031200:HGT:cloud top:735 min fcst:
38:37648546:d=2023031200:ULWRF:top of atmosphere:735 min fcst:
39:39561926:d=2023031200:DSWRF:surface:735 min fcst:
40:40235734:d=2023031200:DLWRF:surface:735 min fcst:
41:42303313:d=2023031200:USWRF:surface:735 min fcst:
42:42823796:d=2023031200:ULWRF:surface:735 min fcst:
43:44447719:d=2023031200:VBDSF:surface:735 min fcst:
44:45083573:d=2023031200:VDDSF:surface:735 min fcst:
45:45765649:d=2023031200:USWRF:top of atmosphere:735 min fcst:
46:46448872:d=2023031200:SBT123:top of atmosphere:735 min fcst:
47:48050397:d=2023031200:SBT124:top of atmosphere:735 min fcst:
48:50411726:d=2023031200:SBT113:top of atmosphere:735 min fcst:
49:51890086:d=2023031200:SBT114:top of atmosphere:735 min fcst:
50:54126340:d=2023031200:REFC:entire atmosphere:750 min fcst:
51:54795062:d=2023031200:RETOP:cloud top:750 min fcst:
52:55064675:d=2023031200:var discipline=0 center=7 local_table=1 parmcat=16 parm=201:entire atmosphere:750 min fcst:
53:55827086:d=2023031200:VIL:entire atmosphere:750 min fcst:
54:56250218:d=2023031200:VIS:surface:750 min fcst:
55:57637467:d=2023031200:REFD:1000 m above ground:750 min fcst:
56:58030883:d=2023031200:REFD:4000 m above ground:750 min fcst:
57:58254449:d=2023031200:GUST:surface:750 min fcst:
58:59568193:d=2023031200:UPHL:5000-2000 m above ground:750 min fcst:
59:59674280:d=2023031200:UGRD:80 m above ground:750 min fcst:
60:60868814:d=2023031200:VGRD:80 m above ground:750 min fcst:
61:62076573:d=2023031200:PRES:surface:750 min fcst:
62:63590686:d=2023031200:HGT:surface:750 min fcst:
63:65744381:d=2023031200:TMP:2 m above ground:750 min fcst:
64:66949541:d=2023031200:SPFH:2 m above ground:750 min fcst:
65:68300930:d=2023031200:DPT:2 m above ground:750 min fcst:
66:69492750:d=2023031200:UGRD:10 m above ground:750 min fcst:
67:71874365:d=2023031200:VGRD:10 m above ground:750 min fcst:
68:74255980:d=2023031200:WIND:10 m above ground:745-750 min ave fcst:
69:76147706:d=2023031200:UGRD:10 m above ground:745-750 min ave fcst:
70:78022269:d=2023031200:VGRD:10 m above ground:745-750 min ave fcst:
71:79862604:d=2023031200:DSWRF:surface:735-750 min ave fcst:
72:80366961:d=2023031200:VBDSF:surface:735-750 min ave fcst:
73:80867664:d=2023031200:CPOFP:surface:750 min fcst:
74:81081696:d=2023031200:PRATE:surface:750 min fcst:
75:81185428:d=2023031200:APCP:surface:735-750 min acc fcst:
76:81623998:d=2023031200:WEASD:surface:735-750 min acc fcst:
77:81906446:d=2023031200:FROZR:surface:735-750 min acc fcst:
78:82004006:d=2023031200:CSNOW:surface:750 min fcst:
79:82043242:d=2023031200:CICEP:surface:750 min fcst:
80:82046759:d=2023031200:CFRZR:surface:750 min fcst:
81:82052733:d=2023031200:CRAIN:surface:750 min fcst:
82:82126049:d=2023031200:TCOLW:entire atmosphere:750 min fcst:
83:83271986:d=2023031200:TCOLI:entire atmosphere:750 min fcst:
84:85127922:d=2023031200:HGT:cloud ceiling:750 min fcst:
85:87068879:d=2023031200:HGT:cloud base:750 min fcst:
86:90169738:d=2023031200:HGT:cloud top:750 min fcst:
87:92025900:d=2023031200:ULWRF:top of atmosphere:750 min fcst:
88:93937266:d=2023031200:DSWRF:surface:750 min fcst:
89:94769189:d=2023031200:DLWRF:surface:750 min fcst:
90:96834549:d=2023031200:USWRF:surface:750 min fcst:
91:97486053:d=2023031200:ULWRF:surface:750 min fcst:
92:99107310:d=2023031200:VBDSF:surface:750 min fcst:
93:99822112:d=2023031200:VDDSF:surface:750 min fcst:
94:100665853:d=2023031200:USWRF:top of atmosphere:750 min fcst:
95:101503586:d=2023031200:SBT123:top of atmosphere:750 min fcst:
96:103122046:d=2023031200:SBT124:top of atmosphere:750 min fcst:
97:105479467:d=2023031200:SBT113:top of atmosphere:750 min fcst:
98:106958753:d=2023031200:SBT114:top of atmosphere:750 min fcst:
99:109192866:d=2023031200:REFC:entire atmosphere:765 min fcst:
100:109865168:d=2023031200:RETOP:cloud top:765 min fcst:
101:110135591:d=2023031200:var discipline=0 center=7 local_table=1 parmcat=16 parm=201:entire atmosphere:765 min fcst:
102:110921341:d=2023031200:VIL:entire atmosphere:765 min fcst:
103:111345494:d=2023031200:VIS:surface:765 min fcst:
104:112735081:d=2023031200:REFD:1000 m above ground:765 min fcst:
105:113135816:d=2023031200:REFD:4000 m above ground:765 min fcst:
106:113359391:d=2023031200:GUST:surface:765 min fcst:
107:114669754:d=2023031200:UPHL:5000-2000 m above ground:765 min fcst:
108:114775700:d=2023031200:UGRD:80 m above ground:765 min fcst:
109:115969110:d=2023031200:VGRD:80 m above ground:765 min fcst:
110:117175214:d=2023031200:PRES:surface:765 min fcst:
111:118691577:d=2023031200:HGT:surface:765 min fcst:
112:120845272:d=2023031200:TMP:2 m above ground:765 min fcst:
113:122044777:d=2023031200:SPFH:2 m above ground:765 min fcst:
114:123392081:d=2023031200:DPT:2 m above ground:765 min fcst:
115:124580274:d=2023031200:UGRD:10 m above ground:765 min fcst:
116:126961889:d=2023031200:VGRD:10 m above ground:765 min fcst:
117:129343504:d=2023031200:WIND:10 m above ground:760-765 min ave fcst:
118:131231925:d=2023031200:UGRD:10 m above ground:760-765 min ave fcst:
119:133101853:d=2023031200:VGRD:10 m above ground:760-765 min ave fcst:
120:134938264:d=2023031200:DSWRF:surface:750-765 min ave fcst:
121:135545311:d=2023031200:VBDSF:surface:750-765 min ave fcst:
122:136109315:d=2023031200:CPOFP:surface:765 min fcst:
123:136326812:d=2023031200:PRATE:surface:765 min fcst:
124:136430161:d=2023031200:APCP:surface:750-765 min acc fcst:
125:136864501:d=2023031200:WEASD:surface:750-765 min acc fcst:
126:137148884:d=2023031200:FROZR:surface:750-765 min acc fcst:
127:137245677:d=2023031200:CSNOW:surface:765 min fcst:
128:137284964:d=2023031200:CICEP:surface:765 min fcst:
129:137288244:d=2023031200:CFRZR:surface:765 min fcst:
130:137294660:d=2023031200:CRAIN:surface:765 min fcst:
131:137367386:d=2023031200:TCOLW:entire atmosphere:765 min fcst:
132:138513446:d=2023031200:TCOLI:entire atmosphere:765 min fcst:
133:140356222:d=2023031200:HGT:cloud ceiling:765 min fcst:
134:142306588:d=2023031200:HGT:cloud base:765 min fcst:
135:145414575:d=2023031200:HGT:cloud top:765 min fcst:
136:147275924:d=2023031200:ULWRF:top of atmosphere:765 min fcst:
137:149184134:d=2023031200:DSWRF:surface:765 min fcst:
138:150177044:d=2023031200:DLWRF:surface:765 min fcst:
139:152238973:d=2023031200:USWRF:surface:765 min fcst:
140:153018360:d=2023031200:ULWRF:surface:765 min fcst:
141:154636041:d=2023031200:VBDSF:surface:765 min fcst:
142:155436515:d=2023031200:VDDSF:surface:765 min fcst:
143:156442803:d=2023031200:USWRF:top of atmosphere:765 min fcst:
144:157429615:d=2023031200:SBT123:top of atmosphere:765 min fcst:
145:159047731:d=2023031200:SBT124:top of atmosphere:765 min fcst:
146:161403679:d=2023031200:SBT113:top of atmosphere:765 min fcst:
147:162883102:d=2023031200:SBT114:top of atmosphere:765 min fcst:
148:165117583:d=2023031200:REFC:entire atmosphere:780 min fcst:
149:165791792:d=2023031200:RETOP:cloud top:780 min fcst:
150:166064978:d=2023031200:var discipline=0 center=7 local_table=1 parmcat=16 parm=201:entire atmosphere:780 min fcst:
151:166852804:d=2023031200:VIL:entire atmosphere:780 min fcst:
152:167277590:d=2023031200:VIS:surface:780 min fcst:
153:168669201:d=2023031200:REFD:1000 m above ground:780 min fcst:
154:169072822:d=2023031200:REFD:4000 m above ground:780 min fcst:
155:169296034:d=2023031200:GUST:surface:780 min fcst:
156:170602055:d=2023031200:UPHL:5000-2000 m above ground:780 min fcst:
157:170707623:d=2023031200:UGRD:80 m above ground:780 min fcst:
158:171899386:d=2023031200:VGRD:80 m above ground:780 min fcst:
159:173103414:d=2023031200:PRES:surface:780 min fcst:
160:174617688:d=2023031200:HGT:surface:780 min fcst:
161:176771383:d=2023031200:TMP:2 m above ground:780 min fcst:
162:177965670:d=2023031200:SPFH:2 m above ground:780 min fcst:
163:179309300:d=2023031200:DPT:2 m above ground:780 min fcst:
164:180493978:d=2023031200:UGRD:10 m above ground:780 min fcst:
165:182637450:d=2023031200:VGRD:10 m above ground:780 min fcst:
166:185019065:d=2023031200:WIND:10 m above ground:775-780 min ave fcst:
167:186903307:d=2023031200:UGRD:10 m above ground:775-780 min ave fcst:
168:188768478:d=2023031200:VGRD:10 m above ground:775-780 min ave fcst:
169:190601139:d=2023031200:DSWRF:surface:765-780 min ave fcst:
170:191309973:d=2023031200:VBDSF:surface:765-780 min ave fcst:
171:191937128:d=2023031200:CPOFP:surface:780 min fcst:
172:192155257:d=2023031200:PRATE:surface:780 min fcst:
173:192258275:d=2023031200:APCP:surface:765-780 min acc fcst:
174:192695126:d=2023031200:WEASD:surface:765-780 min acc fcst:
175:192981299:d=2023031200:FROZR:surface:765-780 min acc fcst:
176:193078758:d=2023031200:CSNOW:surface:780 min fcst:
177:193117614:d=2023031200:CICEP:surface:780 min fcst:
178:193120881:d=2023031200:CFRZR:surface:780 min fcst:
179:193127347:d=2023031200:CRAIN:surface:780 min fcst:
180:193199352:d=2023031200:TCOLW:entire atmosphere:780 min fcst:
181:194345866:d=2023031200:TCOLI:entire atmosphere:780 min fcst:
182:196188441:d=2023031200:HGT:cloud ceiling:780 min fcst:
183:198149050:d=2023031200:HGT:cloud base:780 min fcst:
184:201260133:d=2023031200:HGT:cloud top:780 min fcst:
185:203132121:d=2023031200:ULWRF:top of atmosphere:780 min fcst:
186:205037684:d=2023031200:DSWRF:surface:780 min fcst:
187:206196064:d=2023031200:DLWRF:surface:780 min fcst:
188:208253734:d=2023031200:USWRF:surface:780 min fcst:
189:209163148:d=2023031200:ULWRF:surface:780 min fcst:
190:210778196:d=2023031200:VBDSF:surface:780 min fcst:
191:211664932:d=2023031200:VDDSF:surface:780 min fcst:
192:212830468:d=2023031200:USWRF:top of atmosphere:780 min fcst:
193:213965444:d=2023031200:SBT123:top of atmosphere:780 min fcst:
194:215583407:d=2023031200:SBT124:top of atmosphere:780 min fcst:
195:217939575:d=2023031200:SBT113:top of atmosphere:780 min fcst:
196:219418397:d=2023031200:SBT114:top of atmosphere:780 min fcst:

@martindurant
Copy link

@emfdavid , can you please link to the most general overview of your work on .idx files here?

@emfdavid
Copy link

emfdavid commented May 9, 2024

I created a parser that infers the offset and length
I punted on reverse engineering the tags in the idx file.
I run scangrib on each group and map one to one in order from the idx file to the scangrib group the idx tags.
I am traveling this week - happy to follow up more next week.

@emfdavid
Copy link

@martindurant the notebook in that directory provides the overview of how to use the these methods. The pangeo showcase talk has my narrated version.

@JackKelly
Copy link
Contributor

JackKelly commented Sep 24, 2024

FWIW, I've started tinkering with .idx parsing in Rust (in my nascent hypergrib project).

@mpiannucci please shout if you'd prefer the .idx parsing to live in gribberish!

(I stumbled across this github issue after I started tinkering with parsing .idx files... Now that I've seen this issue, it feels like .idx parsing probably should live in gribberish not hypergrib! hypergrib is all about making huge multi-grib datasets easy to use lazily... basically "the kerchunk trick" 🙂 plus any GRIB-specific hacks we can think of to go as fast as possible. My guess is that anything that relates to parsing single GRIB files probably belongs in gribberish???)

Rust's csv crate makes it pretty easy to get started:

#[derive(PartialEq, Debug, serde::Deserialize)]
struct IdxRecord {
    msg_id: u32,
    byte_offset: u32,
    init_time: String,       // TODO: Use DateTime<Utc>?
    nwp_variable: String,    // TODO: Use NWPVariable enum?
    vertical_level: String,  // TODO: Use VerticalLevel enum?
    forecast_step: String,   // TODO: Use TimeDelta?
    ensemble_member: String, // TODO: Use EnsembleMember enum?
}

/// `b` is the contents of an `.idx` file.
fn parse_idx(b: &[u8]) -> anyhow::Result<Vec<IdxRecord>> {
    let mut rdr = csv::ReaderBuilder::new()
        .delimiter(b':')
        .has_headers(false)
        .from_reader(b);
    let mut records = vec![];
    for result in rdr.deserialize() {
        records.push(result?);
    }
    Ok(records)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_parse_idx() -> anyhow::Result<()> {
        let idx_text = "\
1:0:d=2017010100:HGT:10 mb:anl:ENS=low-res ctl
2:50487:d=2017010100:TMP:10 mb:anl:ENS=low-res ctl
3:70653:d=2017010100:RH:10 mb:anl:ENS=low-res ctl
4:81565:d=2017010100:UGRD:10 mb:anl:ENS=low-res ctl
";
        let records = parse_idx(idx_text.as_bytes())?;
        assert_eq!(records.len(), 4);
        assert_eq!(
            records[0],
            IdxRecord {
                msg_id: 1,
                byte_offset: 0,
                init_time: String::from("d=2017010100"),
                nwp_variable: String::from("HGT"),
                vertical_level: String::from("10 mb"),
                forecast_step: String::from("anl"),
                ensemble_member: String::from("ENS=low-res ctl"),
            }
        );
        Ok(())
    }
}

@martindurant
Copy link

Thanks @JackKelly . Take whatever you need from the kerchunk files and outstanding PRs on the topic, of course. Most of that is not around the idx files themselves (which are simple) but how to map them onto sets of grib files, and how to query these mappings to make logical data sets.

@JackKelly
Copy link
Contributor

JackKelly commented Sep 24, 2024

@mpiannucci I see that commit 01745ad implemented MessageMetadata::as_idx(&self, index: usize) -> String.

Is the intention of this github issue to implement a MessageMetadata::from_idx(idx_data: &[u8]) -> MessageMetadata method?

@emfdavid
Copy link

It would be great to have a complete solution in rust but if you want to jump start the real work in rust is reading the grib files. You should be able to change the codec in these demo files pretty easily if you want to experiment.

@mpiannucci
Copy link
Owner Author

This is sooo cool to see!

Is the intention of this github issue to implement a MessageMetadata::from_idx(idx_data: &[u8]) -> MessageMetadata method?

Yeah that was my intention, but i clearly have not gotten to build it. I would LOVE to have this as part of the gribberish repo if folks are open to that.

@JackKelly
Copy link
Contributor

@emfdavid wrote:

the real work in rust is reading the grib files

I hear you: The main performance benefits of using Rust will come from reading the grib files themselves, rather than the .idx files. But, for the MVP of hypergrib, I'm starting by "cheating" and creating my manifest by reading existing .idx files (although, soon after the MVP, I will have to read the grib files to create my manifest because the .idx files don't include essential information like the grid definition... maybe gribberish could define and generate "extended .idx" sidecar files which include additional metadata about each message?!).

For the MVP of hypergrib I'm focused on the NODD GEFS dataset (which already has .idx files). So, I agree that tinkering with .idx files in Rust isn't necessarily the fastest way to a big performance boost. But reading .idx files in Rust is a requirement of the hypergrib MVP 🙂.

Plus, I'd love to help with gribberish. If only because it has such an awesome name 🙂

@JackKelly
Copy link
Contributor

JackKelly commented Sep 25, 2024

@mpiannucci wrote:

I would LOVE to have [a MessageMetadata::from_idx function] as part of the gribberish

Great! Let's make it happen 🙂

First, please may I ask a few quick questions about how best to approach this?

1) Where do I find the most authoritative definition of .idx files?

Is the best "definition" the source code for wgrib?

2) .idx files don't include all the required fields in MessageMetadata

For example, the body of an .idx file doesn't include data_compression, proj, or crs.

Some potential paths forward might include:

  1. Change the definition of struct MessageMetadata so almost all the fields are Option
    • Advantages: Conceptually simple.
    • Disadvantages: Does downstream code in gribberish require that MessageMetadata is fully populated? If so, all that code will have to be changed to handle metadata fields that are set to None.
  2. Force the user to supply missing metadata. e.g. implement a Builder pattern whereby some (but not all) required fields can be populated from an .idx file. But the user will have to supply the remaining fields before they're allowed to construct a MessageMetadata instance.
    • Advantages: Ensures gribberish has all the metadata it needs. Minimal changes to downstream gribberish code.
    • Disadvantages: Makes life much harder for users!
  3. Hardcode the missing metadata for each dataset into gribberish such that gribberish knows, for example, which data_compression is used by GEFS. The user would just have to supply an .idx file and tell gribberish which dataset that .idx file comes from (GEFS, HRRR, etc.)
    • Advantages: Makes life easy for users! And guarantees that downstream gribberish code gets all the fields it expects.
    • Disadvantages: More work for us! We'd have to hardcode data for lots of datasets! (But there aren't that many NWPs). And what would gribberish do if it comes across an .idx files from a dataset it doesn't know about?
  4. We define an "extended idx" format 🙂
    • I quite like this idea. But let's park it for now 🙂

@martindurant
Copy link

.idx files don't include all the required fields in MessageMetadata

This is exactly what we deal with in kerchunk - we scan the first of a set of files to get all the data we need, and then use the .idx files for all the rest to infer the complete data for all of them. From your point of view, this is somewhere between solutions 2) and 3).

@JackKelly
Copy link
Contributor

That's great to know, thanks @martindurant. Sorry to ask a slightly off-topic question but: What does kerchunk do if the metadata changes over the course of a dataset? e.g. an NWP dataset which is, say, 0.5 degree horizontal resolution from 2015 to 2020. But the NWP gets upgraded in 2021 to 0.25 degree resolution? (I've been worrying about this exact issue for hypergrib recently. I don't have a great solution!)

@martindurant
Copy link

It doesn't "do" anything, you would I think have to process the two batches separately.

If the chunking is actually different in the overall dataset, then this can't be represented in the zarr model at all without at least variable chunking, and maybe not at all. This is why virtualizarr is a more general solution, and the two projects should probably be tied more closely together :)

@JackKelly
Copy link
Contributor

On the topic of how to handle .idx metadata in gribberish...

I'm sure everyone has considered this already but it feels like there are three distinct types of message metadata:

  1. The metadata that locates the 2D grib message in the huge, virtual, 5-dimensional array that xarray thinks about. Specifically, those 5 dims are the reference time, forecast step, product (AKA variable), vertical level, and the ensemble member.
  2. The physical location of the grib message in storage: the path, offset, and length.
  3. Metadata that helps us interpret the grib message: the crs, proj, data compression, etc.

IIUC, the .idx file always gives us everything we need for "metadata type 1" and gives us 2 or the 3 fields we need for "metadata type 2".

Maybe a 5th option (following on from the 4 "potential paths forward" listed above) would be to split gribberish's MessageMetadata into 2 or 3 separate structs?

@emfdavid
Copy link

emfdavid commented Sep 25, 2024

These are exactly the issues we struggled with last winter. @martindurant and @Anu-Ra-g did a great job adding documentation about how the the kerchunk solution works.

What we found is that the idx file does not provide compete Type I metadata - at least not without reverse engineering the wgrib fortran code that wrote the "attrs" column which includes the product level and step in a form better suited to humans than machines.

Screenshot 2024-09-25 at 12 01 58 PM

The solution implemented for kerchunk is to build a mapping from this attrs string to the Type I meta data you need to locate each chunk in the hyper kube.

Resulting a table that stores all the type I & II data
Screenshot 2024-09-25 at 12 04 36 PM

From this, and that static metadata about compression, dimensions and product attributes you can than construct any logical dataset you like.

Operationally we know when NOAA is going to change the model - it is an operational product, but just to be sure, I compare reading the grib file directly with parsing the idx file for 1 in 1000 files to make sure nothing has changed.

I would love to get rid of the mapping and improve the api for building the logical dataset, but I hope understanding one working solution can give you a boost to find a better way.

If we come up with a standard from for the Type I & II metadata we may be able to get NOAA/NODD to build and maintain the database for us. This would be really exciting if we can show that a common form could support kerchunk, virtualizarr and hypergrib, I think we would be well on our way!

@emfdavid
Copy link

And the name gribberish is totally awesome!

@JackKelly
Copy link
Contributor

JackKelly commented Sep 26, 2024

This is all super-useful, @emfdavid!

.idx spec

I think I'm up for trying to fully understand the wgrib source code, and to build something in gribberish that can understand any .idx file created by wgrib. That said, I haven't fully digested what this entails, so please don't hold me to this if it turns out to be far more work than I'm imagining! Maybe my first step will be to convert the wgrib source code to a human-readable spec for .idx files, if such a spec doesn't already exist?

Scanning grib files

On the topic of scanning grib files... One of my ultimate goals is to (help) run a service that maintains a public manifest of NODD grib datasets. So users can just run xr.open_dataset(MANIFEST_URL, engine="hypergrib") to lazily open petabyte-scale grib datasets. I'm talking myself into the idea that the "public manifest service" will have to scan every grib file that NODD publishes, so it knows exactly when the dataset changes (e.g. when the NWP upgrades its horizontal resolution etc.). Which is one of the reasons I'm interested in making our own "extended .idx" format! Or using your trick, @emfdavid, of storing the manifest in a database!

(Actually, I'm hoping I won't have to scan the entirety of every grib file. I'm hoping I can use the existing .idx files to tell me where each message begins, and only "scan" the metadata sections, not the binary payload. Although reading lots of small byteranges may be inefficient on cloud object storage...)

@emfdavid
Copy link

build something in gribberish that can understand any .idx file created by wgrib

This is fantastic - the data driven mapping between the IDX files and the cf grib metadata is definitely a bit of a hack. If you can build and maintain tables/code based on wgrib that would be strictly better.

So users can just run xr.open_dataset(MANIFEST_URL, engine="hypergrib") to lazily open petabyte-scale grib datasets

The full dataset for all products, all steps & runtimes is billions of chunks for a single perturbation of GEFS.
Would you have different MANIFEST_URLs for different slices or maybe a time range operator as part of the url? It will be a very large manifest response. When you are ready, I would like to discuss the 'reinflate' method we built in kerchunk to turn a manifest and an axis specification into a logical dataset.

ultimate goals is to (help) run a service that maintains a public manifest

I would love to see NODD actually build and maintain the manifest database for us. I think we could do it if the community comes together around a table schema. I think there is funding and a path to implementation.

@martindurant
Copy link

@rabernat , you might have some thoughts on the DB approach being sketched out here and related kerchunk threads. This is not exactly a kerchunk manifest, but a set of chunk details that can be made into various datasets, but no one dataset can use all the references (they wouldn't fit in the data model, with mutually conflicting coordinates).

I would love to see NODD actually build and maintain the manifest database for us.

This is the big goal! No one of us wants to read however many PB of data they have, or support the storage/query interface it would need to use. Making manifests for some specific view is doable, but the data is always being updated, so there needs to be a process too.

@JackKelly
Copy link
Contributor

Have NOAA hinted that they might have appetite for maintaining a public manifest for NODD?

@JackKelly
Copy link
Contributor

The full dataset for all products, all steps & runtimes is billions of chunks for a single perturbation of GEFS

Yeah. I'm interested in defining a very concise manifest format but, even if the manifest only uses a single byte per chunk - which sounds impossibly concise(!) - then we're still looking at a manifest that's gigabytes in size. These dang NWP datasets are so BIG! Hmm, I see what you mean... this definitely needs more thought... It feels solvable though...

@JackKelly
Copy link
Contributor

JackKelly commented Sep 27, 2024

(BTW, I'm going to propose that we move discussion of giga-scale manifests etcetera to the hypergrib discussion forum, so we can focus this gribberish GitHub issue on the question of how gribberish can handle .idx files... I'll post a link to relevant hypergrib discussions within a few hours. Sorry, my fault for getting carried away!)

UPDATED WITH LINKS:

@emfdavid
Copy link

Yes - if we do determine we need a manifest and we came up with a information model that supports the community needs (hypergrib, virtualizarr, kerchunk) I think we could definitely find funding and enthusiastic support from NODD to operate it.

If we can deliver a tool that doesn't need the manifest at all and can load whole data sets looking up chunks algorithmically - as you are now proposing - well that would be even better!

@JackKelly
Copy link
Contributor

JackKelly commented Oct 1, 2024

@mpiannucci I've made a start on parsing .idx files into gribberish's parameters.

I don't (yet) have great Rust macro skills. Please may I ask your advice: Do you think it should be possible to write a Rust macro to automatically convert a string like "TMP" to gribberish::templates::product::parameters::meteorological::TemperatureProduct::Temperature, using the #[abbrev = "TMP"] annotations that you've already painstakingly defined?!

At the moment, I'm manually mapping from abbreviation strings to the appropriate enum variant (here's my code so far - we can move this into gribberish if you'd like).

@JackKelly
Copy link
Contributor

JackKelly commented Oct 8, 2024

I'm afraid I'm moving away from the idea of parsing .idx files into gribberish types (sorry!). For the hypergrib MVP, I'm going to keep the parameter abbreviations and levels as strings. So I'm afraid I'm going to pause work on this (for now, at least). If anyone else wants to take this on then go for it!

After the MVP, hypergrib will decode the parameter abbreviations and levels, but will probably use the GRIB2 tables encoded as .csv files in gdal UPDATE: The correct link is: gdal/frmts/grib/data.

@emfdavid
Copy link

emfdavid commented Oct 8, 2024

That looks like a good resource for a machine readable form... but to get all the NOAA special variables you will have to take a look at the ncep libs library. I opened an issue asking for the data to be exposed in a machine readable form, but I don't think it has happened yet.

@JackKelly
Copy link
Contributor

Yikes! Thanks for flagging that! GRIB is a bit of a mess, isn't it?! It feels like a useful contribution would be "just" to collate all these GRIB code tables into a single place, in a machine-readable form. (I've started a thread to track progress on this idea)

@JackKelly
Copy link
Contributor

JackKelly commented Oct 10, 2024

Sorry, I linked to the wrong gdal path in my comment above. The following is the correct path, which contains more GRIB tables as CSVs: gdal/frmts/grib/data/

The README for that directory is here: gdal/frmts/grib/degrib/README.TXT

to get all the NOAA special variables you will have to take a look at the ncep libs library

Does this CSV contain what you need: gdal/frmts/grib/data/grib2_table_4_2_local_NCEP.csv

@martindurant
Copy link

a useful contribution would be "just" to collate all these GRIB code tables into a single place

Yes! I wonder how often new entries are added... Of course, many "definitions" come with an implementation too, for example all the coordinate projections.

@JackKelly
Copy link
Contributor

JackKelly commented Oct 10, 2024

After posting the comment where I said:

a useful contribution would be "just" to collate all these GRIB code tables into a single place

I discovered that the GDAL codebase appears to contain a bunch of vendor-specific tables:

Is this sufficient? Does gdal already contain all the tables we need?

@martindurant
Copy link

Obviously I don't know, but I wouldn't be too surprised if GDAL was mostly on top of this. Of course, how up to date that is, is another matter - but probably they have a pretty active user base pushing for updates as they arise.

@mpiannucci
Copy link
Owner Author

mpiannucci commented Oct 10, 2024

The trouble is that there are multiple formats for the code tables. GDAL is probably the most comprehensive resource. Youll notice that the format GDAL has probably doesnt match the WMO code tables but that doesnt really matter

I hate the code tables impl in gribberish and was going to codegen them but never had the time.

@JackKelly
Copy link
Contributor

JackKelly commented Oct 10, 2024

I hate the code tables impl in gribberish and was going to codegen them but never had the time.

Oh, I'm in awe of your pure-Rust representation of the GRIB tables! Reading your code genuinely expanded my understanding of what can be represented in pure Rust!

Making no promises... but I'm wondering about writing a new Rust crate which:

  1. Contains a copy of GDAL's CSVs (with a README which gives full attribution to GDAL, of course!)
    • A subsequent PR could convert the CSVs to a binary format for better performance at runtime.
  2. Contains some simple Rust code to read the GRIB tables into memory to provide a way to look up information?

Does that sound useful? Or are you determined to use a pure-Rust representation of the code tables (where the Rust might be codegen'd from the GDAL CSVs).

@emfdavid
Copy link

This is definitely more of a people problem than a technical problem.
The existing gdal work certainly looks the most comprehensive - a great find!
I wonder if osgeo.org would consider moving the data to a separate data only repository and treating it as a proper versioned dependency that could be easily shared?
It would be great to find the correct docs to link from NCEP and WMO/ECMWF and the reference implementations they maintain.

@JackKelly
Copy link
Contributor

JackKelly commented Oct 11, 2024

I wonder if osgeo.org would consider moving the data to a separate data only repository

Good question. I've just requested to be subscribed to the gdal-dev mailing list, so I can ask this question! (The gdal github issues page says that github issues are only for feature requests and bug reports)

UPDATE: Here's my post to the GDAL-dev mailing list.

@JackKelly
Copy link
Contributor

JackKelly commented Oct 11, 2024

On the topic of generating Rust code which represents the GRIB code table... Having just seen a great talk at EuroRust on codegen, and then finding this blog post on codegen, I'm now excited about codegen! (Prior to today I didn't know much about codegen and had assumed it'd be very hard). I've started a new issue: #63

@emfdavid
Copy link

So for some of the more opaque descriptions in the IDX files can we now parse the "level" and "step" descriptions to get coordinates and indices for variables like these?

153:168669201:d=2023031200:REFD:1000 m above ground:780 min fcst:
156:170602055:d=2023031200:UPHL:5000-2000 m above ground:780 min fcst:
166:185019065:d=2023031200:WIND:10 m above ground:775-780 min ave fcst:

The vertical intervals are particularly difficult and probably more obscure meteorological variables, but the time average and accumulation variables are really important.

@JackKelly
Copy link
Contributor

JackKelly commented Oct 12, 2024

can we now parse the "level" and "step"

Not yet! I haven't written any code to parse the body of the .idx files yet. (Although, when we do write code to parse the body, we could use nom).

I'm afraid I won't get round to parsing "level" and "step" from the body of .idx files for a while (a few months). For the hypergrib MVP I'm initially focused on GEFS. And, in GEFS, a bunch of information is available in the .idx filesnames in a form that's easier to parse than the body of the .idx. Specifically, the GEFS filenames include the initialisation datetime, the ensemble member, and the step.

I will extract the level string and product abbreviation from the body of the .idx files, but for the hypergrib MVP I'm just gonna leave these as strings.

My plan for the development of hypergrib is here. The ultimate plan is definitely to fully parse this information.

Also, before we can parse the body of .idx files into gribberish data structures, I'm guessing we should first complete #63

@emfdavid
Copy link

Although, when we do write code to parse the body, we could use nom

This is why I shouldn't name things - but I deeply appreciate the great names other people come up with.

I will extract the level string and product abbreviation from the body of the .idx files, but for the hypergrib MVP I'm just gonna leave these as strings.

This sounds like a great compromise to get at the meat of the problem in the MVP and leave some nasty string parsing logic to later. Suggest you avoid spending time on the CF attrs as well. Fix that later.

Let's see that IO rate flatlined at the NIC limit then fix the little stuff.

@JackKelly
Copy link
Contributor

My little grib_tables Rust crate is now available: https://crates.io/crates/grib_tables

grib_tables just loads the GDAL CSV files into memory and allows the user to map from the parameter abbreviation string (e.g. "VTMP") to the full details of the param. You can also map from the numeric parameter ID to the full param details.

@martindurant
Copy link

Thanks @JackKelly ! The source code is at https://github.com/JackKelly/hypergrib/tree/main/crates/grib_tables , if that wasn't obvious.

@JackKelly
Copy link
Contributor

Oooh, good point, I've just updated the repository in the Cargo.toml to point to that crates/grib_tables subdirectory!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python question Further information is requested rust Pull requests that update Rust code
Projects
None yet
Development

No branches or pull requests

4 participants