Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset of UK-cropped satellite data from Europe dataset #150

Open
devsjc opened this issue Feb 1, 2023 · 5 comments
Open

Create dataset of UK-cropped satellite data from Europe dataset #150

devsjc opened this issue Feb 1, 2023 · 5 comments
Assignees

Comments

@devsjc
Copy link
Contributor

devsjc commented Feb 1, 2023

Summary

Currently there exists a ~40Tb satellite image dataset on GCP (and on Leonardo). For ease of ML training, having a more managably-sized ~100Gb dataset that is purely UK image data would be beneficial. As such, we want to read in that existing dataset, crop the images down so they cover the UK alone, and write it to a new dataset.

Data structure

The dataset in GCP is stored in the bucket solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/v4.

The sattelite dataset consists of several years of data. This is a grid of chunks, each chunk containing 12 5-minute timesteps making up an hours' worth of imagery.

The bounds used to specify the UK in Satip are "UK": (-16, 45, 10, 62).

Method (Work in progress)

  1. Pull and uncompress current data, x timesteps at a time
  2. Copy/save metadata to avoid loss
  3. Extract images from chunks

Known gotchas

  • XArray will often delete zarr attribute files when writing new data: ensure to copy them explicitly into the new dataset
  • Will require decoding via OCF's blosc2 py library
@jacobbieker
Copy link
Member

You might want to rechunk the dataset as well, primarily in the x and y dims to better match the spatial extant.

@devsjc
Copy link
Contributor Author

devsjc commented Feb 1, 2023

I seem to recall that the images for this dataset were chunked using a 4x4 grid? If x and y are only split into 4 respectively on the large image dataset, and with these cropped images expected to be ~100x smaller, won't one entire cropped image be significanly less than what was previously in a x/y chunk, and hence we might not even need to chunk x/y?

Forgive me if/as my lack of understanding renders this question nonsensical...!

@jacobbieker
Copy link
Member

Yeah, I agree! But you might have to explicitly rechunk the data to that size

@zakwatts
Copy link

@devsjc Is this complete now? I.e: code to do this merged?

@peterdudfield
Copy link
Collaborator

This could be linked to #180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Todo
Development

No branches or pull requests

4 participants