Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making NASA JPL PO.DAAC datasets available from ocean.pangeo.io? #686

Closed
lewismc opened this issue Aug 6, 2019 · 11 comments
Closed

Making NASA JPL PO.DAAC datasets available from ocean.pangeo.io? #686

lewismc opened this issue Aug 6, 2019 · 11 comments
Labels

Comments

@lewismc
Copy link

lewismc commented Aug 6, 2019

Hi folks,

I'm currently evaluating Pangeo and came across the Google Cloud Storage Data Catalog documentation and specifically how one could prepare cloud optimized Zarr data.

PO.DAAC's data catalog currently offers >550 datasets covering a diverse and rich world of oceanography variables e.g. sea surface salinity, gravity, sea ice, sea surface temperature, etc. I think it would be excellent if these were available for processing within Pangeo through simple Python function calls.

How would I go about investigating the above? Thank you

@lewismc lewismc changed the title Adding NASA JPL PO.DAAC datasets Making NASA JPL PO.DAAC datasets available as part of Google Cloud Storage Data Catalog? Aug 6, 2019
@lewismc
Copy link
Author

lewismc commented Aug 6, 2019

A follow-up here is that I would like guidance on whether we should create the Zarr manifestations on the PO.DAAC-side and serve them from our data center or if there is a precedent/guidance on how this should be done. Any comments are really appreciated. Thank you

@rabernat
Copy link
Member

rabernat commented Aug 6, 2019

Hi @lewismc - welcome to Pangeo! As a physical oceanographer myself, I'm thrilled to see someone from NASA getting involved in this issue. I personally would love to have the PO.DAAC datasets available in a cloud-optimized format. So THANK YOU for your participation!

First let me state that all of this zarr / cloud stuff we are doing is pretty new and experimental. We are working towards defining best practices, but there are still a lot of open questions about how to do things.

When trying to host-cloud optimized mirrors of existing data repositories like PO.DAAC, the main issues come to mind are:

  1. What format to use? We have learned that simply copying netCDF4 / HDF files to cloud storage does not work well. Zarr works great, but is not yet considered a community standard. In the long term, Unidata's incorporation of Zarr into the netCDF library will solve this problem. In the short term, Zarr written by xarray is our recommended format for analysis-ready data.
  2. What aggregation level to use? Zarr is conducive to much larger aggregation levels that netCDF. For satellite data, it's typical for one netCDF file to contain a single day's snapshot of a field (e.g. SST). In contrast, with Zarr we like to put the entire timseries into a single zarr Array. Our goal here is to stop thinking about "files" and start thinking about "datasets" holistically. This is closely related to the next question...
  3. What chunking to use? NetCDF files distributed with one file per day have already made a strong choice about chunking, which affects the type of analysis that can be done efficiently. In Zarr, you have to be explicit about chunking. Do you want to chunk in time, space, depth, etc? Or maybe some combination. The answer depends on what sort of use cases you anticipate. It's also reasonable to store multiple copies of the same dataset with different chunk sizes.
  4. What compression to use? We would like to see more experimentation in this area. We are especially excited about lossy compression like zfp, which may lead to compression ratios as high as 10x with little loss of scientific value.
  5. How to produce the zarr datasets? We have several examples of how to generate zarr from netCDF archives here, but this is far from a comprehensive guide. A related question is:
  6. How to update the zarr datasets? Many of the datasets that depend on live satellites require regular updates. It's a very open question how to automate this and do it efficiently. @davidbrochart has a nice blog post on this, and xarray recently gained the ability to append to existing stores. We would love to see some experimentation here.
  7. Where to host the data? Zarr datasets work best and have the lowest cost when they are placed in a cloud object store such as S3 or GCS. If you are going to be accessing the data from ocean.pangeo.io, the data should live in the Google Cloud US-Central region.
  8. How to catalog the cloud datasets? When were starting this work, we just had a few experimental zarr datasets in the cloud. Now we have hundreds, and we need some sort of catalog. The traditional catalog tools, e.g. THREDDS, ERDAP, are overkill here, since we don't need to actually serve the data, just catalog where it lives. We are very inspired by the design of STAC (Spatio Temporal Asset Catalog) and there is even nascent support for netCDF-type data in STAC via STAC for generic netCDF-type data? radiantearth/stac-spec#366. For now, our catalog approach is what you found in https://github.com/pangeo-data/pangeo-datastore; we are using intake to catalog the datasets and a custom Sphinx website to display this catalog on the web. The easiest path towards cataloging any new datasets you produce is submitting a PR to that repo. Again, we welcome experimentation in this area.

I hope this list can help you figure out how to move forward. I invite you to attend our weekly Pangeo meetings if you want to discuss things face-to-face.

@rabernat
Copy link
Member

rabernat commented Aug 6, 2019

Also, I'm curious if you have official support from PO.DAAC to be doing this?

@lewismc
Copy link
Author

lewismc commented Aug 6, 2019

This is really refreshing to see @rabernat thank you for your detailed thoughts. There is a lot for me to respond to so let me gradually make my way through it

First let me state that all of this zarr / cloud stuff we are doing is pretty new and experimental. We are working towards defining best practices, but there are still a lot of open questions about how to do things.

NASA Earth Science Data Information System (ESDIS) program's Earth Science Data System Working Group (ESDSWG) currently facilitates one WG focused on analysis ready data. I think presenting your work to them would be an excellent step in the right direcvtion. Maybe you already knew about them. Please let me know if you would like ot learn more.

In the short term, Zarr written by xarray is our recommended format for analysis-ready data.

By reading the documentation I understood this. It is very interesting and until I read the Pengeo documentation I was not actually aware of the Zarr format if I am honest.

Our goal here is to stop thinking about "files" and start thinking about "datasets" holistically.

For analysis purposes we've been going in this direction for about 5 years now as well. The way we described it was that the legacy formats e.g. netCDF are optimized for write performance with the timeseries being written in one pass. This is of course not optimal for analysis where data access patterns are significantly different in nature. In short, getting away from files has been something we've been doing with our R&TD projects for some time so we are on the same page here :)

What chunking to use?...The answer depends on what sort of use cases you anticipate.

Correct. In the Apache SDAP project, the analysis stack NEXUS uses various tiling schemes which are dataset specific. A suitable chunking or data structuring methodology requires inherent knowledge of both the dataset one is working with and what your access patterns are. This is non-trivial in nature and we only really find something that works well after several iterations and comparisons.

What compression to use? We would like to see more experimentation in this area.

Maybe working with the PO.DAAC datasets is an opportunity to experiment here...?

How to produce the zarr datasets?

I will certainly share my experiences on the referenced Github issue.

How to update the zarr datasets?

I have nothing ro share right now but we do have several datasets at PO.DAAC which would allow us to experiment in this area as well.

Where to host the data?

I think for the time being, accessing from ocean.pangeo.io is the best place. We will not be attempting to host a dedicated Pangeo stack at PO.DAAC (or in one of our AWS accounts) any time soon I wouldn't imagine. Maybe we can discuss further about hosting some candidate Zarrified PO.DAAC dataset(s) in Google Cloud US-Central region which would allow us to address a few of the experimentation areas stated above?

How to catalog the cloud datasets? ...The easiest path towards cataloging any new datasets you produce is submitting a PR to that repo. Again, we welcome experimentation in this area.

Yes I agree that the lightweight data catalog is a suiting concept. Once we've reviewed some of the candidates and we make some progress on the netCDF/HDF-->Zarr conversation pipeline then we can come back to this one.

I invite you to attend our weekly Pangeo meetings if you want to discuss things face-to-face.

Yes I will attend the next one which is tomorrow Wednesday, Aug 7th, 2019. If you could spare some time to include this item on the agenda it would be excellent.

Also, I'm curious if you have official support from PO.DAAC to be doing this?

We were discussing Pangeo only yesterday and although it is very early days (that is to say that we have known about the Pangeo community and stack for quite some time but that there are few examples which employ PO.DAAC data) I see no reason for us not to push on. There is no official support for us mirroring the entire PO.DAAC archive as Zarr... the justification to do this would need to come from our community and at this point in time there is just no way that we would get the support required. HOWEVER there is certainly worth in us prototyping a conversion pipeline, staging data for rapid infusion into ocean.pangeo.io and demonstrating the interactive analysis capability for one or more representative science analyses.

I look forward to joining the call tomorrow.

@lewismc lewismc changed the title Making NASA JPL PO.DAAC datasets available as part of Google Cloud Storage Data Catalog? Making NASA JPL PO.DAAC datasets available as part of ocean.pangeo.io? Aug 6, 2019
@lewismc lewismc changed the title Making NASA JPL PO.DAAC datasets available as part of ocean.pangeo.io? Making NASA JPL PO.DAAC datasets available from ocean.pangeo.io? Aug 6, 2019
@lewismc
Copy link
Author

lewismc commented Aug 6, 2019

@rabernat just got conformation that Mike Gangl (my boss and lead of PO.DAAC Cloud System will join us as well)

@rabernat
Copy link
Member

There was quite a bit of interest in the PO.DAAC datasets here at the Pangeo annual meeting in Seattle.

When we last talked, our plan was to brainstorm some desired outcomes for this work. I'll kick that off here:

  • Place analysis-ready PO.DAAC data in cloud storage somewhere
  • Provide data-proximate computing for PO.DAAC data; bypass download step
  • Analyze PO.DAAC data using dask in elastic scaling mode
  • Demonstrate real-world use cases with this technology stack
  • Develop best practices for chunking and compression parameters

@abarciauskas-bgse
Copy link
Contributor

Following up on @rabernat I have been looking into converting the MUR SST dataset to Zarr using the PO.DAAC opendap server. This conversion is motivated by interest from @cgentemann in using this dataset in some cloud science tutorials.

Here is a gist of the pangeo jupyter notebook that I am working on for converting this dataset: https://gist.github.com/abarciauskas-bgse/95259d8ccce60b452afe4415a476225f

My next step is to look into running this on separate cluster (e.g. not on hub.pangeo) so I can have it run over a couple days. Right now reading each time step (a day) is taking at least a minute using the opendap server (@rabernat helped me figure this out), which means doing the whole MUR SST archive (16-17 years) will take 4-5 days by my estimation.

If anyone has advice on things to look into to speed this up I would be open to them!

@stale
Copy link

stale bot commented Oct 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 23, 2019
@stale
Copy link

stale bot commented Oct 30, 2019

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Oct 30, 2019
@DaniJonesOcean
Copy link

Hi all. Has there been any progress on this lately? I'd also like to have access to JPL data via Pangeo. Specifically, sea ice would be excellent.

@cgentemann
Copy link
Member

Yes, NASA DAACs are moving their data to the cloud. You can find the datasets here: https://search.earthdata.nasa.gov/search?ff=Available%20from%20AWS%20Cloud. There is more information here: https://podaac.jpl.nasa.gov/cloud-datasets/about. There are some tutorials on access, but there isn't yet a clean way to easily use your earthdata login to access it. The tutorials provide some code, but it is a bit complicated. Hopefully a library will come out soon. -chelle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants