Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CaFFe Dataset #2350

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Add CaFFe Dataset #2350

wants to merge 12 commits into from

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Oct 16, 2024

This PR adds the CaFFe (CAlving Fronts and where to Find thEm) dataset and accompanying DataModule for calving front and land scape zone segmentation.

Implementation for the chipped dataset based on this script, which I uploaded to Huggingface.

Dataset features:

  • 13,090 train, 2,241 validation, and 3,761 test images
  • varying spatial resolution of 6-20m
  • paired binary calving front segmentation masks
  • paired multi-class land cover segmentation masks

Dataset format:

  • images are single-channel pngs with dimension 512x512
  • segmentation masks are single-channel pngs

TODOs:

  • Check class enumeration and plotting for correct class/colors

Random Sample Plot:
glacier_test

@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing datamodules PyTorch Lightning datamodules labels Oct 16, 2024
@nilsleh nilsleh marked this pull request as draft October 16, 2024 13:27
@adamjstewart adamjstewart added this to the 0.7.0 milestone Oct 16, 2024
@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 16, 2024

Hi @Nora-Go,

we would like to add your dataset to torchgeo but I have a couple of questions:

  • the imagery and labels come as .png files and I did not find any accompanying metadata files that would give information about the geo-location etc of the imagery and patches, which would be extremely useful. Thus, I am wondering where I might be able to find this information.
  • I am not precisely sure about the encoding of the land zone classes. In your repo, the classes are converted to labels here and this comment seems to suggest that for the class label pixel values of [0, 64, 127, 255], the labels are [Glacier, Rock, Ocean/Ice Melange, NA] in that same order, where NA is essentially a background class. Is that correct?

Thanks in advance!


self.size = size

def setup(self, stage: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks the same as the base class, could prob be removed


mask_dirs = ('fronts', 'zones')

url = 'https://huggingface.co/datasets/torchgeo/glacier_calving_front/resolve/main/glacier_calving_data.zip'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace main with the git commit hash for stability/reproducibility.

0: 'background',
64: 'ocean',
127: 'rock',
254: 'glacier',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either in the dataset or the datamodule, we need to map these to ordinal numbers, correct? I've been meaning to add a transform for this since it comes up so often.

Copy link
Collaborator Author

@nilsleh nilsleh Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked as a draft, because I am not 100% sure here, that's why I asked the question to the author. But yes you are correct, missing code for mapping it to ordinal and will do that once I get the answer :)

@Nora-Go
Copy link

Nora-Go commented Oct 17, 2024

Hi @nilsleh,
First of all - cool! :D happy that the access to the dataset gets easier and that you take the time! Thank you :)

* the imagery and labels come as `.png` files and I did not find any accompanying metadata files that would give information about the geo-location etc of the imagery and patches, which would be extremely useful. Thus, I am wondering where I might be able to find this information.

This information is not publicly available so far. If you want I can send you the geo tiffs - those have the information and you can extract them to include them here.

* I am not precisely sure about the encoding of the land zone classes. In your repo, the classes are converted to labels [here](https://github.com/Nora-Go/Calving_Fronts_and_Where_to_Find_Them/blob/de8e8c36292470fd4037501155db2f5d4aa8ef13/validate_or_test.py#L67) and [this comment](https://github.com/Nora-Go/Calving_Fronts_and_Where_to_Find_Them/blob/de8e8c36292470fd4037501155db2f5d4aa8ef13/models/zones_segmentation_model.py#L10) seems to suggest that for the class label pixel values of [0, 64, 127, 255], the labels are [Glacier, Rock, Ocean/Ice Melange, NA] in that same order, where NA is essentially a background class. Is that correct?

It is 0: NA (no information available - e.g. you don't see anything in the image, but in reality it would be one of the following), 64: Rock outcrop, 127: Glacier, and 254: Ocean/Ice Melange

I have some questions myself:
Do you have to provide the patches instead of the full images? And a predefined val split (using the one that I used for the baseline)? At the moment I'm actually using patches of size 512 x 512 and that works better. And the val split I used so far would not work if you wanted to take advantage of the time series information :)

@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 18, 2024

@Nora-Go thank you for the reply. Ah too bad that the metadata is not available right away. I won't have the time to make a deep dive into the dataset, I was just looking for a interesting task with good labels for an evaluation.

However, if you upload the geo tiffs somewhere (could also be the Hugginface repo) someone can add it at a later time, this time as a GeoDataset where patch sampling then happens on the fly with a GeoSampler. Since, I was interested in just a "benchmark" evaluation, a predefined split is better for consistency and comparability. I can rerun your patching script to include the dataset as a 512x512 version if you prefer that?

If you are interested in adding any of these parts yourself, and should you have questions about that, feel free to reach out.

@Nora-Go
Copy link

Nora-Go commented Oct 18, 2024

@nilsleh ah I see. Ok. Do you need the meta data already for using it as a benchmark? If yes, I can try to extract what you need. Otherwise I'll see how I'll deal with the geo tiffs for a future GeoDataset.

For using the dataset as a benchmark, do you want to compare against my baseline (then the 256 x 256 is good) or do you want to compare against the state-of-the-art? The state-of-the-art (https://ieeexplore.ieee.org/abstract/document/10440599) uses a mixture between 256x256 and 512x512.

Let me know if you have any further questions or need help handling the dataset :)

@nilsleh
Copy link
Collaborator Author

nilsleh commented Oct 18, 2024

Having latitude, longitude, and time available as metadata would already be really helpful for further downstream evaluation, so that would be great already in the patched dataset version, for example as a csv or json file with png filenames mapping to that information (but other formats work as well of course).

I think it's fine to go with 512x512 for now as the evaluation would be more an internal comparison of models, and if you are using the 512x512 version in your research now as well, this might be more up to date.

Also in the linked paper, I see that you name the dataset CaFFe, should we use that name as well here?

@Nora-Go
Copy link

Nora-Go commented Oct 21, 2024

Yes, I guess 512x512 would be better :) and yes, it would be great if you could use the name CaFFe (which stands for "CAlving Fronts and where to Find thEm")! Thank you!

I'll provide you with a csv - just give me a little time :) (I'm just waiting for a response from a collaborator)

@nilsleh nilsleh changed the title Add Glacier Calving Front Dataset Add CaFFe Dataset Oct 21, 2024
@nilsleh nilsleh marked this pull request as ready for review October 21, 2024 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants