Add CaFFe Dataset #2350

nilsleh · 2024-10-16T13:26:26Z

This PR adds the CaFFe (CAlving Fronts and where to Find thEm) dataset and accompanying DataModule for calving front and land scape zone segmentation.

Implementation for the chipped dataset based on this script, which I uploaded to Huggingface.

Dataset features:

13,090 train, 2,241 validation, and 3,761 test images
varying spatial resolution of 6-20m
paired binary calving front segmentation masks
paired multi-class land cover segmentation masks

Dataset format:

images are single-channel pngs with dimension 512x512
segmentation masks are single-channel pngs

TODOs:

Check class enumeration and plotting for correct class/colors

Random Sample Plot:

nilsleh · 2024-10-16T13:33:19Z

Hi @Nora-Go,

we would like to add your dataset to torchgeo but I have a couple of questions:

the imagery and labels come as .png files and I did not find any accompanying metadata files that would give information about the geo-location etc of the imagery and patches, which would be extremely useful. Thus, I am wondering where I might be able to find this information.
I am not precisely sure about the encoding of the land zone classes. In your repo, the classes are converted to labels here and this comment seems to suggest that for the class label pixel values of [0, 64, 127, 255], the labels are [Glacier, Rock, Ocean/Ice Melange, NA] in that same order, where NA is essentially a background class. Is that correct?

Thanks in advance!

adamjstewart · 2024-10-16T13:32:30Z

torchgeo/datamodules/glacier_calving_front.py

+
+        self.size = size
+
+    def setup(self, stage: str) -> None:


Looks the same as the base class, could prob be removed

adamjstewart · 2024-10-16T13:32:53Z

torchgeo/datasets/glacier_calving_front.py

+
+    mask_dirs = ('fronts', 'zones')
+
+    url = 'https://huggingface.co/datasets/torchgeo/glacier_calving_front/resolve/main/glacier_calving_data.zip'


Replace main with the git commit hash for stability/reproducibility.

adamjstewart · 2024-10-16T13:33:56Z

torchgeo/datasets/glacier_calving_front.py

+        0: 'background',
+        64: 'ocean',
+        127: 'rock',
+        254: 'glacier',


Either in the dataset or the datamodule, we need to map these to ordinal numbers, correct? I've been meaning to add a transform for this since it comes up so often.

I marked as a draft, because I am not 100% sure here, that's why I asked the question to the author. But yes you are correct, missing code for mapping it to ordinal and will do that once I get the answer :)

Nora-Go · 2024-10-17T12:16:57Z

Hi @nilsleh,
First of all - cool! :D happy that the access to the dataset gets easier and that you take the time! Thank you :)

* the imagery and labels come as `.png` files and I did not find any accompanying metadata files that would give information about the geo-location etc of the imagery and patches, which would be extremely useful. Thus, I am wondering where I might be able to find this information.

This information is not publicly available so far. If you want I can send you the geo tiffs - those have the information and you can extract them to include them here.

* I am not precisely sure about the encoding of the land zone classes. In your repo, the classes are converted to labels [here](https://github.com/Nora-Go/Calving_Fronts_and_Where_to_Find_Them/blob/de8e8c36292470fd4037501155db2f5d4aa8ef13/validate_or_test.py#L67) and [this comment](https://github.com/Nora-Go/Calving_Fronts_and_Where_to_Find_Them/blob/de8e8c36292470fd4037501155db2f5d4aa8ef13/models/zones_segmentation_model.py#L10) seems to suggest that for the class label pixel values of [0, 64, 127, 255], the labels are [Glacier, Rock, Ocean/Ice Melange, NA] in that same order, where NA is essentially a background class. Is that correct?

It is 0: NA (no information available - e.g. you don't see anything in the image, but in reality it would be one of the following), 64: Rock outcrop, 127: Glacier, and 254: Ocean/Ice Melange

I have some questions myself:
Do you have to provide the patches instead of the full images? And a predefined val split (using the one that I used for the baseline)? At the moment I'm actually using patches of size 512 x 512 and that works better. And the val split I used so far would not work if you wanted to take advantage of the time series information :)

nilsleh · 2024-10-18T06:16:45Z

@Nora-Go thank you for the reply. Ah too bad that the metadata is not available right away. I won't have the time to make a deep dive into the dataset, I was just looking for a interesting task with good labels for an evaluation.

However, if you upload the geo tiffs somewhere (could also be the Hugginface repo) someone can add it at a later time, this time as a GeoDataset where patch sampling then happens on the fly with a GeoSampler. Since, I was interested in just a "benchmark" evaluation, a predefined split is better for consistency and comparability. I can rerun your patching script to include the dataset as a 512x512 version if you prefer that?

If you are interested in adding any of these parts yourself, and should you have questions about that, feel free to reach out.

Nora-Go · 2024-10-18T07:28:51Z

@nilsleh ah I see. Ok. Do you need the meta data already for using it as a benchmark? If yes, I can try to extract what you need. Otherwise I'll see how I'll deal with the geo tiffs for a future GeoDataset.

For using the dataset as a benchmark, do you want to compare against my baseline (then the 256 x 256 is good) or do you want to compare against the state-of-the-art? The state-of-the-art (https://ieeexplore.ieee.org/abstract/document/10440599) uses a mixture between 256x256 and 512x512.

Let me know if you have any further questions or need help handling the dataset :)

nilsleh · 2024-10-18T08:04:01Z

Having latitude, longitude, and time available as metadata would already be really helpful for further downstream evaluation, so that would be great already in the patched dataset version, for example as a csv or json file with png filenames mapping to that information (but other formats work as well of course).

I think it's fine to go with 512x512 for now as the evaluation would be more an internal comparison of models, and if you are using the 512x512 version in your research now as well, this might be more up to date.

Also in the linked paper, I see that you name the dataset CaFFe, should we use that name as well here?

Nora-Go · 2024-10-21T15:07:58Z

Yes, I guess 512x512 would be better :) and yes, it would be great if you could use the name CaFFe (which stands for "CAlving Fronts and where to Find thEm")! Thank you!

I'll provide you with a csv - just give me a little time :) (I'm just waiting for a response from a collaborator)

nilsleh added 2 commits October 16, 2024 13:18

glacier ds

be55b94

typo

3ef9d7e

github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing datamodules PyTorch Lightning datamodules labels Oct 16, 2024

nilsleh marked this pull request as draft October 16, 2024 13:27

adamjstewart added this to the 0.7.0 milestone Oct 16, 2024

adamjstewart reviewed Oct 16, 2024

View reviewed changes

quick review

a66c9a7

nilsleh added 2 commits October 21, 2024 15:01

ordinal class map and plotting

17ceb7e

merge main

aad4550

rename to caffe

26633a8

nilsleh changed the title ~~Add Glacier Calving Front Dataset~~ Add CaFFe Dataset Oct 21, 2024

nilsleh added 6 commits October 21, 2024 16:14

more rename

d673470

datamodule test without trainer

e6004be

forgot a rename

a12cf3b

test for dm

f9e8188

test loader

b85016b

docs target dataset name

a9a84ed

nilsleh marked this pull request as ready for review October 21, 2024 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CaFFe Dataset #2350

Add CaFFe Dataset #2350

nilsleh commented Oct 16, 2024 •

edited

Loading

nilsleh commented Oct 16, 2024 •

edited

Loading

adamjstewart Oct 16, 2024

adamjstewart Oct 16, 2024

adamjstewart Oct 16, 2024

nilsleh Oct 16, 2024 •

edited

Loading

Nora-Go commented Oct 17, 2024 •

edited

Loading

nilsleh commented Oct 18, 2024 •

edited

Loading

Nora-Go commented Oct 18, 2024

nilsleh commented Oct 18, 2024

Nora-Go commented Oct 21, 2024


		mask_dirs = ('fronts', 'zones')

		url = 'https://huggingface.co/datasets/torchgeo/glacier_calving_front/resolve/main/glacier_calving_data.zip'

Add CaFFe Dataset #2350

Are you sure you want to change the base?

Add CaFFe Dataset #2350

Conversation

nilsleh commented Oct 16, 2024 • edited Loading

nilsleh commented Oct 16, 2024 • edited Loading

adamjstewart Oct 16, 2024

Choose a reason for hiding this comment

adamjstewart Oct 16, 2024

Choose a reason for hiding this comment

adamjstewart Oct 16, 2024

Choose a reason for hiding this comment

nilsleh Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Nora-Go commented Oct 17, 2024 • edited Loading

nilsleh commented Oct 18, 2024 • edited Loading

Nora-Go commented Oct 18, 2024

nilsleh commented Oct 18, 2024

Nora-Go commented Oct 21, 2024

nilsleh commented Oct 16, 2024 •

edited

Loading

nilsleh commented Oct 16, 2024 •

edited

Loading

nilsleh Oct 16, 2024 •

edited

Loading

Nora-Go commented Oct 17, 2024 •

edited

Loading

nilsleh commented Oct 18, 2024 •

edited

Loading