LightningDataModule to load GeoTIFF files #52

weiji14 · 2023-11-24T06:31:54Z

What I am changing

A LightningDataModule to load GeoTIFF data

How I did it

Using torchdata to construct the DataPipe
GeoTIFF files are read using rasterio
Train/validation split is 80%/20%

TODO:

Install torchdata dependency
Initial implementation of GeoTIFFDataPipeModule
Add extra parameters to control DataLoader (e.g. num_workers)
Add unit tests
Refactor to load GeoTIFF data from s3 bucket instead of local drive
etc

Notes:

Have tried using rioxarray to read the GeoTIFFs, but seems a little slower than rasterio
Also experimented with loading from NetCDF files using xarray's h5netcdf engine (about same speed as rioxarray loading GeoTIFF)
Fastest seems to be loading from Zarr, but would require re-formatting of data, so leaving that to a future PR.

How you can test it

Download the GeoTIFF files from the s3 bucket (TODO add instructions)
Run python trainer.py fit --trainer.max_epochs=20 --trainer.precision=16-mixed --data.data_path=data --data.batch_size=32 --data.num_workers=8 locally

Related Issues

References:

https://zen3geo.readthedocs.io/en/v0.6.2/walkthrough.html

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries!

Decoupling the neural network model's unit test from the LightningDataModule by implementing a standalone datapipe fixture instead.

Create a LightningDataModule to load GeoTIFF files. Uses torchdata to create the data pipeline. Using the FileLister DataPipe to iterate over *.tif files in the data/ folder, and do a random 80/20 split for the training and validation set. The GeoTIFF files are read into numpy.ndarrrays using rasterio, and converted to torch.Tensors with the default collate function. Using rasterio instead of rioxarray to reduce an extra layer of overhead in the data loading.

yellowcap · 2023-11-24T11:52:48Z

src/datamodule.py

+    # GeoTIFF - Rasterio
+    with rasterio.open(fp=filepath) as dataset:
+        array: np.ndarray = dataset.read()
+        tensor: torch.Tensor = torch.as_tensor(data=array.astype(dtype="float16"))


Is float32 to float16 a save tansformation?

There will be some loss of floating point precision, but we'll likely be using 16-bit precision training (see https://lightning.ai/docs/pytorch/2.1.0/common/precision_intermediate.html) to speed up the model training, so best to pre-emptively convert the data to float16 dtype here.

yellowcap · 2023-11-24T11:58:43Z

src/datamodule.py

    """
+    # GeoTIFF - Rasterio
+    with rasterio.open(fp=filepath) as dataset:


Did you experiment with other file openers? Maybe the loader gets more stable if we use

from skimage import io im = io.imread(filepath)

or other tif specific loaders like

https://pypi.org/project/tifffile/

Could be worth a shot to see if that helps stabilizing the loader when compared to zarr.

So skimage's imread actually uses tifffile behind the scenes for reading TIFF files, see https://github.com/scikit-image/scikit-image/blob/441fe68b95a86d4ae2a351311a0c39a4232b6521/skimage/io/_io.py#L16-L68, but I know that tiffile has some issues with multiprocessing/threading, see cgohlke/tifffile#215. Will also need to see if skimage/tifffile supports reading from s3 buckets directly like rasterio/GDAL. Found this thread cgohlke/tifffile#125 which looks interesting.

Enable setting the number of subprocesses used for data loading. Default to 8 for now, but can be configured on LightningCLI using `python trainer.py fit --data.num_workers=8`.

Contains a build of torchdata that is pre-compiled with the correct AWSSDK extension, and won't result in errors like `ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?)`.

Enable setting the path to the folder containing the GeoTIFF data files. Defaults to data/ for now, but can be configured on LightningCLI using `python trainer.py fit --data.data_path=data/56HKH`. Also setting the recursive=True flag to allow for files in nested directories.

Ensure that loading one mini-batch of data from a data folder works. Created two temporary random GeoTIFF files containing arrays of shape (3, 256, 256) in a fixture for the test.

weiji14 · 2023-11-28T07:20:40Z

Refactor to load GeoTIFF data from s3 bucket instead of local drive

Decided to handle reading from s3 in a separate PR, because it was about 10x slower than reading from a local disk, even from the same us-east-1 region. Specifically, a mini-batch took about 0.2it/s when reading from s3, compared to ~2it/s from a local data folder. Might need to play with some I/O or networking related settings.

weiji14 · 2023-11-28T07:24:47Z

Again, merging directly in the interest of speed. Will refactor to try other drivers (following discussion at #52 (comment)) later.

weiji14 added 3 commits November 24, 2023 18:31

➕ Add torchdata

a99b6e8

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries!

♻️ Refactor test_model_vit to use datapipe fixture

7988293

Decoupling the neural network model's unit test from the LightningDataModule by implementing a standalone datapipe fixture instead.

weiji14 added the data-pipeline Pull Requests about the data pipeline label Nov 24, 2023

weiji14 self-assigned this Nov 24, 2023

weiji14 changed the title ~~Implement GeoTIFFDataPipeModule~~ LightningDataModule to load GeoTIFF files Nov 24, 2023

yellowcap reviewed Nov 24, 2023

View reviewed changes

weiji14 added 4 commits November 27, 2023 12:00

🧵 Allow configuring num_workers in DataLoader

8b4155e

Enable setting the number of subprocesses used for data loading. Default to 8 for now, but can be configured on LightningCLI using `python trainer.py fit --data.num_workers=8`.

📌 Install torchdata=0.7.1 from conda-forge instead of PyPI

e52c9a4

Contains a build of torchdata that is pre-compiled with the correct AWSSDK extension, and won't result in errors like `ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?)`.

✅ Add unit test for GeoTIFFDataModule

80a6130

Ensure that loading one mini-batch of data from a data folder works. Created two temporary random GeoTIFF files containing arrays of shape (3, 256, 256) in a fixture for the test.

weiji14 marked this pull request as ready for review November 28, 2023 07:17

weiji14 merged commit be426c1 into main Nov 28, 2023
1 check passed

weiji14 deleted the geotiff-datapipe branch November 28, 2023 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightningDataModule to load GeoTIFF files #52

LightningDataModule to load GeoTIFF files #52

weiji14 commented Nov 24, 2023 •

edited

Loading

yellowcap Nov 24, 2023

weiji14 Nov 26, 2023

yellowcap Nov 24, 2023

weiji14 Nov 26, 2023 •

edited

Loading

weiji14 commented Nov 28, 2023

weiji14 commented Nov 28, 2023

LightningDataModule to load GeoTIFF files #52

LightningDataModule to load GeoTIFF files #52

Conversation

weiji14 commented Nov 24, 2023 • edited Loading

What I am changing

How I did it

How you can test it

Related Issues

yellowcap Nov 24, 2023

Choose a reason for hiding this comment

weiji14 Nov 26, 2023

Choose a reason for hiding this comment

yellowcap Nov 24, 2023

Choose a reason for hiding this comment

weiji14 Nov 26, 2023 • edited Loading

Choose a reason for hiding this comment

weiji14 commented Nov 28, 2023

weiji14 commented Nov 28, 2023

weiji14 commented Nov 24, 2023 •

edited

Loading

weiji14 Nov 26, 2023 •

edited

Loading