Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
LightningDataModule to load GeoTIFF files (#52)
* ➕ Add torchdata A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries! * ♻️ Refactor test_model_vit to use datapipe fixture Decoupling the neural network model's unit test from the LightningDataModule by implementing a standalone datapipe fixture instead. * ✨ Implement GeoTIFFDataPipeModule Create a LightningDataModule to load GeoTIFF files. Uses torchdata to create the data pipeline. Using the FileLister DataPipe to iterate over *.tif files in the data/ folder, and do a random 80/20 split for the training and validation set. The GeoTIFF files are read into numpy.ndarrrays using rasterio, and converted to torch.Tensors with the default collate function. Using rasterio instead of rioxarray to reduce an extra layer of overhead in the data loading. * 🧵 Allow configuring num_workers in DataLoader Enable setting the number of subprocesses used for data loading. Default to 8 for now, but can be configured on LightningCLI using `python trainer.py fit --data.num_workers=8`. * 📌 Install torchdata=0.7.1 from conda-forge instead of PyPI Contains a build of torchdata that is pre-compiled with the correct AWSSDK extension, and won't result in errors like `ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?)`. * 🔧 Allow configuring data path containing the GeoTIFF files Enable setting the path to the folder containing the GeoTIFF data files. Defaults to data/ for now, but can be configured on LightningCLI using `python trainer.py fit --data.data_path=data/56HKH`. Also setting the recursive=True flag to allow for files in nested directories. * ✅ Add unit test for GeoTIFFDataModule Ensure that loading one mini-batch of data from a data folder works. Created two temporary random GeoTIFF files containing arrays of shape (3, 256, 256) in a fixture for the test.
- Loading branch information