-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RandomGeoSampler for Pre-chipped Datasets #1976
Conversation
We should do this for all the datamodules you're using, not just these two. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, is there anything else you wanted to add to this PR or can I merge?
Tests fail because the fake data consists of a single file and 10% of 1 is 0. |
That's all I want to add to this PR. Thanks for reviewing the code, @adamjstewart! |
This PR adds 1K new files to git. Do we really need that many files? |
If we change the split portion from [0.8, 0.1, 0.1] to something like [0.6, 0.2, 0.2], we can reduce 50% of test images. That being said, are 500 images too many to be added to git? We can manually change the split portion later using [0.8, 0.1, 0.1]. |
We shouldn't change the default split just to make testing easier. If you want to make the split ratio an input parameter (like we do for many other data modules), then we could use [0.33, 0.33, 0.33] only in CI and use [0.8, 0.1, 0.1] by default.
I'm fine with at least 10 images, I'm less fine with at least 1000 images. |
@yichiac is this still needed? |
I think it is still better to include this change. If we can reduce the number of test files, it would be great to have |
I don't think the sampler is important here, the splitter is what's important. I just feel like most users will either use raw Sentinel-2 tiles or write their own NonGeoDataset for the pre-chipped data. GeoDatasets don't make as much sense for pre-chipped data, they're really for raw data. |
@yichiac We should make a decision on this PR before the TorchGeo 0.6.0 release next week. Basically, it boils down to one question. Are these data modules designed for working with:
If 1, this PR should be closed, as the data modules already work as intended. If 2, we need to document this. We could also consider automatically downloading the harmonized version of the dataset in |
I lean to option 1. This PR could be closed. |
This PR replaces
RandomBatchGeoSampler
withRandomGeoSampler
for Agrifieldnet, Sentinel2_CDL, Sentinel2_NCCM, Sentinel2_EuroCrops, and Sentinel2_SouthAmericaSoybean datamodules.RandomGeoSampler
works better for the pre-chipped datasets because it returns the correct random bounding boxes.