Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PANDASmall dataset #664

Merged
merged 12 commits into from
Oct 8, 2024
Merged

Conversation

nkaenzig
Copy link
Collaborator

@nkaenzig nkaenzig commented Oct 7, 2024

Closes #662

  • Uses only 20% of all slides (~2000)
  • 200 instead of 1000 patches per slide (found experimentally that this yields similar results still)

-> results in 25x less patches, therefore runs approximately 25x faster than the full panda benchmark, given that patch embedding generation takes up most compute time

@nkaenzig nkaenzig linked an issue Oct 7, 2024 that may be closed by this pull request
@nkaenzig nkaenzig marked this pull request as ready for review October 7, 2024 10:03
@roman807
Copy link
Collaborator

roman807 commented Oct 7, 2024

thanks @nkaenzig, looks good. How did you determine the data size (20% of slides & 200 patches) -- do we know that for example 10% of data or 100 patches would not be sufficient?

@nkaenzig
Copy link
Collaborator Author

nkaenzig commented Oct 7, 2024

@roman807 Good question.

The number 200 for the # patches was determined experimentally:
image
You can see in this graphic that there is a significant performance drop when going from 200 to only 100 patches.

Regarding the 20% question: This dataset has 6 classes, we want to make sure that in each of the train, val & test splits, we still have sufficient examples per class. Using the current ratio, we have 166 WSIs per class in the train set, and 83 samples per class in each val/test. Especially for the val/test set I don't want to go lower in terms of sample count. Also at 20%, the evaluation runtime becomes reasonable: for ViT-S inference eva predict only takes around 5-10 min , while for ViT-G (giant) it takes around 2 hours.

@roman807
Copy link
Collaborator

roman807 commented Oct 7, 2024

Thinking about terminology. Should we use "small" instead of "tiny"? I think tiny usually refers to something very small, e.g. minimal data for unit or integration test

@nkaenzig nkaenzig changed the title Add PANDATiny dataset Add PANDASmall dataset Oct 7, 2024
@nkaenzig nkaenzig requested a review from roman807 October 7, 2024 14:47
@nkaenzig nkaenzig self-assigned this Oct 8, 2024
@nkaenzig nkaenzig enabled auto-merge (squash) October 8, 2024 08:06
@nkaenzig nkaenzig merged commit be6dc72 into main Oct 8, 2024
6 checks passed
@nkaenzig nkaenzig deleted the 662-create-a-tiny-version-of-the-panda-dataset branch October 8, 2024 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a tiny version of the PANDA dataset
2 participants