Add `PANDASmall` dataset #664

nkaenzig · 2024-10-07T10:02:46Z

Closes #662

Uses only 20% of all slides (~2000)
200 instead of 1000 patches per slide (found experimentally that this yields similar results still)

-> results in 25x less patches, therefore runs approximately 25x faster than the full panda benchmark, given that patch embedding generation takes up most compute time

src/eva/core/data/splitting/stratified.py

docs/user-guide/advanced/replicate_evaluations.md

roman807 · 2024-10-07T13:19:45Z

thanks @nkaenzig, looks good. How did you determine the data size (20% of slides & 200 patches) -- do we know that for example 10% of data or 100 patches would not be sufficient?

nkaenzig · 2024-10-07T13:28:42Z

@roman807 Good question.

The number 200 for the # patches was determined experimentally:

You can see in this graphic that there is a significant performance drop when going from 200 to only 100 patches.

Regarding the 20% question: This dataset has 6 classes, we want to make sure that in each of the train, val & test splits, we still have sufficient examples per class. Using the current ratio, we have 166 WSIs per class in the train set, and 83 samples per class in each val/test. Especially for the val/test set I don't want to go lower in terms of sample count. Also at 20%, the evaluation runtime becomes reasonable: for ViT-S inference eva predict only takes around 5-10 min , while for ViT-G (giant) it takes around 2 hours.

roman807 · 2024-10-07T13:28:58Z

Thinking about terminology. Should we use "small" instead of "tiny"? I think tiny usually refers to something very small, e.g. minimal data for unit or integration test

nkaenzig added 2 commits October 7, 2024 09:12

updated stratified split logic to accepts ratios that don't sum up to 1

83b751d

added PANDATiny dataset class

7704c35

nkaenzig linked an issue Oct 7, 2024 that may be closed by this pull request

Create a tiny version of the PANDA dataset #662

Closed

nkaenzig marked this pull request as ready for review October 7, 2024 10:03

nkaenzig added 2 commits October 7, 2024 11:26

updated random split logic & test

8edee37

fix linting

ad4b539

roman807 reviewed Oct 7, 2024

View reviewed changes

src/eva/core/data/splitting/stratified.py Show resolved Hide resolved

roman807 reviewed Oct 7, 2024

View reviewed changes

docs/user-guide/advanced/replicate_evaluations.md Outdated Show resolved Hide resolved

fixed docs

2403486

nkaenzig added 2 commits October 7, 2024 13:30

renamed tiny to small

09d4c08

tiny -> small

e36f005

nkaenzig changed the title ~~Add PANDATiny dataset~~ Add PANDASmall dataset Oct 7, 2024

nkaenzig added 4 commits October 7, 2024 13:54

use local random generator in samplers

f4c6510

updated grid sampler unit tests

0532e3a

fixed sampler unit tests

adbd249

updated panda unittest

e2762e0

nkaenzig requested a review from roman807 October 7, 2024 14:47

roman807 approved these changes Oct 7, 2024

View reviewed changes

nkaenzig self-assigned this Oct 8, 2024

Merge branch 'main' into 662-create-a-tiny-version-of-the-panda-dataset

6114220

nkaenzig enabled auto-merge (squash) October 8, 2024 08:06

nkaenzig merged commit be6dc72 into main Oct 8, 2024
6 checks passed

nkaenzig deleted the 662-create-a-tiny-version-of-the-panda-dataset branch October 8, 2024 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `PANDASmall` dataset #664

Add `PANDASmall` dataset #664

nkaenzig commented Oct 7, 2024 •

edited

Loading

roman807 commented Oct 7, 2024

nkaenzig commented Oct 7, 2024 •

edited

Loading

roman807 commented Oct 7, 2024

Add PANDASmall dataset #664

Add PANDASmall dataset #664

Conversation

nkaenzig commented Oct 7, 2024 • edited Loading

roman807 commented Oct 7, 2024

nkaenzig commented Oct 7, 2024 • edited Loading

roman807 commented Oct 7, 2024

Add `PANDASmall` dataset #664

Add `PANDASmall` dataset #664

nkaenzig commented Oct 7, 2024 •

edited

Loading

nkaenzig commented Oct 7, 2024 •

edited

Loading