-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests for MNIST data loader and update data loading functionality #138
base: main
Are you sure you want to change the base?
Conversation
Hi @ketayon, thank you for the contribution. I will get back to reviewing as soon as I can. However, are you in a rush to get this done before hacktoberfest gets over? |
Hello, no, I don’t take part in hacktoberfest.On 22 Oct 2024, at 20:49, Saasha Joshi ***@***.***> wrote:Hi @ketayon, thank you for the contribution. I will get back to reviewing as soon as I can. However, are you in a rush to get this done before hacktoberfest gets over?
|
Hi @ketayon, I have been reviewing this PR and this made me realize that I would prefer loading the train and test datasets separately, as PyTorch suggets. Like here, piQture/piqture/data_loader/mnist_data_loader.py Lines 93 to 103 in b3e7203
With this, I would like to add a However, the idea of just loading the train set and splitting it into train and test dataset is not bad, but it unnecessarily reduces the total amount of data we load. Or should we simply discard the idea of having a Let me know if you think otherwise. |
# collate_fn with args. | ||
new_batch = [] | ||
custom_collate = partial(collate_fn, labels=labels, new_batch=new_batch) | ||
raise TypeError("img_size tuple must contain integers.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we revert the TypeError to what it was before? I prefer having more details in the error message.
batch_size_train: int = 64, | ||
batch_size_test: int = 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having separate batch_size
arguments is good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, any particular reason behind these arguments having default values of 64 and 1000?
|
||
Returns: | ||
Train and Test DataLoader objects. | ||
Train and/or Test DataLoader objects, depending on the `load` argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we say,
Train and/or Test DataLoader objects, depending on the `load` argument. | |
Train and/or Test DataLoader objects, depending on the `dataset` argument. |
raise TypeError("batch_size_train and batch_size_test must be integers.") | ||
|
||
if labels is not None and not isinstance(labels, list): | ||
raise TypeError("labels must be a list.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise TypeError("labels must be a list.") | |
raise TypeError("The input labels must be of the type list.") |
normalize_min: Optional[float] = None, | ||
normalize_max: Optional[float] = None, | ||
split_ratio: float = 0.8, | ||
load: str = "both", # Options: "train", "test", or "both" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load: str = "both", # Options: "train", "test", or "both" | |
dataset: str = "both", # Options: "train", "test", or "both" |
if load not in {"train", "test", "both"}: | ||
raise ValueError('load must be one of "train", "test", or "both".') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if load not in {"train", "test", "both"}: | |
raise ValueError('load must be one of "train", "test", or "both".') | |
if dataset not in {"train", "test", "both"}: | |
raise ValueError('The dataset argument must be one of "train", "test", or "both".') |
MinMaxNormalization(normalize_min, normalize_max) | ||
if normalize_min is not None | ||
and normalize_max is not None # pylint: disable=C0301 | ||
else torchvision.transforms.Lambda(lambda x: x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need an else statement.
# Load the full MNIST dataset | ||
mnist_full = datasets.MNIST( | ||
root="data/mnist_data", | ||
train=True, | ||
train=True, # Always load train to split later | ||
download=True, | ||
transform=mnist_transform, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch suggests to load train and test datasets separately. I would prefer keeping it the original way.
# Download dataset.
mnist_train = datasets.MNIST(
root="data/mnist_data",
train=True,
download=True,
transform=mnist_transform,
)
mnist_test = datasets.MNIST(
root="data/mnist_data", train=False, download=True, transform=mnist_transform
)
This PR addresses issue #101 by adding train_test_split functionality within the load_mnist_dataset function. Now, users no longer need to manually split the MNIST dataset into training and testing sets before loading.
Changes:
Let me know if any further changes are required.