Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for MNIST data loader and update data loading functionality #138

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ketayon
Copy link
Contributor

@ketayon ketayon commented Oct 20, 2024

This PR addresses issue #101 by adding train_test_split functionality within the load_mnist_dataset function. Now, users no longer need to manually split the MNIST dataset into training and testing sets before loading.

Changes:

  1. Added train-test splitting based on the split_ratio argument, allowing for flexible dataset sizes.
  2. Updated the function to allow loading only the train or test dataloader, or both, enhancing usability.
  3. Added comprehensive tests in test_mnist_data_loader.py to ensure the functionality of the load_mnist_dataset function, including:
  • Verifying that the dataset splits correctly according to the specified split_ratio.
  • Testing the functionality of loading train and test dataloaders independently or together.
  • Checking that the data normalization functionality works as intended.

Let me know if any further changes are required.

@SaashaJoshi
Copy link
Owner

Hi @ketayon, thank you for the contribution. I will get back to reviewing as soon as I can. However, are you in a rush to get this done before hacktoberfest gets over?

@SaashaJoshi SaashaJoshi added the hacktoberfest-accepted Approved for Hacktoberfest label Oct 22, 2024
@ketayon
Copy link
Contributor Author

ketayon commented Oct 22, 2024 via email

@SaashaJoshi SaashaJoshi linked an issue Dec 11, 2024 that may be closed by this pull request
@SaashaJoshi
Copy link
Owner

SaashaJoshi commented Dec 18, 2024

Hi @ketayon,

I have been reviewing this PR and this made me realize that I would prefer loading the train and test datasets separately, as PyTorch suggets. Like here,

# Download dataset.
mnist_train = datasets.MNIST(
root="data/mnist_data",
train=True,
download=True,
transform=mnist_transform,
)
mnist_test = datasets.MNIST(
root="data/mnist_data", train=False, download=True, transform=mnist_transform
)

With this, I would like to add a ratio argument to mention how many train:test samples a user wishes to load. This argument would provide a value to the batch_size argument.

However, the idea of just loading the train set and splitting it into train and test dataset is not bad, but it unnecessarily reduces the total amount of data we load. Or should we simply discard the idea of having a train_test_split since we are loading the datasets separately?

Let me know if you think otherwise.

# collate_fn with args.
new_batch = []
custom_collate = partial(collate_fn, labels=labels, new_batch=new_batch)
raise TypeError("img_size tuple must contain integers.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert the TypeError to what it was before? I prefer having more details in the error message.

Comment on lines +27 to +28
batch_size_train: int = 64,
batch_size_test: int = 1000,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having separate batch_size arguments is good!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, any particular reason behind these arguments having default values of 64 and 1000?


Returns:
Train and Test DataLoader objects.
Train and/or Test DataLoader objects, depending on the `load` argument.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say,

Suggested change
Train and/or Test DataLoader objects, depending on the `load` argument.
Train and/or Test DataLoader objects, depending on the `dataset` argument.

raise TypeError("batch_size_train and batch_size_test must be integers.")

if labels is not None and not isinstance(labels, list):
raise TypeError("labels must be a list.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError("labels must be a list.")
raise TypeError("The input labels must be of the type list.")

normalize_min: Optional[float] = None,
normalize_max: Optional[float] = None,
split_ratio: float = 0.8,
load: str = "both", # Options: "train", "test", or "both"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
load: str = "both", # Options: "train", "test", or "both"
dataset: str = "both", # Options: "train", "test", or "both"

Comment on lines +69 to +70
if load not in {"train", "test", "both"}:
raise ValueError('load must be one of "train", "test", or "both".')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if load not in {"train", "test", "both"}:
raise ValueError('load must be one of "train", "test", or "both".')
if dataset not in {"train", "test", "both"}:
raise ValueError('The dataset argument must be one of "train", "test", or "both".')

MinMaxNormalization(normalize_min, normalize_max)
if normalize_min is not None
and normalize_max is not None # pylint: disable=C0301
else torchvision.transforms.Lambda(lambda x: x)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need an else statement.

Comment on lines +86 to 92
# Load the full MNIST dataset
mnist_full = datasets.MNIST(
root="data/mnist_data",
train=True,
train=True, # Always load train to split later
download=True,
transform=mnist_transform,
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorch suggests to load train and test datasets separately. I would prefer keeping it the original way.

# Download dataset.
mnist_train = datasets.MNIST(
    root="data/mnist_data",
    train=True,
    download=True,
    transform=mnist_transform,
)

mnist_test = datasets.MNIST(
    root="data/mnist_data", train=False, download=True, transform=mnist_transform
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Approved for Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_mnist_dataset must include train_test_split
2 participants