New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add tests for MNIST data loader and update data loading functionality #138

Open

ketayon wants to merge 4 commits into SaashaJoshi:main from ketayon:fix-issue-101

Contributor

ketayon commented Oct 20, 2024

This PR addresses issue #101 by adding train_test_split functionality within the load_mnist_dataset function. Now, users no longer need to manually split the MNIST dataset into training and testing sets before loading.

Changes:

Added train-test splitting based on the split_ratio argument, allowing for flexible dataset sizes.
Updated the function to allow loading only the train or test dataloader, or both, enhancing usability.
Added comprehensive tests in test_mnist_data_loader.py to ensure the functionality of the load_mnist_dataset function, including:

Verifying that the dataset splits correctly according to the specified split_ratio.
Testing the functionality of loading train and test dataloaders independently or together.
Checking that the data normalization functionality works as intended.

Let me know if any further changes are required.

ketayon added 4 commits

October 20, 2024 15:12


          Add tests for MNIST data loader and update data loading functionality

9b70a9f


          Fixed issues

026ce69


          Fix isort and black issues

d1a22a9


          Increased test coverage to 100% for mnist_data_loader

02e9147

Owner

SaashaJoshi commented Oct 22, 2024

Hi @ketayon, thank you for the contribution. I will get back to reviewing as soon as I can. However, are you in a rush to get this done before hacktoberfest gets over?

SaashaJoshi added the hacktoberfest-accepted label

Contributor Author

ketayon commented Oct 22, 2024 via email

Hello, no, I don’t take part in hacktoberfest.On 22 Oct 2024, at 20:49, Saasha Joshi ***@***.***> wrote:Hi @ketayon, thank you for the contribution. I will get back to reviewing as soon as I can. However, are you in a rush to get this done before hacktoberfest gets over?

SaashaJoshi mentioned this pull request

load_mnist_data with split #137

Closed

SaashaJoshi linked an issue

that may be closed by this pull request

load_mnist_dataset must include train_test_split #101

Open

Owner

SaashaJoshi commented Dec 18, 2024 •

edited

Loading

I have been reviewing this PR and this made me realize that I would prefer loading the train and test datasets separately, as PyTorch suggets. Like here,

piQture/piqture/data_loader/mnist_data_loader.py

Lines 93 to 103 in b3e7203

    
           # Download dataset. 
        
           mnist_train = datasets.MNIST( 
        
               root="data/mnist_data", 
        
               train=True, 
        
               download=True, 
        
               transform=mnist_transform, 
        
           ) 
        
           mnist_test = datasets.MNIST( 
        
               root="data/mnist_data", train=False, download=True, transform=mnist_transform 
        
           )

With this, I would like to add a ratio argument to mention how many train:test samples a user wishes to load. This argument would provide a value to the batch_size argument.

However, the idea of just loading the train set and splitting it into train and test dataset is not bad, but it unnecessarily reduces the total amount of data we load. Or should we simply discard the idea of having a train_test_split since we are loading the datasets separately?

Let me know if you think otherwise.

SaashaJoshi reviewed

View reviewed changes

piqture/data_loader/mnist_data_loader.py

-                  # collate_fn with args.
-                  new_batch = []
-                  custom_collate = partial(collate_fn, labels=labels, new_batch=new_batch)
+                      raise TypeError("img_size tuple must contain integers.")

Owner

SaashaJoshi Dec 18, 2024

Can we revert the TypeError to what it was before? I prefer having more details in the error message.

piqture/data_loader/mnist_data_loader.py

Comment on lines +27 to +28

		batch_size_train: int = 64,
		batch_size_test: int = 1000,

Owner

SaashaJoshi Dec 18, 2024

Having separate batch_size arguments is good!

Owner

SaashaJoshi Dec 18, 2024

However, any particular reason behind these arguments having default values of 64 and 1000?

piqture/data_loader/mnist_data_loader.py

                   Returns:
-                      Train and Test DataLoader objects.
+                      Train and/or Test DataLoader objects, depending on the `load` argument.

Owner

SaashaJoshi Dec 18, 2024

Can we say,

Suggested change

      
                    Train and/or Test DataLoader objects, depending on the `load` argument.
          
                    Train and/or Test DataLoader objects, depending on the `dataset` argument.

piqture/data_loader/mnist_data_loader.py

+                      raise TypeError("batch_size_train and batch_size_test must be integers.")
+                  if labels is not None and not isinstance(labels, list):
+                      raise TypeError("labels must be a list.")

Owner

SaashaJoshi Dec 18, 2024

Suggested change

      
                    raise TypeError("labels must be a list.")
          
                    raise TypeError("The input labels must be of the type list.")

piqture/data_loader/mnist_data_loader.py

+                  normalize_min: Optional[float] = None,
+                  normalize_max: Optional[float] = None,
+                  split_ratio: float = 0.8,
+                  load: str = "both",  # Options: "train", "test", or "both"

Owner

SaashaJoshi Dec 18, 2024

Suggested change

      
                load: str = "both",  # Options: "train", "test", or "both"
          
                dataset: str = "both",  # Options: "train", "test", or "both"

piqture/data_loader/mnist_data_loader.py

Comment on lines +69 to +70

		if load not in {"train", "test", "both"}:
		raise ValueError('load must be one of "train", "test", or "both".')

Owner

SaashaJoshi Dec 18, 2024

Suggested change

      
                if load not in {"train", "test", "both"}:
          
                    raise ValueError('load must be one of "train", "test", or "both".')
          
                if dataset not in {"train", "test", "both"}:
          
                    raise ValueError('The dataset argument must be one of "train", "test", or "both".')

piqture/data_loader/mnist_data_loader.py

+                              MinMaxNormalization(normalize_min, normalize_max)
+                              if normalize_min is not None
+                              and normalize_max is not None  # pylint: disable=C0301
+                              else torchvision.transforms.Lambda(lambda x: x)

Owner

SaashaJoshi Dec 18, 2024

We do not need an else statement.

piqture/data_loader/mnist_data_loader.py

Comment on lines +86 to 92

    
                  # Load the full MNIST dataset

                  mnist_full = datasets.MNIST(

                      root="data/mnist_data",

                      train=True,

                      train=True,  # Always load train to split later

                      download=True,

                      transform=mnist_transform,

                  )

Owner

SaashaJoshi Dec 18, 2024

PyTorch suggests to load train and test datasets separately. I would prefer keeping it the original way.

# Download dataset.
mnist_train = datasets.MNIST(
    root="data/mnist_data",
    train=True,
    download=True,
    transform=mnist_transform,
)

mnist_test = datasets.MNIST(
    root="data/mnist_data", train=False, download=True, transform=mnist_transform
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hacktoberfest-accepted