-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter #225
Comments
@anrahman4 Thanks for reporting. I will look into this. |
1. Fix uneven sampling done for index based and iterative 2. Add a validation step to ensure we can validate that global indices are correctly shuffled and no indices are lost. 3. Make sure we do file and sample shuffling in reconfigure step. 4. Remove sample shuffling from dataloader Sampler code. 5. Added test case to support uneven file distributions #225
1. Fix uneven sampling done for index based and iterative 2. Add a validation step to ensure we can validate that global indices are correctly shuffled and no indices are lost. 3. Make sure we do file and sample shuffling in reconfigure step. 4. Remove sample shuffling from dataloader Sampler code. 5. Added test case to support uneven file distributions #225
1. Fix uneven sampling done for index based and iterative 2. Add a validation step to ensure we can validate that global indices are correctly shuffled and no indices are lost. 3. Make sure we do file and sample shuffling in reconfigure step. 4. Remove sample shuffling from dataloader Sampler code. 5. Added test case to support uneven file distributions #225
@anrahman4 Please check #226 and see if it solves your problem. |
Pulled the commit and ran the same command:
Still received the following error in dlio_benchmark/reader/reader_handler.py:
|
@anrahman4 Thank you for your testing. It looks like the PyTorch sampler went overboard. :( It should be correct now. Can you retest and confirm? I tried with four epochs, and it works with your configuration. |
@hariharan-devarajan That did the trick :) Was able to to run the mpirun command with np as 1, 2, 3, 4, 5, 6, 7, 8 Thank you for the very prompt fix! |
* For sample indexing we fix the uneven sampling 1. Fix uneven sampling done for index based and iterative 2. Add a validation step to ensure we can validate that global indices are correctly shuffled and no indices are lost. 3. Make sure we do file and sample shuffling in reconfigure step. 4. Remove sample shuffling from dataloader Sampler code. 5. Added test case to support uneven file distributions #225 * Increase GOTCHA and DFTRACER LEVEL to reduce print * Increasing test case to improve testing * reduced reader threads asnot enough data * ensure the sampler do not goes past the file in the last rank.
IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter
After recent changes that were done to dlio_benchmark/utils/config.py , I am running into issues with a list index out of range when running certain numbers of parallel working with mpirun. I am able to successful runs with integer values that even divide 9375 (the value I have set to num_files_train), but does not work cleanly when the number divides into non-whole numbers.
Command:
Error Message
Here is my custom_workload.yaml:
custom_workload.yaml
I figured out the issue with the code that is currently listed under the main branch for dlio_benchmark. I went to the file in which the Python script was pointing at the error. The error came from the get_global_map_index function:
What the code overall is trying to do here, is to divide up the number of samples amongst the amount of cores you have set using mpirun in the initial command, making a map to the number of files to indexes in Python. Total_samples is defined by your workload file as the value set in num_files_train in the dataset section of your custom_workload.yaml file.
The problem is when you specify the file number as say, 9375, it will indeed create that many files, but the very last file name if using zero based indexing will be 9374. So let’s take a look at an example where we run mpirun -np 4:
In this case 4 cores will be utilized and will be defined as ranks 0-3. The samples_per_proc variable will get calculated as ceiling(9375/4) = 2344 samples per processor. The code will define the start_sample and end_sample range that each rank is responsible for. For 9,375 files, the forward loop up top breaks down like this:
rank 0: for loop 0, 2344 => samples 0-2343
rank 1: for loop 2344, 4688 => samples 2344-4687
rank 2: for loop 4688, 7033 => samples 4688-7032
rank 3: for loop 7033, 9376 => samples 7033-9375
But for the last rank, in this case rank 3, the variable for end_sample is actually set to 9376, not 9375. If you look at the indexing, the indexing is supposed to look for index 9374 to access file number 9375, but it is instead trying to use index 9375 for 9375 which does not exist, hence the Python list index out of range.
The reason why you didn’t run into this bug when running np = 1 is because the samples_per_proc calculation just ends up being 9375 since self.comm_size = 1. Then when you go into the forward loop, it actually goes from 0 to 9374 in terms of indexing which is actually correct.
I believe this behavior has to do with the introduction of the ceiling function to calculate samples_per_proc, where the very last rank when more than one rank is being used will be incorrectly calculated if the num_files_train parameter is not divisible into a whole number by the concurrency set in mpirun. Using the MLPerf Storage commit version bc693c6 of this function seems to fix the issue initially and allows the first epoch to complete:
commit bc693c6
But after, I get this error message instead:
Looks like in the main branch, the DataLoader then also has a key error, where it is trying to look at key 9375 as opposed to 9374.
Please confirm if this is truly the issue and fix the relevant files to have the global map create a proper index map. Thank you.
The text was updated successfully, but these errors were encountered: