IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter #225

anrahman4 · 2024-08-29T23:58:44Z

IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter

After recent changes that were done to dlio_benchmark/utils/config.py , I am running into issues with a list index out of range when running certain numbers of parallel working with mpirun. I am able to successful runs with integer values that even divide 9375 (the value I have set to num_files_train), but does not work cleanly when the number divides into non-whole numbers.

Command:

mpirun -np 4 dlio_benchmark --config-dir /mnt/mlperf_stor/dlio_benchmark/dlio_benchmark/configs/ workload=custom_workload.yaml

Error Message

Error executing job with overrides: ['workload=custom_workload.yaml']
Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 386, in run_benchmark
    benchmark.run()
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 330, in run
    self.args.reconfigure(epoch)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/utils/config.py", line 382, in reconfigure
    self.train_global_index_map = self.get_global_map_index(self.file_list_train, self.total_samples_train)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/utils/config.py", line 361, in get_global_map_index
    abs_path = os.path.abspath(file_list[file_index])
IndexError: list index out of range

Here is my custom_workload.yaml:

custom_workload.yaml

model: unet3d

framework: pytorch

workflow:
  generate_data: False
  train: True
  checkpoint: False
  profiling: True

dataset:
  data_folder: /mnt/dlio_benchmark/dlio_benchmark/dlio_benchmark/configs/data
  format: npz
  num_files_train: 9375
  num_samples_per_file: 1
  record_length: 146600628
  record_length_stdev: 68341808

reader:
  data_loader: pytorch
  batch_size: 4
  read_threads: 4
  file_shuffle: seed
  sample_shuffle: seed
  shuffle_size: 4

train:
  epochs: 1
  computation_time: 0.323

checkpoint:
  checkpoint_folder: /mnt/dlio_benchmark/dlio_benchmark/dlio_benchmark/configs/checkpoints
  checkpoint_after_epoch: 5
  epochs_between_checkpoints: 2
  model_size: 499153191

metric:
  au: 0.90

I figured out the issue with the code that is currently listed under the main branch for dlio_benchmark. I went to the file in which the Python script was pointing at the error. The error came from the get_global_map_index function:

@dlp.log
    def get_global_map_index(self, file_list, total_samples):
        process_thread_file_map = {}
        num_files = len(file_list)
        if num_files > 0:
            samples_per_proc = int(math.ceil(total_samples/self.comm_size)) 
            start_sample = self.my_rank * samples_per_proc
            end_sample = (self.my_rank + 1) * samples_per_proc
            for global_sample_index in range(start_sample, end_sample):
                file_index = global_sample_index//self.num_samples_per_file
                abs_path = os.path.abspath(file_list[file_index]) 
                sample_index = global_sample_index % self.num_samples_per_file
                process_thread_file_map[global_sample_index] = (abs_path, sample_index)
            logging.debug(f"{self.my_rank} {process_thread_file_map}")
        return process_thread_file_map

What the code overall is trying to do here, is to divide up the number of samples amongst the amount of cores you have set using mpirun in the initial command, making a map to the number of files to indexes in Python. Total_samples is defined by your workload file as the value set in num_files_train in the dataset section of your custom_workload.yaml file.

The problem is when you specify the file number as say, 9375, it will indeed create that many files, but the very last file name if using zero based indexing will be 9374. So let’s take a look at an example where we run mpirun -np 4:

In this case 4 cores will be utilized and will be defined as ranks 0-3. The samples_per_proc variable will get calculated as ceiling(9375/4) = 2344 samples per processor. The code will define the start_sample and end_sample range that each rank is responsible for. For 9,375 files, the forward loop up top breaks down like this:

rank 0: for loop 0, 2344 => samples 0-2343
rank 1: for loop 2344, 4688 => samples 2344-4687
rank 2: for loop 4688, 7033 => samples 4688-7032
rank 3: for loop 7033, 9376 => samples 7033-9375

But for the last rank, in this case rank 3, the variable for end_sample is actually set to 9376, not 9375. If you look at the indexing, the indexing is supposed to look for index 9374 to access file number 9375, but it is instead trying to use index 9375 for 9375 which does not exist, hence the Python list index out of range.

The reason why you didn’t run into this bug when running np = 1 is because the samples_per_proc calculation just ends up being 9375 since self.comm_size = 1. Then when you go into the forward loop, it actually goes from 0 to 9374 in terms of indexing which is actually correct.

I believe this behavior has to do with the introduction of the ceiling function to calculate samples_per_proc, where the very last rank when more than one rank is being used will be incorrectly calculated if the num_files_train parameter is not divisible into a whole number by the concurrency set in mpirun. Using the MLPerf Storage commit version bc693c6 of this function seems to fix the issue initially and allows the first epoch to complete:

commit bc693c6

process_thread_file_map = {}
        for global_sample_index in range(total_samples):
            file_index = global_sample_index//self.num_samples_per_file
            abs_path = os.path.abspath(file_list[file_index]) 
            sample_index = global_sample_index % self.num_samples_per_file
            process_thread_file_map[global_sample_index] = (abs_path, sample_index)
        return process_thread_file_map

But after, I get this error message instead:

Error executing job with overrides: ['workload=custom_workload.yaml']
Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 386, in run_benchmark
    benchmark.run()
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 338, in run
    steps = self._train(epoch)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 259, in _train
    for batch in dlp.iter(loader.next()):
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 222, in iter
    for v in func:
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 174, in next
    for batch in self._dataset:
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 84, in __getitem__
    return self.reader.read_index(image_idx, step)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/npz_reader.py", line 57, in read_index
    return super().read_index(image_idx, step)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/reader_handler.py", line 116, in read_index
    filename, sample_index = self.global_index_map[global_sample_idx]
KeyError: 9375

Looks like in the main branch, the DataLoader then also has a key error, where it is trying to look at key 9375 as opposed to 9374.

Please confirm if this is truly the issue and fix the relevant files to have the global map create a proper index map. Thank you.

The text was updated successfully, but these errors were encountered:

hariharan-devarajan · 2024-08-30T01:51:00Z

@anrahman4 Thanks for reporting. I will look into this.

1. Fix uneven sampling done for index based and iterative 2. Add a validation step to ensure we can validate that global indices are correctly shuffled and no indices are lost. 3. Make sure we do file and sample shuffling in reconfigure step. 4. Remove sample shuffling from dataloader Sampler code. 5. Added test case to support uneven file distributions #225

hariharan-devarajan · 2024-08-30T08:07:28Z

@anrahman4 Please check #226 and see if it solves your problem.

anrahman4 · 2024-08-30T17:35:17Z

@hariharan-devarajan

Pulled the commit and ran the same command:

mpirun -np 4 dlio_benchmark --config-dir /mnt/mlperf_stor/dlio_benchmark/dlio_benchmark/configs/ workload=custom_workload.yaml

Still received the following error in dlio_benchmark/reader/reader_handler.py:

Error executing job with overrides: ['workload=custom_workload.yaml']
Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 386, in run_benchmark
    benchmark.run()
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 338, in run
    steps = self._train(epoch)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 259, in _train
    for batch in dlp.iter(loader.next()):
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 222, in iter
    for v in func:
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 174, in next
    for batch in self._dataset:
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 84, in __getitem__
    return self.reader.read_index(image_idx, step)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/npz_reader.py", line 57, in read_index
    return super().read_index(image_idx, step)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/reader_handler.py", line 116, in read_index
    filename, sample_index = self.global_index_map[global_sample_idx]
KeyError: 9375

hariharan-devarajan · 2024-08-31T00:58:38Z

@anrahman4 Thank you for your testing. It looks like the PyTorch sampler went overboard. :( It should be correct now. Can you retest and confirm? I tried with four epochs, and it works with your configuration.

anrahman4 · 2024-09-02T05:50:27Z

@hariharan-devarajan That did the trick :)

Was able to to run the mpirun command with np as 1, 2, 3, 4, 5, 6, 7, 8

Thank you for the very prompt fix!

* For sample indexing we fix the uneven sampling 1. Fix uneven sampling done for index based and iterative 2. Add a validation step to ensure we can validate that global indices are correctly shuffled and no indices are lost. 3. Make sure we do file and sample shuffling in reconfigure step. 4. Remove sample shuffling from dataloader Sampler code. 5. Added test case to support uneven file distributions #225 * Increase GOTCHA and DFTRACER LEVEL to reduce print * Increasing test case to improve testing * reduced reader threads asnot enough data * ensure the sampler do not goes past the file in the last rank.

hariharan-devarajan self-assigned this Aug 30, 2024

hariharan-devarajan mentioned this issue Aug 30, 2024

For sample indexing we fix the uneven sampling #226

Merged

5 tasks

zhenghh04 closed this as completed in #226 Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter #225

IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter #225

anrahman4 commented Aug 29, 2024

hariharan-devarajan commented Aug 30, 2024

hariharan-devarajan commented Aug 30, 2024

anrahman4 commented Aug 30, 2024

hariharan-devarajan commented Aug 31, 2024

anrahman4 commented Sep 2, 2024

IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter #225

IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter #225

Comments

anrahman4 commented Aug 29, 2024

hariharan-devarajan commented Aug 30, 2024

hariharan-devarajan commented Aug 30, 2024

anrahman4 commented Aug 30, 2024

hariharan-devarajan commented Aug 31, 2024

anrahman4 commented Sep 2, 2024