Expected GPU utilization pattern for video decoding. #5764

ivandariojr · 2024-12-28T02:20:46Z

Describe the question.

During optimization of a training pipeline I observed a GPU utilization pattern that would suggest that the DALI pipeline and the training code are running sequentially rather than in parallel as I would have expected from reading the DALI documentation.

Here you can see that model is training when the GPU cuda utilization (blue) spikes. But during training the GPU decoding utilization (green) stops. It also seems like there is very low GPU decoding utilization. Is there some way to ensure the decoder is always running so the downstream model doesnt stop training?

This is a a very simple GPU video processing pipeline in Dali that decodes, resizes and then pads videos. I am using this pipeline to train a downstream model. Here are some of the parameters being used to configure this pipeline:

dataloader:
  num_devices: 8
  last_batch_policy: nvidia.dali.plugin.base_iterator.LastBatchPolicy.PARTIAL
  video_pipeline_partial:
    _target_: nvidia.dali.pipeline_def
    _partial_: true
    batch_size: 272
    num_threads: 8
    py_num_workers: 16
    exec_dynamic: true
  video_resize_fn:
    _target_: nvidia.dali.fn.readers.video_resize
    _partial_: true
    device: "gpu"
    sequence_length: 17
    max_size: 256
    resize_longer: 256
    file_list_include_preceding_frame: true
    prefetch_queue_depth: 4
    pad_sequences: false
    pad_last_batch: false
    random_shuffle: true
    stick_to_shard: true
    minibatch_size: 128
  pad_fn:
    _target_: nvidia.dali.fn.pad
    _partial_: true
    fill_value: 0
    axes: [1,2]
    shape:
      - 256
      - 256

And there is the python code where these are being used:

class DaliVideoDataloader(DALIGenericIterator):
    """A DALI dataloader for video data."""

    def __init__(
        self,
        dataset: Dataset,
        video_pipeline_partial: Callable,
        video_resize_fn: Callable,
        pad_fn: Callable,
        device_id: int,
        num_devices: int,
        # DaliGenericIterator args
        size: int = -1,
        auto_reset: bool = False,
        last_batch_padded: bool = False,
        last_batch_policy: LastBatchPolicy = LastBatchPolicy.FILL,
        prepare_first_batch: bool = True,
    ) -> None:

        def video_pipeline(filenames: list[str | Path]) -> Any:
            video = video_resize_fn(filenames=filenames, name="Reader", num_shards=num_devices, shard_id=device_id)
            padded = pad_fn(video)
            return padded

        pipeline = video_pipeline_partial(fn=video_pipeline, device_id=device_id)
        pipe = pipeline(dataset.data_files)
        super().__init__(
            pipelines=[pipe],
            size=size,
            reader_name="Reader",
            auto_reset=auto_reset,
            last_batch_padded=last_batch_padded,
            last_batch_policy=last_batch_policy,
            prepare_first_batch=prepare_first_batch,
            output_map=["video"],
        )

    def __next__(self) -> torch.Tensor:  # pyright: ignore
        """Returns the next video tensor."""
        out = super().__next__()
        return out[0]["video"]

If this is expected behavior that is fine but I am trying to make sure that there isn't a flag or misconfiguration that is causing this performance.

Thanks for your help!

Check for duplicates

I have searched the open bugs/issues and have found no duplicates for this bug report

JanuszL · 2024-12-30T08:39:20Z

Hi @ivandariojr,

Thank you for reaching out. The utilization plots you showed are good place to start more thorough analysis. I recommend capturing the profile using NSIght System to learn more details. It may happen that there is a piece of CPU code in the training that stalls the GPU work, and DALI is not a bottleneck but provide the data at the peace the training can consume it.

ivandariojr added the question Further information is requested label Dec 28, 2024

dali-automaton assigned JanuszL Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected GPU utilization pattern for video decoding. #5764

Expected GPU utilization pattern for video decoding. #5764

ivandariojr commented Dec 28, 2024 •

edited

Loading

JanuszL commented Dec 30, 2024

Expected GPU utilization pattern for video decoding. #5764

Expected GPU utilization pattern for video decoding. #5764

Comments

ivandariojr commented Dec 28, 2024 • edited Loading

Describe the question.

Check for duplicates

JanuszL commented Dec 30, 2024

ivandariojr commented Dec 28, 2024 •

edited

Loading