Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected GPU utilization pattern for video decoding. #5764

Open
1 task done
ivandariojr opened this issue Dec 28, 2024 · 1 comment
Open
1 task done

Expected GPU utilization pattern for video decoding. #5764

ivandariojr opened this issue Dec 28, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@ivandariojr
Copy link

ivandariojr commented Dec 28, 2024

Describe the question.

During optimization of a training pipeline I observed a GPU utilization pattern that would suggest that the DALI pipeline and the training code are running sequentially rather than in parallel as I would have expected from reading the DALI documentation.

Here you can see that model is training when the GPU cuda utilization (blue) spikes. But during training the GPU decoding utilization (green) stops. It also seems like there is very low GPU decoding utilization. Is there some way to ensure the decoder is always running so the downstream model doesnt stop training?
Image
This is a a very simple GPU video processing pipeline in Dali that decodes, resizes and then pads videos. I am using this pipeline to train a downstream model. Here are some of the parameters being used to configure this pipeline:

dataloader:
  num_devices: 8
  last_batch_policy: nvidia.dali.plugin.base_iterator.LastBatchPolicy.PARTIAL
  video_pipeline_partial:
    _target_: nvidia.dali.pipeline_def
    _partial_: true
    batch_size: 272
    num_threads: 8
    py_num_workers: 16
    exec_dynamic: true
  video_resize_fn:
    _target_: nvidia.dali.fn.readers.video_resize
    _partial_: true
    device: "gpu"
    sequence_length: 17
    max_size: 256
    resize_longer: 256
    file_list_include_preceding_frame: true
    prefetch_queue_depth: 4
    pad_sequences: false
    pad_last_batch: false
    random_shuffle: true
    stick_to_shard: true
    minibatch_size: 128
  pad_fn:
    _target_: nvidia.dali.fn.pad
    _partial_: true
    fill_value: 0
    axes: [1,2]
    shape:
      - 256
      - 256

And there is the python code where these are being used:

class DaliVideoDataloader(DALIGenericIterator):
    """A DALI dataloader for video data."""

    def __init__(
        self,
        dataset: Dataset,
        video_pipeline_partial: Callable,
        video_resize_fn: Callable,
        pad_fn: Callable,
        device_id: int,
        num_devices: int,
        # DaliGenericIterator args
        size: int = -1,
        auto_reset: bool = False,
        last_batch_padded: bool = False,
        last_batch_policy: LastBatchPolicy = LastBatchPolicy.FILL,
        prepare_first_batch: bool = True,
    ) -> None:

        def video_pipeline(filenames: list[str | Path]) -> Any:
            video = video_resize_fn(filenames=filenames, name="Reader", num_shards=num_devices, shard_id=device_id)
            padded = pad_fn(video)
            return padded

        pipeline = video_pipeline_partial(fn=video_pipeline, device_id=device_id)
        pipe = pipeline(dataset.data_files)
        super().__init__(
            pipelines=[pipe],
            size=size,
            reader_name="Reader",
            auto_reset=auto_reset,
            last_batch_padded=last_batch_padded,
            last_batch_policy=last_batch_policy,
            prepare_first_batch=prepare_first_batch,
            output_map=["video"],
        )

    def __next__(self) -> torch.Tensor:  # pyright: ignore
        """Returns the next video tensor."""
        out = super().__next__()
        return out[0]["video"]

If this is expected behavior that is fine but I am trying to make sure that there isn't a flag or misconfiguration that is causing this performance.

Thanks for your help!

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@ivandariojr ivandariojr added the question Further information is requested label Dec 28, 2024
@JanuszL
Copy link
Contributor

JanuszL commented Dec 30, 2024

Hi @ivandariojr,

Thank you for reaching out. The utilization plots you showed are good place to start more thorough analysis. I recommend capturing the profile using NSIght System to learn more details. It may happen that there is a piece of CPU code in the training that stalls the GPU work, and DALI is not a bottleneck but provide the data at the peace the training can consume it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants