when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC" #216

qihang720 · 2023-10-10T07:59:38Z

output layout is CHW, using perf_analyzer to profile.

@autoserialize
@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
    images = dali.fn.external_source(device="cpu", name="encoded")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=299, resize_y=299)
    images = dali.fn.crop_mirror_normalize(images,
                                           dtype=types.FLOAT,
                                           output_layout="CHW",
                                           crop=(299, 299),
                                           mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                           std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
    return images

output layout is HWC

@autoserialize
@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
    images = dali.fn.external_source(device="cpu", name="encoded")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=299, resize_y=299)
    images = dali.fn.crop_mirror_normalize(images,
                                           dtype=types.FLOAT,
                                           output_layout="HWC",
                                           crop=(299, 299),
                                           mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                           std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
    return images

Most of time, model input layout is "NCHW", is there any way we can improve performance?

The text was updated successfully, but these errors were encountered:

JanuszL · 2023-10-10T10:32:21Z

Hi @qihang720,

Can you provide more details about the environment you are using to run your tests?
Can you reproduce a similar number using DALI as a standalone library?
Just recently we introduced a couple of optimizations to crop mirror normalize so updating the TRITON version to the latest can be a good first step to confirm if the use case you have has improved.

qihang720 · 2023-10-11T02:44:48Z

Hi @qihang720,

Can you provide more details about the environment you are using to run your tests? Can you reproduce a similar number using DALI as a standalone library? Just recently we introduced a couple of optimizations to crop mirror normalize so updating the TRITON version to the latest can be a good first step to confirm if the use case you have has improved.

I used nvcr.io/nvidia/tritonserver:23.05-py3 as my working environment.
dali version : nvidia-dali-cuda110 1.29.0

I'm not sure how to profile DALI alone，because every input images' length is different. Triton can help me to batch dynamic shape when I add ragged_batches option.

I will test in new version lately.

JanuszL · 2023-10-11T11:57:26Z

Hi @qihang720,

nvcr.io/nvidia/tritonserver:23.05-py3 uses DALI 1.25. In DALI 1.30 we did a couple of optimizations for the crop_mirror_normalize operator. Please stay tuned for TRITON 23.10 which should include this DALI version.
Also, the biggest gain from the GPU processing is visible when you process a batch of data. Do you see similar results for bigger batches?

qihang720 · 2023-10-13T02:04:41Z

Hi @JanuszL，

Thanks for your advices, I will continue to follow TRITON 23.10.

For batch of data, if my every input is different, the base64 length is also different, so how can I batch it together.

JanuszL · 2023-10-13T07:38:49Z

Hi @qihang720,

For batch of data, if my every input is different, the base64 length is also different, so how can I batch it together.

Please check if this part of our documentation answers your question.

jantonguirao assigned banasraf Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC" #216

when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC" #216

qihang720 commented Oct 10, 2023

JanuszL commented Oct 10, 2023

qihang720 commented Oct 11, 2023

JanuszL commented Oct 11, 2023

qihang720 commented Oct 13, 2023

JanuszL commented Oct 13, 2023

when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC" #216

when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC" #216

Comments

qihang720 commented Oct 10, 2023

JanuszL commented Oct 10, 2023

qihang720 commented Oct 11, 2023

JanuszL commented Oct 11, 2023

qihang720 commented Oct 13, 2023

JanuszL commented Oct 13, 2023