Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC" #216

Open
qihang720 opened this issue Oct 10, 2023 · 5 comments
Assignees

Comments

@qihang720
Copy link

  1. output layout is CHW, using perf_analyzer to profile.
@autoserialize
@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
    images = dali.fn.external_source(device="cpu", name="encoded")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=299, resize_y=299)
    images = dali.fn.crop_mirror_normalize(images,
                                           dtype=types.FLOAT,
                                           output_layout="CHW",
                                           crop=(299, 299),
                                           mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                           std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
    return images

image

  1. output layout is HWC
@autoserialize
@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
    images = dali.fn.external_source(device="cpu", name="encoded")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=299, resize_y=299)
    images = dali.fn.crop_mirror_normalize(images,
                                           dtype=types.FLOAT,
                                           output_layout="HWC",
                                           crop=(299, 299),
                                           mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                           std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
    return images

image

Most of time, model input layout is "NCHW", is there any way we can improve performance?

@JanuszL
Copy link
Collaborator

JanuszL commented Oct 10, 2023

Hi @qihang720,

Can you provide more details about the environment you are using to run your tests?
Can you reproduce a similar number using DALI as a standalone library?
Just recently we introduced a couple of optimizations to crop mirror normalize so updating the TRITON version to the latest can be a good first step to confirm if the use case you have has improved.

@qihang720
Copy link
Author

Hi @qihang720,

Can you provide more details about the environment you are using to run your tests? Can you reproduce a similar number using DALI as a standalone library? Just recently we introduced a couple of optimizations to crop mirror normalize so updating the TRITON version to the latest can be a good first step to confirm if the use case you have has improved.

I used nvcr.io/nvidia/tritonserver:23.05-py3 as my working environment.
dali version : nvidia-dali-cuda110 1.29.0

I'm not sure how to profile DALI alone,because every input images' length is different. Triton can help me to batch dynamic shape when I add ragged_batches option.

I will test in new version lately.

@JanuszL
Copy link
Collaborator

JanuszL commented Oct 11, 2023

Hi @qihang720,

nvcr.io/nvidia/tritonserver:23.05-py3 uses DALI 1.25. In DALI 1.30 we did a couple of optimizations for the crop_mirror_normalize operator. Please stay tuned for TRITON 23.10 which should include this DALI version.
Also, the biggest gain from the GPU processing is visible when you process a batch of data. Do you see similar results for bigger batches?

@qihang720
Copy link
Author

Hi @JanuszL

Thanks for your advices, I will continue to follow TRITON 23.10.

For batch of data, if my every input is different, the base64 length is also different, so how can I batch it together.

@JanuszL
Copy link
Collaborator

JanuszL commented Oct 13, 2023

Hi @qihang720,

For batch of data, if my every input is different, the base64 length is also different, so how can I batch it together.

Please check if this part of our documentation answers your question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants