Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bottleneck in _initialize_and_fill_output func. in multimodal_runner_cpp.py #2827

Open
nicekevin opened this issue Feb 26, 2025 · 0 comments

Comments

@nicekevin
Copy link

nicekevin commented Feb 26, 2025

While running a Qwen2-VL using TRT-LLM, the performance bottleneck was identified in the profiling result. The bottleneck is in "_initialize_and_fill_output" func. from "model_runner_cpp.py", consuming over 1.18 second per query.

I thought this is because "self.session.await_responses(request_ids)" seems to wait for the model inference.
I’d appreciate any solution or suggestion for optimizing this.

Thanks,

  1. cProfile result

ncalls tottime percall cumtime percall filename:lineno(function)
10 11.836 1.184 11.870 1.187 /root/anaconda3/envs/tensorrt_qvl/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py:861(_initialize_and_fill_output)


  1. model_runner_cpp.py

def _initialize_and_fill_output(
    self,
    *,
    request_ids,
    end_id,
    return_dict,
    output_sequence_lengths,
    output_generation_logits,
    output_log_probs,
    output_cum_log_probs,
    batch_input_ids,
    streaming,
    sampling_config: SamplingConfigType,
    is_draft_target_model: bool = False,
):
    num_sequences = self._get_num_sequences(sampling_config)
    # (batch_size, num_sequences, sequence_len)
    output_ids = [[[] for _ in range(num_sequences)]
                  for _ in range(len(request_ids))]

    multi_responses = self.session.await_responses(request_ids)
    responses = [
        response for responses in multi_responses for response in responses
    ]

    return self._fill_output(
        responses=responses,
        output_ids=output_ids,
        end_id=end_id,
        return_dict=return_dict,
        output_sequence_lengths=output_sequence_lengths,
        output_generation_logits=output_generation_logits,
        output_log_probs=output_log_probs,
        output_cum_log_probs=output_cum_log_probs,
        batch_input_ids=batch_input_ids,
        batch_input_ids_list=[],
        streaming=streaming,
        request_ids=request_ids,
        return_all_generated_tokens=False,
        sampling_config=sampling_config,
        is_draft_target_model=is_draft_target_model,
    )

  1. run.py

for i in range(len(questions)):
input_text, output_text = model.run(questions[i], input_image_paths[i], args.max_new_tokens)


[Environment]

Python: 3.10
TRT-LLM: 0.16.0
Transformers: 4.46.0
GPU: A100 40GB


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant