Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking for complete conversion from pretrained huggingface model #611

Open
lionsheep24 opened this issue Jun 18, 2024 · 7 comments
Open

Comments

@lionsheep24
Copy link

lionsheep24 commented Jun 18, 2024

Hello,
I have pretrained a model with huggingface and attempted to deploy it using the TRTLLM-Triton Server method as documented here. However, I've noticed that the transcription results differ significantly from the original model's performance when using the Transformer pipeline.

Upon further investigation, I compared the mel spectrograms and the decoding results between the TRT-LLM implementation and the original pipeline. Both comparisons showed noticeable differences, leading to degraded transcription accuracy in the TRT-LLM implementation. In some cases, it even returned a blank string.

Let me share my pipeline implementation

model_ckpt="./models/whisper-large-v2/2"
torch_dtype = torch.float16
feature_extractor: WhisperFeatureExtractor = WhisperFeatureExtractor.from_pretrained(pretrained_model_name_or_path=model_ckpt)
tokenizer:WhisperTokenizer = WhisperTokenizer.from_pretrained(pretrained_model_name_or_path=model_ckpt)
batch_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_ckpt,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    device_map="cuda:0",
)
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=batch_model,
    tokenizer=tokenizer,
    feature_extractor=feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=False,
    torch_dtype=torch_dtype,
    generate_kwargs={"language": "ko", "num_beams": 1, "do_sample": False},
)

TRT-LLM implementation is same with the link , which I mentioned earlier, and the engine has built by below script. (trtllm version is 0.11.0.dev2024060400)

1. save hf-model 
model = AutoModel.from_pretrained(model_name, use_safetensors=True).half() # save to /workspace/models/whisper-large-v2

2. convert to openai
python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2 --output_dir /workspace/models/whisper-openai --output_name large-v2

3. convert to tensorrt-llm
python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2

4. build tensorrt-llm
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/whisper-tensorrt-llm/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 4 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/whisper-tensorrt-llm/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 4 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable

Client code for tensorrt-llm + tritonserver

from transformers import WhisperFeatureExtractor, WhisperTokenizer, AutoModelForSpeechSeq2Seq, pipeline
import torch
import numpy as np

from tritonclient.grpc import InferenceServerClient
import tritonclient.grpc as grpcclient
from tritonclient.utils import np_to_triton_dtype


def send_whisper(
    samples: np.ndarray,
    triton_client: InferenceServerClient,
    protocol_client,
    model_name: str = "whisper-large-v2-tensorrt-llm",
    whisper_prompt: str = "<|startoftranscript|><|ko|><|transcribe|><|notimestamps|>"
):

    inputs = [
        protocol_client.InferInput("WAV", samples.shape, np_to_triton_dtype(samples.dtype)),
        protocol_client.InferInput("TEXT_PREFIX", [1, 1], "BYTES"),
    ]
    inputs[0].set_data_from_numpy(samples)
    
    input_data_numpy = np.array([whisper_prompt], dtype=object).reshape((1, 1))
    inputs[1].set_data_from_numpy(input_data_numpy)

    outputs = [protocol_client.InferRequestedOutput("TRANSCRIPTS")]
    sequence_id = 100000000  # Example sequence_id, this can be changed as needed
    
    response = triton_client.infer(model_name, inputs, request_id=str(sequence_id), outputs=outputs)
    
    decoding_results = response.as_numpy("TRANSCRIPTS")[0]
    if isinstance(decoding_results, np.ndarray):
        decoding_results = b" ".join(decoding_results).decode("utf-8")
    else:
        decoding_results = decoding_results.decode("utf-8")

    print(f"TensorRT LLM STT Result: {decoding_results}")

Could anyone help me understand why these discrepancies are occurring and how to resolve them?

Thank you in advance for your assistance.

@csukuangfj
Copy link
Collaborator

@yuekaizhang Could you have a look at this issue?

@lionsheep24
Copy link
Author

lionsheep24 commented Jun 18, 2024

Let me share my build script for trt-llm.

1. save hf-model 
model = AutoModel.from_pretrained(model_name, use_safetensors=True).half() # save to /workspace/models/whisper-large-v2

2. convert to openai
python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2 --output_dir /workspace/models/whisper-openai --output_name large-v2

3. convert to tensorrt-llm
python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2

4. build tensorrt-llm
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/whisper-tensorrt-llm/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 4 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/whisper-tensorrt-llm/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 4 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable

@yuekaizhang
Copy link
Collaborator

@lionsheep24 #597 (comment), check this. You may need to align the prompt, beam_size, and other hyper-parameters to get the same outputs.

There are several succuss integration of whisper trt-llm you may refer e.g. https://github.com/Wordcab/wordcab-transcribe/tree/main/src/wordcab_transcribe/engines/tensorrt_llm. Your export steps also look good to me.

@lionsheep24
Copy link
Author

lionsheep24 commented Jun 18, 2024

@yuekaizhang
I'm using <|startoftranscript|><|ko|><|transcribe|><|notimestamps|> prompt, beam_size of 1 and found differences in extracting mel spectrograms from same audio array between hf-way and openai-way.

Same decoding results from different audio features, you mean? There were some values of -0.74171734 in hf-way but corresponding value of openai-way were 0.

I switched compute_feature function to hf WhisperFeatureExtractor but tokenizer throws OverflowError: out of range integral type conversion because decoding result has -1 token.

I reviewed link you shared but It seems to be similar with current repo.

I'm not sure how transcription results can be same even though extracted features are different.

@lionsheep24
Copy link
Author

lionsheep24 commented Jun 20, 2024

Hi all! any updates here?

I am curious about why the audio features extracted from the same audio array differ when using the Huggingface library compared to the method provided in this repository.

Additionally, I want to confirm if it is correct for the values to be different. In my opinion, even if the model is converted, the input audio features should be same.

When I input the features extracted using the Huggingface library into the TensorRT-LLM engine, I received a -1 token(which is different from Huggingface pipeline result), which seems to have caused an error during decoding.

Feel free to let me know if you need any further adjustments or additional information included!

@yuekaizhang
Copy link
Collaborator

Huggingface library compared to the method provided in this repository.

Theoretically, the minor difference of feature values would not have a effect on the transcript results. We actually support huggingface distill whisper in tensorrt-llm, which uses the huggingface feature extractor to train. However, it could work with our feature extractor in inference.

You may try replace the feature extractor if you think that is the root cause.
@lionsheep24

@lionsheep24
Copy link
Author

Yeah I calculated differences of features from huggingface and tensorrt-llm example and the absolute difference was up to 0.74. I think it's not a minor difference.

I tried to replace feature extractor to huggingface and feed feature to tensorrt-llm but I got -1 token from engine, as I mentioned earlier.
@yuekaizhang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants