HQQ Whisper #68

mobicham · 2024-05-06T10:09:17Z

mobicham
May 6, 2024
Maintainer

Moving the HQQ-Whisper conversation here.

huseinzol05 · 2024-05-06T10:19:18Z

huseinzol05
May 6, 2024

this is my gist to convert whisper model into fully HQQ, https://gist.github.com/huseinzol05/70daae3a4557616f315e7744ba3fcc93, but seems the speed is not faster than flash attention 2 on 30 second examples, but simple matmul is faster, https://gist.github.com/huseinzol05/ff59996034604d17c1e53074e9adc03f

3 replies

mobicham May 6, 2024
Maintainer Author

Thanks @huseinzol05 , but that's not how it is supposed to work.
As explained in the documentation, In order to get faster inference, you need to patch the model to use the torchao_int4 kernel, which only works with nbits=4 and axis=1. On top of that, you need to torch.compile the model with fullgraph=True to get the fastest generation.

I am trying to make a fully working example, give me some time please, I will post it here.

huseinzol05 May 6, 2024

yeah i did using nbits=4 and axis=1,

quant_config = BaseQuantizeConfig(nbits=4, 
    group_size=64,
    quant_zero=False,
    quant_scale=False,
    axis=1,
    offload_meta=False)

let me check again about fullgraph=True

mobicham May 6, 2024
Maintainer Author

But you also need to patch it, and the patching logic was only tested on LLMs. Hold on, I am working on this, will post the solution soon.

mobicham · 2024-05-06T11:58:10Z

mobicham
May 6, 2024
Maintainer Author

So I was able to run a benchmark and compare with vanilla fp16:

openai/whisper-medium | RTX 4090
Encoder: use default backend because it needs GEMM
fp16                         : 0.0234 sec / sample
hqq 4-bit (default,compiled) : 0.0124 sec / sample | 1.89x faster 

Decoder: use torchao backend to decode 1 token at a time
fp16                          : 0.01080 sec / sample
hqq 4-bit (ao_int4, compiled) : 0.000928 sec / sample |  11.63x faster

It is not as straightforward because the encoder and decoder require different logics.
I need to do some refactoring to make this easier to use, mainly related to getting the linear tags, but you want to test it with the current codebase, here's how to do it:
You need the nightly torch build and a relatively new GPU, please don't try this on Google colab unless it's an A100:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id      = "openai/whisper-medium"
compute_dtype = torch.bfloat16 # please don't change this
device        = "cuda:0"

model     = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=compute_dtype) 
processor = AutoProcessor.from_pretrained(model_id)


##############################################################################
#No quantize
#model = model.to(device)

##############################################################################
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.quantize import *

# Please keep  nbits=4 and axis=1
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1) 
HQQLinear.set_backend(HQQBackend.PYTORCH)

AutoHQQHFModel.quantize_model(model.model.encoder, quant_config=quant_config, compute_dtype=compute_dtype, device=device)
AutoHQQHFModel.quantize_model(model.model.decoder, quant_config=quant_config, compute_dtype=compute_dtype, device=device)

#Replace HQQLinear layers matmuls to support int4 mm
import hqq.models.base as hqq_base
hqq_base._QUANT_LAYERS = [torch.nn.Linear, HQQLinear]

from hqq.utils.patching import prepare_for_inference

AutoHQQHFModel.set_auto_linear_tags(model.model.encoder)
prepare_for_inference(model.model.encoder)

AutoHQQHFModel.set_auto_linear_tags(model.model.decoder)
prepare_for_inference(model.model.decoder, backend="torchao_int4")

model.model.encoder.forward = torch.compile(model.model.encoder.forward, mode="reduce-overhead", fullgraph=True)
model.model.decoder.forward = torch.compile(model.model.decoder.forward, mode="reduce-overhead", fullgraph=True)
# ##############################################################################

import time 
import numpy as np 

encoder_input = torch.randn([1, 80, 3000], dtype=compute_dtype, device=device)
def run_encoder():
	with torch.no_grad():
		model.model.encoder(encoder_input)
	torch.cuda.synchronize()

t = []
for _ in range(200):
	t1 = time.time()
	run_encoder()
	t2 = time.time()
	t.append(t2-t1)
print("Encoder", np.mean(t[-100:]), "sec / sample")


decoder_input = torch.randint(0, 1000, [1, 1], dtype=torch.int64, device=device)
def run_decoder():
	with torch.no_grad():
		out = model.model.decoder(decoder_input)
	torch.cuda.synchronize()


t = []
for _ in range(200):
	t1 = time.time()
	run_decoder()
	t2 = time.time()
	t.append(t2-t1)
print("Decoder", np.mean(t[-100:]), "sec / sample")

10 replies

mobicham May 6, 2024
Maintainer Author

Thanks for your code @huseinzol05. Flash attention is not related, hqq simply quantizes the linear layers, you can use HQQ + flash-attention.

I don't see the torch.compile part in your code.

I just tried to use model.generate and it says that static cache is not supported. You'd only see a large speed-up with model.generate and HQQ + torch.compile if static cache is supported and enabled.

huseinzol05 May 6, 2024

torch.compile for vanilla hf flash attention? not really bumped the speed

mobicham May 6, 2024
Maintainer Author

torch.compile + hqq ( you can also add flash attention), this is the speed-up we get when static cache is enabled for Llama2: https://github.com/mobiusml/hqq/raw/master/imgs/llama_int4_4090.png

huseinzol05 May 6, 2024

i tried flash attention with hqq, not really bumped up in my local, maybe my dataset is limited, will try torch compile

huseinzol05 May 7, 2024

torch.compile(model.model.encoder)
torch.compile(model.model.decoder)

make 2 times slower, super weird

kadirnar · 2024-05-06T14:59:28Z

kadirnar
May 6, 2024

@mobicham ,

I tested the Torch.compile code and it works. But it does not work on distil models(distil-whisper/distil-large-v3). How can I solve this?

Error Message:

TorchRuntimeError: Failed running call_module L__self___conv1(*(FakeTensor(..., device='cuda:0', size=(1, 80, 3000), dtype=torch.bfloat16),), **{}):
Invalid channel dimensions

from user code:
   File "/usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1172, in forward
    inputs_embeds = nn.functional.gelu(self.conv1(input_features))

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

4 replies

mobicham May 6, 2024
Maintainer Author

Because the encoder expects a different shape:
encoder_input = torch.randn([1, 128, 3000], dtype=compute_dtype, device=device)

Results:

#openai/whisper-medium | RTX 4090
#Encoder: use default backend because it needs GEMM
#fp16                         : 0.03738  sec / sample
#hqq 4-bit (default,compiled) : 0.01869 sec / sample | 2x faster 


#Decoder: use torchao backend to decoe 1 token at a time
#fp16                         : 0.002592 sec / sample
#hqq 4-bit (ao_int4, compiled): 0.000326 sec / sample |  7.95x faster

kadirnar May 6, 2024

I don't want to do encoder and decoder operations. I want to use the pipeline parameter. I ran this code. It's an error related to Cuda and CPU, but I couldn't solve it. Can you help?

Load Model and Compile:

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import hqq.models.base as hqq_base
from hqq.utils.patching import prepare_for_inference
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.quantize import *
import torch._dynamo
import torch

torch._dynamo.config.suppress_errors = True

model_id      = "openai/whisper-medium"
compute_dtype = torch.bfloat16 # please don't change this
device        = "cuda:0"
model     = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=compute_dtype) 
processor = AutoProcessor.from_pretrained(model_id)

# Please keep  nbits=4 and axis=1
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1) 
HQQLinear.set_backend(HQQBackend.PYTORCH)

AutoHQQHFModel.quantize_model(model.model.encoder, quant_config=quant_config, compute_dtype=compute_dtype, device=device)
AutoHQQHFModel.quantize_model(model.model.decoder, quant_config=quant_config, compute_dtype=compute_dtype, device=device)

#Replace HQQLinear layers matmuls to support int4 mm
hqq_base._QUANT_LAYERS = [torch.nn.Linear, HQQLinear]
AutoHQQHFModel.set_auto_linear_tags(model.model.encoder)
prepare_for_inference(model.model.encoder)

AutoHQQHFModel.set_auto_linear_tags(model.model.decoder)
prepare_for_inference(model.model.decoder, backend="torchao_int4")

model.model.encoder.forward = torch.compile(model.model.encoder.forward, mode="reduce-overhead", fullgraph=True)
model.model.decoder.forward = torch.compile(model.model.decoder.forward, mode="reduce-overhead", fullgraph=True)

pipeline:

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    chunk_length_s=30,
    stride_length_s=5,
    max_new_tokens=128,
    batch_size=100,
    return_timestamps=False,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    model_kwargs={"use_flash_attention_2": True},
    generate_kwargs={"language": "english"},
)
output = pipe("test.mp3")

Error Message:

TorchRuntimeError: Failed running call_function <built-in function sub>(*(FakeTensor(..., size=(16384, 64), dtype=torch.bfloat16), FakeTensor(..., device='cuda:0', size=(16384, 1), dtype=torch.bfloat16)), **{}):
Unhandled FakeTensor Device Propagation for aten.sub.Tensor, found two different devices cpu, cuda:0

from user code:
   File "/usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1212, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py", line 768, in forward
    hidden_states, attn_weights, _ = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py", line 652, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/hqq/utils/patching.py", line 46, in <lambda>
    layer.forward = lambda x: forward_hqq_inferece(layer, x)
  File "/usr/local/lib/python3.10/dist-packages/hqq/utils/patching.py", line 40, in forward_hqq_inferece
    out = torch.matmul(x, self.dequantize().T) #TODO GEMV use-case
  File "/usr/local/lib/python3.10/dist-packages/hqq/core/quantize.py", line 620, in dequantize
    W_est = Quantizer.dequantize(W_q, meta)
  File "/usr/local/lib/python3.10/dist-packages/hqq/core/quantize.py", line 170, in dequantize
    W_r = ((W_r - meta["zero"]) * meta["scale"]).reshape(meta["shape"])

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

kadirnar May 6, 2024

I added device to the pipeline function and this error was resolved, but it does not work with torch.compile. It works after deleting these two codes.

model.model.encoder.forward = torch.compile(model.model.encoder.forward, mode="reduce-overhead", fullgraph=True)
model.model.decoder.forward = torch.compile(model.model.decoder.forward, mode="reduce-overhead", fullgraph=True)

torch.compile error::

RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "/usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1224, in forward
    hidden_states = self.layer_norm(hidden_states). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.

kadirnar May 6, 2024

I tested the Colab A100 for 1 hour of video. Is it a bug with Colab? Why is compile slow?

F16: 26 second
Hqq: 24 second
Hqq + compile: 25 second

F16:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=100,
    torch_dtype=torch_dtype,
    device=device,
)
result = pipe("testv0.mp3")

Hqq:

from whisperplus import SpeechToTextPipeline
from transformers import BitsAndBytesConfig, HqqConfig
import torch

hqq_config = HqqConfig(
    nbits=4,
    group_size=64,
    quant_zero=False,
    quant_scale=False,
    axis=0,
    offload_meta=False,
)  # axis=0 is used by default

pipeline = SpeechToTextPipeline(
    model_id="distil-whisper/distil-large-v3",
    quant_config=hqq_config,
    hqq=True,
    flash_attention_2=True,
)

transcript = pipeline(
    audio_path="testv0.mp3",
    chunk_length_s=30,
    stride_length_s=5,
    max_new_tokens=128,
    batch_size=100,
    language="english",
    return_timestamps=False,
)

Hqq Compile

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import hqq.models.base as hqq_base
from hqq.utils.patching import prepare_for_inference
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.quantize import *
import torch._dynamo
import torch

torch._dynamo.config.suppress_errors = True

model_id      = "distil-whisper/distil-large-v3"
compute_dtype = torch.bfloat16 # please don't change this
device        = "cuda:0"
model     = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=compute_dtype) 
processor = AutoProcessor.from_pretrained(model_id)

# Please keep  nbits=4 and axis=1
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1) 
HQQLinear.set_backend(HQQBackend.PYTORCH)

AutoHQQHFModel.quantize_model(model.model.encoder, quant_config=quant_config, compute_dtype=compute_dtype, device=device)
AutoHQQHFModel.quantize_model(model.model.decoder, quant_config=quant_config, compute_dtype=compute_dtype, device=device)

#Replace HQQLinear layers matmuls to support int4 mm
hqq_base._QUANT_LAYERS = [torch.nn.Linear, HQQLinear]
AutoHQQHFModel.set_auto_linear_tags(model.model.encoder)
prepare_for_inference(model.model.encoder)

AutoHQQHFModel.set_auto_linear_tags(model.model.decoder)
prepare_for_inference(model.model.decoder, backend="torchao_int4")

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    chunk_length_s=30,
    stride_length_s=5,
    max_new_tokens=128,
    batch_size=100,
    return_timestamps=False,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    model_kwargs={"use_flash_attention_2": True},
    generate_kwargs={"language": "english"},
    device="cuda"
)

output = pipe("testv0.mp3")

mobicham · 2024-05-07T07:29:33Z

mobicham
May 7, 2024
Maintainer Author

@huseinzol05 @kadirnar
In order for torch.compile to work properly with the generation/pipeline, you need Static Cache: https://huggingface.co/docs/transformers/en/internal/generation_utils#transformers.StaticCache. Compiling requires a pre-allocated static cache, otherwise, everytime the input shape changes, the compiler will start compiling again and the run will be slow.
To use static cache, you need to pass cache_implementation="static" to the generate call. However, it seems that Whisper doesn't support this. We can create a feature request on transformers to support static cache for Whisper.

But token decoding is definitely significantly faster with hqq with torchao backend and fullgraph compilation when you measure that alone as I shared here. Depending on the size of the caching, that speed-up will decrease a bit.

2 replies

kadirnar May 7, 2024

Thank you for the answer. How can I give the mp3 file as input? I used Torchaudio and huggingface/datasets libraries but it gives an error. Can you share sample code?

mobicham May 7, 2024
Maintainer Author

But as I explained above, you can't benefit from the faster inference since static cache is not implemented.
If you just want to test the quality, you can ignore torch.compile and just use the model as you would do with a regular whisper model via the pipeline or model.generate.

mobicham · 2024-05-09T09:01:01Z

mobicham
May 9, 2024
Maintainer Author

Feature request to add static cache support to Whisper: huggingface/transformers#30707

1 reply

kadirnar May 9, 2024

Thanks ❤️

apzl · 2024-06-10T10:49:54Z

apzl
Jun 10, 2024

@mobicham Is the code for testing with long-form audios (>30s) available publicly?

4 replies

mobicham Jun 10, 2024
Maintainer Author

@apzl you can follow the colab link: https://colab.research.google.com/drive/18Zs-oG1Ztco3cfnNexcHDi-Zn9vk2RJ5?usp=sharing

apzl Jun 10, 2024

@mobicham thanks for the reply.
But only the first 30s are transcribed when using the code in colab. Should I split the audio into 30s chunks?

mobicham Jun 10, 2024
Maintainer Author

Tagging @Jiltseb who wrote the benchmark

Jiltseb Jun 10, 2024
Collaborator

@apzl Yes, you need to split the audio first based on voice activity and merge them to as close as 30 seconds as possible (but not greater). You can use open-source implementations of VAD from NVIDIA or Pyannote if desired. Then you can use the same logic as in the colab to compare the speed.
If you are looking to get it one go, huggingface is currently integrating it into their pipeline() implementation and should be available soon. They use chunk_length to split the audio.

kadirnar · 2024-07-15T09:30:23Z

kadirnar
Jul 15, 2024

@mobicham Can I use torch.compile with HQQ optimization?

huggingface/transformers#31166

1 reply

mobicham Jul 16, 2024
Maintainer Author

I don't know about Whisper, but I tried the new logic with LLama and it worked well! For Whisper, there needs to be a refactoring to support the new static cache logic, we can follow this: https://github.com/mobiusml/hqq/blob/master/hqq/utils/generation_hf.py

HQQ Whisper #68

mobicham May 6, 2024 Maintainer

Replies: 7 comments · 25 replies

mobicham May 6, 2024 Maintainer Author

mobicham May 6, 2024 Maintainer Author

mobicham May 6, 2024 Maintainer Author

mobicham May 6, 2024 Maintainer Author

mobicham May 6, 2024 Maintainer Author

mobicham May 6, 2024 Maintainer Author

Load Model and Compile:

pipeline:

mobicham May 7, 2024 Maintainer Author

mobicham May 7, 2024 Maintainer Author

mobicham May 9, 2024 Maintainer Author

mobicham Jun 10, 2024 Maintainer Author

mobicham Jun 10, 2024 Maintainer Author

Jiltseb Jun 10, 2024 Collaborator

mobicham Jul 16, 2024 Maintainer Author

mobicham
May 6, 2024
Maintainer

Replies: 7 comments 25 replies

mobicham May 6, 2024
Maintainer Author

mobicham May 6, 2024
Maintainer Author

mobicham
May 6, 2024
Maintainer Author

mobicham May 6, 2024
Maintainer Author

mobicham May 6, 2024
Maintainer Author

mobicham May 6, 2024
Maintainer Author

mobicham
May 7, 2024
Maintainer Author

mobicham May 7, 2024
Maintainer Author

mobicham
May 9, 2024
Maintainer Author

mobicham Jun 10, 2024
Maintainer Author

mobicham Jun 10, 2024
Maintainer Author

Jiltseb Jun 10, 2024
Collaborator

mobicham Jul 16, 2024
Maintainer Author