-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for exporting LLaMA to ONNX format #922
Conversation
First time contribution. Need help to run a CI. |
tests/exporters/exporters_utils.py
Outdated
"opt": "hf-internal-testing/tiny-random-llama", | ||
"llama": "hf-internal-testing/tiny-random-OPTModel", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you inverted OPT and Llama here
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Okay, so hf-internal-security model doesn't have any pytorch_model*.bin And for test_pipeline_ort_model I guess we need to wait until transformers release 4.28 occur. I will change tiny-random to llama if that's fine as well as fixing a comment from @regisss |
Okay, I managed to get a powerful machine with lots of RAM and confirm it works fine. Tested 7billion model Inference: fp32 28-31s ONNX no optimizations
The argument `from_transformers` is deprecated, and will be removed in optimum 2.0. Use `export` instead
Framework not specified. Using pt to export to ONNX.
Loading checkpoint shards: 0%| | 0/3 [00:00 True
/home/user/miniconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:475: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1:
/home/user/miniconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:46: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min))
/home/user/miniconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:108: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if seq_len > self.max_seq_len_cached:
/home/user/miniconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:231: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
/home/user/miniconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:238: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
/home/user/miniconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:243: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
attn_weights = torch.max(attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min))
/home/user/miniconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:249: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
============= Diagnostic Run torch.onnx.export version 2.0.0+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
Saving external data to one file... Saving external data to one file... |
Small update: Exception tb--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[4], line 3 1 device = torch.device("cuda") 2 t1 = time.time() ----> 3 evall = evaluate("Tell me about Alpacas", use_cache=False) 4 print(time.time() - t1)Cell In[3], line 44, in evaluate(instruction, input, temperature, top_p, top_k, num_beams, max_new_tokens, **kwargs) File ~/miniconda3/envs/llama/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs) File ~/miniconda3/envs/llama/lib/python3.10/site-packages/transformers/generation/utils.py:1416, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs) File ~/miniconda3/envs/llama/lib/python3.10/site-packages/transformers/generation/utils.py:2211, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs) File ~/miniconda3/envs/llama/lib/python3.10/site-packages/optimum/modeling_base.py:85, in OptimizedModel.call(self, *args, **kwargs) File ~/miniconda3/envs/llama/lib/python3.10/site-packages/optimum/onnxruntime/modeling_decoder.py:573, in ORTModelForCausalLM.forward(self, input_ids, attention_mask, past_key_values, labels, **kwargs) File ~/miniconda3/envs/llama/lib/python3.10/site-packages/optimum/onnxruntime/base.py:63, in ORTModelPart.call(self, *args, **kwargs) File ~/miniconda3/envs/llama/lib/python3.10/site-packages/optimum/onnxruntime/base.py:307, in ORTDecoder.forward(self, input_ids, attention_mask, past_key_values, labels, use_cache_branch) File ~/miniconda3/envs/llama/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py:746, in ORTModel._prepare_io_binding(self, model, ordered_input_names, known_output_shapes, outputs_to_not_bind, *model_inputs) IndexError: list index out of range |
Hi @nenkoru , would that work https://huggingface.co/HuggingFaceM4/tiny-random-LlamaForCausalLM/tree/main? |
I guess it won't work because for some reason ORTCausalForLM fails to work with use_cache=False option. I mean the instance of a class initializes, but inference won't work(provided a tb in a previous comment under a spoiler). UPD: tried this tiny-random llama(converted to onnx using optimum-cli) and the same stuff happens(tb in the comment above).
|
Small observation: UPD: monkey-patching ORTDecoder's forward method fixes the problem. UPD2: basically use_cache=False doesn't apply ignoring of past_key_values within forward method at all, and because of having no input_names for those inputs that are being made when executing the model - it fails with an exception. This is why this hack exists - to patch a forward method to do not use past_key_values
|
@nenkoru You opened this PR on the main branch of your fork so it appears I can not push there. Using So to me once the conflict is resolved and transformers is release this looks good! |
I will change the link. But what you think about the issue I reported above? |
About this issue:
It could be that the large memory requirement during the export has been fixed in #932 (if the issue was when merging the decoders). Although it could be that even before this postprocess step, as we export the model in two parts, (
To me the exported model size are decent for |
About the issue with the code snippet you gave: CUDA EP + optimum/tests/onnxruntime/test_modeling.py Lines 2326 to 2331 in 2b5d950
|
But what about my way of fixing that? It was working with that fancy monkey patching quite well |
@nenkoru Yes I guess it works! Ideally we would want to fix the foward/IO Binding code itself and add the test for IO Binding + use_cache=False, but there's I think no strong reason for not reusing past key values. |
For me it was neccesary. Either way, as long as the exporter generates two files. One for the base model and the other one for the model with past keys - it was loading 2x more VRAM. Which is not an expected behaviour. And yep, was unable to merge even on 256gb RAM. I guess it was fixed now by #932 as you mentioned above. Still, I think that should be working as expected - no past keys. Just my two cents |
I've exported a Llama-based model (using past key values + merged); however, when I run inference on it, the memory usage is greater (~67GB for 13B parameter model) AND the runtime is significantly slower compared to the original model: NUM TOKENS: 50
ORT TIMES: [4.2505810260772705, 3.958254814147949, 3.9548168182373047, 3.8123669624328613, 3.7423346042633057]
NORMAL TIMES: [1.8341279029846191, 1.831620693206787, 1.8367419242858887, 1.827073097229004, 1.8272616863250732]
NUM TOKENS: 100
ORT TIMES: [12.765657663345337, 11.290537118911743, 11.22416877746582, 11.383142709732056, 11.26672911643982]
NORMAL TIMES: [3.7361843585968018, 3.7319560050964355, 3.737166166305542, 3.7369601726531982, 3.735769748687744]
NUM TOKENS: 200
ORT TIMES: [40.04459619522095, 35.00810360908508, 33.88215517997742, 33.990755558013916, 36.0823016166687]
NORMAL TIMES: [7.595353603363037, 7.5860066413879395, 7.58692479133606, 7.597402095794678, 7.584235429763794] And when I try to convert it with the Any suggestions? Mind providing me with your package versions? |
Hi @gilljon is this with CPU execution provider or CUDA EP? Hopefully transformers release comes soon so that we can merge the PR and test it! |
It seems like the issue was running an outdated torch. Upgrading torch to NUM TOKENS: 20
ORT TIMES: [0.5817182064056396, 0.5831649303436279, 0.5821049213409424, 0.5825436115264893, 0.5824661254882812]
NORMAL TIMES: [0.4072282314300537, 0.4071035385131836, 0.4074513912200928, 0.40746521949768066, 0.407412052154541]
NUM TOKENS: 50
ORT TIMES: [1.5875389575958252, 1.565403938293457, 1.5717802047729492, 1.5658864974975586, 1.5674340724945068]
NORMAL TIMES: [1.0803217887878418, 1.0725362300872803, 1.0720314979553223, 1.0732998847961426, 1.0723967552185059]
NUM TOKENS: 100
ORT TIMES: [3.256457805633545, 3.2374284267425537, 3.2630958557128906, 3.241764545440674, 3.2146661281585693]
NORMAL TIMES: [2.1945881843566895, 2.1934919357299805, 2.2603249549865723, 2.1941864490509033, 2.19608736038208]
NUM TOKENS: 200
ORT TIMES: [6.571287631988525, 7.072286605834961, 6.634500503540039, 6.614758729934692, 6.610734224319458]
NORMAL TIMES: [4.461749792098999, 4.541143417358398, 4.538911581039429, 4.538933992385864, 4.537773847579956] The model has the following {
"one_external_file": true,
"opset": null,
"optimization": {
"disable_attention": null,
"disable_attention_fusion": false,
"disable_bias_gelu": null,
"disable_bias_gelu_fusion": false,
"disable_bias_skip_layer_norm": null,
"disable_bias_skip_layer_norm_fusion": false,
"disable_embed_layer_norm": true,
"disable_embed_layer_norm_fusion": true,
"disable_gelu": null,
"disable_gelu_fusion": false,
"disable_layer_norm": null,
"disable_layer_norm_fusion": false,
"disable_shape_inference": true,
"disable_skip_layer_norm": null,
"disable_skip_layer_norm_fusion": false,
"enable_gelu_approximation": false,
"enable_transformers_specific_optimizations": true,
"fp16": false,
"no_attention_mask": false,
"optimization_level": 2,
"optimize_for_gpu": false,
"optimize_with_onnxruntime_only": null,
"use_mask_index": false
},
"optimum_version": "1.7.4.dev0",
"quantization": {},
"transformers_version": "4.28.0.dev0",
"use_external_data_format": true
} and was created using the standard Any ideas? This is with |
For runtime, I am not sure, I will profile better once this is merged. For memory, this is probably related: microsoft/onnxruntime#14526 I've seen a similar issue with gpt-j. |
@sam-h-bean You can definitely host it on triton in onnx, although the memory footprint is still > than raw Python. Alternatively, you can deploy it on Triton using Python Backend. |
@gilljon do you think it will get to the point where the memory in onnx is smaller? At 67GB that wouldn't even fit on an A100. |
Using #975 instead of this branch as I could not push here. Llama onnx export will be included in todays release! Not sure if this will work well with TRT, and I always have issues with ORT CUDA EP memory usage. |
This PR adds support for a LLaMA model to be exported to ONNX format.
Fixes # (issue)
#918
Before submitting