Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert Onnx problem #12

Open
xcxhy opened this issue May 8, 2023 · 11 comments
Open

convert Onnx problem #12

xcxhy opened this issue May 8, 2023 · 11 comments

Comments

@xcxhy
Copy link

xcxhy commented May 8, 2023

Hi, thanks for your open source. I would like to ask why the llama-7b model I converted using torch.onnx.export is not the same as the model published on your hugging face.
I directly run your tools/export-onnx.py with the llama-7b model and it will be OOM. If I open torch_dtype=torch.float16 directly when loading the model, there will be no onnx model at the end of the run.
I only have 8 blocks of 3090. Is there any way to deploy LLAMA's 13B with lora?

@tpoisonooo
Copy link
Owner

Sorry for bad doc.
export-onnx.py is just an entrance to call llama inference.

If you want to convert onnx, you need this hacking branch https://github.com/tpoisonooo/transformers/tree/add-convert

You will get 3 simple commits here:
图片

@tpoisonooo
Copy link
Owner

I just add some torch.onnx.export and verify inside it.

@tpoisonooo
Copy link
Owner

tpoisonooo commented May 8, 2023

3090 has 24GB memory, that is enough to load fp16 llama7B model, not enough for fp32, so please check the precision.

For llama 13B with lora, I suggest that you

  1. merge lora weight, there should be many scripts about it
  2. quantize to 4B with GPTQ-for-LLaMa on triton cuda
  3. or llama.cpp on CPU

@xcxhy
Copy link
Author

xcxhy commented May 8, 2023

@tpoisonooo thanks for your response, l have tried the llama.cpp before, and it can indeed accelerate on the cpu, but the inference speed is not significantly improved compared to the GPU.
Also, I tried to quantized to 4bit, but the but the performance will drop sharply.
I hope that convert to TRT can have a greater speed increase.

@tpoisonooo
Copy link
Owner

for 4-bit precision problem, I have contributed --observe option to GPTQ-for-LLaMa, but my inference kernel (layer has different quant option) not finished, AutoGPTQ.tvm would be a good trial while he archived the code two days ago.

for TRT, I have converted to .engine files, precision check is on the way.

@tpoisonooo
Copy link
Owner

cc @xcxhy NVIDIA/TensorRT#2928

@xcxhy
Copy link
Author

xcxhy commented May 9, 2023

@tpoisonooo Thank you for response, but I'm still stuck at the convert onnx part. Sorry, I pulled your brank, but don't know how to convert onnx. I tried both onnx.export and optimal-cli today, but neither worked.Looking forward to your reply.

@tpoisonooo
Copy link
Owner

tpoisonooo commented May 12, 2023

@tpoisonooo Thank you for response, but I'm still stuck at the convert onnx part. Sorry, I pulled your brank, but don't know how to convert onnx. I tried both onnx.export and optimal-cli today, but neither worked.Looking forward to your reply.

STEP1. git clone https://github.com/tloen/alpaca-lora and run generate.py example. It requires you install huggingface/transformers

STEP2. After transformers installed, it existed in your conda/pip environment. Find it.

STEP3. Read these commit history, update transformers source code in your conda/pip environment.

STEP4. Run generate.py again, torch.onnx.export in STEP3 would give you onnx files.

@xcxhy
Copy link
Author

xcxhy commented May 16, 2023

@tpoisonooo Thanks your response. I have studied carefully for many days, and basically got through the process between them, but now I am converting pytorch to ONNX, there will be some if modules in ONNX, but they will still exist after optimization with onnxsim. This leads to an error when using trtexec.

@tpoisonooo
Copy link
Owner

I guess that the if module comes from LLaMa past_keyvalue, try zero_tensor to eliminate it. @xcxhy

@tpoisonooo
Copy link
Owner

AKA build a torch.tensor or np.array with the shape [1,x,0,x]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants