Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Run inference with CB/non-CB SpD Models #155

Merged
merged 12 commits into from
Dec 18, 2024

Conversation

quic-agokhale
Copy link
Contributor

No description provided.

@Akshat-Tripathi
Copy link

Hey all, to help with this effort, I previously also created a minimal inference script for SpD

@eplatero97
Copy link
Contributor

Hello all, followed the implementation of Apoorva's inference app in tests/spd/test_spd_inference.py but tried to merge that into a more modularized pattern to fit the TextGeneration inference class in QEfficient.

this is far from perfect and needs much work and aligning to merge with QEfficient. However, I have validated this app works only for full_batch_size=1. Next week, I can further validate to make sure it works with higher batch sizes.

the general idea is that this PR will be used to validate features to export tlm/dlm work as expected:

An example of the statistics that this returns is below:

$ python3 QEfficient/generation/spd_text_generation_inference.py 
generated_output=', welcome my forum i new new this my is first.\n.\n am can help with any question problem youI happy to you with any i.\n.\n am here to you any any can.\nI to help you help any can.\nI to help help any i can am help any i i.\n am help help any i can to help you help any can am to help any i am help help any i can to help help any i am help you any am to you any can am help any can am help you any am help help any am am help you am help you any can help you am help am help'
ttft=0.0804790819529444
avg_decode_throughput=22.607819810201747
avg_num_accepted_tokens=1.9838709677419355
generated_output=', welcome my forum i new new this my is first.\n.\n am can help with any question problem youI happy to you with any i.\n.\n am here to you any any can.\nI to help you help any can.\nI to help help any i can am help any i i.\n am help help any i can to help you help any can am to help any i am help help any i can to help help any i am help you any am to you any can am help any can am help you any am help help any am am help you am help you any can help you am help am help'
ttft=0.07963072706479579
avg_decode_throughput=22.61431604245997
avg_num_accepted_tokens=1.9838709677419355

NOTE: avg_decode_throughput will be the same for each request as this calculated total number of tokens / delta_time per decode batch size.

@eplatero97 eplatero97 force-pushed the run_spd_infer branch 3 times, most recently from b9fc5e9 to b3c6da4 Compare November 19, 2024 11:59
@eplatero97 eplatero97 force-pushed the run_spd_infer branch 2 times, most recently from ac52060 to f59a988 Compare December 11, 2024 04:13
@ochougul ochougul merged commit 1d7c624 into quic:main Dec 18, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants