llama inference

Exploration of latency on various setups of inference with llama.

Caveats

I didn't explore throughput. That is a deep rabbit hole - I was just exploring latency for a single request. You can tradeoff throughput and latency with various forms of batching requests.
I tried my best to use tools based on the documentation provided.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
anyscale		anyscale
bentoml		bentoml
common		common
ctranslate		ctranslate
exllama		exllama
hf-endpoint		hf-endpoint
hf		hf
mlc		mlc
sagemaker		sagemaker
tgi		tgi
triton-tensorRT-quantized-awq-batch		triton-tensorRT-quantized-awq-batch
triton-tensorRT-quantized-awq		triton-tensorRT-quantized-awq
triton-tensorRT-quantized		triton-tensorRT-quantized
triton-tensorRT		triton-tensorRT
triton-vllm-awq-8bit		triton-vllm-awq-8bit
triton-vllm-awq		triton-vllm-awq
triton-vllm		triton-vllm
trt-bench		trt-bench
vllm-2		vllm-2
vllm		vllm
.gitignore		.gitignore
README.md		README.md
Summary.ipynb		Summary.ipynb
benchmark.ipynb		benchmark.ipynb