cuLLM

LLM written from scratch in CUDA, for inference. It follows the architecture presented in the LLaMA 3.1 paper, "The Llama 3 Herd of Models". It is meant to be used with pretrained weights; navigate to main.cu and enter your weights path in line 88. This repository also contains a tokenizer written from scratch in C++, as close to the llama one as possible, using the GPT-4o (o200k_base) regex pattern with bpe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

cuLLM

Files

README.md

Latest commit

History

README.md

File metadata and controls

cuLLM