Skip to content

How to add the llama3 prompt format to inference.py & minimal_chat.py? Is there a guide or tutorial? #437

Answered by turboderp
minienglish1 asked this question in Q&A
Discussion options

You must be logged in to vote

There's many ways to do it, and I guess the most "normal" way would be to format the entire chat history each round as a single text string and tokenize it all. This would work well enough since the generator skips inference for any tokens that haven't changed when you call begin_stream_ex.

For the minimal example I chose to concatenate tokenized sequences instead. For Llama3, that could look something like this:

from exllamav2 import *
from exllamav2.generator import *
import sys, torch

print("Loading model...")

config = ExLlamaV2Config("/mnt/str/models/llama3-8b-instruct-exl2/4.0bpw/")
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)

to…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@minienglish1
Comment options

Answer selected by minienglish1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants