How to add the llama3 prompt format to inference.py & minimal_chat.py? Is there a guide or tutorial? #437
-
I'm new to writing llm scripts, so I don't understand how to correctly setup the prompt format. I'm trying to use Llama-3-70B-Instruct-exl2 2.4bpw to summarize & extract key words from long captions, which will then be used by my sdxl fine-tuner. So I don't need full chat, just a simple 1 turn script. I've modified inference.py & minimal_chat.py to use, and with a simple prompt it seems to work fine. But the llama3 instructions state: "This format has to be exactly reproduced for effective use." Can anyone point me towards a guide or tutorial that can show me how to correctly configure the prompt format when using exllamav2? from https://huggingface.co/blog/llama3 How to prompt Llama 3 The base models have no prompt format. Like other base models, they can be used to continue an input sequence with a plausible continuation or for zero-shot/few-shot inference. They are also a great foundation for fine-tuning your own use cases. The Instruct versions use the following conversation structure: <|begin_of_text|><|start_header_id|>system<|end_header_id|> {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|> {{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {{ model_answer_1 }}<|eot_id|> This format has to be exactly reproduced for effective use. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
There's many ways to do it, and I guess the most "normal" way would be to format the entire chat history each round as a single text string and tokenize it all. This would work well enough since the generator skips inference for any tokens that haven't changed when you call For the minimal example I chose to concatenate tokenized sequences instead. For Llama3, that could look something like this: from exllamav2 import *
from exllamav2.generator import *
import sys, torch
print("Loading model...")
config = ExLlamaV2Config("/mnt/str/models/llama3-8b-instruct-exl2/4.0bpw/")
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.single_id("<|eot_id|>")]) # <- Set the correct stop condition
gen_settings = ExLlamaV2Sampler.Settings()
system_prompt = "You are a duck."
prompt = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n" # <- First round starts with BOS + system prompt
prompt += system_prompt
prompt += "<|eot_id|>"
while True:
print()
instruction = input("User: ")
print()
print("Assistant: ", end = "")
prompt += "<|start_header_id|>user<|end_header_id|>\n\n" # <- First prompt is system prompt plus this
prompt += instruction
prompt += "<|eot_id|>"
prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n"
instruction_ids = tokenizer.encode(prompt, encode_special_tokens = True) # <- Make sure control tokens are treated as such
context_ids = instruction_ids if generator.sequence_ids is None \
else torch.cat([generator.sequence_ids, instruction_ids], dim = -1)
generator.begin_stream_ex(context_ids, gen_settings)
while True:
res = generator.stream_ex()
if res["eos"]: break
print(res["chunk"], end = "")
sys.stdout.flush()
# Note, when the loop above breaks, generator.sequence_ids contains the tokenized context plus all tokens
# sampled so far, including whatever caused a stop condition (in this case, <|eot_id|>) even though it is
# not returned by stream_ex() as part of the streamed response.
print()
prompt = "" # <- Start next round with empty string instead of BOS + system prompt |
Beta Was this translation helpful? Give feedback.
There's many ways to do it, and I guess the most "normal" way would be to format the entire chat history each round as a single text string and tokenize it all. This would work well enough since the generator skips inference for any tokens that haven't changed when you call
begin_stream_ex
.For the minimal example I chose to concatenate tokenized sequences instead. For Llama3, that could look something like this: