How to add the llama3 prompt format to inference.py & minimal_chat.py? Is there a guide or tutorial? #437

minienglish1 · 2024-05-01T15:45:32Z

minienglish1
May 1, 2024

I'm new to writing llm scripts, so I don't understand how to correctly setup the prompt format.

I'm trying to use Llama-3-70B-Instruct-exl2 2.4bpw to summarize & extract key words from long captions, which will then be used by my sdxl fine-tuner. So I don't need full chat, just a simple 1 turn script.

I've modified inference.py & minimal_chat.py to use, and with a simple prompt it seems to work fine. But the llama3 instructions state: "This format has to be exactly reproduced for effective use."

Can anyone point me towards a guide or tutorial that can show me how to correctly configure the prompt format when using exllamav2?

from https://huggingface.co/blog/llama3

How to prompt Llama 3

The base models have no prompt format. Like other base models, they can be used to continue an input sequence with a plausible continuation or for zero-shot/few-shot inference. They are also a great foundation for fine-tuning your own use cases. The Instruct versions use the following conversation structure:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>

This format has to be exactly reproduced for effective use.

Answered by turboderp

May 1, 2024

There's many ways to do it, and I guess the most "normal" way would be to format the entire chat history each round as a single text string and tokenize it all. This would work well enough since the generator skips inference for any tokens that haven't changed when you call begin_stream_ex.

For the minimal example I chose to concatenate tokenized sequences instead. For Llama3, that could look something like this:

from exllamav2 import *
from exllamav2.generator import *
import sys, torch

print("Loading model...")

config = ExLlamaV2Config("/mnt/str/models/llama3-8b-instruct-exl2/4.0bpw/")
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)

to…

View full answer

turboderp · 2024-05-01T17:00:14Z

turboderp
May 1, 2024
Maintainer

There's many ways to do it, and I guess the most "normal" way would be to format the entire chat history each round as a single text string and tokenize it all. This would work well enough since the generator skips inference for any tokens that haven't changed when you call begin_stream_ex.

For the minimal example I chose to concatenate tokenized sequences instead. For Llama3, that could look something like this:

from exllamav2 import *
from exllamav2.generator import *
import sys, torch

print("Loading model...")

config = ExLlamaV2Config("/mnt/str/models/llama3-8b-instruct-exl2/4.0bpw/")
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.single_id("<|eot_id|>")])  # <- Set the correct stop condition
gen_settings = ExLlamaV2Sampler.Settings()

system_prompt = "You are a duck."

prompt = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"  # <- First round starts with BOS + system prompt
prompt += system_prompt
prompt += "<|eot_id|>"

while True:

    print()
    instruction = input("User: ")
    print()
    print("Assistant: ", end = "")

    prompt += "<|start_header_id|>user<|end_header_id|>\n\n"  # <- First prompt is system prompt plus this
    prompt += instruction
    prompt += "<|eot_id|>"
    prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n"

    instruction_ids = tokenizer.encode(prompt, encode_special_tokens = True)  # <- Make sure control tokens are treated as such
    context_ids = instruction_ids if generator.sequence_ids is None \
        else torch.cat([generator.sequence_ids, instruction_ids], dim = -1)

    generator.begin_stream_ex(context_ids, gen_settings)

    while True:
        res = generator.stream_ex()
        if res["eos"]: break
        print(res["chunk"], end = "")
        sys.stdout.flush()

    # Note, when the loop above breaks, generator.sequence_ids contains the tokenized context plus all tokens
    # sampled so far, including whatever caused a stop condition (in this case, <|eot_id|>) even though it is
    # not returned by stream_ex() as part of the streamed response.

    print()

    prompt = ""  # <- Start next round with empty string instead of BOS + system prompt

1 reply

minienglish1 May 2, 2024
Author

That makes sense. I kinda understood how it worked, but having a simple example helps a lot. I should be able to figure it out.

Appreciate the information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to add the llama3 prompt format to inference.py & minimal_chat.py? Is there a guide or tutorial? #437

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to add the llama3 prompt format to inference.py & minimal_chat.py? Is there a guide or tutorial? #437

minienglish1 May 1, 2024

Replies: 1 comment · 1 reply

turboderp May 1, 2024 Maintainer

minienglish1 May 2, 2024 Author

minienglish1
May 1, 2024

Replies: 1 comment 1 reply

turboderp
May 1, 2024
Maintainer

minienglish1 May 2, 2024
Author