Why is LLaMA response to queries in the conversation so wrong? #466

Harsh-raj · 2023-10-31T07:57:58Z

I have made a cumulative prompt structure for doing conversation with LLaMA-2 using lit-llama repository such that it can remember context of previous prompts and their responses from the conversation. Here is a snippet of a conversation with LLaMA which had bizzare responses.

Loading model ...
Time to load model: 20.19 seconds.
Prompt:Name all the districts in the Indian state of Bihar.
LLaMA: There are 38 districts in the Indian state of Bihar. The districts are:

1. Aurangabad
2. Begusarai
3. Bhagalpur
4. Bhojpur
5. Buxar
6. Darbhanga
7. Gaya
8. Hajipur
9. Jamui
10. Khagaria
11. Kishanganj
12. Lakhisarai
13. Madhubani
14. Munger
15. Nalanda
16. Nawada
17. Patna
18. Purnia
19. Rohtas
20. Samastipur
21. Saran
22. Sheohar
23. Sitamarhi
24. Siwan
25. Supaul
26. Vaishali
27. West Champaran
28. East Champaran
29. Sitamarhi
30. Darbhanga
31. Khagaria
32. Begusarai
33. Purnia
34. Kishanganj
35. Aurangabad
36. Gaya
37. Lakhisarai
38. Nalanda


Time for inference: 8.09 sec total, 126.64 tokens/sec
Memory used: 14.12 GB
Prompt:Name all the ministers in the cabinet of prime minister of India
LLaMA: India has several hill stations, including:

1. Shimla - Himachal Pradesh
2. Darjeeling - West Bengal
3. Gangtok - Sikkim
4. Kodaikanal - Tamil Nadu
5. Ooty - Tamil Nadu
6. Kashmir - Jammu and Kash -


Time for inference: 18.84 sec total, 54.35 tokens/sec
Memory used: 14.23 GB
Prompt:

======================================================================================
One explaination for this could be somehow the response form previous prompt is getting printed here in the next prompt. Why is this happening?

Following is the conversation loop i used in the lora.py script in generate folder of the repo.

conversation = ""
    tokenizer = Tokenizer(tokenizer_path)
    while True:
        prompt = input("Prompt:")  # type: ignore
        if not prompt:
            break
        sample = {"instruction": prompt, "input": input_inst}
        
        #prompt = generate_prompt(sample)
        conversation = generate_conversation(sample, conversation) #testing for conversation
        # print(f"\n\n{conversation}\n\n")
        #encoded = tokenizer.encode(prompt, bos=True, eos=False, device=model.device)
        encoded = tokenizer.encode(conversation, bos=True, eos=False, device=model.device) #testing for conversation
        
        t0 = time.perf_counter()
        output = generate(
            model,
            idx=encoded,
            max_seq_length=max_new_tokens,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k,
            eos_id=tokenizer.eos_id
        )
        t = time.perf_counter() - t0

        output = tokenizer.decode(output)
        output = output.split("### Response:")[-1].strip()
        print(f"LLaMA: {output}")
        
        conversation+=f" {output}\n\n" #testing for conversation

        print(f"\n\nTime for inference: {t:.02f} sec total, {max_new_tokens / t:.02f} tokens/sec", file=sys.stderr)
        if fabric.device.type == "cuda":
            print(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB", file=sys.stderr)

also i used custom function generate_conversation in perpare_alpaca.py script

def generate_conversation(example, conversation: str):
    """Custom function to generate a message to do conversation with the model by making it remember the previous prompts"""
    if example["input"]:
        conversation+=f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:"
        return (
            #"Below is an instruction that describes a task, paired with an input that provides further context. "
            #"Write a response that appropriately completes the request.\n\n"
            conversation
        )
    conversation+=f"### Instruction:\n{example['instruction']}\n\n### Response:"
    return (
        #"Below is an instruction that describes a task. "
        #"Write a response that appropriately completes the request.\n\n"
        conversation
    )

rest of the code is not touched in the repository, I am first doing fine-tuning of the LLaMA model then i am trying to do conversation. Please help!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is LLaMA response to queries in the conversation so wrong? #466

Why is LLaMA response to queries in the conversation so wrong? #466

Harsh-raj commented Oct 31, 2023 •

edited

Loading

Why is LLaMA response to queries in the conversation so wrong? #466

Why is LLaMA response to queries in the conversation so wrong? #466

Comments

Harsh-raj commented Oct 31, 2023 • edited Loading

Harsh-raj commented Oct 31, 2023 •

edited

Loading