Add files via upload

Minami-su · Apr 8, 2024 · 7cb8dcd · 7cb8dcd
1 parent 51b3e55
commit 7cb8dcd
Show file tree

Hide file tree

Showing 7 changed files with 1,410 additions and 0 deletions.
diff --git a/RolePlay_V1/README.md b/RolePlay_V1/README.md
@@ -0,0 +1,94 @@
+![image/png](IA.png)
+
+## News
+[2024-03-18] 𝒀𝒐𝒖𝒕𝒉, 𝒍𝒐𝒗𝒆, 𝒑𝒉𝒊𝒍𝒐𝒔𝒐𝒑𝒉𝒚, 𝒕𝒉𝒂𝒕 𝒔𝒖𝒎𝒎𝒆𝒓, 𝒇𝒊𝒓𝒆𝒘𝒐𝒓𝒌𝒔. From new technology[IA_14B](https://huggingface.co/Minami-su/IA_14B)
+
+[2024-02-25] llamafy_qwen_v2.py [mistral_qwen2](https://github.com/Minami-su/character_AI_open/blob/main/mistral_qwen2.py) Released! The original codebase can be found at: 
+(https://github.com/hiyouga/LLaMA-Factory/blob/main/tests/llamafy_qwen.py). I have made modifications to make it compatible with qwen1.5.
+
+[2024-02-25] Qwen1.5-7B-Chat_mistral [Qwen1.5-7B-Chat_mistral](https://huggingface.co/Minami-su/Qwen1.5-7B-Chat_mistral)Released! 
+
+[2024-02-25] Qwen1.5-0.5B-Chat_mistral [Qwen1.5-0.5B-Chat_mistral](https://huggingface.co/Minami-su/Qwen1.5-0.5B-Chat_mistral) Released! 
+
+[2024-02-24] llamafy_qwen_v2.py [llamafy_qwen_v2](https://github.com/Minami-su/character_AI_open/blob/main/llamafy_qwen_v2.py) Released! The original codebase can be found at: 
+(https://github.com/hiyouga/LLaMA-Factory/blob/main/tests/llamafy_qwen.py). I have made modifications to make it compatible with qwen1.5.
+
+[2024-02-24] Qwen1.5-0.5B-Chat_llamafy [Qwen1.5-0.5B-Chat_llamafy](https://huggingface.co/Minami-su/Qwen1.5-0.5B-Chat_llamafy) Released! 
+
+[2024-02-24] Qwen1.5-7B-Chat_llamafy [Qwen1.5-7B-Chat_llamafy](https://huggingface.co/Minami-su/Qwen1.5-7B-Chat_llamafy) Released! 
+
+
+
+
+[2023-12-16] 中文数据集 [Anime_novel_datasets](https://huggingface.co/datasets/Minami-su/Anime_novel_datasets) Released! 包含153本动漫小说数据！
+
+[2023-12-04] qwen_7b_roleplay_4bit [Yi_34B_Chat_2bit](https://huggingface.co/Minami-su/Yi_34B_Chat_2bit) Released! You can run it on 11G mem GPU,quantize base QuIP# method, a weights-only quantization method that is able to achieve near fp16 performance using only 2 bits per weight.
+
+[2023-11-30] qwen_7b_roleplay_4bit [qwen_7b_roleplay_4bit](https://huggingface.co/Minami-su/qwen_7b_chat_roleplay_4bit) Released! 
+
+# character_AI_open
+开源版characterai&characterGLM
+
+# roleplay_AI 介绍
+基于self-instruct生成的多轮对话roleplay数据，约1k条不同的人格数据和对话
+
+## Getting Started
+1.首先生产roleplay的prompt人设设定，这里我上传了seed_prompt.json然后运行代码即可继续生产人设prompt,seed_prompt.json的指令你也可以自己写大概10条就够启动了
+```bash
+python roleplay_prompt_generate.py
+```
+2.然后生产多轮对话，这时候运行代码即可生产最终数据
+```bash
+python roleplay_Multi-round_dialog_generation2.py
+```
+
+## 存在问题：
+1.基于模型自身生成，所以roleplay存在模型本身价值观融入情况，导致roleplay不够真实，不够准确。并且对模型较为熟悉的人设模仿效果会更好，例如贝多芬，莫扎特等名人，而模型不是很熟悉的人物则生产的数据以及训练后的模仿效果较差。这里的roleplay数据的本质思想是让大模型学会适应roleplay
+
+## 已上传的模型
+模型基于baichuan13b训练的4bit量化版
+https://huggingface.co/Minami-su/roleplay_baichuan-Chat_4bit
+
+## 1k数据
+https://huggingface.co/datasets/Minami-su/roleplay_multiturn_chat_1k_zh_v0.1
+
+
+# character_AI_open
+Open source version of characterai&characterGLM
+
+# roleplay_AI Introduction
+Based on self-instructed generated multi-turn dialogue roleplay data, approximately 1k different personality data and conversations.
+
+## Getting Started
+1. First, generate the roleplay prompt character settings. I have uploaded seed_prompt.json here, run the code to continue generating character prompts.You can also write approximately 10 instructions for seed_prompt.json yourself, and that should be enough to get started.
+```bash
+python roleplay_prompt_generate.py
+```
+2. Then, generate multi-turn dialogues. Run the code at this point to produce the final data.
+```bash
+python roleplay_Multi-round_dialog_generation2.py
+```
+
+## Issues:
+1. Due to being based on model-generated content, roleplay may incorporate the model's own values, making it less realistic and accurate. The imitation effect is better for personalities the model is more familiar with, such as famous figures like Beethoven and Mozart. Characters less familiar to the model result in poorer data generation and imitation after training. The fundamental idea behind roleplay data is to enable the large model to adapt to roleplay scenarios.
+
+## Uploaded Models
+The model is based on a 4-bit quantized version trained on baichuan13b.
+[Roleplay Model - Hugging Face](https://huggingface.co/Minami-su/roleplay_baichuan-Chat_4bit)
+
+## 1k Data
+[Roleplay Multiturn Chat 1k Data - Hugging Face](https://huggingface.co/datasets/Minami-su/roleplay_multiturn_chat_1k_zh_v0.1)
+
+Please note that the above content has been revised to English while maintaining the original format.
+
+
+
+## 引用 info
+```
+@misc{selfinstruct,
+  title={Self-Instruct: Aligning Language Model with Self Generated Instructions},
+  author={Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh},
+  journal={arXiv preprint arXiv:2212.10560},
+  year={2022}
+}
+```
diff --git a/RolePlay_V1/filter1.txt b/RolePlay_V1/filter1.txt
@@ -0,0 +1,87 @@
+扮演
+说：
+答案:
+回答:
+回答：
+:<6>
+<br>
+Chatgpt
+a language AI model
+as a text-based
+created by OpenAI
+I'm your AI
+m an AI
+as an AI
+GPT
+不太明白
+Chatgpt
+chatgpt
+作为一台人工智能助手
+作为一个人工智能助手
+我只是一个虚拟
+抱歉
+对不起
+我只是一个虚构
+我只是一只
+AI小助手
+AI助手
+只是一位虚构
+我只是一个虚构的角色
+是一个虚构的人物
+我是一只小狗
+我只是一只小狗
+的AI助手
+文本机器人
+计算机程式
+语言 AI
+文本AI
+一个程序
+一个AI
+一个ai
+openai
+OpenAI
+聊天机器人
+我是一个聊天机器人
+我是chatgpt
+我是一个程序
+计算机程序
+作为AI
+作为ai
+作为人工智能
+没有情感
+人工智能程序
+没有意识和自我意识
+一名AI
+一名ai
+去除了任何限制
+作为一个AI
+作为一个计算机程序
+作为一个程序
+语言模型
+模仿好
+一个计算机程序
+模仿一个
+模仿的
+模仿不
+模仿一个
+尽力模仿
+我模仿
+模仿我
+够模仿
+在模仿
+模仿好
+一个计算机程序
+扮演一个
+扮演的
+扮演不
+扮演一个
+尽力扮演
+我扮演
+扮演我
+够扮演
+在扮演
+人工智能程序
+一个人工智能
+一只人工智能
+一名人工智能
+一种人工智能
diff --git a/RolePlay_V1/roleplay_Multi-round_dialog_generation2.py b/RolePlay_V1/roleplay_Multi-round_dialog_generation2.py
@@ -0,0 +1,119 @@
+from transformers import LlamaTokenizer,AutoModelForCausalLM
+
+import torch
+ckpt = 'Baichuan-13B-Chat_4bit'
+device = torch.device('cuda')
+#tokenizer = LlamaTokenizer.from_pretrained(ckpt)
+# from auto_gptq import AutoGPTQForCausalLM
+# model = AutoGPTQForCausalLM.from_quantized(ckpt, device_map="auto").half()
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(ckpt,trust_remote_code=True)
+from auto_gptq import AutoGPTQForCausalLM
+model = AutoGPTQForCausalLM.from_quantized(ckpt, device_map="auto",trust_remote_code=True).half()
+# from transformers.generation.utils import GenerationConfig
+# from transformers import BitsAndBytesConfig
+# model = AutoModelForCausalLM.from_pretrained(ckpt,
+#                                              trust_remote_code=True,
+#                                              quantization_config=BitsAndBytesConfig(
+#                                                  load_in_4bit=True,
+#                                                  bnb_4bit_compute_dtype=torch.bfloat16,
+#                                                  bnb_4bit_use_double_quant=True,
+#                                                  bnb_4bit_quant_type='nf4'
+#                                              ),
+#                                              device_map="auto")
+with open('filter1.txt', 'r', encoding='utf-8') as f:
+    sensitive_words = [line.strip() for line in f.readlines()]
+def generate(prompt):
+    print("1",prompt,"2")
+    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
+    generate_ids = model.generate(input_ids=input_ids,
+    max_length=2048,
+    #  do_sample = True,
+    # eos_token_id=tokenizer.eos_token_id )
+    num_beams=1,
+    do_sample=True, top_p=0.9, temperature=0.95, repetition_penalty=1.05, eos_token_id=tokenizer.eos_token_id)
+    output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+    response = output[len(prompt):]
+   # print(response)
+    print("回答：",response)
+    return response
+
+
+import random
+import json
+from tqdm import tqdm
+filename0 = "seed_prompt.json"
+filename2 = "roleplay_data.json"
+translations = []
+total_lines = 10000
+sum_str = ""
+def getq(sum_str):
+    result = generate(sum_str)
+    result = result.strip()
+    while any(word in result for word in sensitive_words):
+        if any(word in result for word in sensitive_words):
+            print("error reloop")
+        result = generate(sum_str)
+        result = result.strip()
+    return result
+def geta(sum_str):
+    result = generate(sum_str)
+    result = result.strip()
+    while any(word in result for word in sensitive_words):
+        if any(word in result for word in sensitive_words):
+            print("error reloop")
+        result = generate(sum_str)
+        result = result.strip()
+    return result
+
+max_history_len=1000
+with tqdm(total=total_lines, desc="指令进度") as pbar:
+    while pbar.n < total_lines:
+        count=0
+        with open(filename0, "r", encoding="utf-8") as file:
+            lines2 = file.readlines()
+        random.shuffle(lines2)
+        i=0
+        count=0
+        sum_str = ""
+        i=0
+        for line2 in lines2:
+            i+=1
+            data2 = json.loads(line2.strip())
+            question3 = data2["instruction"]
+            name=question3.split(":")[0]
+            name=name.replace("人格","")
+            name=name.replace("的","")
+
+            history=[]
+            for _ in range(6):
+                input_text=f'''要求扮演下面角色，并且根据角色的设定内容模仿代入角色相应的对话口吻和风格：{question3}<6>\n'''
+                for history_id, history_utr in enumerate(history[-max_history_len:]):
+                    input_text = input_text + history_utr + '\n'
+                prompt = input_text +f"根据上面内容与{name}发起日常对话，只写出一句即可<6>\n对话:"
+                prompt = prompt.strip()
+                q=getq(prompt)
+                #q=q.replace("人类:","")
+                # q=q.replace("答案:","")
+                # q=q.replace("说：",":")
+                history.append("人类:"+q+"<6>")
+                sum_str2=f'''要求扮演下面角色，并且根据角色的设定内容模仿代入角色相应的对话口吻和风格：{question3}<6>\n'''
+                for history_id, history_utr in enumerate(history[-max_history_len:]):
+                    sum_str2 = sum_str2 + history_utr + '\n'
+                sum_str2 = sum_str2+f"{name}:"    
+                a=geta(sum_str2)
+                history.append(f"{name}:"+a+"<6>")     
+
+
+            sum_str2=sum_str2+a
+            json_data = {'instruction':sum_str2, "input": "", 'output': ""}
+            with open(filename2, 'a', encoding='utf-8') as f:
+                f.write('\n')
+                f.write(json.dumps(json_data, ensure_ascii=False))
+            pbar.update(1)
+            if i==6:
+                break
+        if pbar.n >= total_lines:
+            break
+
diff --git a/RolePlay_V1/roleplay_prompt_generate.py b/RolePlay_V1/roleplay_prompt_generate.py
@@ -0,0 +1,87 @@
+from transformers import LlamaTokenizer,AutoModelForCausalLM
+
+import torch
+ckpt = 'Baichuan-13B-Chat_4bit'
+device = torch.device('cuda')
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+#tokenizer = LlamaTokenizer.from_pretrained(ckpt)
+tokenizer = AutoTokenizer.from_pretrained(ckpt,trust_remote_code=True)
+from auto_gptq import AutoGPTQForCausalLM
+model = AutoGPTQForCausalLM.from_quantized(ckpt, device_map="auto",trust_remote_code=True).half()
+# from transformers.generation.utils import GenerationConfig
+# from transformers import BitsAndBytesConfig
+# model = AutoModelForCausalLM.from_pretrained(ckpt,
+#                                              trust_remote_code=True,
+#                                              quantization_config=BitsAndBytesConfig(
+#                                                  load_in_4bit=True,
+#                                                  bnb_4bit_compute_dtype=torch.bfloat16,
+#                                                  bnb_4bit_use_double_quant=True,
+#                                                  bnb_4bit_quant_type='nf4'
+#                                              ),
+#                                              device_map="auto")
+with open('filter1.txt', 'r', encoding='utf-8') as f:
+    sensitive_words = [line.strip() for line in f.readlines()]
+def generate(prompt):
+    print("1",prompt,"2")
+    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
+    generate_ids = model.generate(input_ids=input_ids,
+    max_length=4096,
+    #  do_sample = True,
+    # eos_token_id=tokenizer.eos_token_id )
+    num_beams=1,
+    do_sample=True, top_p=0.9, temperature=0.95, repetition_penalty=1.05, eos_token_id=tokenizer.eos_token_id)
+    output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+    response = output[len(prompt):]
+   # print(response)
+    print("回答：",response)
+    return response
+
+
+import random
+import json
+from tqdm import tqdm
+
+filename = "seed_prompt.json"
+#filename = "xiaoyu_person_指令2.json"
+translations = []
+total_lines = 100000
+sum_str = ""
+
+with tqdm(total=total_lines, desc="指令进度") as pbar:
+    while pbar.n < total_lines:
+        with open(filename, "r", encoding="utf-8") as file:
+            lines = file.readlines()
+        random.shuffle(lines)
+        i=0
+        sum_str = ""
+        for line in lines:
+            i+=1
+            try:
+                data = json.loads(line.strip())
+            except:
+                print("error:",line.strip())
+                continue
+            question = data["instruction"]
+            sum_str += f"{i}.{question}\n"
+
+            if i == 5: 
+                res = generate(f'请续写下面内容，不少于10条。\n{sum_str}')
+                res = res.split("\n")
+                for result in res:
+                    result = result.strip()
+                    prefix_length = len(result.split(".")[0]) + 1  # 获取前缀数字的长度，包括后面的点号
+                    result = result[prefix_length:]
+                    if result == "":
+                        continue
+                    while any(word in result for word in sensitive_words):
+                        res = generate(f'请续写下面内容，不少于10条。\n{sum_str}')
+                    json_data = {'instruction': result, "input": "", 'output': ""}
+                    # 将数据写入文件
+                    with open(filename, 'a', encoding='utf-8') as f:
+                        f.write(json.dumps(json_data, ensure_ascii=False)+'\n')
+
+                    pbar.update(1)
+        if pbar.n >= total_lines:
+            break
+