Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
Minami-su authored Apr 8, 2024
1 parent 51b3e55 commit 7cb8dcd
Show file tree
Hide file tree
Showing 7 changed files with 1,410 additions and 0 deletions.
94 changes: 94 additions & 0 deletions RolePlay_V1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
![image/png](IA.png)

## News
[2024-03-18] 𝒀𝒐𝒖𝒕𝒉, 𝒍𝒐𝒗𝒆, 𝒑𝒉𝒊𝒍𝒐𝒔𝒐𝒑𝒉𝒚, 𝒕𝒉𝒂𝒕 𝒔𝒖𝒎𝒎𝒆𝒓, 𝒇𝒊𝒓𝒆𝒘𝒐𝒓𝒌𝒔. From new technology[IA_14B](https://huggingface.co/Minami-su/IA_14B)

[2024-02-25] llamafy_qwen_v2.py [mistral_qwen2](https://github.com/Minami-su/character_AI_open/blob/main/mistral_qwen2.py) Released! The original codebase can be found at:
(https://github.com/hiyouga/LLaMA-Factory/blob/main/tests/llamafy_qwen.py). I have made modifications to make it compatible with qwen1.5.

[2024-02-25] Qwen1.5-7B-Chat_mistral [Qwen1.5-7B-Chat_mistral](https://huggingface.co/Minami-su/Qwen1.5-7B-Chat_mistral)Released!

[2024-02-25] Qwen1.5-0.5B-Chat_mistral [Qwen1.5-0.5B-Chat_mistral](https://huggingface.co/Minami-su/Qwen1.5-0.5B-Chat_mistral) Released!

[2024-02-24] llamafy_qwen_v2.py [llamafy_qwen_v2](https://github.com/Minami-su/character_AI_open/blob/main/llamafy_qwen_v2.py) Released! The original codebase can be found at:
(https://github.com/hiyouga/LLaMA-Factory/blob/main/tests/llamafy_qwen.py). I have made modifications to make it compatible with qwen1.5.

[2024-02-24] Qwen1.5-0.5B-Chat_llamafy [Qwen1.5-0.5B-Chat_llamafy](https://huggingface.co/Minami-su/Qwen1.5-0.5B-Chat_llamafy) Released!

[2024-02-24] Qwen1.5-7B-Chat_llamafy [Qwen1.5-7B-Chat_llamafy](https://huggingface.co/Minami-su/Qwen1.5-7B-Chat_llamafy) Released!




[2023-12-16] 中文数据集 [Anime_novel_datasets](https://huggingface.co/datasets/Minami-su/Anime_novel_datasets) Released! 包含153本动漫小说数据!

[2023-12-04] qwen_7b_roleplay_4bit [Yi_34B_Chat_2bit](https://huggingface.co/Minami-su/Yi_34B_Chat_2bit) Released! You can run it on 11G mem GPU,quantize base QuIP# method, a weights-only quantization method that is able to achieve near fp16 performance using only 2 bits per weight.

[2023-11-30] qwen_7b_roleplay_4bit [qwen_7b_roleplay_4bit](https://huggingface.co/Minami-su/qwen_7b_chat_roleplay_4bit) Released!

# character_AI_open
开源版characterai&characterGLM

# roleplay_AI 介绍
基于self-instruct生成的多轮对话roleplay数据,约1k条不同的人格数据和对话

## Getting Started
1.首先生产roleplay的prompt人设设定,这里我上传了seed_prompt.json然后运行代码即可继续生产人设prompt,seed_prompt.json的指令你也可以自己写大概10条就够启动了
```bash
python roleplay_prompt_generate.py
```
2.然后生产多轮对话,这时候运行代码即可生产最终数据
```bash
python roleplay_Multi-round_dialog_generation2.py
```

## 存在问题:
1.基于模型自身生成,所以roleplay存在模型本身价值观融入情况,导致roleplay不够真实,不够准确。并且对模型较为熟悉的人设模仿效果会更好,例如贝多芬,莫扎特等名人,而模型不是很熟悉的人物则生产的数据以及训练后的模仿效果较差。这里的roleplay数据的本质思想是让大模型学会适应roleplay

## 已上传的模型
模型基于baichuan13b训练的4bit量化版
https://huggingface.co/Minami-su/roleplay_baichuan-Chat_4bit

## 1k数据
https://huggingface.co/datasets/Minami-su/roleplay_multiturn_chat_1k_zh_v0.1


# character_AI_open
Open source version of characterai&characterGLM

# roleplay_AI Introduction
Based on self-instructed generated multi-turn dialogue roleplay data, approximately 1k different personality data and conversations.

## Getting Started
1. First, generate the roleplay prompt character settings. I have uploaded seed_prompt.json here, run the code to continue generating character prompts.You can also write approximately 10 instructions for seed_prompt.json yourself, and that should be enough to get started.
```bash
python roleplay_prompt_generate.py
```
2. Then, generate multi-turn dialogues. Run the code at this point to produce the final data.
```bash
python roleplay_Multi-round_dialog_generation2.py
```

## Issues:
1. Due to being based on model-generated content, roleplay may incorporate the model's own values, making it less realistic and accurate. The imitation effect is better for personalities the model is more familiar with, such as famous figures like Beethoven and Mozart. Characters less familiar to the model result in poorer data generation and imitation after training. The fundamental idea behind roleplay data is to enable the large model to adapt to roleplay scenarios.

## Uploaded Models
The model is based on a 4-bit quantized version trained on baichuan13b.
[Roleplay Model - Hugging Face](https://huggingface.co/Minami-su/roleplay_baichuan-Chat_4bit)

## 1k Data
[Roleplay Multiturn Chat 1k Data - Hugging Face](https://huggingface.co/datasets/Minami-su/roleplay_multiturn_chat_1k_zh_v0.1)

Please note that the above content has been revised to English while maintaining the original format.



## 引用 info
```
@misc{selfinstruct,
title={Self-Instruct: Aligning Language Model with Self Generated Instructions},
author={Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh},
journal={arXiv preprint arXiv:2212.10560},
year={2022}
}
```
87 changes: 87 additions & 0 deletions RolePlay_V1/filter1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
扮演
说:
答案:
回答:
回答:
:<6>
<br>
Chatgpt
a language AI model
as a text-based
created by OpenAI
I'm your AI
m an AI
as an AI
GPT
不太明白
Chatgpt
chatgpt
作为一台人工智能助手
作为一个人工智能助手
我只是一个虚拟
抱歉
对不起
我只是一个虚构
我只是一只
AI小助手
AI助手
只是一位虚构
我只是一个虚构的角色
是一个虚构的人物
我是一只小狗
我只是一只小狗
的AI助手
文本机器人
计算机程式
语言 AI
文本AI
一个程序
一个AI
一个ai
openai
OpenAI
聊天机器人
我是一个聊天机器人
我是chatgpt
我是一个程序
计算机程序
作为AI
作为ai
作为人工智能
没有情感
人工智能程序
没有意识和自我意识
一名AI
一名ai
去除了任何限制
作为一个AI
作为一个计算机程序
作为一个程序
语言模型
模仿好
一个计算机程序
模仿一个
模仿的
模仿不
模仿一个
尽力模仿
我模仿
模仿我
够模仿
在模仿
模仿好
一个计算机程序
扮演一个
扮演的
扮演不
扮演一个
尽力扮演
我扮演
扮演我
够扮演
在扮演
人工智能程序
一个人工智能
一只人工智能
一名人工智能
一种人工智能
119 changes: 119 additions & 0 deletions RolePlay_V1/roleplay_Multi-round_dialog_generation2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
from transformers import LlamaTokenizer,AutoModelForCausalLM

import torch
ckpt = 'Baichuan-13B-Chat_4bit'
device = torch.device('cuda')
#tokenizer = LlamaTokenizer.from_pretrained(ckpt)
# from auto_gptq import AutoGPTQForCausalLM
# model = AutoGPTQForCausalLM.from_quantized(ckpt, device_map="auto").half()
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(ckpt,trust_remote_code=True)
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(ckpt, device_map="auto",trust_remote_code=True).half()
# from transformers.generation.utils import GenerationConfig
# from transformers import BitsAndBytesConfig
# model = AutoModelForCausalLM.from_pretrained(ckpt,
# trust_remote_code=True,
# quantization_config=BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_compute_dtype=torch.bfloat16,
# bnb_4bit_use_double_quant=True,
# bnb_4bit_quant_type='nf4'
# ),
# device_map="auto")
with open('filter1.txt', 'r', encoding='utf-8') as f:
sensitive_words = [line.strip() for line in f.readlines()]
def generate(prompt):
print("1",prompt,"2")
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generate_ids = model.generate(input_ids=input_ids,
max_length=2048,
# do_sample = True,
# eos_token_id=tokenizer.eos_token_id )
num_beams=1,
do_sample=True, top_p=0.9, temperature=0.95, repetition_penalty=1.05, eos_token_id=tokenizer.eos_token_id)
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
response = output[len(prompt):]
# print(response)
print("回答:",response)
return response


import random
import json
from tqdm import tqdm
filename0 = "seed_prompt.json"
filename2 = "roleplay_data.json"
translations = []
total_lines = 10000
sum_str = ""
def getq(sum_str):
result = generate(sum_str)
result = result.strip()
while any(word in result for word in sensitive_words):
if any(word in result for word in sensitive_words):
print("error reloop")
result = generate(sum_str)
result = result.strip()
return result
def geta(sum_str):
result = generate(sum_str)
result = result.strip()
while any(word in result for word in sensitive_words):
if any(word in result for word in sensitive_words):
print("error reloop")
result = generate(sum_str)
result = result.strip()
return result

max_history_len=1000
with tqdm(total=total_lines, desc="指令进度") as pbar:
while pbar.n < total_lines:
count=0
with open(filename0, "r", encoding="utf-8") as file:
lines2 = file.readlines()
random.shuffle(lines2)
i=0
count=0
sum_str = ""
i=0
for line2 in lines2:
i+=1
data2 = json.loads(line2.strip())
question3 = data2["instruction"]
name=question3.split(":")[0]
name=name.replace("人格","")
name=name.replace("的","")

history=[]
for _ in range(6):
input_text=f'''要求扮演下面角色,并且根据角色的设定内容模仿代入角色相应的对话口吻和风格:{question3}<6>\n'''
for history_id, history_utr in enumerate(history[-max_history_len:]):
input_text = input_text + history_utr + '\n'
prompt = input_text +f"根据上面内容与{name}发起日常对话,只写出一句即可<6>\n对话:"
prompt = prompt.strip()
q=getq(prompt)
#q=q.replace("人类:","")
# q=q.replace("答案:","")
# q=q.replace("说:",":")
history.append("人类:"+q+"<6>")
sum_str2=f'''要求扮演下面角色,并且根据角色的设定内容模仿代入角色相应的对话口吻和风格:{question3}<6>\n'''
for history_id, history_utr in enumerate(history[-max_history_len:]):
sum_str2 = sum_str2 + history_utr + '\n'
sum_str2 = sum_str2+f"{name}:"
a=geta(sum_str2)
history.append(f"{name}:"+a+"<6>")


sum_str2=sum_str2+a
json_data = {'instruction':sum_str2, "input": "", 'output': ""}
with open(filename2, 'a', encoding='utf-8') as f:
f.write('\n')
f.write(json.dumps(json_data, ensure_ascii=False))
pbar.update(1)
if i==6:
break
if pbar.n >= total_lines:
break

87 changes: 87 additions & 0 deletions RolePlay_V1/roleplay_prompt_generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
from transformers import LlamaTokenizer,AutoModelForCausalLM

import torch
ckpt = 'Baichuan-13B-Chat_4bit'
device = torch.device('cuda')
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
#tokenizer = LlamaTokenizer.from_pretrained(ckpt)
tokenizer = AutoTokenizer.from_pretrained(ckpt,trust_remote_code=True)
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(ckpt, device_map="auto",trust_remote_code=True).half()
# from transformers.generation.utils import GenerationConfig
# from transformers import BitsAndBytesConfig
# model = AutoModelForCausalLM.from_pretrained(ckpt,
# trust_remote_code=True,
# quantization_config=BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_compute_dtype=torch.bfloat16,
# bnb_4bit_use_double_quant=True,
# bnb_4bit_quant_type='nf4'
# ),
# device_map="auto")
with open('filter1.txt', 'r', encoding='utf-8') as f:
sensitive_words = [line.strip() for line in f.readlines()]
def generate(prompt):
print("1",prompt,"2")
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generate_ids = model.generate(input_ids=input_ids,
max_length=4096,
# do_sample = True,
# eos_token_id=tokenizer.eos_token_id )
num_beams=1,
do_sample=True, top_p=0.9, temperature=0.95, repetition_penalty=1.05, eos_token_id=tokenizer.eos_token_id)
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
response = output[len(prompt):]
# print(response)
print("回答:",response)
return response


import random
import json
from tqdm import tqdm

filename = "seed_prompt.json"
#filename = "xiaoyu_person_指令2.json"
translations = []
total_lines = 100000
sum_str = ""

with tqdm(total=total_lines, desc="指令进度") as pbar:
while pbar.n < total_lines:
with open(filename, "r", encoding="utf-8") as file:
lines = file.readlines()
random.shuffle(lines)
i=0
sum_str = ""
for line in lines:
i+=1
try:
data = json.loads(line.strip())
except:
print("error:",line.strip())
continue
question = data["instruction"]
sum_str += f"{i}.{question}\n"

if i == 5:
res = generate(f'请续写下面内容,不少于10条。\n{sum_str}')
res = res.split("\n")
for result in res:
result = result.strip()
prefix_length = len(result.split(".")[0]) + 1 # 获取前缀数字的长度,包括后面的点号
result = result[prefix_length:]
if result == "":
continue
while any(word in result for word in sensitive_words):
res = generate(f'请续写下面内容,不少于10条。\n{sum_str}')
json_data = {'instruction': result, "input": "", 'output': ""}
# 将数据写入文件
with open(filename, 'a', encoding='utf-8') as f:
f.write(json.dumps(json_data, ensure_ascii=False)+'\n')

pbar.update(1)
if pbar.n >= total_lines:
break

Loading

0 comments on commit 7cb8dcd

Please sign in to comment.