Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

数据预处理问题 #19

Open
LiangYong1216 opened this issue Feb 4, 2024 · 2 comments
Open

数据预处理问题 #19

LiangYong1216 opened this issue Feb 4, 2024 · 2 comments

Comments

@LiangYong1216
Copy link

我使用了你们开元的50000条的数据集,当使用data_process.py进行数据处理时,报错了:
The dataset will save in HuatuoGPT2_sft_instruct_GPT4_HuatuoGPT2-7B_4096_dataset
Traceback (most recent call last):
File "/home/ly/test/HuatuoGPT-II-main/adaption/one_stage_training/data_process.py", line 274, in
preprocess(args)
File "/home/ly/test/HuatuoGPT-II-main/adaption/one_stage_training/data_process.py", line 201, in preprocess
train_dataset = HuatuoGPT_data(args, tokenizer)
File "/home/ly/test/HuatuoGPT-II-main/adaption/one_stage_training/data_process.py", line 90, in init
self.data_dict = json.load(f)
File "/home/ly/anaconda3/envs/huatuo/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/home/ly/anaconda3/envs/huatuo/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/home/ly/anaconda3/envs/huatuo/lib/python3.8/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 598)

Process finished with exit code 1
数据集长这样:
image
应该如何修改数据集格式?或者修改某处代码?

@qlibp
Copy link

qlibp commented Mar 20, 2024

手动改,改成形如下的json文件

{
    "SFT_data": [
     ["问:xxxxx",
      "答:xxxxx"]
    ]
}

@jymChen
Copy link
Contributor

jymChen commented Apr 2, 2024

手动改,改成形如下的json文件

{
    "SFT_data": [
     ["问:xxxxx",
      "答:xxxxx"]
    ]
}

这里面对话带的"问:"和 "答:"的符号不用加,记得去掉哈。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants