-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ShareGPT Dataset? #109
Comments
Sorry, I found the dataset on your huggingface. I looked over it though, and the dataset format might be concerning. I may be ignorant, but if trained on the ShareGPT alpaca-format dataset, the model may not coherently learn. ex.
These 2 sequences will likely not be related to eachother during training, making it much more irratic than the way vicuna's original dataset in their format would learn to be |
did u check the _context.json version ? |
Looked into this for a bit. sharegpt_context.json has the same issue to an extent. It seems that everyone is processing the ShareGPT data using Vicuna's pipeline, including this part, which chunks long conversations based on token count. So rather than throwing out data after hitting the context window, we have a fair amount of chats in sharegpt_context.json that start in the middle of things with the first prompt being something like "[HM]: continue". Not sure if training on this is harmful or helpful. |
It would be wise, imo, to alter the vicuna pipeline being used to simply throw away the sequences that get split off, or perhaps, if needed, throw out all convos that are too long, maybe make a 2k ctx length version and a 4k one - since 4k llama models have started to appear (although are not working well at all rn, they will soon) |
I also think since a lot of datasets are doing this that it likely has something to do with the vicuna "random stopping" issues |
At present, there are some works to clean the ShareGPT dataset, and we will continue to pay attention to it. |
Can you link any of those? |
https://paratranz.cn/projects/6725 |
Hello, I see ShareGPT's dataset is listed on the readme, but the download for alpaca format version is not listed. Can it get listed? Very interested.
The text was updated successfully, but these errors were encountered: