-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect setup of Learning Rate Scheduler #81
Comments
Hi ~ I have also been having issues with reproducing the selfrag-7B, I got a low evaluation result compared with eval results from paper。Whould you share your reproduction result from fine tune Llama-2 7B into selfrag 7B |
I ran their finetuning script without making any changes and using the hyperparameter settings from their finetuning scripts on PopQA with retrieval using their pre-computed top-k files and was only able to get up to 34.28% on PopQA and 69.60% on PubHealth and str-em of 28.02 and rg of 35.76 on ASQA. Can you share the results you were able to reproduce? That would be helpful for context. |
my results is as following: num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) Have you attempted to train a model using the aforementioned code along with the corrected finetune code? How does the correct code influence the results? |
I did and it did increase the numbers but still lower than the paper. |
Could you please tell me how to modify the above code in finetune.py to make it correct~I would like to test whether the correct code can reproduce the results presented in the paper. ths a lots |
I found finetune.py in self rag is revised based on open-instrction finetune |
…ignal code of finetune.py from open-instruction is correct
Hello! Thanks for sharing your great work.
I noticed a discrepancy in the way you setup the learning rate scheduler in finetune.py.
When you calculate:
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
Dividing by the total batch size across multiple GPUs should be giving the right number of update steps per epoch instead of gradient_accumulation_steps. This in turn affects your warmup schedule and also your linear decay schedule for your learning rate.
I've also been having issues with reproducing your results with a locally fine-tuned Llama-2 7B model using your codebase and settings compared to your Huggingface checkpoint. So please let me know if you can share any feedback on any additional settings needed to reproduce the Huggingface chcekpoint level perfromance. Thank you.
The text was updated successfully, but these errors were encountered: