-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Augmentation #17
Comments
I will check later on the back translation |
There is a possibility to use PPDB to generate additional paraphrased questions: |
Any updates on creating more questions? Maybe @HenrykBorzymowski can use the MS Azure translator here for backtranslation? They have free 2M chars per month I heard : ) |
I have tried the google/uda project (https://github.com/google-research/uda). It has a back-translation part that allows you to take existing sentences, translate them into French and then back into English with different temperature parameters which will increase the sample size of the existing dataset. Unfortunately the repository is quite outdated and the packages with the given versions do not work anymore. Please install these packages (with python==2.7) and then follow the instructions in the UDA readme file to make it work:
The following command translates the provided sample file in the directory back_translate (google/uda). It automatically divides paragraphs into sentences, translates English sentences into French, and then translates them back into English. Go to the back_translate directory and execute it:
I tried some temperature settings (0.3, 0.5, 0.7, 0.9) for the eval_question_similarity_en.csv table and found that rather small temperatures work better for our case (0.3 or 0.5). With 0.7 and 0.9 we get quite a lot of random translations :D Attached you will find the results if someone is interested :) This could help us to get more variance in our sentences and to be less dependent on certain words that appear in our training set. |
Experiment with different methods for data augmentation, report results and compare to baseline.
The text was updated successfully, but these errors were encountered: