-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different results each model run #617
Comments
Hi @datistiquo |
Hey thank you! I will look at it. Are there any solutions or advice to handle this in this paper? Is the issue addressed in this paper like having different results for different seeds or having different results for the same seed each training run? Because I have the issue getting different results for same seed (via eg random.seed(42)). I dont know how to use then a good bert model when it is a lucky punch to get it? |
You also have to set the seed for numpy and pytorch to get consistent results. Otherwise you have to live with it, that the seed can play quite an important role and that you might need to test different seeds. |
So to recap, if I set those 3 seeds I can get similiar (same?) results even for bert finetuning? And if I use only one seed than I get diferent results... That is my direct understanding of your sentences. :) |
Right, there are different random num. generators you have to seed to get consistent results. |
So in the paper they use just the number 1...N for the seed like tf.random(10)? What initialization is used in SentencesTransformer (uniform, normal...)? How would I set different seeds for weight Initialization and data order? Do you know where they released their data and code for this evaluation? |
@nreimers . seems I get now same results with setting the seed right. I only get changing digits for scores from position 5 and on. So the first 4 digits after the comma are always same but afterwards you get changes. I think this is normal. Do you get the same or do you have always same numbers? Actually really interesting, that the seed is now one important hyperparamter more to tune! :-I |
@nreimers . I wonder how to practical train with differen seeds (in colab eg). Because if I do a simple for loop like
I get different results each time. |
Hi @tide90 Note, training on a GPU is not deterministic. So even when you fix the seeds, the results can differ on a GPU each time you execute. Also note that in your small example you don't seed the seed for Pytorch (don't know if you use it). |
@nreimers . My excample should only show how I use the bert model. afterwards just the fit comes. I just train the model inside the loop. So creating trainingdata (just with the bert tokenizer) comes before setting seed. I did this because I did not see hat there is some randomness. But maybe the tokenizer somehow uses also some random processes? I checked it with just setting the seed once without a loop and restarted the notebook many times and evrytime I get almost exact results (with gpu). So this works, but inside loop not. I use tf huggingface for this. Maybe it is because it is in the same cell, and it is due to the notebook? Maybe I oversee something? |
Note, the random numbers in that loop you get will be different for each run. You have to set the seeds directly before you start a single training run. |
@nreimers what do you mean? you mean inside the run loop? But why are they different? If I just put the seed right at the beginning of the notebook as usual and then reload and retrain many times, the results are still the same? So why not in this for loop. Why are the random numbers each time inside the loop different whereas starting the entire notebook new they are the same? :) |
You can try it: seeds = [20,30, 50]
runs = 3
for seed in seeds:
SEED = seed
tf.random.set_seed(SEED)
random.seed(SEED)
os.environ['PYTHONHASHSEED']=str(SEED)
np.random.seed(SEED)
print("Seed:", seed)
for r in range(runs):
print("Run:", r)
print("Rnd int 1": random.randint(0, 1000))
print("Rnd int 2": random.randint(0, 1000))
print("Rnd int 3": random.randint(0, 1000)) You will get 9 different random numbers. The 9 different random numbers will be the same each time you run the notebook, but inside the |
Ok, but why? So starting the notebook new, the numbers are the same (and results), right? How would I then do seed experiemnts like in the paper? Do I need for each (maybe 20 (!)) seeds a seperate notebook? WHat else I can do for experiemnts? |
Either remove that second loop |
@nreimers: Ok If I remove the run loop it would hardly be possible to check if the differnt results were due to SEED or just other randomness of the same seed. For the second part you mean?
Then I have for each run r per seed the same numbers? |
Yes, in that code example you use the same rnd. numbers in each run (if the seeds are the same). This should yield consistent results for the 3 runs. |
Yes, thanks. this indeed gives the same results! |
Now comes the next issue. If I sue something like optuna for tuning a parameter I specifiy the number of trials. So it can happen that the model is trained multiple times for the same parameters. Before in m custom for loop it was consistent, but now agian using a framework like optuna, results are different again. Can you give some experinece for tuning and seeds? I assume you have here experience. I don't know if you are familiar with optuna, but maybe it is enough to set the seeds again inside the objective function? |
I never set seeds. Setting seeds is a bad experimental setup and the conclusions you draw from should not depend on what seed you are using. I published more about this here: The right experimental setup, if you want to compare two settings (e.g. is BERT or RoBERTa better for your task?) is: Train both setups with e.g. 5 or 10 different random seeds, average the results and look if the results are statistically significant. Here, it doesn't matter what seeds you are using. |
and
Isn't this contradictory. Or maybe I don't see what you mean? Also the above paper above makes it clear that setting seeds can be siginficant. |
The seeds plays a significant role. However, there is no justification for selecting a specific seed, i.e., you cannot justify that you used seed 23 instead of seed 42. Hence, your experimental setup must be robust against this variance which is introduced by the random seed. You achieve this, by not computing single performance with one seed. Instead, you train e.g. 10 runs with 10 different seeds (which seeds doesn't matter) and then you report mean scores. |
@nreimers Ok, thanks for this advice. But in my last comment I asked about Tuning. So, how do you tune? This means you would need to run 10 different seeds for every parameter combination? |
It depends on what your goal is. If you just want to have the best model, then you just tune the hyperparameters and select whatever model performs best on your development set. If you want to sound conclusions, like "RoBERTa works better on this task than BERT", then you have find a good (optimal) hyperparameter configuration for both setting, and then eval this configuration with 10 different random seeds. |
And I do this with just a single seed? So I would in both cases first tune and then check with multiple seeds? I would guess that the other way around is the way to go, that you need need to average results for multiple seeds for each parameter combi to get best parameters? |
@nreimers Would you do tuning also do with multiple seeds? |
@nreimers I found somthing interesting, that my results from a optuna trial search cannot be reprodcued anymore if I use the a BatchHardLoss together with the new SenetnceLabelDataset:
I set all seeds before etc. Whereas for OnlineContrastiveLoss my trial results are reproducible. Any idea? I suposse I handled the numpy randomness with the seed in each trial. But as you use this np.random in the new Dataset I would have import this module right after setting the seeds, right? I think that is my problem...So in each optuna trial I have to import this module! |
Hey,
I think this is normal, but get sometimes very different results training each time eg a cross-encoder with the same parameters.
So this is really bad as tuning parameters does not make sense as training again with found best parameters yield different results.
How could I cope with that or reduce this? In my mind came just setting a seed. But in my experience using a seed did not work (with a tensorflow neural network)
What seeds I would need to set?
The text was updated successfully, but these errors were encountered: