Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results each model run #617

Open
datistiquo opened this issue Dec 11, 2020 · 27 comments
Open

Different results each model run #617

datistiquo opened this issue Dec 11, 2020 · 27 comments

Comments

@datistiquo
Copy link

Hey,

I think this is normal, but get sometimes very different results training each time eg a cross-encoder with the same parameters.
So this is really bad as tuning parameters does not make sense as training again with found best parameters yield different results.

How could I cope with that or reduce this? In my mind came just setting a seed. But in my experience using a seed did not work (with a tensorflow neural network)

What seeds I would need to set?

@nreimers
Copy link
Member

Hi @datistiquo
This is a know issue with BERT & Co., that due to unknown reasons the results can get very bad based on the random seed. This often happens for small datasets, for large datasets, this is more seldom the case. Also it happens more often with BERT large than BERT base.

https://arxiv.org/pdf/2002.06305.pdf

@datistiquo
Copy link
Author

Hey thank you! I will look at it. Are there any solutions or advice to handle this in this paper?

Is the issue addressed in this paper like having different results for different seeds or having different results for the same seed each training run? Because I have the issue getting different results for same seed (via eg random.seed(42)).

I dont know how to use then a good bert model when it is a lucky punch to get it?

@nreimers
Copy link
Member

You also have to set the seed for numpy and pytorch to get consistent results.

Otherwise you have to live with it, that the seed can play quite an important role and that you might need to test different seeds.

@datistiquo
Copy link
Author

You also have to set the seed for numpy and pytorch to get consistent results.

Otherwise you have to live with it ....

So to recap, if I set those 3 seeds I can get similiar (same?) results even for bert finetuning?

And if I use only one seed than I get diferent results... That is my direct understanding of your sentences. :)

@nreimers
Copy link
Member

Right, there are different random num. generators you have to seed to get consistent results.

@datistiquo
Copy link
Author

So in the paper they use just the number 1...N for the seed like tf.random(10)?

What initialization is used in SentencesTransformer (uniform, normal...)?

How would I set different seeds for weight Initialization and data order?

Do you know where they released their data and code for this evaluation?

@datistiquo
Copy link
Author

datistiquo commented Dec 15, 2020

@nreimers . seems I get now same results with setting the seed right.

I only get changing digits for scores from position 5 and on. So the first 4 digits after the comma are always same but afterwards you get changes. I think this is normal. Do you get the same or do you have always same numbers?

Actually really interesting, that the seed is now one important hyperparamter more to tune! :-I

@tide90
Copy link

tide90 commented Jan 6, 2021

@nreimers . I wonder how to practical train with differen seeds (in colab eg). Because if I do a simple for loop like

seeds = [20,30, 50]
runs = 3

for seed in seeds:
  SEED = seed
  tf.random.set_seed(SEED)
  random.seed(SEED)
  os.environ['PYTHONHASHSEED']=str(SEED)
  np.random.seed(SEED)

  for r in range(runs):

    bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-cased'))
   
    BATCH_SIZE=64
    EPOCHS=1

    steps_per_epoch=int(train_size//BATCH_SIZE)
    num_training_steps=int(steps_per_epoch * 1)
    warmup_steps = 50
    learn_rate_warmup=WarmUp(initial_learning_rate=3e-5, warmup_steps=warmup_steps, num_training_steps=num_training_steps)
    optimizer = tf.keras.optimizers.Adam(learn_rate_warmup)#learning_rate=3e-3)

    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    bert_model.compile(optimizer=optimizer, loss=loss)

    bert_model.fit(x_train, y_train , epochs=EPOCHS , batch_size=BATCH_SIZE)
   


I get different results each time.

@nreimers
Copy link
Member

nreimers commented Jan 6, 2021

Hi @tide90
How does the rest of the code look like? And what do you mean with different results.

Note, training on a GPU is not deterministic. So even when you fix the seeds, the results can differ on a GPU each time you execute. Also note that in your small example you don't seed the seed for Pytorch (don't know if you use it).

@tide90
Copy link

tide90 commented Jan 6, 2021

@nreimers . My excample should only show how I use the bert model. afterwards just the fit comes. I just train the model inside the loop. So creating trainingdata (just with the bert tokenizer) comes before setting seed. I did this because I did not see hat there is some randomness. But maybe the tokenizer somehow uses also some random processes?

I checked it with just setting the seed once without a loop and restarted the notebook many times and evrytime I get almost exact results (with gpu). So this works, but inside loop not.

I use tf huggingface for this.

Maybe it is because it is in the same cell, and it is due to the notebook? Maybe I oversee something?

@nreimers
Copy link
Member

nreimers commented Jan 6, 2021

Note, the random numbers in that loop you get will be different for each run. You have to set the seeds directly before you start a single training run.

@tide90
Copy link

tide90 commented Jan 6, 2021

@nreimers what do you mean? you mean inside the run loop? But why are they different? If I just put the seed right at the beginning of the notebook as usual and then reload and retrain many times, the results are still the same? So why not in this for loop.

Why are the random numbers each time inside the loop different whereas starting the entire notebook new they are the same? :)

@nreimers
Copy link
Member

nreimers commented Jan 6, 2021

You can try it:

seeds = [20,30, 50]
runs = 3

for seed in seeds:
  SEED = seed
  tf.random.set_seed(SEED)
  random.seed(SEED)
  os.environ['PYTHONHASHSEED']=str(SEED)
  np.random.seed(SEED)
  print("Seed:", seed)
  for r in range(runs): 
       print("Run:", r)
       print("Rnd int 1": random.randint(0, 1000))
       print("Rnd int 2": random.randint(0, 1000))
       print("Rnd int 3": random.randint(0, 1000))

You will get 9 different random numbers. The 9 different random numbers will be the same each time you run the notebook, but inside the for r in range(runs):, you will get different random ints for the three runs.

@tide90
Copy link

tide90 commented Jan 6, 2021

@nreimers

The 9 different random numbers will be the same each time you run the notebook, but inside the for r in range(runs):, you will get different random ints for the three runs.

Ok, but why? So starting the notebook new, the numbers are the same (and results), right?

How would I then do seed experiemnts like in the paper? Do I need for each (maybe 20 (!)) seeds a seperate notebook? WHat else I can do for experiemnts?

@nreimers
Copy link
Member

nreimers commented Jan 6, 2021

Either remove that second loop for r in range(runs): or move the code were you set the seeds into the second loop.

@tide90
Copy link

tide90 commented Jan 7, 2021

@nreimers: Ok

If I remove the run loop it would hardly be possible to check if the differnt results were due to SEED or just other randomness of the same seed.

For the second part you mean?


seeds = [20,30, 50]
runs = 3

for seed in seeds:
  for r in range(runs): 
       SEED = seed
       tf.random.set_seed(SEED)
       random.seed(SEED)
       os.environ['PYTHONHASHSEED']=str(SEED)
       np.random.seed(SEED)

       print("Seed:", seed)
       print("Run:", r)
       print("Rnd int 1": random.randint(0, 1000))
       print("Rnd int 2": random.randint(0, 1000))
       print("Rnd int 3": random.randint(0, 1000))

Then I have for each run r per seed the same numbers?

@nreimers
Copy link
Member

nreimers commented Jan 7, 2021

Yes, in that code example you use the same rnd. numbers in each run (if the seeds are the same). This should yield consistent results for the 3 runs.

@tide90
Copy link

tide90 commented Jan 7, 2021

Yes, thanks. this indeed gives the same results!

@tide90
Copy link

tide90 commented Jan 7, 2021

Now comes the next issue. If I sue something like optuna for tuning a parameter I specifiy the number of trials. So it can happen that the model is trained multiple times for the same parameters. Before in m custom for loop it was consistent, but now agian using a framework like optuna, results are different again.

Can you give some experinece for tuning and seeds? I assume you have here experience. I don't know if you are familiar with optuna, but maybe it is enough to set the seeds again inside the objective function?

@nreimers
Copy link
Member

nreimers commented Jan 8, 2021

I never set seeds. Setting seeds is a bad experimental setup and the conclusions you draw from should not depend on what seed you are using.

I published more about this here:
https://arxiv.org/abs/1707.09861
https://arxiv.org/abs/1803.09578

The right experimental setup, if you want to compare two settings (e.g. is BERT or RoBERTa better for your task?) is: Train both setups with e.g. 5 or 10 different random seeds, average the results and look if the results are statistically significant.

Here, it doesn't matter what seeds you are using.

@tide90
Copy link

tide90 commented Jan 8, 2021

@nreimers

I never set seeds. Setting seeds is a bad experimental setup and the conclusions you draw from should not depend on what seed

and

Train both setups with e.g. 5 or 10 different random seeds,

Isn't this contradictory. Or maybe I don't see what you mean? Also the above paper above makes it clear that setting seeds can be siginficant.

@nreimers
Copy link
Member

nreimers commented Jan 8, 2021

The seeds plays a significant role. However, there is no justification for selecting a specific seed, i.e., you cannot justify that you used seed 23 instead of seed 42.

Hence, your experimental setup must be robust against this variance which is introduced by the random seed.

You achieve this, by not computing single performance with one seed.

Instead, you train e.g. 10 runs with 10 different seeds (which seeds doesn't matter) and then you report mean scores.

@tide90
Copy link

tide90 commented Jan 8, 2021

@nreimers Ok, thanks for this advice. But in my last comment I asked about Tuning.

So, how do you tune? This means you would need to run 10 different seeds for every parameter combination?

@nreimers
Copy link
Member

nreimers commented Jan 8, 2021

It depends on what your goal is.

If you just want to have the best model, then you just tune the hyperparameters and select whatever model performs best on your development set.

If you want to sound conclusions, like "RoBERTa works better on this task than BERT", then you have find a good (optimal) hyperparameter configuration for both setting, and then eval this configuration with 10 different random seeds.

@tide90
Copy link

tide90 commented Jan 8, 2021

@nreimers

If you just want to have the best model, then you just tune the hyperparameters and select whatever model performs best on your development set.

And I do this with just a single seed?

So I would in both cases first tune and then check with multiple seeds? I would guess that the other way around is the way to go, that you need need to average results for multiple seeds for each parameter combi to get best parameters?

@tide90
Copy link

tide90 commented Jan 8, 2021

@nreimers Would you do tuning also do with multiple seeds?

@datistiquo
Copy link
Author

datistiquo commented Feb 4, 2021

@nreimers I found somthing interesting, that my results from a optuna trial search cannot be reprodcued anymore if I use the a BatchHardLoss together with the new SenetnceLabelDataset:

train_loss = losses.BatchHardTripletLoss(model=model, margin=margin, distance_metric = distance_metric)
like in https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_batch_hard_trec.py.

I set all seeds before etc.

Whereas for OnlineContrastiveLoss my trial results are reproducible.

Any idea?

I suposse I handled the numpy randomness with the seed in each trial. But as you use this np.random in the new Dataset I would have import this module right after setting the seeds, right? I think that is my problem...So in each optuna trial I have to import this module!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants