Different results each model run #617

datistiquo · 2020-12-11T19:07:28Z

Hey,

I think this is normal, but get sometimes very different results training each time eg a cross-encoder with the same parameters.
So this is really bad as tuning parameters does not make sense as training again with found best parameters yield different results.

How could I cope with that or reduce this? In my mind came just setting a seed. But in my experience using a seed did not work (with a tensorflow neural network)

What seeds I would need to set?

nreimers · 2020-12-14T08:20:49Z

Hi @datistiquo
This is a know issue with BERT & Co., that due to unknown reasons the results can get very bad based on the random seed. This often happens for small datasets, for large datasets, this is more seldom the case. Also it happens more often with BERT large than BERT base.

https://arxiv.org/pdf/2002.06305.pdf

datistiquo · 2020-12-14T12:54:36Z

Hey thank you! I will look at it. Are there any solutions or advice to handle this in this paper?

Is the issue addressed in this paper like having different results for different seeds or having different results for the same seed each training run? Because I have the issue getting different results for same seed (via eg random.seed(42)).

I dont know how to use then a good bert model when it is a lucky punch to get it?

nreimers · 2020-12-14T14:17:14Z

You also have to set the seed for numpy and pytorch to get consistent results.

Otherwise you have to live with it, that the seed can play quite an important role and that you might need to test different seeds.

datistiquo · 2020-12-14T14:21:47Z

You also have to set the seed for numpy and pytorch to get consistent results.

Otherwise you have to live with it ....

So to recap, if I set those 3 seeds I can get similiar (same?) results even for bert finetuning?

And if I use only one seed than I get diferent results... That is my direct understanding of your sentences. :)

nreimers · 2020-12-14T14:25:25Z

Right, there are different random num. generators you have to seed to get consistent results.

datistiquo · 2020-12-14T16:17:32Z

So in the paper they use just the number 1...N for the seed like tf.random(10)?

What initialization is used in SentencesTransformer (uniform, normal...)?

How would I set different seeds for weight Initialization and data order?

Do you know where they released their data and code for this evaluation?

datistiquo · 2020-12-15T12:18:58Z

@nreimers . seems I get now same results with setting the seed right.

I only get changing digits for scores from position 5 and on. So the first 4 digits after the comma are always same but afterwards you get changes. I think this is normal. Do you get the same or do you have always same numbers?

Actually really interesting, that the seed is now one important hyperparamter more to tune! :-I

tide90 · 2021-01-06T15:09:06Z

@nreimers . I wonder how to practical train with differen seeds (in colab eg). Because if I do a simple for loop like

seeds = [20,30, 50]
runs = 3

for seed in seeds:
  SEED = seed
  tf.random.set_seed(SEED)
  random.seed(SEED)
  os.environ['PYTHONHASHSEED']=str(SEED)
  np.random.seed(SEED)

  for r in range(runs):

    bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-cased'))
   
    BATCH_SIZE=64
    EPOCHS=1

    steps_per_epoch=int(train_size//BATCH_SIZE)
    num_training_steps=int(steps_per_epoch * 1)
    warmup_steps = 50
    learn_rate_warmup=WarmUp(initial_learning_rate=3e-5, warmup_steps=warmup_steps, num_training_steps=num_training_steps)
    optimizer = tf.keras.optimizers.Adam(learn_rate_warmup)#learning_rate=3e-3)

    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    bert_model.compile(optimizer=optimizer, loss=loss)

    bert_model.fit(x_train, y_train , epochs=EPOCHS , batch_size=BATCH_SIZE)

I get different results each time.

nreimers · 2021-01-06T16:29:22Z

Hi @tide90
How does the rest of the code look like? And what do you mean with different results.

Note, training on a GPU is not deterministic. So even when you fix the seeds, the results can differ on a GPU each time you execute. Also note that in your small example you don't seed the seed for Pytorch (don't know if you use it).

tide90 · 2021-01-06T16:49:04Z

@nreimers . My excample should only show how I use the bert model. afterwards just the fit comes. I just train the model inside the loop. So creating trainingdata (just with the bert tokenizer) comes before setting seed. I did this because I did not see hat there is some randomness. But maybe the tokenizer somehow uses also some random processes?

I checked it with just setting the seed once without a loop and restarted the notebook many times and evrytime I get almost exact results (with gpu). So this works, but inside loop not.

I use tf huggingface for this.

Maybe it is because it is in the same cell, and it is due to the notebook? Maybe I oversee something?

nreimers · 2021-01-06T16:52:37Z

Note, the random numbers in that loop you get will be different for each run. You have to set the seeds directly before you start a single training run.

tide90 · 2021-01-06T16:55:16Z

@nreimers what do you mean? you mean inside the run loop? But why are they different? If I just put the seed right at the beginning of the notebook as usual and then reload and retrain many times, the results are still the same? So why not in this for loop.

Why are the random numbers each time inside the loop different whereas starting the entire notebook new they are the same? :)

nreimers · 2021-01-06T17:07:30Z

You can try it:

seeds = [20,30, 50]
runs = 3

for seed in seeds:
  SEED = seed
  tf.random.set_seed(SEED)
  random.seed(SEED)
  os.environ['PYTHONHASHSEED']=str(SEED)
  np.random.seed(SEED)
  print("Seed:", seed)
  for r in range(runs): 
       print("Run:", r)
       print("Rnd int 1": random.randint(0, 1000))
       print("Rnd int 2": random.randint(0, 1000))
       print("Rnd int 3": random.randint(0, 1000))

You will get 9 different random numbers. The 9 different random numbers will be the same each time you run the notebook, but inside the for r in range(runs):, you will get different random ints for the three runs.

tide90 · 2021-01-06T17:11:57Z

@nreimers

The 9 different random numbers will be the same each time you run the notebook, but inside the for r in range(runs):, you will get different random ints for the three runs.

Ok, but why? So starting the notebook new, the numbers are the same (and results), right?

How would I then do seed experiemnts like in the paper? Do I need for each (maybe 20 (!)) seeds a seperate notebook? WHat else I can do for experiemnts?

nreimers · 2021-01-06T18:53:26Z

Either remove that second loop for r in range(runs): or move the code were you set the seeds into the second loop.

tide90 · 2021-01-07T11:55:15Z

@nreimers: Ok

If I remove the run loop it would hardly be possible to check if the differnt results were due to SEED or just other randomness of the same seed.

For the second part you mean?


seeds = [20,30, 50]
runs = 3

for seed in seeds:
  for r in range(runs): 
       SEED = seed
       tf.random.set_seed(SEED)
       random.seed(SEED)
       os.environ['PYTHONHASHSEED']=str(SEED)
       np.random.seed(SEED)

       print("Seed:", seed)
       print("Run:", r)
       print("Rnd int 1": random.randint(0, 1000))
       print("Rnd int 2": random.randint(0, 1000))
       print("Rnd int 3": random.randint(0, 1000))

Then I have for each run r per seed the same numbers?

nreimers · 2021-01-07T12:41:41Z

Yes, in that code example you use the same rnd. numbers in each run (if the seeds are the same). This should yield consistent results for the 3 runs.

tide90 · 2021-01-07T13:42:28Z

Yes, thanks. this indeed gives the same results!

tide90 · 2021-01-07T16:00:19Z

Now comes the next issue. If I sue something like optuna for tuning a parameter I specifiy the number of trials. So it can happen that the model is trained multiple times for the same parameters. Before in m custom for loop it was consistent, but now agian using a framework like optuna, results are different again.

Can you give some experinece for tuning and seeds? I assume you have here experience. I don't know if you are familiar with optuna, but maybe it is enough to set the seeds again inside the objective function?

nreimers · 2021-01-08T07:46:05Z

I never set seeds. Setting seeds is a bad experimental setup and the conclusions you draw from should not depend on what seed you are using.

I published more about this here:
https://arxiv.org/abs/1707.09861
https://arxiv.org/abs/1803.09578

The right experimental setup, if you want to compare two settings (e.g. is BERT or RoBERTa better for your task?) is: Train both setups with e.g. 5 or 10 different random seeds, average the results and look if the results are statistically significant.

Here, it doesn't matter what seeds you are using.

tide90 · 2021-01-08T08:58:07Z

@nreimers

I never set seeds. Setting seeds is a bad experimental setup and the conclusions you draw from should not depend on what seed

and

Train both setups with e.g. 5 or 10 different random seeds,

Isn't this contradictory. Or maybe I don't see what you mean? Also the above paper above makes it clear that setting seeds can be siginficant.

nreimers · 2021-01-08T09:08:18Z

The seeds plays a significant role. However, there is no justification for selecting a specific seed, i.e., you cannot justify that you used seed 23 instead of seed 42.

Hence, your experimental setup must be robust against this variance which is introduced by the random seed.

You achieve this, by not computing single performance with one seed.

Instead, you train e.g. 10 runs with 10 different seeds (which seeds doesn't matter) and then you report mean scores.

tide90 · 2021-01-08T09:14:18Z

@nreimers Ok, thanks for this advice. But in my last comment I asked about Tuning.

So, how do you tune? This means you would need to run 10 different seeds for every parameter combination?

nreimers · 2021-01-08T09:17:39Z

It depends on what your goal is.

If you just want to have the best model, then you just tune the hyperparameters and select whatever model performs best on your development set.

If you want to sound conclusions, like "RoBERTa works better on this task than BERT", then you have find a good (optimal) hyperparameter configuration for both setting, and then eval this configuration with 10 different random seeds.

tide90 · 2021-01-08T09:26:56Z

@nreimers

If you just want to have the best model, then you just tune the hyperparameters and select whatever model performs best on your development set.

And I do this with just a single seed?

So I would in both cases first tune and then check with multiple seeds? I would guess that the other way around is the way to go, that you need need to average results for multiple seeds for each parameter combi to get best parameters?

tide90 · 2021-01-08T15:12:01Z

@nreimers Would you do tuning also do with multiple seeds?

datistiquo · 2021-02-04T09:39:58Z

@nreimers I found somthing interesting, that my results from a optuna trial search cannot be reprodcued anymore if I use the a BatchHardLoss together with the new SenetnceLabelDataset:

train_loss = losses.BatchHardTripletLoss(model=model, margin=margin, distance_metric = distance_metric)
like in https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_batch_hard_trec.py.

I set all seeds before etc.

Whereas for OnlineContrastiveLoss my trial results are reproducible.

Any idea?

I suposse I handled the numpy randomness with the seed in each trial. But as you use this np.random in the new Dataset I would have import this module right after setting the seeds, right? I think that is my problem...So in each optuna trial I have to import this module!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results each model run #617

Different results each model run #617

datistiquo commented Dec 11, 2020

nreimers commented Dec 14, 2020

datistiquo commented Dec 14, 2020

nreimers commented Dec 14, 2020

datistiquo commented Dec 14, 2020

nreimers commented Dec 14, 2020

datistiquo commented Dec 14, 2020

datistiquo commented Dec 15, 2020 •

edited

Loading

tide90 commented Jan 6, 2021 •

edited

Loading

nreimers commented Jan 6, 2021

tide90 commented Jan 6, 2021 •

edited

Loading

nreimers commented Jan 6, 2021

tide90 commented Jan 6, 2021 •

edited

Loading

nreimers commented Jan 6, 2021

tide90 commented Jan 6, 2021 •

edited

Loading

nreimers commented Jan 6, 2021

tide90 commented Jan 7, 2021 •

edited

Loading

nreimers commented Jan 7, 2021

tide90 commented Jan 7, 2021

tide90 commented Jan 7, 2021

nreimers commented Jan 8, 2021

tide90 commented Jan 8, 2021

nreimers commented Jan 8, 2021

tide90 commented Jan 8, 2021

nreimers commented Jan 8, 2021

tide90 commented Jan 8, 2021 •

edited

Loading

tide90 commented Jan 8, 2021

datistiquo commented Feb 4, 2021 •

edited

Loading

Different results each model run #617

Different results each model run #617

Comments

datistiquo commented Dec 11, 2020

nreimers commented Dec 14, 2020

datistiquo commented Dec 14, 2020

nreimers commented Dec 14, 2020

datistiquo commented Dec 14, 2020

nreimers commented Dec 14, 2020

datistiquo commented Dec 14, 2020

datistiquo commented Dec 15, 2020 • edited Loading

tide90 commented Jan 6, 2021 • edited Loading

nreimers commented Jan 6, 2021

tide90 commented Jan 6, 2021 • edited Loading

nreimers commented Jan 6, 2021

tide90 commented Jan 6, 2021 • edited Loading

nreimers commented Jan 6, 2021

tide90 commented Jan 6, 2021 • edited Loading

nreimers commented Jan 6, 2021

tide90 commented Jan 7, 2021 • edited Loading

nreimers commented Jan 7, 2021

tide90 commented Jan 7, 2021

tide90 commented Jan 7, 2021

nreimers commented Jan 8, 2021

tide90 commented Jan 8, 2021

nreimers commented Jan 8, 2021

tide90 commented Jan 8, 2021

nreimers commented Jan 8, 2021

tide90 commented Jan 8, 2021 • edited Loading

tide90 commented Jan 8, 2021

datistiquo commented Feb 4, 2021 • edited Loading

datistiquo commented Dec 15, 2020 •

edited

Loading

tide90 commented Jan 6, 2021 •

edited

Loading

tide90 commented Jan 6, 2021 •

edited

Loading

tide90 commented Jan 6, 2021 •

edited

Loading

tide90 commented Jan 6, 2021 •

edited

Loading

tide90 commented Jan 7, 2021 •

edited

Loading

tide90 commented Jan 8, 2021 •

edited

Loading

datistiquo commented Feb 4, 2021 •

edited

Loading