Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Cross-Validation for a downstream task #984

Open
PaulForInvent opened this issue Jun 3, 2021 · 22 comments
Open

Question about Cross-Validation for a downstream task #984

PaulForInvent opened this issue Jun 3, 2021 · 22 comments

Comments

@PaulForInvent
Copy link

PaulForInvent commented Jun 3, 2021

Hey,

do you think, I should use cross-validation of my trainingdata while fine-tune a model for semantic search (and simalirity task)?

Surprisingly I always ignored this...

@PaulForInvent PaulForInvent changed the title Question about Coross-Validation for a downstream task Question about Cross-Validation for a downstream task Jun 3, 2021
@nreimers
Copy link
Member

nreimers commented Jun 4, 2021

If you perform an ablation on e.g. what is the best model, what is the best loss, what are the best parameters, then using CV can make sense if it is computationally feasible.

@PaulForInvent
Copy link
Author

@nreimers Thanks.

Just try to do it, but I saw that for kfold you need of course the RandomSubsetSamplers. In my case I use the SentencesLabelDataset which is a IterableDataset and cannot be used with a sampler. That is bad.

Is it possible to have the SentencesLabelDataset as normal Dataset?

@nreimers
Copy link
Member

nreimers commented Jun 4, 2021

It would be better to first create the fold, and the re-init your SentencesLabelDataset

@PaulForInvent
Copy link
Author

It would be better to first create the fold, and the re-init your SentencesLabelDataset

So you suppose to create the fold without any pytorch dataset? But isn't it possible to chnage the SentencesLabelDataset to a normal dataset by replacing yield by return eg...?

@nreimers
Copy link
Member

nreimers commented Jun 4, 2021

I think it is easier to first create your different folds, and then create a new SentencesLabelDataset from it.

@PaulForInvent
Copy link
Author

I think it is easier to first create your different folds

For this I like to do it with a dataset and a SubsetSampler to sample the folds in a pytorch way? Or how would you create the fold?

@PhilipMay
Copy link
Contributor

PhilipMay commented Jun 4, 2021

Maybe you want to have a look here: https://github.com/German-NLP-Group/xlsr

In this script: https://github.com/German-NLP-Group/xlsr/blob/main/xlsr/train_optuna_stsb.py

There I use cross validation as I think it is useful.

@PhilipMay
Copy link
Contributor

PhilipMay commented Jun 4, 2021

I prefer to use crossvalidation when I do automated hyperparameter search. The reasons are:

  • cross validation reduces overfitting on the validation set when you do automated hyperparameter search
  • through using multiple val. sets they better cover your data space when working with small datasets
  • because neural networks are random initialized the random effects on the results are reduced when you calculate the mean of the folds

@PaulForInvent
Copy link
Author

PaulForInvent commented Jun 4, 2021

@PhilipMay Thanks, Maybe using simple arrays is better. I wanted to do it like here:
https://www.machinecurve.com/index.php/2021/02/03/how-to-use-k-fold-cross-validation-with-pytorch/

But I think the SentencesLabelDataset can be rewritten to a simple dataset.

I saw you are using optimizer parameters for tuning like weight decay. Did you find any improvement by that? I found that tuning learningrate is not very usefull (at least in my case).

@PhilipMay
Copy link
Contributor

@PaulForInvent

Here is the Optuna Importance Plot

grafik

@PhilipMay
Copy link
Contributor

@PaulForInvent

and the slice plot

grafik

@PaulForInvent
Copy link
Author

I asked this myself too.

#791

@PaulForInvent
Copy link
Author

PaulForInvent commented Jun 4, 2021

Now I have a different issue. Since I use mainly Batchhard-Losses, I have examples with its class labels. Up to now for evaluation I used a ranking metric on a different validation set. Now, I wonder how do I evaluate my model on each fold, as now, both data are similar structured (meaning these are just labeled examples). Now I could use a evaluation metric if the class label is predicted correctly (multi class task) or a triplet Evaluator...

My main task is actually Ranking, so I also would like to do a ranking evaluation for each fold...but since my fold is fixed I cannot do a ranking task and just have to use the available samples of each class (possibly with a ParaphraseMiningEvaluator())?

Oh this come in my mind right now: someone has used a combination of aranking metric like MRR and a binaray metric like precision to combine for evaluation (and using for parameter tuning)? @PhilipMay @nreimers

@PaulForInvent
Copy link
Author

@nreimers :

I wonder if your ParaphraseMiningEvaluator or BinaryClassificationEvaluator handles the case for ignoring self refeernces by calculating the cosine score of a list of sentences with itself?

@nreimers
Copy link
Member

nreimers commented Jun 7, 2021

It computes whatever you pass as your data. The ParaphraseMiningEvaluator ignores self references.

@PaulForInvent
Copy link
Author

PaulForInvent commented Jun 8, 2021

@PhilipMay I just saw that your are drawing the parameters in each fold new. I did this same thing too. But shouldn't be the parameters for all folds the same?

I also try to find out, how to build the final model, after I used CV for finding the best parameters? This seems a very heavy discussed topic...

Should I then retrain the model using all trainingdata ? Also, despite setting a seed, you cannot guarantee that each model trained with the same parameters yield the same results... So you should save each model during CV and then continue finetuning on all the data?
Any standard way you experiences to be good? @nreimers

@PhilipMay
Copy link
Contributor

PhilipMay commented Jun 8, 2021

I just saw that your are drawing the parameters in each fold new.

No. It just seems like that. When you draw them from optuna multiple times the 2nd and alls following times it returns the same value until the trial is over.

I did this same thing too. But shouldn't be the parameters for all folds the same?

It should (must) be all the same.

@PhilipMay
Copy link
Contributor

PhilipMay commented Jun 8, 2021

Should I then retrain the model using all trainingdata ? Also, despite setting a seed, you cannot guarantee that each model trained with the same parameters yield the same results... So you should save each model during CV and then continue finetuning on all the data?
Any standard way you experiences to be good? @nreimers

I hate seeds and do not use them when doing HP optimization with CV. I just do many CV steps and average them. CV is only about Hyperparameter finding and not about model creation.

When I want to create the "best" final model I train the model with best HP set on the full dataset at the end.

@PaulForInvent
Copy link
Author

I hate seeds and do not use them when doing HP optimization with CV. I just do many CV steps and average them. CV is only about Hyperparameter finding and not about model creation.

When I want to create the "best" final model I train the model with best HP set on the full dataset at the end.

Yes. That is straight forward if always the model behaves the same every training run with the same hyperparameters. I found that some loss types, the results vary (also strongly) training same model each time. But I feel I am the only having this problem... That's why I set sometimes seeds and save each model for each parameter set. But then I can only use this trained model to train it on all the data... This is maybe different from training from scratch with the found best parameters.

@PaulForInvent
Copy link
Author

@PhilipMay What are your experience with randomness? If I do HP search and try to retrain the model with any parameters I get always different results. So, just finding the HPs is not meaningful as not reproducible...

@PhilipMay
Copy link
Contributor

@PaulForInvent just because there is randomness and you get different results does not mean it is not useful.

For example:
I have usecases with small data sets (6000) where I do 10 fold x-validation. The result is the mean of the folds. That helps
to reduce the effect of randomness.

@PhilipMay
Copy link
Contributor

By the way. I saw that the stsb dataset has duplicate sentences in the train set.
So doing cross validation might be no good idea since you might have information leakage from
train to validation...

@PaulForInvent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants