Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overfitting of ridge regression? #1

Open
jhfoxliu opened this issue Jun 21, 2021 · 4 comments
Open

Overfitting of ridge regression? #1

jhfoxliu opened this issue Jun 21, 2021 · 4 comments

Comments

@jhfoxliu
Copy link

Hello, it seems that in FP and BLAC experiments, the ridge regression worked well in validation set. However, both in my hand and in examples from Ivan's re-implementation, the ridge regression sucessfully fits the training set but has bad performance in validation set. I guess this can be blamed for insufficient training of eUnirep. The problem is, if it deserves moving on to directed evolution with a overfit model?

@surgebiswas
Copy link
Contributor

Hard to answer without more information.

A few things that might help me answer your question:
What's the application you're working on here? What does train vs val look like? What ridge implementation are you using and how are you doing hyperparameter selection.

@jhfoxliu
Copy link
Author

I am training models for ADAR2. I only found <10,000 closely related proteins, so I used ~60,000 sequences including other editases to re-train Unirep. I did the training with JAX-unirep. The loss decreased very fast from 0.12 to ~0.02 within 10 epoches. I used RidgeCV to fit the fitness scores with a set of single amino acid mutations (N=33).

Two figures attached, the first one is from my results, and the second is from the Ivan's notebook.

ADAR
Ivan

@surgebiswas
Copy link
Contributor

surgebiswas commented Jun 23, 2021 via email

@jhfoxliu
Copy link
Author

jhfoxliu commented Jul 7, 2021

When you say "re-train" what do you mean? Evotune/fine-tune? How did you monitor unsupervised loss when you were evotuning, and how did you use that info to know when to stop evotuning? Surge Biswas

On Mon, Jun 21, 2021 at 9:48 AM, JH Liu @.***> wrote: I am training models for ADAR2. I only found <10,000 closely related proteins, so I used ~60,000 sequences including other editases to re-train Unirep. I did the training with JAX-unirep. The loss decreased very fast from 0.12 to ~0.02 within 10 epoches. I used RidgeCV to fit the fitness scores with a set of single amino acid mutations (N=33). Two figures attached, the first one is from my results, and the second is from the Ivan's notebook. [image: ADAR] https://user-images.githubusercontent.com/20188476/122772288-ce8cc400-d2d9-11eb-9e91-25484d41896a.png [image: Ivan] https://user-images.githubusercontent.com/20188476/122772304-d2204b00-d2d9-11eb-94ef-6a1de4bb5c0f.png — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3ZM2PBXYAQSDPMSLYVBTLTT47K5ANCNFSM47BRMVRA .

I have additional runs these days. It seems that the global unirep parameters might be broken in a few epoches if the lr is too high (1e-6 or 1e-5). Hence I am now doing evotuning with lr=1e-7. Now it seems much better but I need some time to make it reach the best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants