Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running chemCPA on new dataset #168

Open
hraeder41 opened this issue Jan 23, 2025 · 2 comments
Open

Running chemCPA on new dataset #168

hraeder41 opened this issue Jan 23, 2025 · 2 comments

Comments

@hraeder41
Copy link

hraeder41 commented Jan 23, 2025

Hello,

I have re-trained the model on only the LINCS L1000 data, and I would now like to apply this model to a new dataset and use the predicted expression for another task. In this dataset, I believe I already have the information I need (i.e. SMILES, dose, cell line information, and baseline gene expression). However, I have not been able to find a function in the model that will simply output the predicted gene expression based on this information.

I know that in Issue #109 you mentioned that one could potentially use the evaluate_r2 method (and specifically the compute_prediction function contained within it) for this type of analysis. However, I have not been able to find a set of functions that will allow me to properly instantiate the model and pull only the predictions from this method.

Do you have any advice for those who simply want to run a pre-trained model on a dataset, and have the output simply be the predicted gene expression rather than a set of evaluation metrics? Thank you!

Edit note: I am also having trouble with using new SMILES that were not in the original training set. I know some experiments in the paper used drugs outside of the training data, but I do not see any way to do this without retraining the model. Is this true, or is there a way for the model to process novel SMILES?

@B1RO
Copy link
Collaborator

B1RO commented Feb 6, 2025

Hello,

on the branch biroscak/predict-method-for-new-dataset there's now a chemCPA/predict.py file, that shows how to use a pretrained checkpoint to get predictions on data.

  • The predict function accepts the information that you have (drug_embeddings, covariate configuration, control gene expression data and checkpoint) and returns the predictions.

  • The prepare function shows an example way for how to get the 4 points for the lincs_full_sciplex_genes dataset.

As pertaining to using new smiles, this should also work. - you simply compute embeddings for your new smiles, and supply it via the aforementioned drug_embeddings. Just note, that the model loads the embeddings that were used during training during its setup, so the code might crash if you don't have those, but the prediction can use embeddings whatsoever. This will be fixed.

Hopefully this is helpful.

@ACDBio
Copy link

ACDBio commented Feb 19, 2025

Hi! Could you add embeddings for testing predict.py from biroscak/predict-method-for-new-dataset (e.g. sciplex_complete_lincs_genes_v2_rdkit2D_embedding.parquet), please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants