Running chemCPA on new dataset #168

hraeder41 · 2025-01-23T20:08:45Z

Hello,

I have re-trained the model on only the LINCS L1000 data, and I would now like to apply this model to a new dataset and use the predicted expression for another task. In this dataset, I believe I already have the information I need (i.e. SMILES, dose, cell line information, and baseline gene expression). However, I have not been able to find a function in the model that will simply output the predicted gene expression based on this information.

I know that in Issue #109 you mentioned that one could potentially use the evaluate_r2 method (and specifically the compute_prediction function contained within it) for this type of analysis. However, I have not been able to find a set of functions that will allow me to properly instantiate the model and pull only the predictions from this method.

Do you have any advice for those who simply want to run a pre-trained model on a dataset, and have the output simply be the predicted gene expression rather than a set of evaluation metrics? Thank you!

Edit note: I am also having trouble with using new SMILES that were not in the original training set. I know some experiments in the paper used drugs outside of the training data, but I do not see any way to do this without retraining the model. Is this true, or is there a way for the model to process novel SMILES?

B1RO · 2025-02-06T15:22:01Z

Hello,

on the branch biroscak/predict-method-for-new-dataset there's now a chemCPA/predict.py file, that shows how to use a pretrained checkpoint to get predictions on data.

The predict function accepts the information that you have (drug_embeddings, covariate configuration, control gene expression data and checkpoint) and returns the predictions.
The prepare function shows an example way for how to get the 4 points for the lincs_full_sciplex_genes dataset.

As pertaining to using new smiles, this should also work. - you simply compute embeddings for your new smiles, and supply it via the aforementioned drug_embeddings. Just note, that the model loads the embeddings that were used during training during its setup, so the code might crash if you don't have those, but the prediction can use embeddings whatsoever. This will be fixed.

Hopefully this is helpful.

ACDBio · 2025-02-19T00:54:25Z

Hi! Could you add embeddings for testing predict.py from biroscak/predict-method-for-new-dataset (e.g. sciplex_complete_lincs_genes_v2_rdkit2D_embedding.parquet), please?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running chemCPA on new dataset #168

Running chemCPA on new dataset #168

hraeder41 commented Jan 23, 2025 •

edited

Loading

B1RO commented Feb 6, 2025

ACDBio commented Feb 19, 2025

Running chemCPA on new dataset #168

Running chemCPA on new dataset #168

Comments

hraeder41 commented Jan 23, 2025 • edited Loading

B1RO commented Feb 6, 2025

ACDBio commented Feb 19, 2025

hraeder41 commented Jan 23, 2025 •

edited

Loading