Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scFoundation gene embeddings for GEARS #35

Open
rvinas opened this issue Jun 27, 2024 · 4 comments
Open

scFoundation gene embeddings for GEARS #35

rvinas opened this issue Jun 27, 2024 · 4 comments

Comments

@rvinas
Copy link

rvinas commented Jun 27, 2024

Hello, thank you for your work and the code. I am trying to understand how the scFoundation embeddings were used within the GEARS framework. In the paper, you mention:

(...) In our method, we obtained gene context embeddings for each cell from the scFoundation decoder and set these embeddings as the nodes in the graph (Methods), resulting in a cell-specific gene co-expression graph for predicting perturbations.

How was the cell-specific gene co-expression graph constructed exactly? I was examining your code and I believe this happens here. Could you clarify what the variable pre_in represents? Am I correct in thinking that the GEARS data loader provides the expression of perturbed single-cells in data.x? My understanding from your paper is that the scFoundation embeddings are extracted using control cells only.

Your help would be greatly appreciated!

@WhirlFirst
Copy link
Collaborator

Hi, for the details of constructing gene co-expression graph, you may need to also read the original GEARS paper. https://www.nature.com/articles/s41587-023-01905-6 pre_in represents the unperturbed cells. We used the same dataloader from GEARS. What we did was replace the randomly initialized gene embeddings of the original GEARS model with the contextual gene embedding from our model. The edges in the gene co-expression graph remain unchanged.

@rvinas
Copy link
Author

rvinas commented Jun 28, 2024

Thank you for the clarification! I now understand why pre_in represents the unperturbed cells. In the create_cell_graph_dataset function, control cells are sampled at random and their expression is then stored in data.x. Do you have any intuition on why the contextual gene embeddings from scFoundation are helpful for that task, considering that control cells are sampled at random? I wonder why the contextual aspect is important, given that the sampled control cell is unrelated to the perturbed cell.

@WhirlFirst
Copy link
Collaborator

Happy to know that you figure out the code. As for the contextual embeddings, I think that the contextual gene embeddings offer a more flexible input for the model. This variety of input data may make the model easier to learn the distribution of the input data and predict the results well. Also, the contextual embeddings contain more information about the gene expression level compared with the random initialized one, which is another gain for better prediction.

@rvinas
Copy link
Author

rvinas commented Jun 28, 2024

I see, thank you for your insights. Did you try conditioning the GEARS model on your learnt, non-contextual gene embeddings? (i.e. the gene name embeddings). In other words, can the performance gain be explained by the quality of gene embeddings as opposed to the contextual aspect? I am still unsure why it is helpful to condition the model on random control cells.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants