Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about the masking procedure #248

Open
FUminlee opened this issue Sep 7, 2024 · 2 comments
Open

about the masking procedure #248

FUminlee opened this issue Sep 7, 2024 · 2 comments

Comments

@FUminlee
Copy link

FUminlee commented Sep 7, 2024

Hi scGPT team,

Your work is quite amazing and I am interested in your pre-training procedure! Could I ask several questions?

In the pretrain.py you provide, the default training_tasks is "both" ,which means that the args.mask_ratio will be set to [0.25, 0.50, 0.75]. Also, the DataCollator will take "_call_both" in the "call part". However, it seems that the "_call_both" doesn't have a "_mask" operation. How could the whole training randomly mask some values?

Best,

@subercui
Copy link
Member

subercui commented Sep 7, 2024

Hi thank you for the question! I think you mean the code around here:

data_dict = {
"pcpt_gene": padded_pcpt_genes,
"pcpt_expr": padded_pcpt_expressions,
"gen_gene": padded_gen_genes,
"gen_expr_target": padded_gen_expressions,
}
return data_dict

In the pretraining, these "gen_genes" can be considered unknown to the model. Their expression, as in "gen_expr_target", will not be input to the model. The ratio of "gen_genes" is set by [0.25, 0.50, 0.75]

@FUminlee
Copy link
Author

FUminlee commented Sep 7, 2024

Thanks for the reply! Yes that is the part that I look into. In this case, whether the "random_split" split the dataset into "pcpt_gene" and "gen_gene" and the "_call_both" part doesn't need a "_mask" function to implement masking?

Could I ask another question? Under my understanding, the ratio of "gen_genes" is set by [0.25, 0.50, 0.75] so that at each step, 25% of the genes will be unmasked. However, the "random_split" will randomly split the whole genes into "gen_genes" and "pcpt_genes". In this case, how to ensure that data_collator does not mask genes that were already in "pcpt_gene" before?(e.g. At the second step, gene A one of the 25% genes that are unmasked. But in the next step, gene A becomes one of the 50% genes that are masked.)

I really appreciate your time and help!

Best,
Fumin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants