-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about the masking procedure #248
Comments
Hi thank you for the question! I think you mean the code around here: Lines 348 to 355 in 4068d67
In the pretraining, these "gen_genes" can be considered unknown to the model. Their expression, as in "gen_expr_target", will not be input to the model. The ratio of "gen_genes" is set by [0.25, 0.50, 0.75] |
Thanks for the reply! Yes that is the part that I look into. In this case, whether the "random_split" split the dataset into "pcpt_gene" and "gen_gene" and the "_call_both" part doesn't need a "_mask" function to implement masking? Could I ask another question? Under my understanding, the ratio of "gen_genes" is set by [0.25, 0.50, 0.75] so that at each step, 25% of the genes will be unmasked. However, the "random_split" will randomly split the whole genes into "gen_genes" and "pcpt_genes". In this case, how to ensure that data_collator does not mask genes that were already in "pcpt_gene" before?(e.g. At the second step, gene A one of the 25% genes that are unmasked. But in the next step, gene A becomes one of the 50% genes that are masked.) I really appreciate your time and help! Best, |
Hi scGPT team,
Your work is quite amazing and I am interested in your pre-training procedure! Could I ask several questions?
In the pretrain.py you provide, the default training_tasks is "both" ,which means that the args.mask_ratio will be set to [0.25, 0.50, 0.75]. Also, the DataCollator will take "_call_both" in the "call part". However, it seems that the "_call_both" doesn't have a "_mask" operation. How could the whole training randomly mask some values?
Best,
The text was updated successfully, but these errors were encountered: