about the masking procedure #248

FUminlee · 2024-09-07T00:11:16Z

Hi scGPT team,

Your work is quite amazing and I am interested in your pre-training procedure! Could I ask several questions?

In the pretrain.py you provide, the default training_tasks is "both" ,which means that the args.mask_ratio will be set to [0.25, 0.50, 0.75]. Also, the DataCollator will take "_call_both" in the "call part". However, it seems that the "_call_both" doesn't have a "_mask" operation. How could the whole training randomly mask some values?

Best,

subercui · 2024-09-07T00:20:50Z

Hi thank you for the question! I think you mean the code around here:

scGPT/scgpt/data_collator.py

Lines 348 to 355 in 4068d67

    
           data_dict = { 
        
               "pcpt_gene": padded_pcpt_genes, 
        
               "pcpt_expr": padded_pcpt_expressions, 
        
               "gen_gene": padded_gen_genes, 
        
               "gen_expr_target": padded_gen_expressions, 
        
           } 
        
           return data_dict

In the pretraining, these "gen_genes" can be considered unknown to the model. Their expression, as in "gen_expr_target", will not be input to the model. The ratio of "gen_genes" is set by [0.25, 0.50, 0.75]

FUminlee · 2024-09-07T01:00:42Z

Thanks for the reply! Yes that is the part that I look into. In this case, whether the "random_split" split the dataset into "pcpt_gene" and "gen_gene" and the "_call_both" part doesn't need a "_mask" function to implement masking?

Could I ask another question? Under my understanding, the ratio of "gen_genes" is set by [0.25, 0.50, 0.75] so that at each step, 25% of the genes will be unmasked. However, the "random_split" will randomly split the whole genes into "gen_genes" and "pcpt_genes". In this case, how to ensure that data_collator does not mask genes that were already in "pcpt_gene" before?(e.g. At the second step, gene A one of the 25% genes that are unmasked. But in the next step, gene A becomes one of the 50% genes that are masked.)

I really appreciate your time and help!

Best,
Fumin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about the masking procedure #248

about the masking procedure #248

FUminlee commented Sep 7, 2024

subercui commented Sep 7, 2024

FUminlee commented Sep 7, 2024

about the masking procedure #248

about the masking procedure #248

Comments

FUminlee commented Sep 7, 2024

subercui commented Sep 7, 2024

FUminlee commented Sep 7, 2024