Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single_snp with large number of SNPs #51

Open
snowformatics opened this issue Jul 17, 2024 · 5 comments
Open

single_snp with large number of SNPs #51

snowformatics opened this issue Jul 17, 2024 · 5 comments

Comments

@snowformatics
Copy link

Hi Carl,

I have some datasets with > 5 Million SNPs (but < 500 samples and 1 phenotype). The run with single_snp takes more then 8 hours. Is there any way to speed up things?

Thanks

@CarlKCarlK
Copy link

CarlKCarlK commented Jul 17, 2024 via email

@snowformatics
Copy link
Author

Hi Carl,

thanks for the replay, here are more details:

  • Are you doing leave-one-out-chromosome?
    Yes, we use the default settings.
  • For how many chromosomes? (Sorry, I forgot if this is people, mice, or something else)
    We are working with plants, 5-22 chromosomes.
  • Is your machine's CPU not at 100% when you do a single_snp run?
    It's at 100%
  • And/or do you have access to multiple machines?

I have access to GPU and a Slurm cluster.

Thanks

@CarlKCarlK
Copy link

CarlKCarlK commented Jul 19, 2024

Below is some python code that will let you run the single_snp in parts. Put the code in a file such as run_in_parts.py. Change the name of the input files (and add a covarate file if you use one).

To divide the work into 1000 parts and run part index 0, you do:

python run_in_parts.py 0 1000

This will produce output file result.0of1000.tsv. I'd suggest doing this for a several indexes on your local machine to check that
it works and runs a part in a reasonable amount of time.

I don't know that much about Slurm, but the idea is to run this on the cluster. Set the # of parts to something reasonable for the cluster (maybe 10, or 20, or 100 or 1000 [it really depends on the cluster's policies]). Be sure the input and output files will work for the cluster (by having them as some shared space). You also need to have a way to have fastlmm installed on the cluster for your job. Then you get slurm run some jobs. For example, if you want to run in 10 parts, then you'd want 10 slurm tasks of:

python run_in_parts.py 0 10
python run_in_parts.py 1 10
python run_in_parts.py 2 10
python run_in_parts.py 3 10
python run_in_parts.py 3 10
python run_in_parts.py 4 10
python run_in_parts.py 5 10
python run_in_parts.py 6 10
python run_in_parts.py 7 10
python run_in_parts.py 8 10
python run_in_parts.py 9 10

And the output will be 10 tabbed output files that you could merge and sort to see the results. [with Pandas or other tools]

import argparse
from fastlmm.association import single_snp
from pysnptools.snpreader import Bed

def main(part_count, part_index):
    # File paths
    bed_file = r"O:\programs\pysnptools\pysnptools\examples\toydata.bed"
    pheno_file = r"O:\programs\pysnptools\pysnptools\examples\toydata.phe"
    
    # Load the BED and phenotype data
    bed = Bed(bed_file, count_A1=True)

    # Calculate start and end indexes for the current part
    snp_start = bed.sid_count * part_index // part_count
    snp_end = bed.sid_count * (part_index + 1) // part_count

    # Slice the BED file to get the SNPs for the current part
    test_snps = bed[:, snp_start:snp_end]

    # Perform single SNP association test and save results to a file
    output_file_name = f"result.{part_index}of{part_count}.tsv"
    single_snp(test_snps=test_snps, pheno=pheno_file, K0=bed, output_file_name=output_file_name)
    print(f"Results saved to {output_file_name}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Run single SNP association test on a partition of SNPs.')
    parser.add_argument('part_index', type=int, help='Index of the current part (0-based).')
    parser.add_argument('part_count', type=int, help='Total number of parts to divide the SNP data into.')
    
    args = parser.parse_args()
    
    main(args.part_count, args.part_index)

@CarlKCarlK
Copy link

CarlKCarlK commented Jul 19, 2024 via email

@snowformatics
Copy link
Author

Thanks a lot Carl, I will test it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants