-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
single_snp with large number of SNPs #51
Comments
Greetings,
* Is there any way to speed up things?
Yes, (probably, kind of)
* Are you doing leave-one-out-chromosome?
* For how many chromosomes? (Sorry, I forgot if this is people, mice, or something else)
* Is your machine's CPU not at 100% when you do a single_snp run?
* And/or do you have access to multiple machines?
After I get more information, I can give you more details, but the general idea is:
* (optional) Precompute the similarity matrix for each left-one-out chromosome
* Run different parts of the test SNPs on different processors on one machine
* Run different parts of the test SNPs on different machines.
* Carl
From: snowformatics ***@***.***>
Sent: Tuesday, July 16, 2024 11:35 PM
To: fastlmm/FaST-LMM ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [fastlmm/FaST-LMM] single_snp with large number of SNPs (Issue #51)
Importance: High
Hi Carl,
I have some datasets with > 5 Million SNPs (but < 500 samples and 1 phenotype). The run with single_snp takes more then 8 hours. Is there any way to speed up things?
Thanks
-
Reply to this email directly, view it on GitHub<#51>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABR65PYZIQDYBMLA4IHTJETZMYGC5AVCNFSM6AAAAABK76LFGSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTENZXG4ZDENQ>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>>
|
Hi Carl, thanks for the replay, here are more details:
I have access to GPU and a Slurm cluster. Thanks |
Below is some python code that will let you run the single_snp in parts. Put the code in a file such as To divide the work into 1000 parts and run part index 0, you do: python run_in_parts.py 0 1000 This will produce output file I don't know that much about Slurm, but the idea is to run this on the cluster. Set the # of parts to something reasonable for the cluster (maybe 10, or 20, or 100 or 1000 [it really depends on the cluster's policies]). Be sure the input and output files will work for the cluster (by having them as some shared space). You also need to have a way to have fastlmm installed on the cluster for your job. Then you get slurm run some jobs. For example, if you want to run in 10 parts, then you'd want 10 slurm tasks of: python run_in_parts.py 0 10
python run_in_parts.py 1 10
python run_in_parts.py 2 10
python run_in_parts.py 3 10
python run_in_parts.py 3 10
python run_in_parts.py 4 10
python run_in_parts.py 5 10
python run_in_parts.py 6 10
python run_in_parts.py 7 10
python run_in_parts.py 8 10
python run_in_parts.py 9 10 And the output will be 10 tabbed output files that you could merge and sort to see the results. [with Pandas or other tools] import argparse
from fastlmm.association import single_snp
from pysnptools.snpreader import Bed
def main(part_count, part_index):
# File paths
bed_file = r"O:\programs\pysnptools\pysnptools\examples\toydata.bed"
pheno_file = r"O:\programs\pysnptools\pysnptools\examples\toydata.phe"
# Load the BED and phenotype data
bed = Bed(bed_file, count_A1=True)
# Calculate start and end indexes for the current part
snp_start = bed.sid_count * part_index // part_count
snp_end = bed.sid_count * (part_index + 1) // part_count
# Slice the BED file to get the SNPs for the current part
test_snps = bed[:, snp_start:snp_end]
# Perform single SNP association test and save results to a file
output_file_name = f"result.{part_index}of{part_count}.tsv"
single_snp(test_snps=test_snps, pheno=pheno_file, K0=bed, output_file_name=output_file_name)
print(f"Results saved to {output_file_name}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Run single SNP association test on a partition of SNPs.')
parser.add_argument('part_index', type=int, help='Index of the current part (0-based).')
parser.add_argument('part_count', type=int, help='Total number of parts to divide the SNP data into.')
args = parser.parse_args()
main(args.part_count, args.part_index) |
I wrote a reply but accidently sent to unfinished. So, be sure to read it on github for the final version.
From: snowformatics ***@***.***>
Sent: Thursday, July 18, 2024 10:05 PM
To: fastlmm/FaST-LMM ***@***.***>
Cc: Carl Kadie ***@***.***>; Comment ***@***.***>
Subject: Re: [fastlmm/FaST-LMM] single_snp with large number of SNPs (Issue #51)
Importance: High
Hi Carl,
thanks for the replay, here are more details:
* Are you doing leave-one-out-chromosome?
Yes, we use the default settings.
* For how many chromosomes? (Sorry, I forgot if this is people, mice, or something else)
We are working with plants, 5-22 chromosomes.
* Is your machine's CPU not at 100% when you do a single_snp run?
It's at 100%
* And/or do you have access to multiple machines?
I have access to GPU and a Slurm cluster.
Thanks
-
Reply to this email directly, view it on GitHub<#51 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABR65PYTOP37I3JVHCQQNWTZNCM7XAVCNFSM6AAAAABK76LFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZYGE3DKOBXGE>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
|
Thanks a lot Carl, I will test it! |
Hi Carl,
I have some datasets with > 5 Million SNPs (but < 500 samples and 1 phenotype). The run with single_snp takes more then 8 hours. Is there any way to speed up things?
Thanks
The text was updated successfully, but these errors were encountered: