-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing of off-target reference regions using target bed file #114
Comments
Thanks for submitting this bug, I will take a look hopefully this weekend. |
I think this ticket is fixed now. Please reopen if you have further issues. |
Hi, thanks for the new release, all other issues seem to have been fixed. However I am still having problems with this one. |
are you using a publicly available bed file? Or can you send me a sample of the bed file? |
|
Right, well, it still needs to generate reads, even if there are no mutations (unless you skip fastq and bam files). The way it works is it generates variants first, then generates reads, to which it adds the variants (and random sequencing errors). The sampling reads and covering datasets functions are related to generating the reads themselves. It does check the bed files when generating reads, and throws out any reads that fall outside of the bed regions. It does generate variants for the entire dataset as a first pass. It seemed like that part went fast enough that it wasn't critical to speed up, but we haven't tested on a full human dataset yet. I can look into filtering out bed regions. But for a given read, there may be differences from the reference (due to sequencing errors), but if you want to by pass read generation completely, you can try setting produce_fastq and produce_bam to false. I think the final vcf output will be filtered for the bam file (or it should be). I will look into generating the mutations, in terms of being able to skip areas that aren't in a bed file, to see if that speeds things up. |
Thank you so much for explaining in detail. As you mentioned, generating mutations does not take that long, the critical step is the read generation one. It would be ideal if neat focused on generating reads directly in the bed regions instead of covering the full genome/chromosome and then discarding those reads that do not land in the bed regions. I will try what you suggest, but I really wanted the fastq file to benchmark a variant calling pipeline and see if it agrees with my input_vcf and neat's ground truth. |
Right, the problem we keep running into is fragment sizes. For small fragments, coverage gets stuck trying to generate enough valid reads to cover the dataset, especially in paired-ended mode. The initial solution then was to generate end points for all the reads, then throw out the ones we don't need before doing any actual processing. This method may not work as well for longer chromosomes, because it requires looping through the length of the chromosome. Trying to find a balance or a solution that works for both long and short read regions is proving to be a challenge. |
Describe the bug
It seems like neat processes regions in the reference that I did not explicitly include in the target bed file. I want to simulate reads for some cancer related genes using the full human reference genome and a target bed file, but before using the complete bed file I wanted to test it with just 2 regions of chromosome 2. However, it seems like neat generates random mutations and samples reads for other chromosomes too. I don’t know if they will be written to the fastq file in the end, but these unnecessary steps makes the process much slower. Additionally, neat also generated mutations for the whole chromosome 2 and then filtered out the ones that landed in regions that were not included in the bed file.
To Reproduce
Steps to reproduce the behavior:
Download the latest human reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/
Make a copy of the provided template config file (I called it test_config.yml) and set the parameters:
‘’’reference: <path_to_GRCh38_latest_genomic.fna>
target_bed: <path_to_bed_file>
rng_seed: 6386514007882411’’’
The rest are left with the “.” as default.
Run neat on the command line:
neat --log-name test --log-detail HIGH --log-level DEBUG read-simulator -c test_config.yml -o test
Expected behavior
Maybe process just targeted regions to accelerate operations.
Additional comments
Desktop:
The text was updated successfully, but these errors were encountered: