Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect insertions positions with VCF #86

Open
ChristopheLegendre opened this issue Jun 24, 2023 · 2 comments
Open

Incorrect insertions positions with VCF #86

ChristopheLegendre opened this issue Jun 24, 2023 · 2 comments

Comments

@ChristopheLegendre
Copy link

Hello Nils,

I have been using DWGSIM version 0.1.14.
My goal is to simulate Insertion ranging from length 10 to 200, and being able to add specific sequences of insertion

First test simulation with VCF

In my first try, I used a VCF with -v option and an insertion length of 198, the simulation gave me the correct location for my insertion after aligning the reads with BWA and called the variant with variant callers.
I only put one insertion in my VCF.

Second test of Simulation with VCF

My next try was to used different insertions length with the same workflow tested with my first try. I unfortunately noticed that none of the new insertions were correctly simulated.

  1. The genomic coordinates were shifted upward to the location originally present in the VCF, and the inserted bases got modified compared to my inputted ones.
  2. the recaptured inserted sequences were not the one originally put in the VCF.

### Test simulation with BED

Based on previous issues described in this repo, I decided to give a shot to the BED file as well to simulate my insertions.

After trial and errors in making the BED file, I got the message:
Error: insertion of length 36 exceeded the maximum supported length of 26
but when using the command dwgsim --help, there is a note that mentions that the longest insertion supported is 4294967295, I cite:
Note: The longest supported insertion is 4294967295.

I tried with an insertion of 25 and it worked as expected (I shrank my original insertion to a length of 25 (26 = 1 + 25; i.e., the REF plus all the new 25 inserted bases)).

case 1

if the bed file has the following line,
chr13 28034117 28034143 GATCATATTCATATTCTCTGAAATCA INSERT
G is the REF at position 28034118, here the first G, and the inserted bases I want in are ATCATATTCATATTCTCTGAAATCA, i.e. 25 bases. I get this output in the dwgsim's mutations.vcf file:
chr13 28034117 . A AAATCATATTCATATTCTCTGAAATC 100 PASS AF=1.0;pl=3;mt=INSERT

case 2

If I put the following line in the bed file,
chr13 28034118 28034144 GATCATATTCATATTCTCTGAAATCA INSERT
I get this output in the dwgsim's mutations.vcf file:
chr13 28034119 . A AGATCATATTCATATTCTCTGAAATCA 100 PASS AF=1.0;pl=3;mt=INSERT

case 3

If I put the following line in the bed file
chr13 28034118 **28034143** ATCATATTCATATTCTCTGAAATCA INSERT
I get this output in the dwgsim's mutations.vcf file:
chr13 28034119 . A AATCATATTCATATTCTCTGAAATCA 100 PASS AF=1.0;pl=3;mt=INSERT

Knowing that BED file is 0-based, the case 1 above should have been the correct case to use to get the insertion being inserted right after the G on position 28034118.


Impact of the SEED value on the AF and pl values in the dwgsim's mutations.vcf file

I have been using the -F to control the allele frequency specifically;
With using a BED file and -b option, I noticed that the values of AF were not the 0.5 all the time, but that value changes when using a different seed value; I woulf have expected not to be modified since I stiplated on the command line that I wanted AF=0.5 specifically using the -F option; Unless I am mistaken on the correlation between the two values of -F and AF in the mutations output file.




Questions:

  1. Is there anything wrong with the way the INSERT line in the BED file is constructed?
  2. Is there any option that could make dwgsim working with a BED file and with insertions longer than 26?
  3. Is there any option that should be adjusted to make it work with a VCF using -v?
  4. Is AF only fixed and not dependent on the seed value when using the -v option and not the -b ?
  5. Is the longest insertion of 4 billion bases only supported when using a VCF? otherwise what is the max length of insertion supported when using a VCF?
  6. Could you let me know what would be the best way (i.e. dwgsim command) to simulate exactly these insertions? Same exact coordinate and same exact sequence as the one in the provided BED or VCF. Thanks.

I am providing an example of one of my VCF with an insertion of 81 bases.
The Reference genome was only the complete sequence of the chromosome 13.

Homo sapiens chromosome 13, GRCh38.p14 Primary Assembly
NCBI Reference Sequence: NC_000013.11

Example of dwgsim command

dwgsim -d 486 -s 95 -1 150 -2 150 -C 20.0 -y 0 -R 0 -X 0 -I 30 -n 0 -r 0.00000000000000001 -c 0 -M -S 2 -z 28034113 -o 1 -F 0.50    -b   insertion81.bed   GRCh38_chr13_complete_sequence.fa    myoutPrefix

Let me know what further information you may need.
Thanks for your feedback,
Chris.
inputs_insertion.zip

@nh13
Copy link
Owner

nh13 commented Jun 24, 2023

I’ll be honest that I have very little time for open source unpaid support given my work and family commitments. My apologies in advance, and feel free to reach out to www.fulcrumgenomics.com if you’d like paid support, though we are also booked out pretty far. I’ll keep this open so I can get to it if I am able to come up for air.

@ChristopheLegendre
Copy link
Author

Thanks for your quick and honest reply.

Chris from TGen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants