Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when generating variants where there is a degenerate symbol in the reference #122

Open
dani-ture opened this issue Jul 11, 2024 · 6 comments

Comments

@dani-ture
Copy link

dani-ture commented Jul 11, 2024

Describe the bug

It looks like when neat was generating variants, it found by chance a “Y” in the reference sequence and aborted the variant generation process.

To Reproduce

Steps to reproduce the behavior:

  1. Download the latest human reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/

  2. Make a copy of the provided template config file (I called it test_config_human.yml) and set the parameters:

    ‘’’reference: <path_to_GRCh38_latest_genomic.fna>

    target_bed: <path_to_bed_file>

    produce_vcf: true

    produce_fastq: false

    rng_seed: 6386514007882411’’’

    The rest are left with the “.” as default.

  3. Run neat on the command line:neat --log-name test --log-detail HIGH --log-level DEBUG read-simulator -c test_config_human.yml -o test

Expected behavior

Generate variants and output them to a vcf file.

Desktop:

  • OS: Linux
  • Browser: Chrome
  • Version: 4.2.2

image

@joshfactorial
Copy link
Collaborator

joshfactorial commented Jul 11, 2024

Yeah, I've never seen a "Y" in the reference before. I can investigate how to handle that. For now I would just just do something like sed -i 's/Y/N/g' genome.fa to swap out Y's for N's and see if it runs okay.

@dani-ture
Copy link
Author

dani-ture commented Jul 11, 2024

I've been inspecting the ref files and apparently there are some ambiguous characters like K, Y, M, R, W... I guess I'll just have to preprocess them as you suggest. I don't know if I would have to reindex the human ref genome afterwards. Thanks!

@joshfactorial
Copy link
Collaborator

joshfactorial commented Jul 11, 2024 via email

@dani-ture
Copy link
Author

It is the DNA reference indeed, but there are just a few of these degenerate bases spilled over the reference to indicate variation or uncertainty in the assembly.
image

You can read more here: https://en.wikipedia.org/wiki/Nucleic_acid_notation

@joshfactorial
Copy link
Collaborator

okay, just haven't run into those yet I guess in the wild.

@joshfactorial
Copy link
Collaborator

You might try HG19 or some older version of the reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants