Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding the annotaTR tool #226

Merged
merged 80 commits into from
Aug 23, 2024
Merged
Show file tree
Hide file tree
Changes from 77 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
15bf91f
adding initial annotatr tool
Aug 7, 2024
586f59e
adding dosage fields to vcf output
Aug 7, 2024
8079a89
draft dosage functions
Aug 7, 2024
5f8cffe
add multiple outputs and add dslen to pvar
Aug 8, 2024
f0cc101
initial version of beagle annotation
Aug 8, 2024
8eaeada
add pgenlib as dependency, remove tensorqtl output for now
Aug 8, 2024
ceafe5c
adding tests for dosages in tr harmonizer
Aug 9, 2024
23b58ef
drafting annotatr readme
Aug 9, 2024
d96b893
updating annotatr docs
Aug 9, 2024
6aff70a
updating annotatr docs
Aug 9, 2024
3ff937f
adding first test of beagle type dosages
Aug 9, 2024
b2eb0ae
adding tests for beagle dosages
Aug 10, 2024
7c9a0f0
adding first annotatr tests
Aug 10, 2024
16c5daa
update cyvcf2 version to have num_records
Aug 10, 2024
f8d41ab
adding outtype tests
Aug 11, 2024
743337e
added unit tests for annotaTR
Aug 11, 2024
8befb77
annotatr fail if no operation
Aug 11, 2024
412c2ed
remove unnecessary vcfwriter function
Aug 11, 2024
d01ce9e
adding documentation to annotatr functions
Aug 11, 2024
082c1d5
updating annotaTR readme
Aug 11, 2024
0e3fbad
adding notes to callers documentation
Aug 11, 2024
853780c
updating pgenlib dependency version
Aug 12, 2024
71459d7
update pgenlib version in dev-env
Aug 12, 2024
fd68767
making annotatr test file smaller
Aug 12, 2024
1804d76
further reducing file sizes
Aug 12, 2024
a858a63
update pgenlib version in dev-env
Aug 12, 2024
fca52fa
updating cyvcf2 version
Aug 12, 2024
d418f0e
adding back modified large associatr files
Aug 12, 2024
1c63128
adding missing docs file for annotatr
Aug 12, 2024
a7651b2
only count dup locus if both ref and alt the same
Aug 12, 2024
48a4261
fix docs typos
Aug 12, 2024
153f553
more docs typo fixes
Aug 12, 2024
d733654
change how we access AP1
Aug 12, 2024
f819d3a
updating how AP1 and AP2 format fields accessed
Aug 13, 2024
264aad5
updating how format is accessed in tests
Aug 13, 2024
f848bcf
updated missing docs file
Aug 13, 2024
bef6c98
adding more options for matching refpanel
Aug 14, 2024
f2fb971
typo in match option
Aug 14, 2024
ca25580
don't load samples in ref panel
Aug 14, 2024
d63bd42
better warning for invalid pgen output
Aug 14, 2024
78136fb
updating annotatr docs
Aug 14, 2024
2754992
add message about loading ref panel
Aug 14, 2024
a92eb27
adding debug option
Aug 14, 2024
b711455
updating how we trim alleles
Aug 14, 2024
a179df8
don't sort alt alleles when getting locuskey
Aug 14, 2024
502393b
updating how we get variant ct for refpanel
Aug 14, 2024
8e2cac6
updating variant count message
Aug 14, 2024
c81b567
Merge branch 'master' into annotaTR
gymreklab Aug 14, 2024
0b80210
bug in psam output
Aug 14, 2024
da575bf
Merge branch 'annotaTR' of https://github.com/gymrek-lab/TRTools into…
Aug 14, 2024
7bfa9b0
typo on readme - annotatr link pointed to prancstr
Aug 15, 2024
2307b19
minor change to annotatr readme
Aug 15, 2024
cd56092
Apply suggestions from code review
gymreklab Aug 22, 2024
a92ae75
Apply suggestions from code review
gymreklab Aug 22, 2024
09690dc
Update doc/CALLERS.rst
gymreklab Aug 22, 2024
55a576e
Update doc/CALLERS.rst
gymreklab Aug 22, 2024
59b901d
Update doc/CALLERS.rst
gymreklab Aug 22, 2024
332dbca
Update trtools/utils/tests/test_trharmonizer.py
gymreklab Aug 22, 2024
8c41978
Update trtools/utils/tests/test_trharmonizer.py
gymreklab Aug 22, 2024
1600de7
Update pyproject.toml
gymreklab Aug 22, 2024
a667a91
adding poetry.lock file suggested from code review
Aug 22, 2024
eaa2a95
make it clear annotatr requires bgzipped and indexed vcf
Aug 22, 2024
313e25e
add test for beagle input but no refpanel specified
Aug 22, 2024
46c47c3
add type suggestion to GetDosages function
Aug 22, 2024
5190f57
Update test/cmdline_tests.sh
gymreklab Aug 22, 2024
fbf1b1b
input set for sample order rather than list
Aug 22, 2024
70f3d58
Merge branch 'annotaTR' of https://github.com/gymrek-lab/TRTools into…
Aug 22, 2024
ab98702
remove numpy typing suggestion
Aug 23, 2024
34f28ee
added check to dosages for case of no samples to return None
Aug 23, 2024
9eb5d3e
expose pgen batch size
Aug 23, 2024
2c349e1
typos in batch size argparse
Aug 23, 2024
667f0ba
clean up checking input vcftype
Aug 23, 2024
2239956
update to checking vcftype
Aug 23, 2024
44d8e94
restore old beagle test files
Aug 23, 2024
30c5646
minor changes to make pvars valid vcfs
Aug 23, 2024
23c2067
add test for vcfs not zipped or indexed
Aug 23, 2024
c7f2b04
adding file tests for annotatr
Aug 23, 2024
4f7def2
updated CEU subset unzip/index files to have variants
Aug 23, 2024
ff84ab3
making unzipped example smaller
Aug 23, 2024
003b0ed
Update trtools/utils/tr_harmonizer.py
gymreklab Aug 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ TRTools includes the following tools.
* `associaTR <https://trtools.readthedocs.io/en/stable/source/associaTR.html>`_: a tool for testing TR length-phenotype associations (e.g., running a TR GWAS)
* `prancSTR <https://trtools.readthedocs.io/en/stable/source/prancSTR.html>`_: a tool for identifying somatic mosacisim at TRs. Currently only compatible with HipSTR VCF files. (*beta mode*)
* `simTR <https://trtools.readthedocs.io/en/stable/source/simTR.html>`_: a tool for simulating next-generation sequencing reads from TR regions. (*beta mode*)
* `annotaTR <https://trtools.readthedocs.io/en/stable/source/annotaTR.html>`_: a tool for annotating TR VCF files with dosage or other metadata and optionally converting to PGEN output.

Type :code:`<command> --help` to see a full set of options.

Expand Down
34 changes: 14 additions & 20 deletions doc/CALLERS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,40 +83,34 @@ TRTools supports TR genotypes produced by any of the above genotypers and then i
in this tool suite, unless it's docs specifically say otherwise, that tool can be used on Beagle VCFs as if those VCFs were produced directly by the underlying TR genotyper,
with no additional flags or arguments needed, as long as the steps below were followed to make sure the Beagle VCF is properly formatted.

Caveats:
Notes and caveats:

* Beagle provides phased best-guess genotypes for each imputed sample at each TR locus. When run with the :code:`ap` or :code:`gp` flags Beagle will also output
probabilities for each possible haplotype/genotype, respectively. These probabilities are also called dosages. While dosages are often more informative for downstream
analyses than the best-guess genotypes located in the :code:`GT` format field (for instance, for association testing), TRTools currently does *not* support dosage
based analyses and instead will only look at the :code:`GT` field. Feel free to submit PRs with features that handle dosages (see the :ref:`Contributing` docs).
probabilities for each possible allele/genotype, respectively. The allele probabilities can be used to compute genotype dosages that take into account imputation uncertainty. The annotaTR utility described below can be used to compute dosages based on the AP fields. These can also be accessed from the TR Harmonizer :code:`TRRecord.GetDosages` function.
* At each locus Beagle returns the most probable phased genotype. This will often but not always correspond to the most probable unphased genotype. For instance,
it is possible that :code:`P(A|A) > P(A|B)` and :code:`P(A|A) > P(B|A)`, but :code:`P(A/A) = P(A|A) < P(A|B) + P(B|A) = P(A/B)`. Similarly, it is possible that
:code:`P(A|B) > P(C|D)` and :code:`P(A|B) > P(D|C)`, but :code:`P(A/B) = P(A|B) + P(B|A) < P(C|D) + P(D|C) = P(C/D)`.
TRTools currently does not take this into
account and just uses the phased genotypes returned by Beagle. If you deem this to be an issue, feel free to submit PRs to help TRTools take this into account
account and just uses the phased genotypes returned by Beagle. If you would like to add functionality to support this, feel free to submit PRs to help TRTools take this into account
(see the :ref:`Contributing` docs).
* For callers which return sequences, not just lengths (e.g. HipSTR), if there are loci with multiple plausible sequences of the same length, then its possible
* For callers which return sequences, not just lengths (e.g. HipSTR), if there are loci with multiple plausible sequences of the same length, then it's possible
that the most probable genotype returned by Beagle does not have the most probable length. For example, the following could be true of a single haplotype:
:code:`Len(S_1) = L_1, Len(S_2) = L_1, Len(S_3) = L_2` and :code:`P(S_1) < P(S_3), P(S_2) < P(S_3)` but :code:`P(S_3) < P(S_1) + P(S_2)`.

An overview of steps to perform before Beagle imputation:
See `here <https://github.com/gymrek-lab/ensembleTR?tab=readme-ov-file#usage-1>`_ for example commands for running Beagle to impute TRs. Some additional notes:

* The samples being imputed into must have directly genotyped loci that are also genotyped in the reference samples. This allows those samples to be 'matched' with samples in the reference.
* The genotypes of both the reference samples and samples of interest must be phased. That can be done by statistically phasing the genotypes prior to running Beagle imputation.
* The referece samples must also not contain any missing genotypes. Possible methods for dealing with that include removing loci with missing genotypes or using imputation to impute
the missing genotypes prior to imputing the TRs.
* VCF files for the samples being imputed into must include genotyped loci (typically SNPs) that are also genotyped in the reference samples. This allows those samples to be 'matched' with samples in the reference.
* The reference samples must be phased and also not contain any missing genotypes. Possible methods for dealing with that include removing loci with missing genotypes or using imputation to impute the missing genotypes prior to imputing the TRs. The reference panel at the link above has already been phased/imputed and should work directly as input to Beagle.

The VCFs that Beagle outputs need to be preprocessed before use by TRTools. We have provided a tool :code:`trtools_prep_beagle_vcf.sh` to run on those VCFs.
After running this script, the files should be usable by any of the tools in TRTools.
The VCFs that Beagle outputs need to be preprocessed before use by TRTools to contain required INFO fields. The :code:`annotaTR` utility provided by TRTools can be used to add this information as well as compute dosages, with the option to output either VCF or PGEN format.
(As of Aug 2024, we had only provided the script :code:`trtools_prep_beagle_vcf.sh` which is still available and can also be used to add the INFO annotations but cannot write to PGEN files or compute dosages). After running either of those tools to annotate Beagle VCFs, the files should be usable by any of the tools in TRTools.

In case of error, it may be useful to know what steps the script attempts to perform:
In case of error, it may be useful to know what steps those tools attempt to perform:

* It copies over source and command meta header lines from the reference panel to the imputed VCF so
that it is clear which genotyper's syntax is being used to represent the STRs in the VCF.
* It copies over contig and ALT lines which is required for downstream tools including mergeSTR and is good practice to include in the VCF header.
* It annotates each STR with the necessary INFO fields from the reference panel that Beagle dropped from the imputed VCF.
* The imputed VCF contains both TR loci and the shared loci (commonly SNPs) that were used for the imputation.
This script removes the non-STR loci (identified as those loci not having STR-specific INFO fields).
* Copy over source and command meta header lines from the reference panel to the imputed VCF so that it is clear which genotyper's syntax is being used to represent the STRs in the VCF.
* Copy over contig and ALT lines which is required for downstream tools including mergeSTR and is good practice to include in the VCF header.
* Annotate each STR with the necessary INFO fields from the reference panel that Beagle dropped from the imputed VCF.
* Removes the non-TR loci (identified as those loci not having TR-specific INFO fields) so that the final output file contains only TRs.

.. _Beagle: http://faculty.washington.edu/browning/beagle/beagle.html
.. _bcftools: https://samtools.github.io/bcftools/bcftools.html
9 changes: 9 additions & 0 deletions doc/UTILITIES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ TRTools offers a variety of command-line tools for performing manipulations to T
source/associaTR.rst
source/prancSTR.rst
source/simTR.rst
source/annotaTR.rst

.. Include directives from tool READMEs

Expand Down Expand Up @@ -51,6 +52,10 @@ TRTools offers a variety of command-line tools for performing manipulations to T
:start-after: overview_directive
:end-before: overview_directive_done

.. include:: ../trtools/annotaTR/README.rst
:start-after: overview_directive
:end-before: overview_directive_done

:doc:`source/dumpSTR`

* |dumpSTR overview|
Expand Down Expand Up @@ -83,3 +88,7 @@ TRTools offers a variety of command-line tools for performing manipulations to T

* |simTR overview|

:doc:`source/annotaTR`

* |annotaTR overview|

1 change: 1 addition & 0 deletions doc/source/annotaTR.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. include:: ../../trtools/annotaTR/README.rst
5 changes: 5 additions & 0 deletions doc/trtools.annotaTR.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
trtools.annotaTR module
---------------------------

.. automodule:: trtools.annotaTR
:imported-members:
Binary file added example-files/CEU_subset_unindexed.vcf.gz
Binary file not shown.
Loading
Loading