- VIA automates the process of variant identification, analyzing family data and returning possible candidates based on a range of specific filters and models of inheritance.
- The phenotype to genes file is an old version that has 7 columns. It is added here as a compressed file that will need to be decompressed to be used.
- For those on the CUMC HPC the unzipped phenotype to genes file is available upon request.
- Data Pre-Requisites
- cleaned data file (output of Step 4 Post Processing)
- pedigree file
- phenotype file (optional)
- System Pre-Requisites
Notes on cleaned data file:
- Remove shifted rows from Annovar output (typically 1-10 in large file; check Otherinfo10; should all be .)
- Add headers with sample names (from VCF file)
- Replace all 0|1, 1|1 and 0|0 with 0/1, 1/1 and 0/0
To install the application, clone this GitHub repository on your machine. If you have git installed on your computer, you can do this with the following command:
git clone https://github.com/yr542/variant-filtering.git
This file should have the following columns in the following order:
- Family_ID (e.g. FIN1)
- individual_ID (e.g. FIN1.2)
- Status (e.g. Mother, Father, Child, Sibling, Other)
- Sex (e.g. Male, Female)
- Phenotype (Affected OR Unaffected)
An example would be a tab-delimited table like this:
Family_ID | individual_ID | Status | Sex | Phenotype |
---|---|---|---|---|
FIN1 | FIN1-1 | Father | Male | Unaffected |
FIN1 | FIN1-2 | Mother | Female | Unaffected |
FIN1 | FIN1-3 | Child | Male | Affected |
This file should be the exact output of the Mendelian_filtering_WORK.Rmd script. Within the defined output format of this script, the important columns used by VIA are:
- Chr (Chromosome number)
- AF_popmax, PopFreqMax, GME_AF, Kaviar_AF, and abraom_freq (population allele frequencies according to a variety of sources)
- CLNSIG (clinical significance of the varient, i.e. benign vs pathogenic)
- < Individual ID > (each individual has a column that contains the allelic depth, the zygosity, etc. for each gene)
This file should have the following columns in the following order:
- Family_ID (e.g. FIN5)
- HPO (e.g. HP:0000365)
An example would be a tab-delimited table like this:
Family_ID | HPO |
---|---|
FIN1 | HP:0000365 |
FIN10 | HP:0001249 |
FIN13 | HP:0001249,HP:0012758,HP:0000252 |
FIN15 | HP:0001251,HP:0001252 |
Notice that when a family has multiple HPO numbers they must be separated by commas (without spaces) in the HPO column.
This file corresponds to the mapfile input. It should be the latest phenotype_to_genes.txt file from HPO, which can be found at https://hpo.jax.org/app/download/annotation. This file updates every 2 months so the user should check that they have the most updated version of the file in their repository to get the most accurate and current results. Once this file is downloaded it should be in the format compatible with this application without any changes needing to be made.
The complete application can be run through main.py with the following command:
python main.py
For greater flexibility, there are also the following optional arguments:
- --pedfile OR -p : specify the absolute or relative path to the pedigree file. If no argument is specified, the application will look for a file named Test_Ped.txt in the repository's directory
- --data OR -d : specify the absolute or relative path to the cleaned data file. If no argument is specified, the application will look for a file named Test_cleaned.txt in the repository's directory
- --output OR -o : specify the file name (including extension and (optionally) the file path) for the ouput file. If no argument is specified, the application will name the output file filtered.csv and place it in the repository's directory
- --output_phen OR -op : specify the file name (including extension and (optionally) the file path) for the output file when additional phenotype filtering is performed. If no argument is specified, the application will name the output file filtered_phen.csv and place it in the repository's directory
- --family OR -f : specify a certain family to output an individual csv file for. The default behaviour is to produce a single output file with variants for all families.
- --phenfile or -ph : specify the absolute or relative path to the phenotype file. If no argument is specified, the application will look for a file named Test_Phen.txt in the repository's directory
- --mapfile or -m : specify the absolute or relative path to the phenotype-to-gene mapping file. If no argument is specified, the application will look for a file named phenotype_to_genes.txt in the repository's directory. If no such file exists, the user is be prompted to download one.
- --nophen: specify that no phenotype filtering will be performed.
Any combination of these arguments can be used, and they can be chained together. For example, using all five would look like:
python main.py -p <file path> -d <file path> -o <file path> -f <family name> -ph <file path>
VIA outputs two csv files with a row for each candidate gene for each individual. In both files the columns are the same as the second input (the cleaned annotated file), except that there are three columns prepended:
- inh model: The inheritance model(s) (comma-separated) that the variant for that row corresponds to. (e.g. addn; xl,ad; ad, etc.)
- family: The Family ID for the individual to whom the variant for that row corresponds to (e.g. FIN5).
- sample: The Individual ID for the individual(s) to whom the variant for that row corresponds to (e.g. FIN5.3). Note that the individuals will appear in the same order as their corresponding inheritance models. So if the inheritance models are "xl,ad", and the individuals are "A,B", then the variant is under an xl inheritance model for A and an ad inheritance model for B.'
The first csv file simply has all of the candidate genes for each individual without taking into account the HPO terms. The candidate genes are sorted first in order of sample and then in order of inh model. The second csv file has all of the candidate genes for each individual that also match the phenotype of the affected individuals. The second file also sorts the output in order of family, then in order of the number of HPO terms each candidate 'matches' for a family, then in order of sample. Further, in the second file a column containing the number of HPO terms matched by each candidate is included after the file containing the sample number (so it is the fourth column).
Note: the 'ad, addn' model is not shown for affected individuals with no parents (singletons with or without sibs) in the full output, but will be shown in the 'phen' output.
VIA identifies variants corresponding to four models of inheritance. Note that, for affected singletons, addn and ad are not shown.
Identifies variants in affected individuals who are 1/1 for a given variant. For autosomal recessive, any affected children or siblings are 1/1 and unaffected mothers and fathers are both 0/1. For x-linked, affected siblings and children are 1/1 while unaffected mothers are 0/1 and unaffected fathers are 0/0. The affected person must be male and the variant must be on the x chromosome. For x-linked de novo variants, all of the above hold true except unaffected mothers are 0/0.
Identifies variants in affected individuals who are 0/1 more than or equal to 2 times in a single gene. If the parents are available, variants are only listed when they are correctly phased (one variant comes from one parent and the other from the second parent). If there are more than 2 variants in a single gene in the affected individual, only two of them need to be phased correctly for the variants to be listed.
Identifies variants in affected children/siblings which are not present in either parent. Variants with an allele frequency greater than 0.0005 or a low coverage depth (<6) are removed.
Identifies variants in affected children/siblings when at least one of the parents are also affected. Variants with an allele frequency greater than 0.0005 or a low coverage depth (<6) are removed.
Each sampled individual is a member of the person class. Their characteristic attributes are:
- ID (i.e. FIN5.1)
- sex
- phenotype (unaffected or affected)
Each family belongs to the family class. Their characteristic attributes are:
- ID (i.e. FIN5)
- people - a list of Person objects
- siblings - a list of Person objects whose relationship in the family (relative to the affected individual being studied) is 'sibling'
- mother - a Person object for the mother of the family
- father - a Person object for the father of the family
- child - a Person object for the affected individual being studied
Each of these filters are used to pull out candidate variants:
- filter_AD(df, name, ad) - filters a DataFrame (df) by the minimum allele depth (ad) in a paricular column (name)
- filter_DP(df, name, dp, inplace=1) - filters the DataFrame (df) by min depth in a particular column (name). If inplace is set to an integer other than 1, it will filter df into a new data frame, but by default the function filters in place.
- filter_occurences(df, zyg, namestart, nameend, cap) - filters the DataFrame (df) by the max number of occurrences (cap) of a particular zygosity (zyg) in a range of columns
- filter_AF(df, cap) - filters the DataFrame (df) in place by the maximum population allele frequency (cap)
- filter_zyg(df, name, zyg) - filters the DataFrame (df) for the zygosity in a particular column (name)
- exclude_zyg(df, name, zyg) - filters the DataFrame (df) to exclude a certain zygosity (zyg) in a particular column (name)
- filter_benign(df) - filters the DataFrame (df) to exclude variants that are "Benign" or "Likely benign". This filter is not used in any of the models.
- filter_DP_Max(df, names, dp, inplace=1) - filters the DataFrame (df) for variants with a maximum DP across a list of affected people (names) that is greater than the minimum value (dp), a given constant. If inplace is 1, it filters df in place; if it is not, it filters into a new DataFrame
- filter_chr(df, chrom, exclude = False) - filters the DataFrame (df) to keep only the rows in which the gene is located in a particular chromosome (chrom)