-
Notifications
You must be signed in to change notification settings - Fork 18
iva_qc stats
martinghunt edited this page Oct 31, 2014
·
6 revisions
This page has an explanation of each of the stats
computed by the script iva_qc
.
-
ref_EMBL_dir
: directory of EMBL files used for the reference genome. -
ref_EMBL_files
: EMBL files in the above directory. -
ref_bases
: total length of the reference. -
ref_sequences
: number of sequences in the reference (i.e. number of EMBL files). -
cds_number
: the number of CDSs in the reference. -
assembly_bases
: total length of the assembly. -
assembly_contigs
: number of contigs in the assembly (i.e. number of fasta records in the input file).
During QC, before calculating any metrics, the following things happen:
- The reads are mapped to the assembly.
- The reads are mapped to the reference.
- The assembly and reference are compared using nucmer.
- All CDSs are extracted from the reference and compared to the assembly using nucmer.
This gives rise to the following metrics.
-
ref_bases_assembled
: total length of the nucmer hits in the reference, only counting once any bases that are in more than one nucmer hit. -
assembly_bases_in_ref
: as forref_bases_assembled
, except nucmer hits in the assembly. -
assembly_contigs_hit_ref
: number of contigs that have a nucmer hit to the reference. -
ref_sequences_assembled
: number of reference sequences that have at least 90% of their bases covered by nucmer hits. -
ref_sequences_assembled_ok
: a reference sequence of lengthr
is counted as assembled OK if it has a single nucmer hit of lengthh
to a contig of lengthc
with (i)0.9 <= h / r <= 1.1
and (ii)0.9 <= h / c
-
ref_bases_assembler_missed
: a base is counted as missed if it is not covered by a nucmer hit, but has at least 5X read coverage on both strands. -
assembly_bases_reads_disagree
: a position in the assembly where, on each strand, more than half of the reads are different from the assembled base. -
assembly_sum_longest_match_each_segment
: for each segment in the reference, we take the longest matching contig. The lengths of these contigs are added up. -
cds_assembled
: as forref_sequences_assembled
, but we are considering CDS sequences instead of contigs. -
cds_assembled_ok
: as forref_sequences_assembled
, but we are considering CDS sequences instead of contigs. Further, each assembled CDS must contain an open reading frame of length at least 90% of the CDS length. This is extremely stringent and you are probably better off considering the RATT metrics listed later.
The following metrics are taken directly from the output of the GAGE analysis code. Descriptions are taken directly from the supplementary information of the publication.
-
gage_Missing_Reference_Bases
: number of bases that did not align to any contig. -
gage_Missing_Assembly_Bases
: number of bases that did not align to the reference genome. -
gage_Missing_Assembly_Contigs
: number of contigs with no matches to the reference. -
gage_Duplicated_Reference_Bases
: sequences that occurred more times in the assembly than in the reference genome. -
gage_Compressed_Reference_Bases
: sequences that occurred more times in the reference than in the assembly. -
gage_Bad_Trim
: bases at the beginning or end of contigs that did not align to the reference. -
gage_Avg_Idy
: computed on the one‐to‐one aligned segments, ignoring duplicated bases. -
gage_SNPs
: computed on the one‐to‐one aligned segments, ignoring duplicated bases. gage_Indels_<_5bp
gage_Indels_>=_5
-
gage_Inversions
: a switch between strands (and orientation). -
gage_Relocation
: connect distant segments from the same chromosome. -
gage_Translocation
: connect segments from different chromosomes.
The following are taken directly from the output of RATT.
-
ratt_elements_found
: number of annotation elements found in the reference EMBL files. -
ratt_elements_transferred
: number of annotation elements transferred onto the assembly. -
ratt_elements_transferred_partially
: number of annotation elements partially transferred onto the assembly. -
ratt_elements_split
: number of annotation elements split into more than one new element in the assembly. -
ratt_parts_of_elements_not_transferred
: total parts of elements not transferred, from only those elements that were partially transferred. -
ratt_elements_not_transferred
: number of annotation elements that were not transferred. -
ratt_gene_models_to_transfer
: number of genes found in the reference EMBL files. -
ratt_gene_models_transferred
: number of genes transferred onto the assembly. -
ratt_gene_models_transferred_partially
: number of genes partially transferred onto the assembly. -
ratt_exons_not_transferred_from_partial_matches
: number of exons that could not be transferred, from only those genes that were partially transferred. -
ratt_gene_models_not_transferred
: number of genes not transferred onto the assembly.