Skip to content

iva_qc stats

martinghunt edited this page Oct 31, 2014 · 6 revisions

This page has an explanation of each of the stats computed by the script iva_qc.

Basic assembly and reference information

  • ref_EMBL_dir: directory of EMBL files used for the reference genome.
  • ref_EMBL_files: EMBL files in the above directory.
  • ref_bases: total length of the reference.
  • ref_sequences: number of sequences in the reference (i.e. number of EMBL files).
  • cds_number: the number of CDSs in the reference.
  • assembly_bases: total length of the assembly.
  • assembly_contigs: number of contigs in the assembly (i.e. number of fasta records in the input file).

Custom metrics

During QC, before calculating any metrics, the following things happen:

  1. The reads are mapped to the assembly.
  2. The reads are mapped to the reference.
  3. The assembly and reference are compared using nucmer.
  4. All CDSs are extracted from the reference and compared to the assembly using nucmer.

This gives rise to the following metrics.

  • ref_bases_assembled: total length of the nucmer hits in the reference, only counting once any bases that are in more than one nucmer hit.
  • assembly_bases_in_ref: as for ref_bases_assembled, except nucmer hits in the assembly.
  • assembly_contigs_hit_ref: number of contigs that have a nucmer hit to the reference.
  • ref_sequences_assembled: number of reference sequences that have at least 90% of their bases covered by nucmer hits.
  • ref_sequences_assembled_ok: a reference sequence of length r is counted as assembled OK if it has a single nucmer hit of length h to a contig of length c with (i) 0.9 <= h / r <= 1.1 and (ii) 0.9 <= h / c
  • ref_bases_assembler_missed: a base is counted as missed if it is not covered by a nucmer hit, but has at least 5X read coverage on both strands.
  • assembly_bases_reads_disagree: a position in the assembly where, on each strand, more than half of the reads are different from the assembled base.
  • assembly_sum_longest_match_each_segment: for each segment in the reference, we take the longest matching contig. The lengths of these contigs are added up.
  • cds_assembled: as for ref_sequences_assembled, but we are considering CDS sequences instead of contigs.
  • cds_assembled_ok: as for ref_sequences_assembled, but we are considering CDS sequences instead of contigs. Further, each assembled CDS must contain an open reading frame of length at least 90% of the CDS length. This is extremely stringent and you are probably better off considering the RATT metrics listed later.

GAGE metrics

The following metrics are taken directly from the output of the GAGE analysis code. Descriptions are taken directly from the supplementary information of the publication.

  • gage_Missing_Reference_Bases: number of bases that did not align to any contig.
  • gage_Missing_Assembly_Bases: number of bases that did not align to the reference genome.
  • gage_Missing_Assembly_Contigs: number of contigs with no matches to the reference.
  • gage_Duplicated_Reference_Bases: sequences that occurred more times in the assembly than in the reference genome.
  • gage_Compressed_Reference_Bases: sequences that occurred more times in the reference than in the assembly.
  • gage_Bad_Trim: bases at the beginning or end of contigs that did not align to the reference.
  • gage_Avg_Idy: computed on the one‐to‐one aligned segments, ignoring duplicated bases.
  • gage_SNPs: computed on the one‐to‐one aligned segments, ignoring duplicated bases.
  • gage_Indels_<_5bp
  • gage_Indels_>=_5
  • gage_Inversions: a switch between strands (and orientation).
  • gage_Relocation: connect distant segments from the same chromosome.
  • gage_Translocation: connect segments from different chromosomes.

RATT metrics

The following are taken directly from the output of RATT.

  • ratt_elements_found: number of annotation elements found in the reference EMBL files.
  • ratt_elements_transferred: number of annotation elements transferred onto the assembly.
  • ratt_elements_transferred_partially: number of annotation elements partially transferred onto the assembly.
  • ratt_elements_split: number of annotation elements split into more than one new element in the assembly.
  • ratt_parts_of_elements_not_transferred: total parts of elements not transferred, from only those elements that were partially transferred.
  • ratt_elements_not_transferred: number of annotation elements that were not transferred.
  • ratt_gene_models_to_transfer: number of genes found in the reference EMBL files.
  • ratt_gene_models_transferred: number of genes transferred onto the assembly.
  • ratt_gene_models_transferred_partially: number of genes partially transferred onto the assembly.
  • ratt_exons_not_transferred_from_partial_matches: number of exons that could not be transferred, from only those genes that were partially transferred.
  • ratt_gene_models_not_transferred: number of genes not transferred onto the assembly.