Skip to content

Latest commit

 

History

History
158 lines (130 loc) · 11.4 KB

README.md

File metadata and controls

158 lines (130 loc) · 11.4 KB

Staphylinidae Bait Set For Anchored Hybrid Enrichment

The bait-stringent-RM25pc-all.fas bait set contains 39938 100nt baits designed from 10 diverse species of Staphylinidae targetting 1229 single-copy, protein encoding orthologs. Experimental application of this bait set is described in Brunke et al. (Accepted)

Acknowledgements

This bait set was made possible by the significant contributions from the following scientists:

  • Janina L. Kypke, (Natural History Museum of Denmark, Denmark) and Hermes Escalona (Australian National Insect Collection, Australia), for providing the underlying ortholog alignments based on transcriptome and genome alignments given in the Supplementary_S2.csv file. Clusters of orthologous genes were identified using the methodology given in Kypke (2018) and Brunke et al. (Accepted).

  • Dr. Adam Brunke, (Agriculture & Agri-Food Canada) for his instrumental guidance on the design of the bait set.

  • Brian Brunelle (Arbor BioSciences) for probe creation and QA/QC prior to manufacturing.

  • Dr. Jeremy Dettman, Julie Chapados, Robin Richter, Wayne McCormick (Agriculture & Agri-Food Canada) for the extensive validation and implementing of the bait set.

Files

  • bait-stringent-RM25pc-all.fas: Final bait set with OrthoDB v9 headers
  • bait-stringent-RM25pc-all.fasta: Final bait set with Phyluce compatible UCE headers
  • input.seq.fas: Final file generated by staphylinidae_baits.py submitted to Arbor BioSciences for bait design.
  • Supplementary_S2.csv: List of all reference taxa

Design Methodology

Input Files: 3814 orthologs, as amino acid (Protein) and nucleotide (DNA) sequences were generated using 34 insect genomes, transcriptomes and low-coverage genomes, processed with OrthoDB v9 and the Orthograph pipeline (see Kypke (2018) and Brunke et al. (Accepted), for details). These reference taxa included 20 representatives of Staphylinidae and are outlined in Supplementary_S2.csv

  1. Amino acid orthologs were aligned using T_Coffee 11.0.8 with default settings, additionally outputting score_ascii files
  2. Nucleotide alignments were generated using Tranalign from EMBOSS 6.6.0 with T_Coffee AA alignments and DNA orthologs
  3. staphylinidae_baits.py performed the following steps:
    • Scanned each score_ascii file with sliding window starting with a length of 2000 AA to a minimum of 75 AA. Each alignment position has a TSC score from 1-9 so a max score of 300 AA is 2700. The largest window with a sum above 95% of the max TCS score was considered a conserved block.
    • Corresponding conserved blocks were excised from the nucleotide alignments by multiplying the start and stop indices of protein alignments by 3.
    • Any sequences with at least 20% gapped bases were removed from conserved blocks
    • To fit within a 40000 bait target, the 20 Staphylinidae references were curated down to 10 priority species which represented a diverse sampling of the informal subfamily groups. Nucleotide conserved blocks were examined for the presence of the following priority Staphylinidae species:
      • Acidota crenata (Omaliine group: Omaliinae)
      • Silphotelus sp (Omaliine group: Proteininae)
      • Deleaster dichrous (Oxyteline group: Oxytelinae)
      • Paederus cruenticollis (Staphylinine group: Paederinae)
      • Philonthus decorus (Staphylinine group: Staphylininae)
      • Quedius fuliginosus (Staphylinine group: Staphylininae)
      • Stenus bimaculatus (Staphylinine group: Steninae)
      • Nicrophorus vespilloides (Silphidae or Staphylinidae: Tachyporine group)
      • Deinopsis erosa (Tachyporine group: Aleocharinae)
      • Lordithon lunulatus (Tachyporine group: Tachyporinae)
    • 1229 conserved blocks were found containing 5-10 priority Staphylinidae with a minimum size of 300bp
    • Curated blocks were created by removing all non priority Staphylinidae from conserved blocks
    • Curated blocks were appended into a multi-fasta (input-seq.fas) and submitted to Arbor Biosciences for manufacturing

Arbor BioSciences Bait Design

The following procedures were performed by Brian Brunelle of Arbor Biosciences

  1. 7,985 loci provided for bait design (Ns were replaced with Ts if they occurred consecutively for 1-10 bases)

  2. Using RepeatMasker, soft-masked the input sequences for simple repeats and repeats found in the Staphylinidae database; 0.12% masked (all simple repeats)

  3. Optimal 100 nt bait from every 120 nt interval based on GC, ΔG, etc = 40,127 raw baits

  4. Each bait candidate was BLASTed against the 8 provided genomes;

    • ref01 = GCF_001937115.1_Atum_1.0_genomic.fas
    • ref02 = GCF_000699045.2_Apla_2.0_genomic.fas
    • ref03 = GCF_000648695.1_Otau_2.0_genomic.fas
    • ref04 = GCF_000500325.1_Ldec_2.0_genomic.fas
    • ref05 = GCF_000390285.2_Agla_2.0_genomic.fas
    • ref06 = GCF_000355655.1_DendPond_male_1.0_genomic.fas
    • ref07 = GCF_000002335.3_Tcas5.2_genomic.fas
    • ref08 = GCF_001412225.1_Nicve_v1.0_genomic.fas

    A hybridization melting temperature (Tm, defined as temperature at which 50% of molecules are hybridized) was estimated for each hit assuming standard myBaits® buffers and conditions.

  5. For each bait candidate, one BLAST hit with the highest Tm is first discarded from the results (allowing for 1 hit in the genome), and only the top 500 hits (by bit score) are considered. Based on the distribution of remaining calculated Tm's, we filtered out non-specific baits using the following criteria:

    • Stringent (only specific baits pass). Bait candidates pass if they satisfy one of these conditions:
      • No hits with Tm above 60°C
      • At most 2 hits 62.5 – 65°C
      • At most 10 hits 62.5 – 65°C and at least 1 failing flanking bait
      • At most 10 hits 62.5 – 65°C, 2 hits 65 – 67.5°C, and fewer than 2 passing flanking baits
      • At most 2 hits 62.5 – 65°C, 1 hit 65 – 67.5°C, 1 hit 70°C or above, and < 2 passing flanking baits
    • Moderate (some non-specific baits pass)
      • Additional candidates pass if they have at most 10 hits 62.5 – 65°C and 2 hits above 65°C, and fewer than 2 passing baits on each flank.
    • Relaxed (more non-specific baits pass)
      • Additional candidates pass if they have at most 10 hits 62.5 – 65°C and 4 hits above 65°C, and fewer than 2 passing baits on each flank.
  6. Optimal bait design, keep only baits that passed “Moderate” BLAST filtering for all 8 genomes and were ≤25% Repeat Masked

  7. Final bait set, bait-stringent-RM25pc-all.fas: 100 nt baits after recommended filtration = 39938 baits

Phyluce Prediction of Bait Effectiveness

Phyluce provides methodology to evaluate the theoretical effectiveness of a bait set prior to manufacturing.
The bait set was evaluated against 18 Coleoptera genome assemblies available on NCBI as of November 2018. Bait set fasta headers were converted to Phyluce "UCE" format and processed via Phyluce, outlined in Tutorial IV: Identifying UCE Loci and Designing Baits To Target Them: In-silico test of the bait design

Table 1. Target Hits Among NCBI Coleoptera Assemblies

GenBank Accession Species Post Filtered "UCEs"
GCA_000002335.3 Tribolium castaneum 873
GCA_000281835.1 Priacma serrata 530
GCA_000346045.2 Dendroctonus ponderosae 727
GCA_000355655.1 Dendroctonus ponderosae 775
GCA_000390285.2 Anoplophora glabripennis 775
GCA_000500325.2 Leptinotarsa decemlineata 621
GCA_000648695.2 Onthophagus taurus 841
GCA_000699045.2 Agrilus planipennis 748
GCA_001012855.1 Hypothenemus hampei 854
GCA_001412225.1 Nicrophorus vespilloides* 992
GCA_001443705.1 Oryctes borbonicus 718
GCA_001937115.1 Aethina tumida 730
GCA_002278615.1 Pogonus chalceus 691
GCA_002938485.1 Sitophilus oryzae 688
GCA_003013835.1 Diabrotica virgifera 420
GCA_003054995.1 Aleochara bilineata* 896
GCA_003402655.1 Harmonia axyridis 756
GCA_003568925.1 Coccinella septempunctata 769
*Staphylinidae or Silphidae (sister-group)

Built With

  • Python - Programming language
  • Conda - Package, dependency and environment management
  • Phyluce - Target enrichment data analysis
  • BioPython - Tools for biological computation
  • T_Coffee Multiple sequence alignment
  • Tranalign Nucleotide alignments from protein alignments
  • Orthograph Orthologous gene detection
  • OrthoDB Ortholog Database

Contact

Jackson Eyres, Bioinformatics Programmer, Agriculture & Agri-Food Canada
[email protected]

Cite Us

If you use this bait set, please cite the following paper

Brunke, A J., Hansen, A. K., Salnitska, M., Kypke, J. L., Escalona, H., Chapados, J.T., Eyres, J., Richter, R., Smetana, A., Ślipiński, A., Zwick, A., Hájek, J., Leschen, R., Solodovnikov, A. and Dettman, J.R. The limits of Quediini at last (Coleoptera: Staphylinidae: Staphylininae): a rove beetle mega-radiation resolved by comprehensive sampling and anchored phylogenomics. Systematic Entomology. Accepted. 1–36.

Copyright

Agriculture & Agri-Food Canada, Government of Canada

License

This project is licensed under the MIT License - see the LICENSE file for details

Citations

  1. Brunke, A J., Hansen, A. K., Salnitska, M., Kypke, J. L., Escalona, H., Chapados, J.T., Eyres, J., Richter, R., Smetana, A., Ślipiński, A., Zwick, A., Hájek, J., Leschen, R., Solodovnikov, A. and Dettman, J.R. The limits of Quediini at last (Coleoptera: Staphylinidae: Staphylininae): a rove beetle mega-radiation resolved by comprehensive sampling and anchored phylogenomics. Systematic Entomology. Accepted. 1–36.

  2. Kypke, J.L. (2018). Phylogenetics of the world’s largest beetle family (Coleoptera: Staphylinidae): A methodological exploration. Natural History Museum of Denmark, Faculty of Science, University of Copenhagen. Available: https://curis.ku.dk/portal/files/217380217/PhD_Janina_Lisa_Kypke_SNM.pdf

  3. Faircloth BC. 2016. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics 32:786-788. doi:10.1093/bioinformatics/btv646.

  4. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs Kriventseva EK et al, NAR, Nov 2018, doi:10.1093/nar/gky1053. PMID:30395283

  5. Petersen, M., Meusemann, K., Donath, A. et al. Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes. BMC Bioinformatics 18, 111 (2017). https://doi.org/10.1186/s12859-017-1529-8