The bait-stringent-RM25pc-all.fas bait set contains 39938 100nt baits designed from 10 diverse species of Staphylinidae targetting 1229 single-copy, protein encoding orthologs. Experimental application of this bait set is described in Brunke et al. (Accepted)
This bait set was made possible by the significant contributions from the following scientists:
-
Janina L. Kypke, (Natural History Museum of Denmark, Denmark) and Hermes Escalona (Australian National Insect Collection, Australia), for providing the underlying ortholog alignments based on transcriptome and genome alignments given in the Supplementary_S2.csv file. Clusters of orthologous genes were identified using the methodology given in Kypke (2018) and Brunke et al. (Accepted).
-
Dr. Adam Brunke, (Agriculture & Agri-Food Canada) for his instrumental guidance on the design of the bait set.
-
Brian Brunelle (Arbor BioSciences) for probe creation and QA/QC prior to manufacturing.
-
Dr. Jeremy Dettman, Julie Chapados, Robin Richter, Wayne McCormick (Agriculture & Agri-Food Canada) for the extensive validation and implementing of the bait set.
- bait-stringent-RM25pc-all.fas: Final bait set with OrthoDB v9 headers
- bait-stringent-RM25pc-all.fasta: Final bait set with Phyluce compatible UCE headers
- input.seq.fas: Final file generated by staphylinidae_baits.py submitted to Arbor BioSciences for bait design.
- Supplementary_S2.csv: List of all reference taxa
Input Files: 3814 orthologs, as amino acid (Protein) and nucleotide (DNA) sequences were generated using 34 insect genomes, transcriptomes and low-coverage genomes, processed with OrthoDB v9 and the Orthograph pipeline (see Kypke (2018) and Brunke et al. (Accepted), for details). These reference taxa included 20 representatives of Staphylinidae and are outlined in Supplementary_S2.csv
- Amino acid orthologs were aligned using T_Coffee 11.0.8 with default settings, additionally outputting score_ascii files
- Nucleotide alignments were generated using Tranalign from EMBOSS 6.6.0 with T_Coffee AA alignments and DNA orthologs
- staphylinidae_baits.py performed the following steps:
- Scanned each score_ascii file with sliding window starting with a length of 2000 AA to a minimum of 75 AA. Each alignment position has a TSC score from 1-9 so a max score of 300 AA is 2700. The largest window with a sum above 95% of the max TCS score was considered a conserved block.
- Corresponding conserved blocks were excised from the nucleotide alignments by multiplying the start and stop indices of protein alignments by 3.
- Any sequences with at least 20% gapped bases were removed from conserved blocks
- To fit within a 40000 bait target, the 20 Staphylinidae references were curated down to 10 priority species which represented a diverse sampling of the informal subfamily groups.
Nucleotide conserved blocks were examined for the presence of the following priority Staphylinidae species:
- Acidota crenata (Omaliine group: Omaliinae)
- Silphotelus sp (Omaliine group: Proteininae)
- Deleaster dichrous (Oxyteline group: Oxytelinae)
- Paederus cruenticollis (Staphylinine group: Paederinae)
- Philonthus decorus (Staphylinine group: Staphylininae)
- Quedius fuliginosus (Staphylinine group: Staphylininae)
- Stenus bimaculatus (Staphylinine group: Steninae)
- Nicrophorus vespilloides (Silphidae or Staphylinidae: Tachyporine group)
- Deinopsis erosa (Tachyporine group: Aleocharinae)
- Lordithon lunulatus (Tachyporine group: Tachyporinae)
- 1229 conserved blocks were found containing 5-10 priority Staphylinidae with a minimum size of 300bp
- Curated blocks were created by removing all non priority Staphylinidae from conserved blocks
- Curated blocks were appended into a multi-fasta (input-seq.fas) and submitted to Arbor Biosciences for manufacturing
The following procedures were performed by Brian Brunelle of Arbor Biosciences
-
7,985 loci provided for bait design (Ns were replaced with Ts if they occurred consecutively for 1-10 bases)
-
Using RepeatMasker, soft-masked the input sequences for simple repeats and repeats found in the Staphylinidae database; 0.12% masked (all simple repeats)
-
Optimal 100 nt bait from every 120 nt interval based on GC, ΔG, etc = 40,127 raw baits
-
Each bait candidate was BLASTed against the 8 provided genomes;
- ref01 = GCF_001937115.1_Atum_1.0_genomic.fas
- ref02 = GCF_000699045.2_Apla_2.0_genomic.fas
- ref03 = GCF_000648695.1_Otau_2.0_genomic.fas
- ref04 = GCF_000500325.1_Ldec_2.0_genomic.fas
- ref05 = GCF_000390285.2_Agla_2.0_genomic.fas
- ref06 = GCF_000355655.1_DendPond_male_1.0_genomic.fas
- ref07 = GCF_000002335.3_Tcas5.2_genomic.fas
- ref08 = GCF_001412225.1_Nicve_v1.0_genomic.fas
A hybridization melting temperature (Tm, defined as temperature at which 50% of molecules are hybridized) was estimated for each hit assuming standard myBaits® buffers and conditions.
-
For each bait candidate, one BLAST hit with the highest Tm is first discarded from the results (allowing for 1 hit in the genome), and only the top 500 hits (by bit score) are considered. Based on the distribution of remaining calculated Tm's, we filtered out non-specific baits using the following criteria:
- Stringent (only specific baits pass). Bait candidates pass if they satisfy one of these conditions:
- No hits with Tm above 60°C
- At most 2 hits 62.5 – 65°C
- At most 10 hits 62.5 – 65°C and at least 1 failing flanking bait
- At most 10 hits 62.5 – 65°C, 2 hits 65 – 67.5°C, and fewer than 2 passing flanking baits
- At most 2 hits 62.5 – 65°C, 1 hit 65 – 67.5°C, 1 hit 70°C or above, and < 2 passing flanking baits
- Moderate (some non-specific baits pass)
- Additional candidates pass if they have at most 10 hits 62.5 – 65°C and 2 hits above 65°C, and fewer than 2 passing baits on each flank.
- Relaxed (more non-specific baits pass)
- Additional candidates pass if they have at most 10 hits 62.5 – 65°C and 4 hits above 65°C, and fewer than 2 passing baits on each flank.
- Stringent (only specific baits pass). Bait candidates pass if they satisfy one of these conditions:
-
Optimal bait design, keep only baits that passed “Moderate” BLAST filtering for all 8 genomes and were ≤25% Repeat Masked
-
Final bait set, bait-stringent-RM25pc-all.fas: 100 nt baits after recommended filtration = 39938 baits
Phyluce provides methodology to evaluate the theoretical effectiveness of a bait set prior to manufacturing.
The bait set was evaluated against 18 Coleoptera genome assemblies available on NCBI as of November 2018.
Bait set fasta headers were converted to Phyluce "UCE" format and processed via Phyluce, outlined in
Tutorial IV: Identifying UCE Loci and Designing Baits To Target Them: In-silico test of the bait design
Table 1. Target Hits Among NCBI Coleoptera Assemblies
GenBank Accession | Species | Post Filtered "UCEs" |
---|---|---|
GCA_000002335.3 | Tribolium castaneum | 873 |
GCA_000281835.1 | Priacma serrata | 530 |
GCA_000346045.2 | Dendroctonus ponderosae | 727 |
GCA_000355655.1 | Dendroctonus ponderosae | 775 |
GCA_000390285.2 | Anoplophora glabripennis | 775 |
GCA_000500325.2 | Leptinotarsa decemlineata | 621 |
GCA_000648695.2 | Onthophagus taurus | 841 |
GCA_000699045.2 | Agrilus planipennis | 748 |
GCA_001012855.1 | Hypothenemus hampei | 854 |
GCA_001412225.1 | Nicrophorus vespilloides* | 992 |
GCA_001443705.1 | Oryctes borbonicus | 718 |
GCA_001937115.1 | Aethina tumida | 730 |
GCA_002278615.1 | Pogonus chalceus | 691 |
GCA_002938485.1 | Sitophilus oryzae | 688 |
GCA_003013835.1 | Diabrotica virgifera | 420 |
GCA_003054995.1 | Aleochara bilineata* | 896 |
GCA_003402655.1 | Harmonia axyridis | 756 |
GCA_003568925.1 | Coccinella septempunctata | 769 |
*Staphylinidae or Silphidae (sister-group) |
- Python - Programming language
- Conda - Package, dependency and environment management
- Phyluce - Target enrichment data analysis
- BioPython - Tools for biological computation
- T_Coffee Multiple sequence alignment
- Tranalign Nucleotide alignments from protein alignments
- Orthograph Orthologous gene detection
- OrthoDB Ortholog Database
Jackson Eyres, Bioinformatics Programmer, Agriculture & Agri-Food Canada
[email protected]
If you use this bait set, please cite the following paper
Brunke, A J., Hansen, A. K., Salnitska, M., Kypke, J. L., Escalona, H., Chapados, J.T., Eyres, J., Richter, R., Smetana, A., Ślipiński, A., Zwick, A., Hájek, J., Leschen, R., Solodovnikov, A. and Dettman, J.R. The limits of Quediini at last (Coleoptera: Staphylinidae: Staphylininae): a rove beetle mega-radiation resolved by comprehensive sampling and anchored phylogenomics. Systematic Entomology. Accepted. 1–36.
Agriculture & Agri-Food Canada, Government of Canada
This project is licensed under the MIT License - see the LICENSE file for details
-
Brunke, A J., Hansen, A. K., Salnitska, M., Kypke, J. L., Escalona, H., Chapados, J.T., Eyres, J., Richter, R., Smetana, A., Ślipiński, A., Zwick, A., Hájek, J., Leschen, R., Solodovnikov, A. and Dettman, J.R. The limits of Quediini at last (Coleoptera: Staphylinidae: Staphylininae): a rove beetle mega-radiation resolved by comprehensive sampling and anchored phylogenomics. Systematic Entomology. Accepted. 1–36.
-
Kypke, J.L. (2018). Phylogenetics of the world’s largest beetle family (Coleoptera: Staphylinidae): A methodological exploration. Natural History Museum of Denmark, Faculty of Science, University of Copenhagen. Available: https://curis.ku.dk/portal/files/217380217/PhD_Janina_Lisa_Kypke_SNM.pdf
-
Faircloth BC. 2016. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics 32:786-788. doi:10.1093/bioinformatics/btv646.
-
OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs Kriventseva EK et al, NAR, Nov 2018, doi:10.1093/nar/gky1053. PMID:30395283
-
Petersen, M., Meusemann, K., Donath, A. et al. Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes. BMC Bioinformatics 18, 111 (2017). https://doi.org/10.1186/s12859-017-1529-8