MIG-Phylogenomics

To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.

Read data

The read data for this analysis is in SRA under accession number PRJNA340324

Raw paired read libraries

Expect read-1 and read-2 fastq files for each of the following libraries. FIle paths in notebooks my need to be adjusted depending on where you place the files on your machine (big data is usually placed outsied the work drive and the path for those are system specific)

Sample	Library	Used in analysis
MincA14	150715_D00248_0103_AC75KUANXX_4_IL-TP-021	+
MareHarA	150403_D00261_0236_AC6E37ANXX_8_IL-TP-021	+
MareHarA	150403_D00261_0236_AC6E37ANXX_8_IL-TP-023
MareHarA	150521_D00200_0260_AC6V40ANXX_2_IL-TP-021
MareHarA	150521_D00200_0260_AC6V40ANXX_2_IL-TP-023
MjavLD15	150715_D00248_0103_AC75KUANXX_4_IL-TP-010	+
MincL19	150715_D00248_0103_AC75KUANXX_4_IL-TP-011	+
MareL32	150715_D00248_0103_AC75KUANXX_4_IL-TP-022	+
MareL28	150715_D00248_0103_AC75KUANXX_4_IL-TP-008	+
MjavL57	150715_D00248_0103_AC75KUANXX_4_IL-TP-001	+
MjavVW4	mjavanicaVW4_500	+
MjavVW4	mjavanicaVW4_300
MincW1	150212_D00261_0225_AC6EKCANXX_1_IL-TP-013	+
MincW1	150212_D00261_0225_AC6EKCANXX_1_IL-TP-005
MincVW6	150212_D00261_0225_AC6EKCANXX_1_IL-TP-007	+
MincVW6	150212_D00261_0225_AC6EKCANXX_1_IL-TP-002
MincHarC	150212_D00261_0225_AC6EKCANXX_1_IL-TP-012	+
MincHarC	150212_D00261_0225_AC6EKCANXX_1_IL-TP-004
Minc557R	150212_D00261_0225_AC6EKCANXX_1_IL-TP-006	+
MincL9	150715_D00248_0103_AC75KUANXX_4_IL-TP-009	+
MincL27	150715_D00248_0103_AC75KUANXX_4_IL-TP-020	+
MjavLD17	150715_D00248_0103_AC75KUANXX_4_IL-TP-003	+
MentL30	150716_D00248_0104_BC75KYANXX_3_IL-TP-005	+
MentL30	150716_D00248_0104_BC75KYANXX_3_IL-TP-019
MfloSJF1	160425_E00397_0014_AHLYG7CCXX_1_TP-D7-003	+
MfloSJF1	160426_K00166_0058_AH7WLVBBXX_8_TP-D7-005_TP-D5-003

Genome assembly scripts

Genome assembly scripts by Dr. Laura Salazar are available here. The genome assembly files are in this repository.

Quality trimmed paired read file

These were used for mapping of genes and of contig pairs, based on raw read libraries indicated by + . They are available in this location until 25/6/2018. Aternatively, they can be created in notebook 2.

25M read subset of the first trimmed read file

These were used for mitochondrial genome assembly, based on the first read trimmed file. When link is provided instead of a file, the trimmed read one file had less than 25 M reads in it and was also used as the subset. The links will need to be recreated on your system. These files are created in notebook 5.

Notebooks and related files

0. Dependencies

Notebook file name: Dependencies.ipynb

1. CDSs and proteins from genome assemblies

Notebook file name: CDSs_and_proteins_from_genome_assemblies.ipynb

Related files:

meloidogyne_assemblies: contains fasta genome assemblies
annotation: contain gff files for the assemblies in assemblies

dirs that start with None: genes, cdss or proteins without premature stop codon
dirs that start with stopped: genes, cdss or proteins with a premature stop codon
dirs that start with all: a merge of None and stopped
dirs that end with files: raw, as indicated in the gff
dirs that end with centroinds: cds files that were reduced with a vsearch step
dirs that end with reviewed: final treated datasets (see notebook)
ref in all the dir names indicate that these files are derived from a genome assembly annotation.

2. Map-assemble genes from read data for samples without assemblies

Notebook file name: Map_assemble_gene.ipynb

Related files

<sample name>_bwa/<sample name>.nt.fasta: map-assembled gene files

<None | stopped | all >_<cdss | proteins | gffs>
with None indicating that nothing is written.

dirs that start with None: gffs, cdss or proteins without premature stop codon
dirs that start with stopped: gffs, cdss or proteins with a premature stop codon
dirs that start with all: a merge of None and stopped

3. Orthology clustering

Notebook file name: Orthology_clustering.ipynb

Related files:

orthofinder/all_inputs/<sample name><None|_ref>.aa.fasta: links to protein sequences of all the samples. They will need to be regenerated locally (step included in the notebook).

orthofinder/all_inputs/Results_Jan16/<inflation value>_OrthologousGroup.csv: Orthology clusters, with <inflation value> representign the mcl inflation parameter, except for 0, representing an inflation of 1.5, and 1, representing inflation of 1.1.

orthofinder/all_inputs/Results_Jan16/WorkingDirectory: OrthoFinder inputs and outputs of the Blast step.
orthofinder/all_inputs/Results_Jan16/OGs_I2_1-4.gb.gz: A genbank file with coding and protein sequences of orthology clusters with 1 to 4 gene copies for each reference sample`.

orthofinder/all_inputs/Results_Jan16/OGs_I2_1-4.gb.loci.<csv|txt>: ReproPhylo formated list of the loci that are in the genbank file.

orthofinder/all_inputs/Results_Jan16/rootknot_phylogenomics: Input and output files of the OC filtering and correction pipeline, with trimal settings of gt=0.7 and st=0.01`

orthofinder/all_inputs/Results_Jan16/I2_3X2_gt0.7_st_0.01_alns_<1-4 | all2 | flo2>: Sequence alignments of orthology clusters in which inparalogs are collapsed into a single sequence, OCs with fragmanted orthologs are excluded and each genome copy contains up to one copy per sample.
1-4: all the orthology clusters in which there are at least 3 reference samples with 2 gene copies. all2: a subset of 1-4 in which all the reference samples have two gene copies.
flo2: a subset of 1-4 in which all MfloSJF1 has two gene copies.

Figures

4. Nuclear phylogenomics

Notebook file name: Nuclear_phylogenomics.ipynb

Related files:

orthofinder/all_inputs/Results_Jul02/I2_3X2_gt0.7_st_0.01_alns_1-4/<astralshuffeled | raxmlshuffled>: randomization analyses in which homeolog 1 and homeolog 2 are randomly assigned for each gene.
astralshuffeled: 100 astral runs, in which hom 1 and 2 were randomly assigend for each gene.
raxmlshuffled: 100 raxml supermatrix trees, in which hom 1 and 2 were randomly assigned for each gene, prior to the concatenation of the supermatrix.

orthofinder/all_inputs/Results_Jul02/I2_3X2_gt0.7_st_0.01_alns_1-4/trees.txt: a list of gene trees that were used for astral (non randomized)

orthofinder/all_inputs/Results_Jul02/I2_3X2_gt0.7_st_0.01_alns_1-4/raxmlshuffled/trees.txt: a list of randomized supermatrix trees.

orthofinder/all_inputs/Results_Jul02/I2_3X2_gt0.7_st_0.01_alns_1-4/ RAxML_StrictConsensusTree<AstStrict | RaxStrict>: strict consensus trees that resulted from the two randomization analyses with astral and raxml.

orthofinder/all_inputs/Results_Jul02/I2_3X2_gt0.7_st_0.01_alns_1-4/ RAxML_<>.merged_clusters_<>:
A through raxml tree reconstrction of a supermatrix of all the OCs, following a treeCL analysis confirming their shared phylogeny.

Figures

5. Mitochondrial genome assembly

Notebook file name: Mitochondrial_genome_assembly.ipynb

Related files:

<sample name>_mitobim: mitobim assembly based on mitochondrial gene seeds.
mito_references: reference mitochondrial genomes from ncbi.

6. Mitochondrial genome annotation

Notebook file name: Mitochondrial_genomes_annotation.ipynb

Related files:

mitochondrial_assemblies/<sample name>_genes.fasta: fasta files of mitochondrial genes as predicted by exonerate.

7. Mitochondrial genome phylogenomics

Notebook file name: Mitochondrial_genomes_tree.ipynb

Related files:

mitochondrial_assemblies/phylogenetic_analysis: all the files associated with the reprophylo pipeline.

Figures

8. Intra-genome identity among homeolog gene pairs

Notebook file name: Intra_genome_sequence_divergence.ipynb

Related files:

intrablast_p_ident_dict.pkl: pairwise homoeolog identity values for all the samples.

Figures:

9. Coverage ratio between homeolog contigs within a genome

Notebook file name: Median_ratio.ipynb

Related files:

<sample>_contig_pairs_bwa: read mapping to homoeolog contig pairs.
coverage_ratio_histograms: outputs.
genes_to_contigs.pkl: contig assignment of genes.
OG_contig_relationship.pkl: contig assignments of OCs.
contig_pairs_data: fasta files with contigs pairs.

Figures

10. Gene Conversion

Notebook file name: GeneConversion.ipynb

Related files:

synteny: all the related files.

Figures

11. Transposable elements

Notebook file name: TE.ipynb

Related files:

TEs: all the related files.

Figures

12. Intra and interspecific genetic diversity

Notebook file name: GeneticVariation.ipynb

Files

README.md

Latest commit

History

README.md

File metadata and controls

MIG-Phylogenomics

Read data

Raw paired read libraries

Genome assembly scripts

Quality trimmed paired read file

25M read subset of the first trimmed read file

Notebooks and related files

0. Dependencies

1. CDSs and proteins from genome assemblies

Related files:

2. Map-assemble genes from read data for samples without assemblies

Related files

3. Orthology clustering

Related files:

Figures

4. Nuclear phylogenomics

Related files:

Figures

5. Mitochondrial genome assembly

Related files:

6. Mitochondrial genome annotation

Related files:

7. Mitochondrial genome phylogenomics

Related files:

Figures

8. Intra-genome identity among homeolog gene pairs

Related files:

Figures:

9. Coverage ratio between homeolog contigs within a genome

Related files:

Figures

10. Gene Conversion

Related files:

Figures

11. Transposable elements

Related files:

Figures

12. Intra and interspecific genetic diversity

Figures