Code used in the study of Petunia axillaris X P. exserta hybrid necrosis, associated with publication.
Authors: Chaobin Li, Marta Binaghi, Vivien Pichon, Gina Cannarozzi, Loreta Brandao de Freitas, Mathieu Hanemian, Cris Kuhlemeier
Author of this page: Marta Binaghi
Genome sequence and annotation used in all the analyses: P. axillaris N version 4.03 (submitted to NCBI).
To identify the regions of the genome asssociated with the presence and abscence of the necrosis phenotype, we applied a BSR-Seq approach. This is similar to a BSA (bulk segregant analysis) but is performed on RNA-seq data. The approach is very similar to that used by Soyk et al. 2017 doi.org/10.1038/ng.3733.
We produced two F2 populations, coming from the crosses P. axillaris N X P. exserta and P. exserta X P. axillaris N. The plants from these populations were then phenotyped for the presence and absence of necrotic symptoms and were grouped accordingly. Note that in the article we only present the results from the cross P. axillaris N X P. exserta. The raw reads for the other cross are nonetheless available on SRA.
We performed RNA extraction (see article materials and methods for details), and pooled the necrotic individuals together in a sample, and the healthy individual in another sample.
Was performed by Novogene, with the Eukaryotic RNA-seq (polyA enriched library prep).
- paired end reads
- 150 bp read length
- 250-300 bp insert cDNA library
- Sanger Illumina 1.9 encoding
Are available on NCBI BioProject PRJNA708139, under SRA accessions:
- SRR13907521: healthy individuals, P. exserta X P. axillaris N
- SRR13907522: necrotic individuals, P. exserta X P. axillaris N
- SRR13907523: healthy individuals, P. axillaris N X P. exserta
- SRR13907524: necrotic individuals, P. axillaris N X P. exserta
Forward reads are numbered 1, reverse are numbered 2.
Read quality assessment performed with fastqc.
Reads trimmed with trimmomatic parameters LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
.
Script bsa01_quality_trim.sh.
Read numbers before and after cleaning listed in bsa_read_alignment_stats.csv.
Genome index with STAR: bsa02_STAR_index.sh.
Alignment with bsa03_STARmapping.sh.
--sjdbOverhang 149
--outFilterType BySJout
--outFilterMultimapNmax 20
--twopassMode Basic
Aligned reads number in bsa_read_alignment_stats.csv.
We then add the read group info in each bam file with bsa04_readGroup_index.sh.
And mark duplicated reads with bsa05_markdup.sh.
We called variants using GATK v4.0.4.0, using the haplotypeCaller and filtering out variants based on quality values because Petunia does not have a high quality variants database to use BQSR.
Genome dictionary created in a previous step with bsa04_readGroup_index.sh, same script to add read group info in the bam files.
We called variants with bsa06_SNPcalling.sh, with parameters
--dont-use-soft-clipped-bases
--pcr-indel-model NONE
--stand-call-conf 30
--native-pair-hmm-threads 4
We then plot the distribution of the quality values of the variants to know if the GATK suggested filters are fine. This is done with bsa07_SNPselect_qualityplot.sh which uses plot_vcfq_distribution.R to make the plots.
We then apply the filters for quality with bsa08_SNPfilter.sh. Filters:
-window 10
-cluster 3
--filter-expression "QD<2.0"
--filter-name "QD"
--filter-expression "FS>30.0"
--filter-name "FS"
Then we select variants passing filters and only the biallelic SNPs. We then keep only variants where the read depth is at least 100 and then we thin the dataset to 100bp.
Number of variants in each dataset is listed in bsa_variant_numbers.csv.
The final set of variants is available in BSA_SNP_biallelic_gatkselected_minDP100_thin100.recode.vcf.
Is performed in R, with script bsa09_analysis.R.
The plots shown in the manuscript are obtained with script bsa10_plots_manuscript.R.
Low coverage sequencing performed to define the boundaries of the introgression region in the IL5 line (Hermann et al 2015, https://link.springer.com/article/10.1007/s00425-015-2251-2).
Library preparation and sequencing were performed by the Next Generation Sequencing platform of the University of Bern, with settings:
- whole genome sequencing library
- Paired end
- 150 bp long reads
- Illumina HiSeq 3000
Have been uploaded to NCBI SRA under BioProject PRJNA705072.
- SRR13809747: sequencing sample KMH1, recombinant line ID 1_A1 (Extended data table 3)
- SRR13809746: sequencing sample KMH2, recombinant line ID 7_A1 (Extended data table 3)
- SRR13809741: sequencing sample KMH3, recombinant line ID 18_E3 (Extended data table 3)
- SRR13809740: sequencing sample KMH4, recombinant line ID 26_E1 (Extended data table 3)
- SRR13809739: sequencing sample KMH5, recombinant line ID 1_E1 (Extended data table 3)
- SRR13809738: sequencing sample KMH6, recombinant line ID 7_E1 (Extended data table 3)
- SRR13809737: sequencing sample KMH7, recombinant line ID 24_E1 (Extended data table 3)
- SRR13809736: sequencing sample KMH8, recombinant line ID 18_A1 (Extended data table 3)
- SRR13809735: sequencing sample KMH9, recombinant line ID 10_A1 (Extended data table 3)
- SRR13809734: sequencing sample KMH10, recombinant line ID 29_A1 (Extended data table 3)
- SRR13809745: sequencing sample KMH11, recombinant line ID 26_A1 (Extended data table 3)
- SRR13809744: sequencing sample KMH12, recombinant line ID 25_A1 (Extended data table 3)
- SRR13809743: sequencing sample KMH13, recombinant line ID 39_A3 (Extended data table 3)
- SRR13809742: sequencing sample KMH14, recombinant line ID 34_A1 (Extended data table 3)
Forward reads are numbered 1, reverse are numbered 2.
Reads renamed with il01_rename_raw_reads.sh.
Read quality assessment performed with fastqc. Reads trimmed with trimmomatic parameters LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:100
.
Script il02_fastqc_trim.sh. Summary of raw and trimmed reads obtained with fastq-stats, il03_stats_index_genome.sh.
Read numbers before and after cleaning listed in il_read_alignment_stats.csv.
Genome index done in script il03_stats_index_genome.sh, alignment performed with BWA MEM, with default parameters in script il04_align.sh.
The proportion of reads aligned is shown in the table il_read_alignment_stats.csv.
Is performed with GATK 4.0.4.0, following GATK best practices for organisms without a database of high quality variants. Variants are called in ERC mode. Variant calling done with script il05_call_variants.sh, then the single samples are combined in a vcf file with il06_combineGvcfs.sh. The same script is also extracting the quality values to plot them and verify that the hard filters of GATK are suitable for the dataset. The plotting is done with an R script plot_vcfq_distribution.R. The plots are shown in il_snp_quality.pdf and il_indel_quality.pdf.
I then apply the hard filters to the variants in script il07_filter_variants.sh. In order to visualise more easily the boundaries of the introgression, we filter the resulting vcf file to keep only sites on chromosome 2, with a call rate of 100% (all samples have a called genotype at the site), and we extract the homozygous genotype positions.
To identify genes differentially expressed in the necrotic and healthy plants from the introgression lines we performed an RNAseq experiment. The samples used are plants constituting the progeny of a single selfed plant heterozygous for the introgression IL5 (see article materials and methods for details). In the progeny of this heterozygous plant we have some plants homozygous exserta, homozygous axillaris and heterozygous in the introgression. Three plants per genotype were selected, and one leaf per plant was collected. The tissue was collected from leaves of the homozygous axillaris IL when they just started displaying necrotic symptoms, and equivalent tissue was collected from the other genotypes.
The RNAseq data were used to perform a differential expression (DE) analysis.
Was performed at the Lausanne Genomics Technologies Facility.
- Single end reads
- 125 bp long
- Illumina HiSeq
- TruSeq stranded RNA library prep
Have been uploaded to NCBI SRA under BioProject PRJNA705649.
- SRR13861738: heterozygous introgression, replicate 3 (kmh9)
- SRR13861739: heterozygous introgression, replicate 2 (kmh8)
- SRR13861740: heterozygous introgression, replicate 1 (kmh7)
- SRR13861741: homozygous exserta introgression, replicate 3 (kmh6)
- SRR13861742: homozygous exserta introgression, replicate 2 (kmh5)
- SRR13861743: homozygous exserta introgression, replicate 1 (kmh4)
- SRR13861744: homozygous axillaris introgression, replicate 3 (kmh3)
- SRR13861745: homozygous axillaris introgression, replicate 2 (kmh2)
- SRR13861746: homozygous axillaris introgression, replicate 1 (kmh1)
Raw reads are renamed with script rs01_rename_raw_reads.sh.
Quality control and trimming in script rs02_fastqc_trim.sh. Parameters for trimming:
trimmomatic SE \
-threads 4 \
-phred33 \
kmh${SLURM_ARRAY_TASK_ID}.fastq.gz \
kmh${SLURM_ARRAY_TASK_ID}_trim.fastq.gz \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 \
ILLUMINACLIP:TruSeq3-SE.fa:2:30:3
The read numbers before and after trimming are reported in rs_read_alignment_stats.csv.
Note that I build another STAR index for the genome because in the genomeGenerate step they require the read length and here it is 125 bp. So I use script rs03_STAR_genome_index.sh to redo the genome index. Parameters:
STAR --runMode genomeGenerate \
--runThreadN 4 \
--genomeDir ${scdir}/data/genomes/Pax_125bp \
--genomeFastaFiles ${scdir}/data/genomes/Peax402INV.fasta \
--sjdbGTFfile ${scdir}/data/genomes/peaxi162AQ_Peax402INV.cds.gff \
--sjdbOverhang 124 \
--sjdbGTFtagExonParentTranscript Parent \
--genomeChrBinNbits 16 --limitGenomeGenerateRAM 24000000000
The alignment is then performed with rs04_STARmapping.sh. Aligned read numbers reported in rs_read_alignment_stats.csv. Alignment performed with parameters:
STAR --genomeDir data/genomes/Pax_125bp \
--runThreadN 16 \
--readFilesIn data/raw/rs_rawreads/kmh${SLURM_ARRAY_TASK_ID}_trim.fastq.gz \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \
--outFileNamePrefix data/raw/rs_aligned_reads/kmh${SLURM_ARRAY_TASK_ID}_trim_STAR_ \
--readFilesCommand zcat \
--genomeLoad NoSharedMemory
Is performed with subread featureCounts function, in script rs05_featureCounts.sh. Parameters are:
featureCounts -T 16 \
-t gene \
-s 2 \
-g ID \
-a ${scdir}/data/genomes/peaxi162AQ_Peax402INV.cds.gff \
-o ${scdir}/data/raw/rs_counts/rs_counts.txt kmh1_trim_STAR_Aligned.sortedByCoord.out.bam kmh2_trim_STAR_Aligned.sortedByCoord.out.bam kmh3_trim_STAR_Aligned.sortedByCoord.out.bam kmh4_trim_STAR_Aligned.sortedByCoord.out.bam kmh5_trim_STAR_Aligned.sortedByCoord.out.bam kmh6_trim_STAR_Aligned.sortedByCoord.out.bam kmh7_trim_STAR_Aligned.sortedByCoord.out.bam kmh8_trim_STAR_Aligned.sortedByCoord.out.bam kmh9_trim_STAR_Aligned.sortedByCoord.out.bam
The DE analysis is done with DESeq2 in the R markdown script rs06_DE_analysis.Rmd. The output is in rs06_DE_analysis.html.
- bwa/0.7.17
- fastqc/0.11.7l
- fastq-stats ea-utils/1.1.2
- GenomeAnalysisTK/4.0.4.0
- picard-tools/2.18.11 or 2.9.0, see scripts to know which version was used
- R on the computing cluster: R/3.4.2
- R on the local machine: R/3.3.3
- samtools/1.8
- SnpEff/4.3t
- STAR/2.6.0c
- subread/1.6.0
- trimmomatic/0.36
- vcftools/0.1.15
For the versions, see at the bottom of the R scripts in the code folder.
- DESeq2 Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.
- dplyr Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr
- ggplot2 Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org
- optparse https://cran.r-project.org/package=optparse
- pheatmap https://cran.r-project.org/package=pheatmap
- PoiClaClu https://cran.r-project.org/package=PoiClaClu
- RColorBrewer https://cran.r-project.org/package=RColorBrewer
- tidyr Wickham H, Girlich M (2022). tidyr: Tidy Messy Data. https://tidyr.tidyverse.org, https://github.com/tidyverse/tidyr
- vcfUtils custom package, available here vcfUtils.tar.gz