Scripts for the compositional analysis of metagenomic shotgun sequencing data calssified by kraken2
Here, I provide the code that was used for the processing, filtering, and analysis of the metagenomic shotgun sequencing data of sedimentary ancient DNA from a marine sediment core from the Bering Sea for the paper by Zimmermann et al. Marine ecosystem shifts with deglacial sea-ice loss inferred from ancient DNA shotgun sequencing. Nat Commun 14, 1650 (2023). https://doi.org/10.1038/s41467-023-36845-x
The processing of the sequencing data and taxonomic classification were carried out on a HPC
The code to process the data is part of this README. Further data processing, filtering and the analysis of the data was carried out in R
. The R-scripts are part of this repository and the README guides through the order of running them.
Overview of used programs, versions used for the analysis, and the link to the publication
Program | Version | Link to publication |
---|---|---|
FastQC | 0.11.9 | Andrews et al. 2010 |
FastUniq | 1.1 | Xu et al. 2012 |
Fastp | 0.20.0 | Chen et al. 2018 |
kraken2 | 2.0.8-beta | Wood et al. 2019 |
HOPS | 0.3.4 | Hübler et al. 2019 |
R | 4.0.3 |
for i in *fastq.gz; do fastqc $i -t 4 -o fastqc_reports/; done
fastuniq -i samplename.list -t q -o samplename_dedup_R1.fastq -p samplename_dedup_R2.fastq
Trimming low quality ends and removing reads with low complexity, removal of residual adapters and merging of overlapping forward and reverse reads. Here is the link to the fastp manual for further information.
fastp --in1=samplename_dedup_R1.fastq --in2=samplename_dedup_R2.fastq --out1=unmerged_forward.fastq --out2=unmerged_reverse.fastq \
--unpaired1=unpaired_forward.fastq --unpaired2=unpaired_reverse.fastq --merge \
--merged_out=merged_samplename.fastq --failed_out=failed_samplename.fastq --thread=36 --verbose \
--adapter_fasta=adapter_2020_04_06.fasta --length_required=30 --overlap_len_require=10 --overlap_diff_percent_limit=20 \
--correction --overlap_diff_limit=5 --low_complexity_filter --complexity_threshold=30 --qualified_quality_phred=15 \
--unqualified_percent_limit=40 --n_base_limit=5 --json=samplename.json --html=samplename.html \
--report_title="samplename_fastp_report" --overrepresentation_analysis --cut_front --cut_tail --cut_window_size=4 \
--cut_mean_quality=10 -q 10 -x 10
Here is the link to the kraken2 manual for further information.
kraken2-2.0.8-beta/kraken2 --conf=0.2 --db kraken2_db merged_samplename.fastq --threads 36 \
--output samplename_merged.kraken --report samplename_merged.report
kraken2-2.0.8-beta/kraken2 --conf=0.2 --paired --db kraken2_db unmerged_forward.fastq \
unmerged_reverse.fastq --threads 36 --report samplename_unmerged.report \
--output samplename_unmerged.kraken
awk 'BEGIN{FS=OFS="\t"} {print FILENAME, $0}' *0.2*report | awk 'BEGIN{FS=OFS="\t"} {gsub(/^[ \t]+/, "", $7)}1' > APMG689_nt0.2_Kraken2.txt
To check whether correlations could be influenced by a reference data bias of closely related taxa, we inspected the our custom full-nt kraken database.
Formal analysis was carried out afterwards using the R-script fish_db_inspection.R
.
kraken2-inspect --threads 36 --db /home/ollie/projects/bio/db/kraken2/nt_2021_04_db \
--use-mpa-style >> ~/krakeninspectdb/kraken2-inspect_nt_2021_04_db.txt
Here is the link to the HOPS manual for further information.
We prepared 3 datasets in which samples were binned according to their age to estimate whether damage patterns increase with time: set1 (1.08–5.6 ka), set2 (6.3–12.6 ka), set3 (13.6–19.9 ka). The samples were concatenated using the cat
command and the different datasets were processed after each other. The configuration file and the taxa list can be found in the folder Input_files/HOPS.
#####Author: Lars Harms
# set variables
#================================================
WORKDIR=/shotgun/APMG689/hops/
INPUT=/shotgun/APMG689/output/out.fastp/set1.fastq.gz
OUTPUT=/shotgun/APMG689/hops/output
CONFIG=/shotgun/APMG689/hops/configfile.txt
INDEXDB=/db/hops/nt-hops-20-11-03_step8/
TAXFILE=/shotgun/APMG689/hops/taxalist_HOPS_APMG689.txt
NCBIRESC=/db/hops/ncbi
MEM=1400
cpus-per-task=28
# preparing the working environment
#================================================
cd $WORKDIR
module load bio/hops/0.34
# create Hops config file
#================================================
echo "preProcess=0
alignment=1
dupRemOff=0
filter=def_anc
index=${INDEXDB}
pathToList=${TAXFILE}
resources=${NCBIRESC}
useSlurm = 0
threadsMalt=${SLURM_CPUS_PER_TASK}
maxMemoryMalt=${MEM}
threadsMaltEx=${SLURM_CPUS_PER_TASK}
maxMemoryMaltEx=${MEM}
threadsPost=${SLURM_CPUS_PER_TASK}
maxMemoryPost=256" > ${CONFIG}
# tasks to be performed
#================================================
srun hops -Xmx${MEM}G -input ${INPUT} -output ${OUTPUT} -m full -c ${CONFIG}
The following R-scripts and python-script are run in the order they are listed below. Input files for running the scripts are included in the folder Input_files.
- Cleaning up the kraken2-output and filtering for family level with
1_prep_famlevel.R
- Making taxonomic subgroups based on manually checked taxa lists
2_prep_taxalists.R
- Resampling of reads to account for differences in per sample read counts for pelagic taxa
3_resampling_plankton_pelagic.R
and benthic taxa3_resampling_benthic.R
. This script is based on the github script by Stefan Kruse R-rarefaction. - Compositional analysis, stratigraphic diagrams, Spearman correlations, and network analysis for pelagic taxa
4_correlation_analysis_networks_plankton_pelagic.R
and benthic taxa4_correlation_analysis_benthic.R
. - We calculated ratios between pelagic and benthic taxa counts and between phototrophic bacteria and phototroophic protists using the script
ratio_plots.R
.