-
Notifications
You must be signed in to change notification settings - Fork 7
Guide to OSD 2014 data
Purpose of this guide is to give a consolidated and authoritative overview of the data from OSD 2014
You should
- get an overview of which data from OSD 2014 is available where
- understand how the data was generated
- and bring you in the position to correctly use and interpret the data
In the days around solstice June 2014 more than 150 scientist teams collected samples around the world.
This resulted
- 150 metagenome samples from around 150 sites
- 155 samples collected by protocol NPL022 and 16S amplicon sequenced by LGC (lgc)
- 155 samples collected by protocol NPL022 and 18S amplicon sequenced by LGC (lgc)
- 7 samples collected by protocol NPL022 and 16S amplicon sequenced by Australia (ramaciotti)
- 30 samples collected of by protocol NE08 and 18S V4 region amplicon sequenced by LifeWatch Italy (lw) see protocol for details
- 32 samples collected by protocol NE08 and 18S V9 region sequenced by LifeWatch (lw) see protocol for details
Details about protocols NPL022 and NE08 are in the OSD Handbook
Please note that we have different sequencing centers who did sequencing:
- LGC Genomics (shorthand: lgc), our main sequencing center who did metagenomes and 16/18S from protocol NPL022
- LifeWatch Italy (shorthand: lw), who kindly provided sequencing for protocol NE08 samples see protocol for details
- Australia - Ramaciotti (shorthand: ramaciotti), who for legal reasons had to sequence all 7 Australian sites
The sequence data as delivered by the sequencing centers was pre-processed in order to derive common data sets on which to base follow-up analysis. Please see wiki page on pre-processing for details
In summary the pre-processing results in two kinds of quality controlled sequence datasets raw and workable for each input sequence set:
-
For amplicon data the output files per sample are:
-
raw: non-merged
-
workable: merged
-
-
For shotgun data the output files per sample are:
-
raw: non-merged (used e.g. by EMG)
-
workable output files
-
merged (used e.g. by mg-traits)
-
non-merged (used e.g. for assemblies)
-
-
- Sample and environmental data *
- ENA archived data
- http://www.ebi.ac.uk/ena/data/view/PRJEB8682
- as of 2014-04-30 all metagenomes,16S and 18S raw data from OSD protocol NPL022
- LifeWatch 18S data still pending
- All workable data is available here
- Metagenome analysis by EMG based on raw data
- Metagenome analysis by MG-Traits based on workable data
- 16S/18S analysis by SILVAngs
- see details below
The dataset has the distinction between 16S and 18S is in the Run alias. The ENA browser displays the Run title (= a short informative description) rather than Run alias (= a submitter provided unique name, frequently being a unique ID meaningful only to the submitter). e.g. ERR867761 <RUN alias="OSD3-lgc-genomics-18S-199"
<TITLE>Illumina MiSeq paired end sequencing; Illumina MiSeq sequencing of sample OSD3_2014-06-20_0m_NPL022from OSD-JUN-2014</TITLE>Furthermore, The Run ERR867760 belongs to the Experiment ERX947555 The Run ERR867761 belongs to the Experiment ERX947554
Each Experiment has it's own description, where the submitter clearly states which amplicon has been sequenced:
http://www.ebi.ac.uk/ena/data/view/ERX947555 (marine 16S rDNA amplicon sequencing) http://www.ebi.ac.uk/ena/data/view/ERX947554 (marine 18S rDNA amplicon sequencing)
We make available all other date (i.e. non-archived in public repositories) via MPI Bremen file server. This is the highest-level entry point.
All metagenomic raw datasts are archived at European Nucletide Arcive (ENA).
You can browse and download the archived metagenomic at European Nucletide Arcive (ENA) here:
-
Based on the raw datasets as archived at ENA, the EMG pipeline analyzed all metagenomes:
- You can browse the EMG results here: https://www.ebi.ac.uk/metagenomics/projects/ERP009703/
Clicking on a sample name will take you to a page where you can view and download the results of the EBI analysis pipeline (EMG) by clicking on the hyperlinks labelled "Taxonomy" or "Function" or the download icon in the "Analysis Results" column. You can also download the sequence data itself from these download pages, for example you can download the data and results for the sample identified as OSD15_2014-06-21_0m_NPL022 (ERS667653) here.
Main analysis was done using SILVAngs pipeline on the workable sequence data set. SILVA taxonomy version 119.1 was used for all 16S datasets and version 119 for all 18S datasets - the differences are very minor and can be viewed here.
The analysis was done for the sequence data as obtained by LGC, LifeWatch and from Australia.
The MED exports contain a taxonomy path for each sequence inside the FASTA header. However this taxonomy is not filtered by 93% quality value, which is the default in SILVAngs. Therefore, to be consistent with other SILVAngs exports, an extra file with filtered taxonomy was added to the MED folder. See this issue for more details.
- Direct link to 16S data
- Details of method and overview of results
- MED-formatted fasta exports of 16S data by sample
- Direct link to 18S data
- Details of method and overview of results
- MED-formatted fasta exports of 18S data by sample
- Direct link to 16S data
- Details of method and overview of results
- MED-formatted fasta exports of 18S data by sample