-
Notifications
You must be signed in to change notification settings - Fork 7
Guide to OSD 2014 data
Purpose of this guide is to give a consolidated and authoritative overview of the data from OSD 2014
You should
- get an overview of which data from OSD 2014 is available where
- understand how the data was generated
- and bring you in the position to correctly use and interpret the data
In the days around solstice June 2014 more than 150 scientist teams collected samples around the world.
This resulted
- 150 metagenome samples from around 150 sites
- 155 samples collected by protocol NPL022 and 16S amplicon sequenced by LGC (lgc)
- 155 samples collected by protocol NPL022 and 18S amplicon sequenced by LGC (lgc)
- 7 samples collected by protocol NPL022 and 16S amplicon sequenced by Australia (ramaciotti)
- 30 samples collected of by protocol NE08 and 18S V4 region amplicon sequenced by LifeWatch Italy (lw) see protocol for details
- 32 samples collected by protocol NE08 and 18S V9 region sequenced by LifeWatch (lw) see protocol for details
Details about protocols NPL022 and NE08 are in the OSD Handbook
[Material and Methods for "OSD 2014 - from sampling to sequencing" can be downloaded here] (http://mb3is.megx.net/osd-files?path=/2014/protocols)
Please note that we have different sequencing centers who did sequencing :
- LGC Genomics (shorthand: lgc), our main sequencing center who did metagenomes and 16/18S from protocol NPL022
- LifeWatch Italy (shorthand: lw), who kindly provided sequencing for protocol NE08 samples see protocol for details
- Australia - Ramaciotti (shorthand: ramaciotti), who for legal reasons had to sequence all 7 Australian sites
The sequence data as delivered by the sequencing centers was pre-processed in order to derive common data sets on which to base follow-up analysis. Please see wiki page on pre-processing for details
In summary the pre-processing results in two kinds of quality controlled sequence datasets raw and workable for each input sequence set:
-
For amplicon data the output files per sample are:
-
raw: non-merged
-
workable: merged
-
-
For shotgun data the output files per sample are:
-
raw: non-merged (used e.g. by EMG)
-
workable output files
-
merged (used e.g. by mg-traits)
-
non-merged (used e.g. for assemblies)
-
-
- Measured by OSD Site Coordinators: OSD 2014 Environmental Metadata
- CSV with '|' (pipe) as delimiter and UTF-8 encoded
- take care adjust the settings accordingly while importing this into e.g. EXCEL
- Detailed documentation of file structure and content
- Calculated environmental data based on data from public environmental databases: OSD 2014 Environmental Ancillary Data
- Including data based on Halpern et al. (as of 2015-12-05)
- Documentation can be found as readme sheets in the file
- This data was kindly generated by Dr. Shruti Malaviya, by crawling related public datasets
- Scanned copies of log sheets are available at PANGEA
- ENA archived data
- http://www.ebi.ac.uk/ena/data/view/PRJEB8682
- as of 2014-04-30 all metagenomes,16S and 18S raw data from OSD protocol NPL022
- LifeWatch 18S available (see below), archiving at ENA pending
- All workable data is available here
- All raw data are available here
- Metagenome analysis by EMG based on raw data
- Metagenome analysis by MG-Traits based on workable data
- 16S/18S analysis by SILVAngs
- see details below
The OSD 2014 Environmental Metadata includes a columns named osd_label
.
Here you can find a file which maps these osd_label
s to the respective ENA RUN identifiers.
The dataset has the distinction between 16S and 18S is in the Run alias. The ENA browser displays the Run title (= a short informative description) rather than Run alias (= a submitter provided unique name, frequently being a unique ID meaningful only to the submitter). e.g. ERR867761 <RUN alias="OSD3-lgc-genomics-18S-199"
<TITLE>Illumina MiSeq paired end sequencing; Illumina MiSeq sequencing of sample OSD3_2014-06-20_0m_NPL022from OSD-JUN-2014</TITLE>Furthermore, The Run ERR867760 belongs to the Experiment ERX947555 The Run ERR867761 belongs to the Experiment ERX947554
Each Experiment has it's own description, where the submitter clearly states which amplicon has been sequenced:
http://www.ebi.ac.uk/ena/data/view/ERX947555 (marine 16S rDNA amplicon sequencing) http://www.ebi.ac.uk/ena/data/view/ERX947554 (marine 18S rDNA amplicon sequencing)
We make available all other date (i.e. non-archived in public repositories) via MPI Bremen file server. This is the highest-level entry point.
All metagenomic raw datasts are archived at European Nucleotide Archive (ENA).
You can browse and download the archived metagenomic at European Nucleotide Archive (ENA) here:
-
Based on the raw datasets as archived at ENA, the EMG pipeline analyzed all metagenomes:
- You can browse the EMG results here: https://www.ebi.ac.uk/metagenomics/projects/ERP009703/
- All workable metagenome data are available here.
- See pre-processing pipeline for further documentation
Clicking on a sample name will take you to a page where you can view and download the results of the EBI analysis pipeline (EMG) by clicking on the hyperlinks labelled "Taxonomy" or "Function" or the download icon in the "Analysis Results" column. You can also download the sequence data itself from these download pages, for example you can download the data and results for the sample identified as OSD15_2014-06-21_0m_NPL022 (ERS667653) here.
- See the OSD assemblies page
We analysed the rDNA sequences identified by the EMG pipeline through SILVAngs and in addition we identified the rDNAs on the EMG derived dataset using the SINA aligner and SILVAngs.
- Direct link to EMG data analysed with SILVAngs
- Direct link to EMG data screened with SINA aligner and analysed with SILVAngs
NB: One of the cluster/OTU files in the SILVAngs 'exports' folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.
Main analysis was done using SILVAngs pipeline on the workable sequence data set. SILVA taxonomy version 119.1 was used for all 16S datasets and version 119 for all 18S datasets - the differences are very minor and can be viewed here.
The analysis was done for the sequence data as obtained from LGC, LifeWatch and Australia.
The MED exports contain a taxonomy path for each sequence inside the FASTA header. However this taxonomy is not filtered by 93% quality value, which is the default in SILVAngs. Therefore, to be consistent with other SILVAngs exports, an extra file with filtered taxonomy was added to the MED folder. See this issue for more details.
NB: One of the cluster/OTU files in the SILVAngs 'exports' folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.
- Direct link to 16S data
- Details of method and overview of results
- MED-formatted fasta exports of 16S data by sample (please read the note on taxonomy paths in MED exports)
NB: One of the cluster/OTU files in the SILVAngs 'exports' folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.
- Direct link to 18S data
- Details of method and overview of results
- MED-formatted fasta exports of 18S data by sample (please read the note on taxonomy paths in MED exports)
NB: One of the cluster/OTU files in the SILVAngs 'exports' folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.
- Direct link to 16S data
- Details of method and overview of results
- MED-formatted fasta exports of 18S data by sample (please read the note on taxonomy paths in MED exports)
NB: One of the cluster/OTU files in the SILVAngs 'exports' folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.
- Direct link to V4 data
- Details of method and overview of results
- MED-formatted fasta exports of 18S data by sample (please read the note on taxonomy paths in MED exports)
NB: One of the cluster/OTU files in the SILVAngs 'exports' folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.