Skip to content

Guide to OSD 2014 data

Ivo edited this page Jul 30, 2015 · 44 revisions

Introduction

Purpose of this guide is to give a consolidated and authoritative overview of the data from OSD 2014

You should

  • get an overview of which data from OSD 2014 is available where
  • understand how the data was generated
  • and bring you in the position to correctly use and interpret the data

Overview

In the days around solstice June 2014 more than 150 scientist teams collected samples around the world.

This resulted

  • 150 metagenome samples from around 150 sites
  • 155 samples collected by protocol NPL022 and 16S amplicon sequenced by LGC (lgc)
  • 155 samples collected by protocol NPL022 and 18S amplicon sequenced by LGC (lgc)
  • 7 samples collected by protocol NPL022 and 16S amplicon sequenced by Australia (ramaciotti)
  • 30 samples collected of by protocol NE08 and 18S V4 region amplicon sequenced by LifeWatch Italy (lw) see protocol for details
  • 32 samples collected by protocol NE08 and 18S V9 region sequenced by LifeWatch (lw) see protocol for details

Details about protocols NPL022 and NE08 are in the OSD Handbook

Please note that we have different sequencing centers who did sequencing:

  1. LGC Genomics (shorthand: lgc), our main sequencing center who did metagenomes and 16/18S from protocol NPL022
  2. LifeWatch Italy (shorthand: lw), who kindly provided sequencing for protocol NE08 samples see protocol for details
  3. Australia - Ramaciotti (shorthand: ramaciotti), who for legal reasons had to sequence all 7 Australian sites

Initial sequence data pre-processing

The sequence data as delivered by the sequencing centers was pre-processed in order to derive common data sets on which to base follow-up analysis. Please see wiki page on pre-processing for details

In summary the pre-processing results in two kinds of quality controlled sequence datasets raw and workable for each input sequence set:

  • For amplicon data the output files per sample are:

    • raw: non-merged

    • workable: merged

  • For shotgun data the output files per sample are:

    • raw: non-merged (used e.g. by EMG)

    • workable output files

      • merged (used e.g. by mg-traits)

      • non-merged (used e.g. for assemblies)

Data Deposited in public archives and available and web sites

How to find the correct data at EBI

The dataset has the distinction between 16S and 18S is in the Run alias. The ENA browser displays the Run title (= a short informative description) rather than Run alias (= a submitter provided unique name, frequently being a unique ID meaningful only to the submitter). e.g. ERR867761 <RUN alias="OSD3-lgc-genomics-18S-199"

<TITLE>Illumina MiSeq paired end sequencing; Illumina MiSeq sequencing of sample OSD3_2014-06-20_0m_NPL022from OSD-JUN-2014</TITLE>

Furthermore, The Run ERR867760 belongs to the Experiment ERX947555 The Run ERR867761 belongs to the Experiment ERX947554

Each Experiment has it's own description, where the submitter clearly states which amplicon has been sequenced:

http://www.ebi.ac.uk/ena/data/view/ERX947555 (marine 16S rDNA amplicon sequencing) http://www.ebi.ac.uk/ena/data/view/ERX947554 (marine 18S rDNA amplicon sequencing)

Additional supplementary/ancillary data

We make available all other date (i.e. non-archived in public repositories) via MPI Bremen file server. This is the highest-level entry point.

Metagenomic data

Raw metagenomic datasets

All metagenomic raw datasts are archived at European Nucletide Arcive (ENA).

You can browse and download the archived metagenomic at European Nucletide Arcive (ENA) here:

Browsing EMG data tip

Clicking on a sample name will take you to a page where you can view and download the results of the EBI analysis pipeline (EMG) by clicking on the hyperlinks labelled "Taxonomy" or "Function" or the download icon in the "Analysis Results" column. You can also download the sequence data itself from these download pages, for example you can download the data and results for the sample identified as OSD15_2014-06-21_0m_NPL022 (ERS667653) here.

Workable datasets

Amplicon data (16/18S rDNA) Analysis by SILVAngs

Main analysis was done using SILVAngs pipeline on the workable sequence data set. SILVA taxonomy version 119.1 was used for all 16S datasets and version 119 for all 18S datasets - the differences are very minor and can be viewed here.

The analysis was done for the sequence data as obtained by LGC, LifeWatch and from Australia.

Note on taxonomy paths in MED exports

The MED exports contain a taxonomy path for each sequence inside the FASTA header. However this taxonomy is not filtered by 93% quality value, which is the default in SILVAngs. Therefore, to be consistent with other SILVAngs exports, an extra file with filtered taxonomy was added to the MED folder. See this issue for more details.

Analysis of workable 16S/18S rDNA from main sequence data set (by LGC)

16S

18S

Analysis of workable 16S rDNA dataset from Australia (sequenced by RGC)

Analysis of workable 18S rDNA datasets (sequenced by Lifewatch Italy)

V4

V9