Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about metagnomic data at EMG compared to documentation #18

Open
renzok opened this issue Jun 9, 2015 · 19 comments
Open

Questions about metagnomic data at EMG compared to documentation #18

renzok opened this issue Jun 9, 2015 · 19 comments

Comments

@renzok
Copy link
Member

renzok commented Jun 9, 2015

The below issue(s) questions were raised by Pascal.

Hi,

As already discussed, there is little reason to work with the OSD raw reads (other than technical or QC questions), so within the scope of the OSD analysis paper we should mostly all be using the workable reads (and assemblies, when available), but at this point I'm a little confused as to where the main workable reads may be hiding :)

Can I make a quick sanity-check with you about the location of the OSD raw and workable read data, taking for example ERR770976:
Raw reads (after demultiplexing + adapter clipping as defined on gitHub) are here:
https://www.ebi.ac.uk/ena/data/view/ERR770976
With two nice looking fastq PE files:
ftp://ftp.sra.ebi.ac.uk/vol1/ERA413/ERA413491/fastq/OSD73_R2_shotgun_raw.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/ERA413/ERA413491/fastq/OSD73_R1_shotgun_raw.fastq.gz
containing 1149852 * 2 reads
A certain flavor of workable reads (after quality trimming + length filtering as defined on gitHub) are here:
https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667589/runs/ERR770976/results/versions/2.0
Which provides a link to the above raw reads, and a link to "Processed nucleotide reads (FASTA)":
https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667589/runs/ERR770976/results/sequences/versions/2.0/export?contentType=text&exportValue=processedReads
which downloads a file which from its name look like merged reads:
ERR770976_MERGED_FASTQ_nt_reads.fasta
it contains 975769 reads, not incompatible with the raw PE read number above.

There are a number of questions:
1/ is this file ERR770976_MERGED_FASTQ_nt_reads.fasta the merged workable reads file?
2/ but then how come this file has an extra "ambiguous bas filtering" step which is not documented on gitHub in (as seen in the "Quality Control" tab of https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667589/runs/ERR770976/results/versions/2.0)
3/ where are the workable un-merged reads which are described on gitHub as part of the "shotgun data the output files per sample".
4/ why is the "initial" (I guess "raw") number of reads (1,241,922) in the EMG "Quality Control" tab different from the raw read number on the ENA page above (1,149,852). Because the EMG analysis appears based on more reads than the supposedly official raw dataset, is it possible that EMG started with the "pre-raw" data set, ie the files directly obtained from the sequencing company?

So it seems that in fact the data located on the EMG portal are not the workable dataset described on giHub documentation, but instead an EMG specific dataset based on a different raw dataset, using a different read pre-processing?

I seem to remember we'd agreed that we should all be working with the same data sets, or else how are we to hope to compare and merge our results, plus the mat & met section is going to be one read pre-processing method for each OSD analysis partner?

Thanks ever so much for your help (and patience as this might have been said before - but trust me I did really do my best to search before writing this mail, so it might in fact just be an issue with documentation or my poor understanding).
Cheers,
Pascal

@renzok
Copy link
Member Author

renzok commented Jun 9, 2015

Hi Pascal,

To start answering your question:

4/ why is the "initial" (I guess "raw") number of reads (1,241,922) in the EMG "Quality Control" tab
different from the raw read number on the ENA page above (1,149,852). Because the EMG
analysis appears based on more reads than the supposedly official raw dataset,

Which is more up to EMG folks. @sittichpeitsche can you answer?

@almitchell
Copy link

Hi,

If we take ERR770976, the fasta PE files on the ENA site with 1,149,852 * 2 reads were the ones used by the EBI analysis pipeline - there is no other raw dataset.

The first thing the pipeline does is run SeqPrep to merge the paired end sequences. The resultant file contains 1,241,922 sequences, as illustrated in the ‘Initial reads’ section of the QC page. The reason that this number of reads is higher than the reads in the individual paired end files is that it contains merged reads plus reads that cannot be merged. So the 'initial reads file' used by the pipeline contains:
(R1+R2 merged reads) + (unmerged R1 reads) + (R2 unmerged reads).

The various QC steps (length filtering and removal of sequences with > 10% undetermined nucleotides)
are then applied to this initial reads set. As a result of these processes, we end up with a file containing 975,769 sequences, which is the processed reads file that you link to below.

I hope this makes sense. If there any anything that is unclear, please let me know and I will try to clarify.

Alex

@AVezzi
Copy link

AVezzi commented Jun 10, 2015

Hi,

I had to join the conversation, as I had the same "confusion" that Pascal had.
Here in Padova we have two analysis proposals that are based on the IPR codes that are present in the "*InterPro.tsv" tables released by the EMG EBI effort.
And, as Renzo presumably remember, more than 3 weeks ago I asked to the OSD leaders if all the data (shotgun, 16S and 18S) were all "the same", coming out from an unique pre-processing step.

Today that I had the first look to the Github forum (just for the same doubt that Pascal had), I discovered that ....? What?
Just surfing a bit on the EMG data (and portal) I had clear that someone had merged the reads (that is absolutely correct), and that the number of merged reads could be more than the single raw data file as present in ENA.

The question here is:
what we have to do?
Is EBI going to release a "new version of the EMG data", based on the workable dataset that Bremen is producing? In that case, we clearly have to wait till the new release.
Instead, if the dataset will remain the actual, can we proceed with our analysis, or at the end everything will be useless, as there is no consistency among data?

Could someone answer to my last points? Now we are completely stuck
Cheers,
Ale

@renzok
Copy link
Member Author

renzok commented Jun 11, 2015

Hi @AVezzi ,

no need to be stuck or lost or frustrated. I am not aware of what you exactly want to study. However, I do think your analysis will be valid either based on EMG data or on the OSD workable data. It is just important to be aware of the differences.

Indeed EMG used the 'OSD raw dataset' and produced their own output.

However, OSD decided to to define 'workable datasets' as described in github wiki for the OSD community independently of any existing or future analysis pipeline. Main reason for that were:

  • Each pipeline has there own policy of changing over time (e.g. EMG just changed shortly before OSD jamboree)
  • Many other pipelines need a 'workable datasets' as input (e.g. mg-traits or SILVAnga)

In this specific case the main difference between the EMG AND osd workable are:

  • EMG uses seqprep for merging, OSD uses PEAR: the differences are not that big (see attachment nicely kindly produced by Antonio)
  • the main difference in results is based on the fact that OSD pre-processing uses Q20 instead of Q15 in the quality-trimming step after merging.

@genomewalker (antonio) and I also run all metagenomes workable datasets through the mg-traits pipeline. The final result will be available in two weeks. Maybe you are interested to also do your analysis on the data based on mg-traits in addition to EMG. This might give you different results or not, but definitely it will make your analysis and interpretation more robust.

ciao,renzo

@renzok
Copy link
Member Author

renzok commented Jun 11, 2015

forgot: all workable datasets are available for download now and I updated the documentatio with the aim to make thinks about workable and raw more clear!

@Zaphod-dev
Copy link

I followed the link " All workable data is available here" to https://zarafa.mpi-bremen.de/owncloud/index.php/apps/files?dir=/Shared/mgg/OSD/analysis-data/2014/datasets/workable/ and this then asks for a username and password?

@Zaphod-dev
Copy link

Also, the documentation is still a little unclear under the section "Metagenomic data" it is still written "You can browse and download the metagenomic data as well as analysis results generated by the European Bioinformatics Institute (EBI) here: https://www.ebi.ac.uk/metagenomics/projects/ERP009703/"
It should be explained that the EMG function & taxonomy results can be accessed there, but that the raw reads are stored at ENA (although linked from the EMG web page, but this to make sure everyone understand where the data is actually officially archived, not providing secondary links that participate in confusing things further) whereas the "workable reads" are available somewhere else (eg zarafa when the access is fixed). I would here explicitely state what exactly are the " Processed nucleotide reads (FASTA)" available from the EMG website, and how they differ from the "workable reads". Better repeat things than risk someone jumping to this section and missing the important points?

@Zaphod-dev
Copy link

@AVezzi
Copy link

AVezzi commented Jun 11, 2015

Thanks Antonio for your answer that allows us to go further with the analyses.
As far as I know (could you confirm this, please?) mg-traits is not producing any InterPro matches table, therefore we need to use the EMG dataset.

One possibility, and this a question to EMG folks, is to do again all the EMG analysis, using the Bremen produced workable dataset (= skipping the merging procedure of the EMG pipeline).
What the EMG folks think on it? It could be (at least for me) the only way to use the same dataset for all the OSD metagenomic analyses.
Thanks,
Ale

@renzok
Copy link
Member Author

renzok commented Jun 11, 2015

@hingamp thanks for further suggestion I will try to improve documentation

@renzok
Copy link
Member Author

renzok commented Jun 11, 2015

@AVezzi First, no one requires you to use InterPro matches. That said InterPro matches are good. Mg-Traits produces PFAM matches using a the uproc tool

@almitchell
Copy link

Hi Ale,
The problem with EMG re-doing the analysis is it would be break the standard merging procedure that we have used for
all of our samples since we began running the EBI metagenomics pipeline several years ago. Because we use a standardised
system, everything in the portal is (broadly) comparable, since the raw have been prepared in the same way. This means,
for example, that the OSD data can be compared to the TARA Oceans dataset that we are currently analysing and
uploading to the Portal, or other oceanographic studies.

The alternative would be for the Bremen dataset to use our merging system - ie, Seqprep with Q15 for quality trimming. I
believe this was the original plan.

Alex

On 11 Jun 2015, at 15:47, AVezzi [email protected] wrote:

Thanks Antonio for your answer that allows us to go further with the analyses.
As far as I know (could you confirm this, please?) mg-traits is not producing any InterPro matches table, therefore we need to use the EMG dataset.

One possibility, and this a question to EMG folks, is to do again all the EMG analysis, using the Bremen produced workable dataset (= skipping the merging procedure of the EMG pipeline).
What the EMG folks think on it? It could be (at least for me) the only way to use the same dataset for all the OSD metagenomic analyses.
Thanks,
Ale


Reply to this email directly or view it on GitHub.

@genomewalker
Copy link

Hi all,
just a little comment to:

The alternative would be for the Bremen dataset to use our merging system - ie, Seqprep with Q15 for quality trimming. I believe this was the original plan.

We had a meeting with the OSD analysis group to discuss the pre-processing pipeline and this was never planned. We decided for the OSD pre-processing with the perspective to be suitable to a variety of analyses (including amplicon) and not only for EMG. In this meeting we decided to use PEAR as a merger because in our benchmarks was performing better than the other mergers and it was published. In terms of quality we decided to go for a Q20 as we thought that many analyses would require a higher quality in the base calling than Q15, where the accuracy is 96% compared to the 99% of Q20 (i.e. important for MED analyses).
Maybe EMG can provide the fastq files after their pre-processing steps.

Antonio

@almitchell
Copy link

Hi Antonio,
Just to confirm, the EMG fastq files after the SeqPrep merging steps have taken place (but before any qc) will be will be made available from ENA. We are currently working on getting the files transferred across.
This will mean the raw files, the merged files and the QC-processed files will all be available. Hopefully this covers all use cases.
Alex

We had a meeting with the OSD analysis group to discuss the pre-processing pipeline and this was never planned. We decided for the OSD pre-processing with the perspective to be suitable to a variety of analyses (including amplicon) and not only for EMG. In this meeting we decided to use PEAR as a merger because in our benchmarks was performing better than the other mergers and it was published. In terms of quality we decided to go for a Q20 as we thought that many analyses would require a higher quality in the base calling than Q15, where the accuracy is 96% compared to the 99% of Q20 (i.e. important for MED analyses).
Maybe EMG can provide the fastq files after their pre-processing steps.

Antonio


Reply to this email directly or view it on GitHub.

@almitchell
Copy link

Hi Renzo,

Sorry for the delayed reply on this - I’ve been busy chasing other things.

Rob has been looking into uproc and (in his capacity as head of Pfam) is a bit concerned about the
statement Mg-Traits produces Pfam matches. He asked me to point out that you get different results
using a non-HMM method to the actual HMM library, so uproc does not give Pfam matches - it gives a
fast approximation to them, but treating them as equivalent can be dangerous.

I’ve cc:ed him for more detail if necessary.

Alex

On 11 Jun 2015, at 15:56, renzo [email protected] wrote:

@AVezzi First, no one requires you to use InterPro matches. That said InterPro matches are good. Mg-Traits produces PFAM matches using a the uproc tool


Reply to this email directly or view it on GitHub.

@renzok
Copy link
Member Author

renzok commented Jun 17, 2015

Hi,

I used the term "match" cause uproc is trained on the PFAM database and
gives as output the PFAM accession of the corresponding database entry.
And it is a different method compared to HMM, but "matches" against the
same database.

So, which term should be used in your opinion then and how does it
differ from matches?

ciao,
renzo

On 2015-06-17 16:43, almitchell wrote:

Hi Renzo,

Sorry for the delayed reply on this - I’ve been busy chasing other things.

Rob has been looking into uproc and (in his capacity as head of Pfam)
is a bit concerned about the
statement Mg-Traits produces Pfam matches. He asked me to point out
that you get different results
using a non-HMM method to the actual HMM library, so uproc does not
give Pfam matches - it gives a
fast approximation to them, but treating them as equivalent can be
dangerous.

I’ve cc:ed him for more detail if necessary.

Alex

On 11 Jun 2015, at 15:56, renzo [email protected] wrote:

@AVezzi First, no one requires you to use InterPro matches. That
said InterPro matches are good. Mg-Traits produces PFAM matches using
a the uproc tool


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub
#18 (comment).

@genomewalker
Copy link

Hi Antonio,
Just to confirm, the EMG fastq files after the SeqPrep merging steps have taken place (but before any qc) will be will be made available from ENA. We are currently working on getting the files transferred across.
This will mean the raw files, the merged files and the QC-processed files will all be available. Hopefully this covers all use cases.
Alex

Many thanks Alex, I am sure all OSD colleagues will be happy with the data.

Cheers!

Antonio

@genomewalker
Copy link

Hi,

I used the term "match" cause uproc is trained on the PFAM database and
gives as output the PFAM accession of the corresponding database entry.
And it is a different method compared to HMM, but "matches" against the
same database.

So, which term should be used in your opinion then and how does it
differ from matches?

ciao,
renzo

I think the proper term should be hit instead of match in the same way is used by Peter in his paper. What do you think?

@renzok
Copy link
Member Author

renzok commented Jun 17, 2015

yes, hit is a good term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants