-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about metagnomic data at EMG compared to documentation #18
Comments
Hi Pascal, To start answering your question:
Which is more up to EMG folks. @sittichpeitsche can you answer? |
Hi, If we take ERR770976, the fasta PE files on the ENA site with 1,149,852 * 2 reads were the ones used by the EBI analysis pipeline - there is no other raw dataset. The first thing the pipeline does is run SeqPrep to merge the paired end sequences. The resultant file contains 1,241,922 sequences, as illustrated in the ‘Initial reads’ section of the QC page. The reason that this number of reads is higher than the reads in the individual paired end files is that it contains merged reads plus reads that cannot be merged. So the 'initial reads file' used by the pipeline contains: The various QC steps (length filtering and removal of sequences with > 10% undetermined nucleotides) I hope this makes sense. If there any anything that is unclear, please let me know and I will try to clarify. Alex |
Hi, I had to join the conversation, as I had the same "confusion" that Pascal had. Today that I had the first look to the Github forum (just for the same doubt that Pascal had), I discovered that ....? What? The question here is: Could someone answer to my last points? Now we are completely stuck |
Hi @AVezzi , no need to be stuck or lost or frustrated. I am not aware of what you exactly want to study. However, I do think your analysis will be valid either based on EMG data or on the OSD workable data. It is just important to be aware of the differences. Indeed EMG used the 'OSD raw dataset' and produced their own output. However, OSD decided to to define 'workable datasets' as described in github wiki for the OSD community independently of any existing or future analysis pipeline. Main reason for that were:
In this specific case the main difference between the EMG AND osd workable are:
@genomewalker (antonio) and I also run all metagenomes workable datasets through the mg-traits pipeline. The final result will be available in two weeks. Maybe you are interested to also do your analysis on the data based on mg-traits in addition to EMG. This might give you different results or not, but definitely it will make your analysis and interpretation more robust. ciao,renzo |
forgot: all workable datasets are available for download now and I updated the documentatio with the aim to make thinks about workable and raw more clear! |
I followed the link " All workable data is available here" to https://zarafa.mpi-bremen.de/owncloud/index.php/apps/files?dir=/Shared/mgg/OSD/analysis-data/2014/datasets/workable/ and this then asks for a username and password? |
Also, the documentation is still a little unclear under the section "Metagenomic data" it is still written "You can browse and download the metagenomic data as well as analysis results generated by the European Bioinformatics Institute (EBI) here: https://www.ebi.ac.uk/metagenomics/projects/ERP009703/" |
In fact it looks like the workable data is here: |
Thanks Antonio for your answer that allows us to go further with the analyses. One possibility, and this a question to EMG folks, is to do again all the EMG analysis, using the Bremen produced workable dataset (= skipping the merging procedure of the EMG pipeline). |
@hingamp thanks for further suggestion I will try to improve documentation |
@AVezzi First, no one requires you to use InterPro matches. That said InterPro matches are good. Mg-Traits produces PFAM matches using a the uproc tool |
Hi Ale, The alternative would be for the Bremen dataset to use our merging system - ie, Seqprep with Q15 for quality trimming. I Alex On 11 Jun 2015, at 15:47, AVezzi [email protected] wrote:
|
Hi all,
We had a meeting with the OSD analysis group to discuss the pre-processing pipeline and this was never planned. We decided for the OSD pre-processing with the perspective to be suitable to a variety of analyses (including amplicon) and not only for EMG. In this meeting we decided to use PEAR as a merger because in our benchmarks was performing better than the other mergers and it was published. In terms of quality we decided to go for a Q20 as we thought that many analyses would require a higher quality in the base calling than Q15, where the accuracy is 96% compared to the 99% of Q20 (i.e. important for MED analyses). Antonio |
Hi Antonio,
|
Hi Renzo, Sorry for the delayed reply on this - I’ve been busy chasing other things. Rob has been looking into uproc and (in his capacity as head of Pfam) is a bit concerned about the I’ve cc:ed him for more detail if necessary. Alex On 11 Jun 2015, at 15:56, renzo [email protected] wrote:
|
Hi, I used the term "match" cause uproc is trained on the PFAM database and So, which term should be used in your opinion then and how does it ciao, On 2015-06-17 16:43, almitchell wrote:
|
Many thanks Alex, I am sure all OSD colleagues will be happy with the data. Cheers! Antonio |
I think the proper term should be hit instead of match in the same way is used by Peter in his paper. What do you think? |
yes, hit is a good term. |
The below issue(s) questions were raised by Pascal.
Hi,
As already discussed, there is little reason to work with the OSD raw reads (other than technical or QC questions), so within the scope of the OSD analysis paper we should mostly all be using the workable reads (and assemblies, when available), but at this point I'm a little confused as to where the main workable reads may be hiding :)
Can I make a quick sanity-check with you about the location of the OSD raw and workable read data, taking for example ERR770976:
Raw reads (after demultiplexing + adapter clipping as defined on gitHub) are here:
https://www.ebi.ac.uk/ena/data/view/ERR770976
With two nice looking fastq PE files:
ftp://ftp.sra.ebi.ac.uk/vol1/ERA413/ERA413491/fastq/OSD73_R2_shotgun_raw.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/ERA413/ERA413491/fastq/OSD73_R1_shotgun_raw.fastq.gz
containing 1149852 * 2 reads
A certain flavor of workable reads (after quality trimming + length filtering as defined on gitHub) are here:
https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667589/runs/ERR770976/results/versions/2.0
Which provides a link to the above raw reads, and a link to "Processed nucleotide reads (FASTA)":
https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667589/runs/ERR770976/results/sequences/versions/2.0/export?contentType=text&exportValue=processedReads
which downloads a file which from its name look like merged reads:
ERR770976_MERGED_FASTQ_nt_reads.fasta
it contains 975769 reads, not incompatible with the raw PE read number above.
There are a number of questions:
1/ is this file ERR770976_MERGED_FASTQ_nt_reads.fasta the merged workable reads file?
2/ but then how come this file has an extra "ambiguous bas filtering" step which is not documented on gitHub in (as seen in the "Quality Control" tab of https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667589/runs/ERR770976/results/versions/2.0)
3/ where are the workable un-merged reads which are described on gitHub as part of the "shotgun data the output files per sample".
4/ why is the "initial" (I guess "raw") number of reads (1,241,922) in the EMG "Quality Control" tab different from the raw read number on the ENA page above (1,149,852). Because the EMG analysis appears based on more reads than the supposedly official raw dataset, is it possible that EMG started with the "pre-raw" data set, ie the files directly obtained from the sequencing company?
So it seems that in fact the data located on the EMG portal are not the workable dataset described on giHub documentation, but instead an EMG specific dataset based on a different raw dataset, using a different read pre-processing?
I seem to remember we'd agreed that we should all be working with the same data sets, or else how are we to hope to compare and merge our results, plus the mat & met section is going to be one read pre-processing method for each OSD analysis partner?
Thanks ever so much for your help (and patience as this might have been said before - but trust me I did really do my best to search before writing this mail, so it might in fact just be an issue with documentation or my poor understanding).
Cheers,
Pascal
The text was updated successfully, but these errors were encountered: