-
Notifications
You must be signed in to change notification settings - Fork 0
fall_2018_log
August 2018
Aug.19, Aug.21, Aug.22, Aug.23, Aug.31,
September 2018
Sep.10, Sep.13, Sep.17, Sep.18, Sep.19, Sep.21, Sep.24, Sep.25, Sep.27, Sep.28,
October 2018
Oct.01, Oct.10, Oct.12, Oct.21, Oct.22, Oct.23, Oct.29, Oct.30,
November 2018
Nov.01, Nov.02, Nov.07, Nov.14, Nov.16, Nov.19, Nov.20, Nov.21, Nov.22, Nov.23, Nov.26, Nov.29,
December 2018
meeting with PLB
align EBI ENA biome terms to ENVO biomes. Pier had been asked to do this by ENA people, this would provide semantic refference to the large and unprocessed ENA genomic dataholdings which would be made accessable to the planet microbe tools. It could also be used later on as a rich source of material to do machiene learning over. bonnie says yes
Could dockerize CAMI and get it to run on our infrustructure.
Don't waste to much time getting the data cleaned up if CAMERA is too bad, I think it'd be better to leave it be and restart clean. Perhaps get some undergrad interns to help with the annotation task if necessary. My thoughts would be to just make sure that the HOT, TARA, OSD, BAT ... + a couple more standard datasets of interest ... are ontologized and integrated.
pier suggests I read up on the expressivity of owl to better understand what it can represent. I guess to better search ontology knolwedge graphs.
Use data holdings to machine learn over create new putative environment classifications. Create new envo nodes which may be new unknown environments. Use stats over lots of genomes see if there are reccurent patterns which are not explained by/cluster into known environments, but are perhaps uncaracterized micronieches or something of the sort (uncharacterized envrionments in a human term).
Observatories to potentially draw data in from: National Ecological Observatory Network (NEON), ARGO, have a feature in planetmicrobe where we could connect genomic data to sensor data, ask is there any sensor data that is in the same spatiotemporal pocket as this genomic sampling of interest.
Initative Pier is involved in, following building after 2018 ISME. Omics observatory network. A new MIxS field was made for observatory ID, in an ontology to link. Help to cross compare observatory omics data. Justifies an ecological foundation to doing machine learning on omics data. For example compare the same ecosystem across space or time by having standardized the using the same techniques to do the analysis on. Run a pipline in 2019 and be able to rerun the same thing in 2029 to cross compare potential changes due to global warming. Pier will add me to the google group.
C-DEBI deep dark ocean samples hydrothermal vent, sludge different environments.
bring together HOT, bermuda (BAT), C-DEBI (usc), C-MORE, BCODMO (woods hole) housing cruise/phsiochem ocean chem datasets)
bring in other Earthcube datasets geological ocean chem,
ontologize the metadata
MM-ETSP only in imicrobe look this up
help Matt B to add ontology to his submission tool.
ontology for the informatics tools
restructure schema for our data backend.
Pier added me to the Global Omics Observatory Network (GLOMICON)
A group for discussions around the Global Omics Observatory Network (GLOMICON).Workshop on enhancing interoperability & coordination of long-term omics observations (Bremen, 21.-23. Feb. 2018)
Pier also sent me the list of EBI biomes, seems pretty reasonable only 491, and I'm sure most are in ENVO.
Mandatory TA training for my RAship
Bonnie sent me a link to the lab basecamp page
Question for Pier, I presume these will be Biome classes I am to be creating for ENA ebi?s
integrate with pegasus /other cyberinf projects.
A collab lady was doing NLP ontology for abstracts
My job harmonize the metadata with EBI.
Checked out metadata for various projects, available in the folder metadata_planet_microbe_projects
I was checking the accession numbers which are provided in the Tara pangaea metadata page. The accession numbers weren't simply search-able in their search bar, hence perhaps there is more to accessing them. So I checked out their page on Downloading read and analysis data
try:
ftp://ftp.sra.ebi.ac.uk/vol1/<submission accession prefix>/<submission accession>
where <submission accession prefix> contains the first 6 letters and numbers of the SRA Submission accession.
trying it with the first tara accession: ERS1313951
ftp://ftp.sra.ebi.ac.uk/vol1/ERS131/ERS1313951
Didn't work perhaps I need permission or something to traverse the file structure.
From the Fierer-et-al.-2007-total-microbial-diversity1 paper I was thinking is it possible to do richness and evenness comparisions of functional genes not just OTUs? Could we ask a question of our future large dataholdings such as "what are the richness and evenness of ... some defined set of housekeeping or common functional genes ... across different environment types. Has this been done? Something to think about. Maybe also the richness and evenness of the "hypothetical environment classes" I'd try to discover via machine learning. Could we get a better view of the global richness of these environments? or has this maybe already been done.
Could we somehow Seperate out bacterial and viral genomic material from the metagenomes?
From Alyse link to her spreadsheet for cleaning the CAMERA metadata
Previously get Dr. Clayton T. Morrison to agree to be on my committee. Have a first draft of the ENA ontology project sketched out, will include shortly.
Met John H. Hartman and Illyoung Choi, collaborators of ours, they suggested to read the HULK paper like MASH for kmer hashing. Perhaps they can help with the ENA ontology project. This paper makes use of the CAMI (Critical Assessment of Metagenome Interpretation) dataset which Pier had suggested we incorporate into imicrobe.
I'm presenting for the Sep 27th lab meeting.
There is major interest in adopting interest in adopting schema.org across EarthCube, ESIP, RDA and the data facilities across NSF, NOAA and NASA. From Elisha.
Can print out my health insurance in 5-7 business days from https://www.aetnastudenthealth.com/
Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) Project is as name implies about transcriptomes of euk phytoplantkon. Phylogeny semantics could be an interesting addition later, set it up to ask questions like, what metabolic processes differentiate Prasinophyceae from Bacillariophyceae, that sort of thing, using the NCBITaxon ontology terms. This is a cool idea for later perhaps.
For the GOS data, I looked at a couple random samples, they contain links to sra_bioproject and sra_biosample, the latter has the MIMS (minimum information about a metagenomic sequence). So we'd probably need to extract the MIMS/MIxS data from link drawn from the sra_biosample field. This may be the case for other samples as well, not that the metadata is super informative if it gives us something such as for sample GOS Sample S0135
environment biome ocean_biome
environment feature ocean
environment material water
This plays to the ENA_EBI_BIOMES_ML idea where we'd ideally want to reclassify at a deeper hierarchical level than ocean biome. To be fair the current biomes hierarchy isn't as well developed as the environmental feature hierarchy. I think Pier was uncertain which ENVO hierarchy in which the genomic data would/should cluster. Do we go with biome, environmental feature or environmental material? (Probably not environmental material). An excellent project such as OSD has fields for all three. TARA only has environmental feature which links to ENVO terms. I think this may remain an open question how to deal with this. Philosophically speaking you could describe the environmental feature annotations as an environmental system determined by X environmental feature. May or may not be necessary.
As for the creation of the ENA_EBI_BIOMES_ML biome terms, We should also consider if the biomes heirarchy is the correct one in which to place this, as there is an open question about ENVO:biome over if organisms labeled in a biome are asserted to have evolved/adapted to that biome. The comment read:
There has been some concern raised (see
Issue #143) about the usefulness of the assertion that
organisms have evolved within a given biome. They may have
evolved adaptations elsewhere and demonstrating one or the
other is often not feasible. Consider relabelling to
"environmental system determined by an ecological
community" or similar.
Thus I think this ought to be addressed before I start doing all the EBI "biome" terms under the biome hierarchy. Bringing this back to the other data sets if it's been more common (e.g. TARA) to label samples by environmental feature, should we setup the future ML classifier to work in that environmental feature space? Or do we map those to environmental system determined by X environmental feature terms. I'll have to see what Pier thinks about all this. Pragmatist Kai want's 1 hierarchy by which to classify genomic material but he understands the philosophy behind having multiple, it just can confuse people trying to annotate/set up new experiments.
As for the first round of metatdata to ontologize for planet microbe here is a first pass of what to start with:
HOT_CTD: only use what's overlapped with in others.
HOT_Niskin: only use what's overlapped with in others, make sure we get at least some of the dissolved organic and inorganic terms DOC DON ...
BATS_CTD: Use basically all of
BATS_bottle: only use what's overlapped with in others.
BATS_pigments: chl a and b for now.
TARA_Env_context: most of it, minus first 7 columns. Could probably for now skip some of the sensor data things like angular scattering coefficient etc.
TARA_Methodological_context: not as useful but it does link to ENA, and it may have some info in the Environment (Material). Additionally, there are a bunch of datasets in the these Registry of all samples from the Tara Oceans Expedition collection (2009-2013). I can't determine if useful or not because not all of these link to ENA. I think that some of these were generated afterward using models.
OSD: Take nearly all good model.
GOS: take things which overlapped with others.
MM-ETSP*: take things which overlapped with others.
I can't access C-DEBI on imicrobe yet. Alse didn't include it yet, so I'll consider this a later TODO, probably not adding much in terms of new metadata.
Note not all the TARA EBI accession numbers links to files at ENA!!!!
also check out the ocean omics page and the Ocean Gene Atlas
Weekly PM Meeting
bonnie's Grants say:
Repo science data in one place.
Stc science tech center
C-debi data's structure should theoretically be like hot
A next step is to integrate the Sea view data set from earth cube project (cherry on top) integrated with other types of data to link to. Bring this in with the 4d plus I could check out other terms to ontologize.
BATS messed up BCODMO submission and is messed up so put this on lower priority
RNR seminar: Have draft of intro for the semantics talk for RNR for next Tuesday.
From Dr. Saleska:
A History of the Ecosystem Concept in Ecology: More than the Sum of the Parts
Francis evans -> idea about ecology being the taxonomy of ecosystems
eugene odum Fundamentals of Ecology.
For some of the GOS samples eg: https://www.imicrobe.us/#/samples/588 there are 2 chlorophyll fields with different values. I'm guessing that it's chl a and b, but I'm not sure and we should look into this (somehow) before assigning it as such. I've looked over the gos papers: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.0050077 as well as http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.0050016 and https://www.ncbi.nlm.nih.gov/pubmed/?term=17504484, but they don't talk about the methods for the chl measurements. We may have to ask someone who actually handled the data for GOS.
For hot chloropigments is it chl a? check out the hot protocols page
worked on Tara notes.
Note for OSD not all of the samples are collected from Marine systems, there are lakes rivers and estuaries as well. I'm not sure but I'd presume Pier would want that to be reflected in the semantics. Hence I wonder if it'd be better to create classes which are concentrations of x,y,z in saline water (instead of seawater), then in the knowledge graph we could search for the right concentration of xyz in saline water water subclass i.e. seawater. Could this level of detail also be used to help properly reclassify annotation semantics between marine and freshwater samples? For example looking at the salinities of all the "Maine" samples and suggesting outliers for reannotation. This fits in with our ML annotation meta-data suggestion framework. For now instead of a class like "concentration of x in seawater" would it be better to have these types of fields be annotated with a class like "concentration of x in water". Then have another annotation for individual samples with "concentration of x in seawater" (or in freshwater). I think in principle we'd want to importing all of envo (+ other ontologies) as part of our knowledge graph and we could allow users to query for samples based on a nitrate concentration or based on a nitrate in seawater? We'll need to pin this down.
When looking at a description of fluorescence I found a cool Dutch data set/initiative to get crowd sourced water fluorescence data from this initiative called Citizens’ Observatory for Coast and Ocean Optical Monitoring part of the Marine and Ocean Data Management MARIS. They have a mobile application in which I guess people take pictures of water and it determines a fluorescence unit in FU from the photo. It looks like they have a pretty decent dataset of water fluorescence (a common paramater from our planet microbe data). I'd be cool to collaborate see if they'd be willing to let our infrastructure be able to query their database get the data in a common searchable format and maybe connect up some data. Cool idea for the future.
Fluorescence
from this USGS page they have data about 3 types of fluorescence measurements
Colored dissolved organic matter, water, filtered, field, single band excitation, fluorescence emission, ppb QSE
Colored dissolved organic matter (CDOM), water, in situ, fluorometric method, relative fluorescence units (RFU)
Colored dissolved organic matter fluorescence (fDOM), water, in situ, mg/l of carbon
We have examples of at least two of them.
TARA Fluorescence is Colored dissolved organic matter short name fCDOM with units ppb (QSE), which would make it the first one from the USGS list.
BATS_CTD Fluorescence is in units of (RFU) so it must be the USGS CDOM
HOT_CTD FLUOR: This fluorescence is not the same as bat's in RFU's. This is instead some specific fluorescence that was used to measure chloropigment see this HOT protocol which has a unit of mVOLTS. I'm guessing this was used to help determine the CHLPIG Chloropigments (microgram/liter) measurements.
GOS fluorescence: I have no idea (about any of the GOS metadata)
Git hub issue for Pier and I to work on the ENA EBI biome terms:
Worked on PM metadata sheet as well as PM ontology terms
Bonnie/me meeting: 1 should do my comprehensive exams fall next year or latest January 2020. I should have an initial PM related publication (or at least draft) by then to leverage for the comps exam (to direct the questions toward my work and expertise and away from other things I may not know).
The MM-ETSP dataset could be a really neat example of something PM can ontologize/make FAIR/linked to tools such as RNA seq. Could be part of my first publication. Bonnie and I could write up an exec summary to ontologize the MM-ETSP dataset, if the MMETSP lead would be will to support with funding.
Bonnie tentatively agreed to my 3rd paper unit being the tool to leverage GO to compare differences between omics samples, having it be a tool we host on PM.
To comply with INSDC reporting of sample location by country can use the Mapping owl file for countries in the GAZ gazateer this could be useful for some of our PM projects with country code info.
Presentation seemed to have gone well didn't get a ton of feedback.
Meeting with Alise after:
For OMEGA-C, first implementation wouldn't involve the ENVO graph searching/ linked data. It would just be user inputted fasta's + the user defined annotations (doesn't even need to be PURLS could just be strings). Then we'd have a local GO to parse to make sure we do the PFAM2GO annotation correctly, as in we only map 1 PFAM annotation to one of each of the main GO term hierarchies (BP, MF, CC). Having a brief look at the pfam2go page we noticed that the same PFAM ID PF00032 links to 3 GO terms: oxidoreductase activity, electron transfer activity and membrane. The first two are in a subclass-superclass relationship within the MF hierarchy. For my tool we would only want a PFAM term to map to a single term from one of the 3 GO hierarchies and we want it to be the deepest most term. In this example we'd want electron transfer activity but not oxidoreductase activity for the MF hierarchy (in addition to membrane for the CC hierarchy. We don't want to be double counting terms when doing the PFAM2GO mapping because that would make everything wrong. Hence I'll have to write a new script to properly parse the pfam2go page making sure not to double count and to map to the full purls not just the term_id strings. I could have a local version of GO in house (with regular updates) to query to make sure the mapping isn't double counting. More info about the PFAM2GO mapping on this blog and this paper.
For PM, Alise and I discussed having a knowledge graph in addition to the backend database. The knowledge graph would import ontologies: GO, ENVO, UO (units ontology), etc in addition we could treat it as a Knowledge Representation (KR) schema (A.I knowledge Rep). KR schemas have 2 main components 1) semantics networks, or knowledge graphs often ontologies are used. 2) Frame representations where you encode facts like values about entities in the knolwedge graph. Useful video (in hindi/english) here. In our case we'd encode facts to represent the conversions between units. //Discuss the umol/Kg -> umol/L example.
Luckily we have FragGeneScan and UProC in imicrobe, so it shouldn't be that big of a deal to assemble the pipeline to do this.
Meeting with Alise
Re-integrating collection metadata with samples. We'd need the labeled depth (what they wrote on a bottle e.g 20 m) vs what's linked to actually collected metadata for example the CTD_depth which was 19.7. If we have separate semantics for these things, we could help to better integrate them. This applies for the 4d parameters (plus some others), depth time lat long, (could also have metagenome associated temp values or similar).
For MMETSP We need to incorporate the fact that the 4D info is about collection source and not in-situ omics. If we handle this properly we'd be ready to incorporate data about other isolates. I.E. metagenomes about prochlorococcus isolated from these coordinators. Not metagenomes of anything from those coors. We'll have to figure out how to handle this and if it goes in the 4D search or not. In future could pull down type collections ATCC metagenomes of NCBI isolates etc.
planet microbe/EarthCube_Proposal_Narrative grant in google docs.
Made a mini example figure of what the knowledge graph for PM could involve / help to deal with unit standardization here
Also coded up a separate mini example of querying a knowledge graph using python/sparql here
For the OMEGA-Compare tool I could maybe use interPro2Go rather than PFAM2GO, check it out see if it gives more info theoretically it should as pfram is a subset of interPro.
For the taxonomy of EBI-metagenomes: Alise suggested I check out Latent Dirichlet allocation a type of Dirichlet process proposed by Andrew Ng and others, were you roughly have doccuments with words that belong to different topics, which get assigned to the doccuments. Similar to probabilistic latent semantic analysis (pLSA), also similar is Probabilistic latent semantic analysis but making use of sparse Dirichlet priors (baysian vs frequentest). There is a hierarchical version for making groups among clusters called Hierarchical Dirichlet process which I can hopefully use. Paper on Hierarchical Dirichlet Processes here. Similar to the Chinese restaurant process. I would try to do an Dirichlet Process where the doccuments are metagenomes and the words are kmers and or gene calls/Pfam/GO terms and the topics are the putative ENVO terms.
meeting with Clay:
embedding: know a work by the company it keeps (context in which word occurred). in NLP it's other words around word. when training to id embedding, looks for words that share common context. Introduce loss function cap differ properties, he works with the NLP group embedding syntactic structure (may be correlated) Learn taxonomies of words with little training. Looking in to hyperbolic embedding, common words in context, when words have taxonomic hierarchy, embedding in to euclidean space is hard separate higher things in tree, but not leaves in tree, instead make space hyperbolic if move in 2 diff directions they get far away from each other fast. Have room further from the root have more room for where things are laid out, depends on having stuff in the intermediate steps. Try to embed into hyperbolic spaces things with taxonomic relationships, end up with tree in hyperbolic space. There are other tree induction algorithms. Ryan on Matts committee has done work on inferring taxonomic reconstruction based on similarities with Bayesian similarities. Topic modeling. Thinking of documents as mix of topics, collection of topics in metagenomes. Dist of words that show up. infrenceh searches for what constitutes a topic. ajust parameters to show how many topics represent the data.
HDP's additional hierarchal structure, in stats mixture models prob over when some type of thing becomes part of sample, LDA hdp's search over what makes the things in the samples.
likelyhood funciton of how you expect GO terms to be distributed under some parameters, priors coming from GO. Direct h-clustering model.
generative model can leave out some data and evaluate with withheld data, can't do with non-generative models.
GAMS (For Matt's problem): learners competing, one to gen one to detect if it's from the 1st learner or the original data. Competing with each other, gen trying to fool the police, and the police trying to rec if came from generator. GAN trying to prescribe category around boundary
Bayesian non-parametric models (LDA HDA etc), inf dimensional data, algorithms trying to infer allow data to speak for number of categories. Chinese ret process, stickbreaking sea of different processes, all about infinite dimensions, depending on what info available infer certain parameters, dir process building blocks topics of topics (topics being the new envo classes) CH 10 FCML has topic modeling.
searching on the EBI metagenomics page https://www.ebi.ac.uk/metagenomics/
you can apply filters such as pipeline version 4.1, experiment type: metagenomic to get marine biome metagenomes processed with Interpro and taxonomic info.
Can download a csv table of 'marine biome' labeled metagenomes which would look like:
Analysis Pipeline version Sample MGnify ID Experiment type Assembly ENA run ENA WGS sequence set
MGYA00147345 4.1 ERS1510282 MGYS00001481 metagenomic ERR1806502
MGYA00147342 4.1 ERS1510276 MGYS00001481 metagenomic ERR1806501
MGYA00147347 4.1 ERS1510283 MGYS00001481 metagenomic ERR1806503
MGYA00147344 4.1 ERS1510280 MGYS00001481 metagenomic ERR1806505
MGYA00147343 4.1 ERS1510278 MGYS00001481 metagenomic ERR1806500
MGYA00147348 4.1 ERS1510277 MGYS00001481 metagenomic ERR1806507
MGYA00147341 4.1 ERS1510279 MGYS00001481 metagenomic ERR1806506
MGYA00147346 4.1 ERS1510281 MGYS00001481 metagenomic ERR1806504
From this we could web scrape with bash file calling wget to get the InterPro files something like:
https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00147345/file/ERR1806502_FASTA_InterPro.tsv.gz
in order to get all the data for my INFO 521 final project.
https://radimrehurek.com/gensim/models/hdpmodel.html
Great video series on Bayesian statistics
TODO from Bonnie:
Kai find out how many credit you can get from grad college for transfer. Find out how many total credits need to be from UA.
From Elisha: schema.org expansion paper
The design patterns to auto create the 'concentration of' classes I'll need to make can be found in both the entity_attribute_location.yaml as well as the entity_attribute_location.csv pages.
The .yaml script (YAML an easy way to write JSON objects) is run with the csv file, autofilling terms based on the design. I'm wondering exactly how I'd run it also if I should change line 30 from the .yaml script text: "The %s of a %s when measured in %s."
getting rid of the 'a'. As they probably wrote the script with carbon atom (concentration of a carbon atom) in mind. I presume I'd be good to write new .csv scripts and add them to the /src/envo/modules folder.
My remaining question is how do it run it. According to the Makefile you simply To make a release, type 'make release' in this directory
I probably need to have owltools and robot installed hopefully that'll work on my mac I could never get owltools to work on Gouda (insufficient RAM). The make file states requirements: Oort (http://code.google.com/p/owltools/wiki/OortIntro)
My understanding of the code in the make file is that it's going to separate the diffs which I hope means that it won't re-run all the yaml patterns/associated csv files in the patterns and modules folders respectively. But I'm not sure about this so I should consult with Pier before I try running it, if not Chris Mungall. I'd also need to make sure I can install owltools/robot/Oort. In the mean time I could prepare a csv file like the entity_attribute_location.csv file, with the relevant 'concentration of' classes for Planet Microbe. Make sure to get the right iri number the next one in my ID range I'll have to track down what that was. I'm in the range between 3000000 and 3001000. Also I'll need to make sure we have all the proper import purls for example for CHEBI in the imports directory
Later once we establish a design pattern for the EBI biomes, I could write a new yaml script to auto-generate these.
Created a google doc for the ISTA Project with Ken.
From the EBI page on interpro2go read this paper Camon et. al. (2005) An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics 6, about interpro2go. Checkout the interpro2go mapping file
from pier: There will be a January ESIP meeting check the ESIPfed.org site for details (I think Washington DC). It will involve people from NASA/NOAA which will be pertinent to EarthCube (which is ESIP related material). Hopefully Bonnie will send me. ESIP will be focusing on schema.org so it's our chance to showcase PM as a scientific schema.org usecase. Email Pier about PM/Schema.org for an introduction email to Adam.
Also Email pier a link to the EBI-Biomes spreadsheet to be shared with the EBI folks once I make it.
Pier Also launched an ontology for omics data OMICON, so that should be super useful for the PM metadata.
Searching NCBI for things like North Pacific Subtropical Gyre
other name for aloha station, we get 27 records for example one called HOT metagenomic time and depth series 2010-2011
which has many HOT metagenomes. //TODO a crawling exercise to try and find all the publicly submitted HOT data in NCBI. //Alise has already done this.
Study these pull requests: https://github.com/EnvironmentOntology/envo/issues/622, https://github.com/INCATools/ontology-development-kit/issues/50 and https://github.com/EnvironmentOntology/envo/pull/623 to understand how Chris et al. are trying to make an Atomic base for a combined obo ontology.
The first will compare Marine, Human Digestive system, Soil, and Wastewater. I'm wondering if it would be better to take terms from there broad level classes which are annotated at slightly deeper taxonomy levels for example instead of just Marine we could take a mix of EBI_MGnfy biomes which are subclasses of Marine biome, e.g. Oceanic Coastal and or Hydrothermal Vent. or Environmental/Aquatic/ Marine and Freshwater.
Looking at the EBI biomes which have been run with pipeline 3.0 we have the following in their biome hierarchy:
Marine (28945)
Oceanic (7661)
Abyssal plane (1146)
Photic zone (407)
Sediment (395)
Benthic (234)
Aphotic zone (62)
Coastal (5745)
Sediment (1563)
Sediment (2590)
Intertidal zone (2194)
Coral reef (976)
Oil-contaminated (408)
Salt marsh (387)
Estuary (170)
Sediment (177)
Sediment (139)
Hydrothermal vents (1147)
Sediment (228)
Diffuse flow (162)
Cold seeps (604)
Sediment (74)
Marginal Sea (307)
Oil-contaminated sediment (230)
Pelagic (229)
Neritic zone (69)
The ENVO hierarchy looks like the following:
marine biome -> Marine (28945)
estuarine biome -> Estuary (170)
marine salt marsh biome -> Salt marsh (387)
marine pelagic biome -> Pelagic (229)
neritic pelagic zone biome -> Neritic zone (69)
oceanic pelagic zone biome -> Photic zone (407)
marine benthic biome -> Benthic (234)
marine neritic benthic zone biome
marine bathyal zone biome
marine abyssal zone biome -> Abyssal plane (1146)
marine hadal zone biome
marine reef biome
marine coral reef biome -> Coral reef (976)
marine hydrothermal vent biome
marine black smoker biome
marine white smoker biome
marine ultramafic hydrothermal vent biome
marine basaltic hydrothermal vent biome
marine cold seep biome -> Cold seeps (604)
epeiric sea biome
marginal sea biome -> Marginal Sea (307)
temperate marginal sea biome
tropical marginal sea biome
mediterranean sea biome
ocean biome
marine upwelling biome
Notes:
In the neritic zone the water column isn't deep enough not to be sun exposed wiki ref thus ENVO:neritic pelagic zone biome -> EBI:Neritic zone (69)
marine neritic benthic zone biome: comprises sea floor from the high tide mark to a continental shelf break.
neritic pelagic zone biome: comprises the marine water column above a continental shelf.
Making the assumption from wiki ref that most Abyssal plain samples are within an abyssal zone biome, hence: ENVO:marine abyssal zone biome -> EBI:Abyssal plane (1146)
Unfortunately I realized that the number filtering on the MGnify website is deceiving, it's not the number of samples run with all constraints it's just the total number. It looks like for pipeline 3.0 there arn't samples run at more detailed level or hierarchical annotation. I'll have to use pipeline version 2.0 (for the marine only data) and see which annotation terms I can actually get data for.
Trying to retrieve data from the site again with the filters: metagenomic, the biomes, then pipeline 2.0, pipeline 3.0, then 4.0 (most missing):
Marine 1007, 1141
Oceanic 630, 213, 204
Abyssal plane 0, 0
Photic zone 344, 0, 0
Sediment 26, 0
Benthic 0, 0
Aphotic zone 0, 0
Oil-contaminated 0, 0
Oil-contaminated sediments 0, 0
Coastal 7, 28, 0
Sediment 0, 3
Sediment 0, 33, 2
Intertidal zone 38, 10, 21
Coral reef 17, 0, 0
Oil-contaminated 0, 0
Salt marsh 0, 0
Estuary 16, 0
Sediment 0, 0
Sediment 0, 10
Mangrove swamp 0, 0
Microbialites 0, 0
Hydrothermal vents 61, 595, 0
Sediment 15, 0
Diffuse flow 0, 110
Microbial mats 0, 21
Cold seeps 0, 0, 0
Sediment 0, 0
Marginal Sea 0, 0, 0
Oil-contaminated sediment 0, 3
Pelagic 0, 0, 0
Neritic zone 0, 0, 0
I had tried version 2 of the pipeline as photic zone had the most, but trying version 3 as second column. Pipeline 4.0 only has 733 marine total so probably less 3rd column. Pipeline 4.1 oceanic 0, prob skip 4.1 only 101 results for all marine.
Hence I think I will only take results from pipeline 2 as follows:
Marine 1007 -> marine biome
Oceanic 630 -> ocean biome
Photic zone 344 -> oceanic pelagic zone biome
Sediment 26
Intertidal zone 38
Coral reef 17 -> marine coral reef biome
Estuary 16 -> estuarine biome
Hydrothermal vents 61 -> marine hydrothermal vent biome
Sediment 15
Alternative idea:
marine biome -> Marine 1007
ocean biome -> Oceanic 630 (general)
marine pelagic biome (no ref)
oceanic pelagic zone biome -> Photic zone 344 (general)
oceanic sea surface microlayer biome (top 0.001m) -> Photic zone [0-1]m 85
oceanic sea surface biome (not in ENVO) -> Photic zone [2-20]m 86
oceanic epipelagic zone biome surface-(200-250)m -> Oceanic [21-200]m 98
oceanic mesopelagic zone biome 200-1000m -> Oceanic [201-1000]m 53
oceanic bathypelagic zone biome 1000-(2500or2700)m -> Oceanic [1001-2500]m 1
oceanic abyssopelagic zone biome (2500-2700)-6000m
oceanic hadal pelagic zone biome 6000m-10000m
oceanic benthopelagic zone biome 100m above the seafloor
marine benthic biome -> Sediment 26 (use Oceanic sediment as rep)
marine hydrothermal vent biome -> Hydrothermal vents 61 (most are between 1500 and 1550m depth)
marine reef biome ((no reps)
marine coral reef biome -> Coral reef 17
estuarine biome -> Estuary 16
marine littoral zone (not biome but would fit here) -> Intertidal zone 38
Analyses with filters: pipeline_version:2.0, biome:Environmental/Aquatic/Marine, experiment_type:metagenomic
Aquatic
Marine 1007
Freshwater 34 Mot good!
Analyses with filters: pipeline_version:3.0, experiment_type:metagenomic
Environmental
Aquatic
Marine 1141
Freshwater 264
Terrestrial
Soil 210
soil [0-0m] (depth) 84
soil [0-0.1m] (depth) 130
soil [0.1-5000m] (depth) 73
soil [1-100m] (depth) 50
Forest soil 0
Desert 0
Agricultural 0
Contaminated 0
Grasslands 0
Tropical rainforest 0
Engineered
Wastewater 105
Food production 147
Bioreactor 25
Modeled 32
rest less
Host-associated
Human
Digestive system 1883
Skin 0
Reproductive system 0
Respiratory system 0
Circulatory system 1
milk 0
Mammals
Digestive system 956
Ideas from Matt Bonhoff: open science data framework to convert csv/json into marked up text for our db for the user data input system.
grakn A knowledge graph which we could potentially use to build our knowledge graph which does the unit inter-conversion/ NOX column imputing.
old version of Tax-e low hierarchical resolution data
marine biome -> Marine 1007 (general)
ocean biome -> Oceanic 630 (general)
marine pelagic biome (no reps)
oceanic pelagic zone biome -> Photic zone 344 (general)
oceanic sea surface microlayer biome (top 0.001m) -> Photic zone [0-1]m 85
oceanic sea surface biome (not in ENVO) -> Photic zone [2-20]m 86
oceanic epipelagic zone biome (200-250)m -> Oceanic [21-200]m 98
oceanic mesopelagic zone biome 200-1000m -> Oceanic [201-1000]m 53
marine benthic biome -> Sediment 26 (use Oceanic > Sediment as rep)
marine hydrothermal vent biome -> Hydrothermal vents 61 (most are between 1500 and 1550m depth)
marine reef biome (no reps)
marine coral reef biome -> Coral reef 17
estuarine biome -> Estuary 16
marine littoral zone (not biome but would fit here if it were) -> Intertidal zone 38
Example scikit learn LDAand played with theses
From Clay it would be ok to manual select GO term features which would be biologically relevant equivalent of how human beings decided which words were "stop words" in NLP.
Try it 3 ways, 1) (semi)manually select subsets from GO, 2) trim the first 3 level of the GO hierarchy, 3) use a scikitlearn feature selection method based on correlating features.
From Matt B: Towards a Definition of Knowledge Graphs
Great resource for R including this page on clustering_and_heatmaps
I can also check out hierarchical-k-means-clustering
Open question, is a log transformation of the data ok to do. The data are exponentially distributed hence clays says yes, Alise is concerned that it may overly weight some genes of lower abundance than those of higher, but it also helps us to better see which are different giving them more weight but is it biologically ok to do...
When doing top 60 (GO terms with highest variance) heatmaps, the figure looks really good for the log transformed data, however not so much for the term frequency data. The results of which are mostly separating shallow soil from everything else based on really high values of the top level GO terms metabolic process, and catalytic activity. Without filtering for these high level GO terms (like stop words) were basically just learning that the interpro2GO mapping or perhaps the pfam detection doesn't work well for different subclasses of soil metabolic processes and catalytic activity. Which makes sense given how much isn't known about soil proteins. Hence for the term frequency data, I'll need to do the term reduction for-loop, I.e. a modified version of the mk_d1_subsample_freqs.sh script. Such that I'm not just getting high level terms which aren't well characterized in some systems like what were seeing in soil.
The shallow soil vs everything else clustering is what were now observing in the hclust on TF as well, as it's likely the same story.
The heatmap.2 defaults to the hclust clustering function, which I'm not sure what the default is but it's looking like euclidian distance with centroid see perhaps using this post I can change the heat map hclustfun to ward, so that its comparable to the other hclust.
Can also try this post to set the method which gets used in the row and column hclustering. Got that to work.
When comparing the ward clustering vs the default is perhaps centroid? But I'm still not sure I should read more into the different ones for example here https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-clustering/217742#217742 and I could test it to figure out which method is the standard. Regardless, however it "looks" like the ward does a better job than the standard because the clusters don't seem to just trail off but are rather more symmetrical looking not just infinite recursive tails as Clay alluded to being not super good. Although the column clustering looks better similar but the row clustering looks better with ward, I could check the latest one to see if the row clustering makes sense with the GO ontology.
I should also try the ward.D in addition to normal ward.D.
from this ETH posting about hclust we have the following:
Two different algorithms are found in the literature for Ward clustering. The one used by option "ward.D" (equivalent to the only Ward option "ward" in R versions <= 3.0.3) does not implement Ward's (1963) clustering criterion, whereas option "ward.D2" implements that criterion (Murtagh and Legendre 2014). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(, method="ward") corresponds to hclust(, "ward.D2").
Link to Murtagh and Legendre 2014, the same Legendre who is the author of Numerical Ecology with R aka the source. From the paper we have:
Looking closer at forms of the criterion in (1) and (2) in Section 4.1 – and contrasting these forms of the criterion with the input dissimilarities in Sections 4.2 (Ward1) and 4.3 (Ward2) leads us to the following observation. The Ward2 criterion values are “on a scale of distances” whereas the Ward1 criterion values are “on a scale of distances squared”. Hence to make direct comparisons between the ultrametric distances read off a dendrogram, and compare them to the input distances, it is preferable to use the Ward2 form of the criterion. Thus, the use of cophenetic correlations can be more directly related to the dendrogram produced.
Hence I think I'll use ward.D2, as you need to have squared distances going into ward1 which I'm not doing, and you can just put regular (non squared) distances into ward2.
Ward1 will be kept for back compatibility with previous ver- sions of the function and a warning will indicate that this method does not implement Ward’s clustering criterion.
When I do the heatmap with the log_transformed data and cluster by the top 25 GO terms (by most Variance) the first little cluster of 3 terms together we have:
GO_0019281: L-methionine biosynthetic process from homoserine via O-succinyl-L-homoserine and cystathionine
GO_0015655: alanine:sodium symporter activity
GO_0032328: alanine transport
which biologically makes a lot of sense 2 alanine genes plus a closely related methionoine biosynthesis process.
can use the ontobee query page
with a query such as this to get the labels..
PREFIX obo-term: <http://purl.obolibrary.org/obo/>
SELECT ?x ?label
from <http://purl.obolibrary.org/obo/merged/GO>
WHERE {
?x rdfs:label ?label.
values ?x {obo-term:GO_0046933
obo-term:GO_0046961
obo-term:GO_0006426
obo-term:GO_0006432
obo-term:GO_0006189
obo-term:GO_0006537
obo-term:GO_0009435
obo-term:GO_0009089
obo-term:GO_0006614
obo-term:GO_0006419
obo-term:GO_0006400
obo-term:GO_0006526
obo-term:GO_0006529
obo-term:GO_0015986
obo-term:GO_0042626
obo-term:GO_0009086
obo-term:GO_0006098
obo-term:GO_0006546
obo-term:GO_0008556
obo-term:GO_0003918
obo-term:GO_0006096
obo-term:GO_0006418
obo-term:GO_0032328
obo-term:GO_0015655
obo-term:GO_0019281
}
}
Looking at the top 25 terms + their clusters for the log_norm and TF_norm data, I observe some similar patterns re-occuring, which we could drill down into as GO hierarchies by which to use in feature selection for differentiation/taxonomic clustering of ecosystems. Common themes:
alpha-amino acid biosynthetic process
ATPase activity, could also try ATP biosynthetic process (for whatever reason the purl for its major subclass glycolytic process isn't resolving. But it would make sense to cluster ecosystems based in ATP biosynthesis.
nucleotide metabolic process, nicotinamide nucleotide biosynthetic process, nucleoside phosphate metabolic process, nicotinamide nucleotide metabolic process for this group go with subclasses of nucleotide biosynthetic process
tRNA aminoacylation, tRNA processing for these do subclases of tRNA metabolic process
Doing Hclust on alpha-amino acid biosynthetic process, ATPase activity, nucleotide biosynthetic process, tRNA metabolic process the graphs are remarkably similar in terms of how the overall clusters look and which biomes go where. Is there some real biological differentiator which seperates them all that way even for these different metabolic/biosynthesis GO hierarchies.
Doing the alpha-amino acid biosynthetic process as the heat map we have a really clear looking seperation in both the columns and rows the rows don't (at first pass seem to group by amino acid family) but the 3 row clusters look like their consistent with matching expression patterns in the GO term frequency (in the heat map). We'll see what this all means. Perhaps there are certain amino acid groups which really are produced much more than others and that this varies with biome types or we just don't detect them in soil as well which explains why top soil clusters together...
Think about the biological reasoning for the previous pass at the problem, which lead to the selection term for the GO hierarchies of alpha-amino acid biosynthetic process, ATPase activity, nucleotide biosynthetic process Which I realize should actually be nucleoside phosphate metabolic process, tRNA metabolic process , I realize all of these things are implicated in (biological) translation (production of proteins). Amino acids make up proteins hence cells making proteins need to make amino acids. Nucleotides are the key constituents of DNA and RNA, and hence are required for to make both, but it would be interesting to check if its about the biosynthesis of RNA specific nucleotides to check if it's specifically about transcription/translation. Biological processes which for prokaryotes occur almost simultaneously. RNA is unstable and quickly broken down hence as soon as it's being transcribed (produced) within a cell ribosomes attach to it and start to translate the messenger RNA (mRNA) into proteins. I'm pretty sure that in prokaryotes this happens as the as the mRNA is being transcribed. Additionally, tRNA metabolic process make sense in the context of translation as tRNA molecules are what help to bring amino acid molecules to the ribosome/mRNA complex to complete translation into proteins. Finally ATPase activity makes sense, as ATPase (adenosine triphosphatase) are a class of enzymes that catalyze the decomposition of ATP into ADP which helps liberate energy which is used to drive other chemical reactions. Many cellular processes including translation (and the liteney of it's sub processes) require ATP, though it is not specific to just translation.
I see subclasses of Aminoacyl tRNA synthetases, which are enzymes that catalyze the bonding between specific tRNAs and the amino acids during translation.
The story emerges! and it's about the clustering of ecosystems based on the relative proportions of genes involved translation. We can cluster ecosystems based on how the prokaryotes there tend to do translation. In a sense we are clustering life by it's various variations on translation and seeing in what ecosystems do the different translation types inhere. For example the overall ecosystem clustering patterns I see over and over in these various diagrams is that there a two major clusters one with all the 0m soil and some food production and human and animal digestive system biomes, the other major cluster with clusters for freshwater, deep soil, human/mammalian digestive systems, wastewater, and marine/food production.
What I will do as a hopefully final figure for this story for now, is to use all the GO term subclass sets for the vaious translation involved hierarchies, including the 4 I have plus I'll check GO for some more which may be relevant, and make a hclustering figure in which the selected GO term features are all (more a less) involved in translation. Then I'll propose that hclustering as the hierarchy/taxonomy of ecosystems based on genes involved in translation. Hence a prokarytic translation-centric hierarchy of ecosystems.
Looking for more translation related classes from GO:
translation with synonym protein biosynthesis, it has very few subclasses but try regardless.
RNA binding or it's subclass translation factor activity, RNA binding not a ton of subclasses but relevant. RNA binding has many other subclasses not all of which are directly involved in translation, however most appear to be. Examples of subclasses include mRNA binding, tRNA binding, rRNA binding each of which having more subclasses+ all relevant to translation. Would be good to add subclasses of RNA binding. There are 174 subclasses of RNA binding. It would make more sense to take only the subclasses for which I know there to be a direct relation to translation/transcription namely: tRNA binding, mRNA binding, translation factor activity, RNA binding, rRNA binding, for these I will include the top level term e.g. mRNA binding in among the list of subclasses.
translational elongation not many subclasses, but just add to be comprehensive.
regulation of translation many subclasses (maybe more eukaryotic specific but still worth trying)
negative regulation of translation many subclasses, is a subclass of regulation of translation.
I could also try to dig down into ATPase activity to find translation specific classes. I still need to look at the subclasses of ATPase activity which we see in our data helping to define the clusters, but perhaps (if it isn't already) I could instead use the subclass: RNA-dependent ATPase activity, I could also go the other way and look at the superclass of ATPase activity: nucleoside-triphosphatase activity or it's subclass GTPase activity as certain GTPase's such as those of the Translation factor family play an important role in Initiation, elongation and termination of protein biosynthesis. not many subclasses of GTPases. Perhaps better to go with nucleoside-triphosphatase activity which also has subclass helicase activity (enzyme to unzip RNA/DNA) or its subclass RNA helicase activity or it's subclass ATP-dependent RNA helicase activity, which also show up as a subclass of ATPase activity. Perhaps I'll just take subclasses of nucleoside-triphosphatase activity to make sure to include the GTPase and helicase activity classes. It appears that many of the subclasses are already overalapping to a certain extent, for example the overlap between ATPdependent helicase activity and GTP/ATPase. It was to memory intensive to use the query_for_subclasses_of script, if there are too many classes I'll just do separate queries for GTPase activity, coupled, andRNA helicase activity, ATP-dependent helicase activity there were 216 so take the separated ones.
On further inspection of the ATPase activity hirarchy I realize only those subclasses of ATP-dependent helicase activity are relevant to translation/transcription, as there are many different ATPases which help to power many different things. ATP-dependent helicases whether for unziping DNA, RNA or both are relevant to translation as DNA needs be unziped in order to transcribe in order to then translate it. A separate question could be asked about classifying ecosystems by their ATPase activity, especially in regard to what molecules they transport etc. But that can be for another time. Maybe with all nucleoside-triphosphatase activity](http://purl.obolibrary.org/obo/GO_0017111) subclasses could be a different gene-centric model for clustering ecosystems based on the how they derive energy from various NTP's. Like with translation these processes are pretty well studied hence why I can find a good deal of them when going through the dubious interpro2GO bottleneck of what proteins have been mapped to GO terms. Maybe I can Do this as a separate figure and compare and contrast the "energy" centric and translation-centric taxonomies of ecosystems.
With regard to GTPase activity in addition to protein biosynthesis (which I'm after), it is involved in signal transduction membrane translocation and vesical transport, none of which I want to include here in the translation-centric feature selection model. Hence I don't want GTPase motor activity, but all the other classes are about the coupled reaction abd are not specific about what its coupled to. As coupled is really important for protein biosynthesis I'll take the 3 subclasses (including itself) of GTPase activity, coupled.
All told for these translation associated classes dataset called translation_assoc.csv there were 574 (with just a few duplicates) GO term features, of which 99 could be found in our data.
for i in *.pdf; do sips -s format jpeg -s formatOptions 100 "${i}" --out "${i%pdf}jpg"; done
the purl http://purl.obolibrary.org/obo/GO_0006096 glycolytic process doesn't resolve on the ontobee server
marine biome
ocean biome
marine pelagic biome
oceanic pelagic zone biome
oceanic sea surface microlayer biome
oceanic sea surface biome
oceanic epipelagic zone biome
oceanic mesopelagic zone biome
marine benthic biome
marine hydrothermal vent biome
marine reef biome
marine coral reef biome
marine littoral zone
Doing a post-mortem on the Tax-E round 1 project.
From my Translation_associated dataset 1 hclust/heatmap analysis, where the biome clusters essentially broke down by how sparse the overall genomes were, or into large clusters each with varying degrees of completeness. instead of differences throughout the analysis we had a gradient of complete to incomplete. Hence I'll collect some stats to try and better understand what was happening, I mostly suspect it's genomes size smaller and larger which made the two overall clusters.
EBI ERR # | cluster | biome | initial reads | reads sub for QC | median read length | # Interpro terms | notes |
---|---|---|---|---|---|---|---|
ERR1527906 | 1 | soil_0m | 374000 | 290000 | 255 | 308432 | very bad GO term annotations only top levels |
ERR1854464 | 1 | food_production | 26000000 | 2000000 | 150 | 617000 | not super well distributed GO terms freqs |
ERR209517 | 1 | human_digestive | 65500000 | 0 | 115 | 3620 | barely any interpro or terms nor any reads passing qc? looks like it's because all the reads were dropped in the length filtering step. hence the low numbers of terms. |
SRR2182543 | 2 | food production | 3600000 | 2000000 | 150 | 1000000 | still uneven GO dist better than group 1 though |
ERR525820 | 2 | human_digestive | 41000000 | 2000000 | 100 | 7300000 | still uneven GO dist better than group 1 though |
ERR1662271 | 2 | marine | 4000000 | 2000000 | 137 | 1800000 | more even MF BP but still uneven |
ERR1726991 | 2 | freshwater | 10500000 | 2000000 | 150 | 3700000 | more even MF BP but still uneven |
ERR1612256 | 2 | soil | 70700000 | 2000000 | 100 | 13600000 | more even MF BP but still uneven |
Take homes from this table, number of inital reads isn't exactly correlated with low/high cluster but there's clearly a difference in interpro terms which were annotated, cluster 1 are under a million typically, whereas cluster 2 are > 1 million. I'm wondering if ERR209517, (cluster 1 human digestive system) having barely any interpro terms is due to human contamination? I imported the forward reads into cyverse and ran centrifuge on it, as that would tell me if the kmers belong to human (there's probably a better way to do it but it's what I can think of maybe the DeconSeq tool would be better).
Perhaps the human decontamination step would be a good idea, however, I'm thinking that a size cutoff may not actually help, but instead we'd want an intepro terms cutoff value such as 1 million. To get there however, we may have to run the relevant parts of the EBI pipeline ourselves, so that it's the same version, then only analyze genomes with >= 1 million interpro matches.
Tax-E: mini version hand select take 10 metagenomes of >1,000,000 interpro terms run using EBI-metagenomes pipeline version 3.0.
Marine:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00083249?version=3.0#functional Marine > Hydrothermal vents Mid-Cayman Rise Subseafloor microbes at Mid-Cayman Rise
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00085839?version=3.0#functional Marine > Coastal artificial minimal coastal microbial mats at dilution 0, replicate 3
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00104213?version=3.0#functional Hydrothermal vents > Diffuse flow Metagenomes and metatranscriptomes from the diffuse hydrothermal vents of Axial Seamount from 2013 east pacific
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00134047#functional Marine > Hydrothermal vents > Diffuse flow Metagenomes and metatranscriptomes from the diffuse hydrothermal vents of Axial Seamount from 2015 west pacific
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00084050?version=3.0#functional Marine > Coastal Oyster and water samples from Texas bays.
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00085820#functional Marine A metagenomics study of two north sea communities.
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00096449#functional Marine Metagenomic, Metatranscriptomic and Metviriomic analysis of samples collected at four time points during a single day at the Gulf of Aqaba in the Red Sea. Sample Red Sea Diel
Marine Sediment:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00100646?version=3.0#functional Marine > Sediment Postglacial viability and colonization in North America's ice-free corridor
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00096438?version=3.0#functional Marine > Coastal > Sediment Metagenomic investigation of the coastal region of south Gujarat using soil sample - Nargol, Umargam and Dandi
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00098222#functional Marine > Intertidal zone > Sediment Shotgun metagenomic study of sedimentary ancient DNA (sedaDNA), from four strata of sediment core taken from an Bouldner Cliff, a submarine archaeological site in the Solent. Dates of
Soil:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00084040?version=3.0#functional Soil > Oil-contaminated
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00097651?version=3.0#functional Soil Realigned salt marsh, Fingringhoe Range, Winter Samples from salt marshes in the south of England
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00097639?version=3.0#functional Soil Natural salt marsh, Mersea Island, Summer Samples from salt marshes in the south of England
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00071161?version=3.0#functional Soil Metagenomic samples from soil saudi arabia Metagenomic samples from soil
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00071082?version=3.0#functional Soil > Contaminated, oil contaminated military base Hungary
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00083523?version=3.0#functional Soil Russia Soil samples collected for whole-genome shotgun sequencing. Sample 1129-1-6
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00076567#functional Soil Sample kernen 1130 Soil samples collected for whole-genome shotgun sequencing.
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00126471#functional Soil > Permafrost NGEE Arctic Microbial Communities of Polygonal Grounds Arctic tundra
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00084040#functional Soil > Oil-contaminated Metagenome from enrichment culture olive oil contaminated enrichment culture (not a real ecosystem?)
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00067564#functional Soil Labirinto_Cave_Metagenomes 400 m sediment Functional and taxonomic profiling of microbial communities in caves of the Algarve, Southern Portugal.
Aquatic:
Freshwater:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00087112?version=3.0#functional Freshwater > Lake Effects of organic matter manipulation on bacterial community diversity and function Sudbury lake mesocosms
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00069093?version=3.0#functional Freshwater > Groundwater > Contaminated Sample Little Forest Legacy Site
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00112305?version=3.0#functional Freshwater > Lentic By taking three metagenomic approaches to reveal the microplankton communities from composition to functional properties in this study.
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00069086?version=3.0#functional Freshwater > Groundwater > Contaminated Study of the changes in the metagenome of the groundwater community at
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00089085?version=3.0#functional Freshwater > Lake tucurui Brazil
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00085383?version=3.0#functional Freshwater > Lentic > Sediment Daisy Lake Shotgun Test
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00123081?version=3.0#functional Freshwater > Groundwater > Contaminated River Sediment metagenomes of textile dye degrading communities collected from Ankleshwar, India
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00112307?version=3.0#functional Freshwater > Lentic By taking three metagenomic approaches to reveal the microplankton communities from composition to functional properties in this study.
Hot Spring:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00095527?version=3.0#functional Aquatic > Thermal springs > Hot (42-90C) Hot spring microbial communities from Beowulf Spring, Yellowstone National Park, USA - T=65-68 metagenome
Non-marine Saline and Alkaline:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00115219?version=3.0#functional Aquatic > Non-marine Saline and Alkaline > Saline > Microbial mats Microbialite and microbial mats systems in lagoons, salt flats and Volcanoes of Andean South America Altiplane Argentina
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00112226?version=3.0#functional Aquatic > Non-marine Saline and Alkaline > Saline > Microbial mats Microbialite and microbial mats systems in lagoons, salt flats and Volcanoes of Andean South America Altiplane Chile
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00087076?version=3.0#functional Aquatic > Non-marine Saline and Alkaline > Hypersaline Metagenomic analysis of an Iranian Hypersaline Lake microbial ecosystem
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00094695?version=3.0#functional Aquatic > Non-marine Saline and Alkaline > Alkaline Molecular study on haloarchaea Egypt
Air:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00112459?version=3.0#functional Air > Outdoor Air Metagenomic analysis of the atmospheric microbial composition over Cyprus.
Wastewater:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00082010?version=3.0#functional Wastewater > Activated Sludge Shotgun sequencing on multiple activated sludge samples
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00097786?version=3.0#functional Wastewater > Nutrient removal > Dissolved organics (anaerobic) Metagenomic analysis of anaerobic reactors for wastewater treatment
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00095145?version=3.0#functional Wastewater > Industrial wastewater > Petrochemical Retrieval of Commamox genomes using metagenomics
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00095121?version=3.0#functional Wastewater > Activated Sludge EMG produced TPA metagenomics assembly of the Genomic and in situ investigations of the novel uncultured Chloroflexi associated with 0092 morphotype filamentous bulking in activated sludge (Genome sequence for Ca. "Promineofilum breve" Cfx-K) data set
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00131888?version=3.0#functional Wastewater > Nutrient removal > Dissolved organics (anaerobic) Heterotrophic methanogens dominate in anaerobic digesters
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00102737?version=3.0#functional Wastewater > Nutrient removal > Biological phosphorus removal > Activated sludge Metagenomes of Danish EBPR WWTPs
Food Production:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00105558?version=3.0#functional Engineered > Food production > Dairy products Thermus thermophilus is responsible for the pink discolouration defect in cheese
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00127067#functional Engineered > Food production Dietary supplements Metagenome Sample Product J
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00131717#functional Engineered > Food production We describe here the metagenomics-derived viral sequences found in beef, pork, chicken purchased from supermarkets in San Francisco maybe viral but it may also just be the metagenome of the meat itself
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00131825#functional Engineered > Food production food contamination metagenome Metagenome EC1705-spinach fresh bagged spinach spiked with STEC
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00129028#functional Engineered > Food production > Fermented beverages Metagenomic analysis of ivorian sweet wort (tchapalo processing) Metagenomic analysis sweet wort
Human Digestive:
Fecal:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00128713?version=3.0#functional Human > Digestive system > Large intestine > Fecal EMG produced TPA metagenomics assembly of the Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life (InfantGut) data set & Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life ... human feces from Denmark
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00102907?version=3.0#functional Human > Digestive system > Large intestine > Fecal Dysbiosis of gut microbiota contributes to the pathogenesis of hypertension
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00128794?version=3.0#functional Human > Digestive system > Large intestine > Fecal EMG produced TPA metagenomics assembly of the Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life (InfantGut) data set & Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life ......... Stool sample from spanish baby looks like
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00102508?version=3.0#functional Human > Digestive system > Large intestine > Fecal Analysis of stool samples from sickle cell disease patients and healthy controls Stool sample from human subject - Healthy Washington DC
Oral: saliva
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00067596?version=3.0#functional Human > Digestive system > Oral > Saliva Oral microbiome samples from the Philippines EMG produced TPA metagenomics assembly of the Oral microbiome samples from the Philippines (Phillipines oral microbiome) data set.
Oral:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00067599?version=3.0#functional Human > Digestive system > Oral EMG produced TPA metagenomics assembly of the Oral Microbiome (human oral metagenome) data set. & Oral Microbiome China
Oral: animal BM
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00103104#functional Mammals > Digestive system > Oral cavity > Buccal mucosa Metagenomic analysis of gut microbiota in sows and piglets (EBI) Sample gut metagenome from Weaned Piglet 7
Animal Digestive:
Fecal:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00094735?version=3.0#functional Mammals > Digestive system > Fecal Intergenerational transfer of antibiotic-perturbed microbiota enhances colitis in susceptible mice Sample Control Wild-type pup ... mouse poop
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00122972?version=3.0#functional Mammals > Digestive system > Fecal Uganda_DrugResistance2016 Surveillance for prevalence of drug resistance bacteria in Ugandan animal agriculture .... not sure which animal's poop
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00128533#functional Mammals > Digestive system > Large intestine > Fecal A longitudinal study of the feline faecal microbiome identifies changes into early adulthood irrespective of sexual development Feline faecal sample ... cat poop
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00076569#functional Mammals > Digestive system > Large intestine > Fecal The Effect of High Oxalate Exposure on the Gut Microbiota of Neotoma albigula White-throated woodrat feces Sample Maximum oxalate Neotoma albigula
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00074240#functional Mammals > Digestive system > Large intestine > Fecal The Pig's other genome: a reference gene catalogue of the gut microbiome ... pig poop Binary mixed pig china
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00102738#functional Engineered > Food production Microbes present in feed to Atlantic salmon Feed for Atlantic salmon farming SHOULD ACTUALLY BE under salmon feces
Other Don't use:
Oral:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00084472?version=3.0#functional Human > Digestive system > Oral A plaque on both your houses. Exploring the history of urbanisation and infectious diseases through the study of archaeological dental tartar .... old teeth take????
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00067596?version=3.0#functional Human > Digestive system > Oral > Saliva Oral microbiome samples from the Philippines EMG produced TPA metagenomics assembly of the Oral microbiome samples from the Philippines (Phillipines oral microbiome) data set.
Marine viruses:
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00133855?version=3.0#functional Marine Sample TARA virus Arabian sea
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00102577?version=3.0#functional Environmental > Aquatic > Marine Marine water Metagenome Examine viral communities in ballast and harbor waters
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00133906?version=3.0#functional Marine Sample TARA virus south pacific
https://www.ebi.ac.uk/metagenomics/analyses/MGYA00133895?version=3.0#functional Marine Sample TARA virus pacific near mexico
For Tax-E version 2 modify merge_go.py to take the downloaded lists of the functional interpro terms. Store them like in the tax-E folder structure: ~/Desktop/software/tax-e/data/go/d1_subsample/
For now manually download everything on the list putting it into folders with names like 'marine', 'marine_sediment' ...