Skip to content

spring_2019_log

Kai Blumberg edited this page Jun 24, 2019 · 127 revisions

Table of contents

January 2019

Jan.07, Jan.11, Jan.12, Jan.13, Jan.15, Jan.16, Jan.23, Jan.27, Jan.28,

February 2019

Feb.10, Feb.17, Feb.21, Feb.25, Feb.26, Feb.27, Feb.28,

March 2019

Mar.01, Mar.05, Mar.06, Mar.07, Mar.11, Mar.13, Mar.14, Mar.18, Mar.19, Mar.21, Mar.25, Mar.26, Mar.27, Mar.28,

April 2019

Apr.01, Apr.03, Apr.04, Apr.08, Apr.09, Apr.10, Apr.17, Apr.23, Apr.25,

Jan.07

Tax_E examining the interpro results I noticed there is a lot of "double counting" of genes.

for example:

M00530:90:000000000-AC36U:1:2113:22500:7563-1:N:0:3_1_214_-	16a03a88e8f4cecbf58989a3eaab6da9	70	Pfam	PF02925	Bacteriophage scaffolding protein D	1	70	1.1E-45	T	09-03-2016	IPR004196	Scaffold protein D, bacteriophage	GO:0046797
M00530:90:000000000-AC36U:1:2113:22500:7563-1:N:0:3_1_214_-	16a03a88e8f4cecbf58989a3eaab6da9	70	Gene3D	G3DSA:1.10.1850.10		1	70	9.6E-46	T	09-03-2016	IPR004196	Scaffold protein D, bacteriophage	GO:0046797

In which we see that for same read for the same region 1-70 both Pfam and Gene3D predict this same interpro hit IPR004196. Hence if we just summed the total number of interpro hits we'd double count this annotation. Hence if/when we were to parse the final interpro results (after re-running the analysis using the same EBI pipeline version on a bunch of genomes, we'd have to parse each results tsv file and eliminate double counting. For example if the read is the same and the start-stop coors overlap then only count it once. I'm guessing that doing this would reduce the number of Interpro results down by ~1/4 (rough estimate based on looking at repeats in some data).

I still think that I could use the 1 million cutoff to select reads, I could do this programmatically from all run Interpro results from all EBI pipeline versions. Then I could write a script to parse them all and provide a resulting number of interpro results, From which we could see how many of which biomes we have etc and select a cutoff of interpro terms and or which biomes to analyse for a "real" experiment.

Jan.11

Link to NCBI BioSample Packages has the MIxS MIMS etc packages. The water ones have many optional attributes which I could ontologize for the sake of PM, and beyond. Link to MIxS/MIMARKS (x sequence and marker genes 16S plus others) paper. Link to MISAG/MIMAG (sags and metagenome assembled genomes) paper

link to ebi biosamples if we filter by submission package MIMS.me.water.4.0 or submission model MIMS/MIMARKS.water we get 957 samples which as far as I can tell are all or mostly all 16S, we could webcrawl to get the attributes.

Could also search the ENA site for MIMS, download the xml, and parse it for all SRS (biosamples) of type MIMS Environmental/Metagenome sample from marine metagenome. (Which will I presume will get the same 16S samples as the above link to ebi biosamples. Neither seems to get us the metagenomes of type WGS.

A better way of importing metagenomic data from EBI for PM is to use the MGnify portal and search with filters: experiment type (metagenomic or amplicon), Biome: marine, as well as temperature and depth, that way were more likely to get samples with metadata.

When doing this with metagenomic data we get 1052 results. You can download this as a csv, to get the Sample numbers. the sample link for example is a page with the sample overview the map the metadata in a human read-able format, including External links to the ENA website, and EBI biosample. The former for example has the link to the fastq files. The latter (biosample) for example has the metadata which you can Download as: XML, JSON, Bioschemas, or Phenopacket. For example the Bioschemas (which is an ldjson file) can be downloaded by a web link such as: https://www.ebi.ac.uk/biosamples/samples/SAMN02628409.ldjson, or https://www.ebi.ac.uk/biosamples/samples/SAMN02628409.json for the JSON file. This would be a good way of importing higher quality data for PM. We may however have to do some manual checking/editing and or trimming of these data, as they can be a bit variable. In good cases, the text will have associated units (stored as a units tag in json under the field (which I think is how Matt's doing this). Some have no units at all, and some have the units embedded into the same field text field for example "4.47 m". Sometimes you'll have the units embedded in the characteristics headers, for example Chlorophyll (ng/uL). The best case scenarios for example TARA you'll have in the JSON file for example SAMEA2620852 you'll have separate text and unit field for a given characteristic, from this example

Chlorophyll Sensor	
  0	
    text	"0.214573"
    unit	"mg Chl/m3" 

It's interesting that EBI has also decided to store the metadata using JSON schemas, as Matt has been doing. These metagenomes are unfortunately not quite as clean as we need so we'll have to decide if we proceed with this and the extent to which we manually clean them, perhaps this can be automated somehow?

When searching MGnify for marine amplicons with depth and temp we get 4465 results. This is also a mixed bag of metadata with and without units. Perhaps a "heuristic" way to get around this is to filter the JSON files for fields which contain 'depth' and 'temp' (or some variation on it) which also have the units of C or an m somewhere. This can be in a variety of places in the json files some examples are as follows:

total depth of water column	
  0	
    text	"35 m"
Depth	
  0	
    text	"3"
    unit	"m"
depth	
  0	
    text	"1m"
depth	
  0	
    text	"1.6 meters"

A an example of a "bad" sample without units:

temperature	
  0	
    text	"27.318"

Although we could make the assumption that metadata such as nitrate or nitrite without unit is in uM, we wouldn't be know what what to do for something like chlorophyll which is typically reported in a variety of units. Perhaps an exercise for later would be to try and input missing units for metagenomes/amplicon metadata which are reported without units, using some Machine learning methods based on the "good data" which has units.

Jan.12

From Anna Kuparinen (Sky's friend Prof in Finland), the wild weird out-there challenging ideas are what really move science, even if it turns out to be dis-proven it will start a debate, which is how things progress (like in the old Greek tradition of debate between students and teachers etc to arrive at the truth).

Journal tree: trends in ecology and evolution, entirely perspective opinion and review, (not primary lit) High impact, and she thinks you could publish about the struggles of applying different Machine learning approaches to answering biological questions (response to the informatics heavy approach and lack of biological questions I note among informatician types).

Explore the environmental gradient, see if the clusters correspond. For example try hierarchical clustering of metagenomes along a temp or ph or sal gradient, ideally temperature. Perhaps try constraining this with the fluid ecosystem types: marine, freshwater, (maybe wastewater), hotsprings ect.

Key to science paper: Simple phenoma/question of interest. A map of where such phenomena exists (with as many colors as possible), a causal relation, if x then y, predictions of how things will be impacted. Could maybe try this with the taxonomy of fluid ecosystems along temp gradients. If we get clusters (which we attributed to temp ranges in steps we say these are the functional community types. With global warming we expect a shift in types -> could make predictions about which types exist where and what they would change to. That would be super brilliant if this worked out nicely.

Bootstrap the Hclust results, outliers may be in their own cluster, want a probability distribution of clusters more than a static rigid clustering.

Things in nature operate in step functions, this should apply to clusters of ecosystem functions as well.

Comparison across systems is good, for example different biomes. This is analogous to her example about the population growth rates of terrestrial mammals being similar to fish and that there is some universality to the trends which are seen across different systems.

break the tree (tax-E) -> see if you can get the same tree from only ~10 features instead of the 10,000's.

She liked the revisiting N* idea, testing if the Redfield ratio holds across different ecosystem types.

In response to Tax-E, she lent me her book, Body size: the structure and function of aquatic ecosystems. Which states that although body size explains a lot, it doesn't explain everything. She takes issue with the standard disproving of the null hypothesis idea as it's too narrow, If you're just out to prove the one thing you though you were going to prove, you miss all the other possibilities of what may exist. It isn't good to "nay-say" and shut stuff down right away cause you may miss things, relationships and patterns which you need to dig around and explore for.

Toward the temperature-gradient dependent Tax-E: there are 20360 total metagenomes, when filtering for those with temperature only 1559, when further filtering those by biome we get: 1529 marine, 13 freshwater, and 17 Host-associated. Many of the freshwater samples are Groundwater > Mine drainage. All but 1 of the Host-associated samples are Human > Digestive system > Oral. Many or all the host-assoc have <1million interpro annotations. Amazingly none of the soil metagenomes have associated temp values. Hard to fathom.

I wonder if there could be a case to be made for a question about a depth gradient? Although I can imagine that depth in soil vs aquatic systems don't work on the same scale. Filtering for Metagenomic and Terrestrial: 1481 results. Additionally filter by depth 639. Metagenomes with depth: 3519: 639 terrestrial 2450 marine, 228 freshwater, 11 Non-marine saline and Alkaline, 4 Estuary 8 thermal springs, 24 engineered (15 of which are food production), and 147 host-associated. A depth gradient in different systems doesn't seem really comparable. Doing this for pH, nitrate phosphate etc is currently difficult (not currently search-able on MGnify), however with an implemented PM it would be possible to do. Could be a publication or part of a PM pub.

Some soil metagenomes have pH in their metadata an example here, so it may be possible to do a pH gradient across different biomes, at least soil and marine. We'd just need to parse a lot to collect the appropriate samples.

Jan.13

Appending onto temperature-gradient dependent Tax-E from yesterday. There is a biome: root > Environmental > Aquatic > Thermal springs > Hot (42-90C). Although the samples from this biome don't have stated T values, we know they are >= 42 C. Of the 43 Thermal spring biome results, there are 29 of root > Environmental > Aquatic > Thermal springs > Hot (42-90C)., and 3 root > Environmental > Aquatic > Thermal springs > Hot (42-90C) > Sediment results. Of the Thermal spring only 9 have been run with the EBI pipeline of those only the two following results have >1 million interpro hits:

1: MGYA00010586 from sample ERS802771, which they describe has having a temperature of 50? to 80? C

2: MGYA00095527 from SRS1616308, with Temp in the range 65-68

The ocean's sea surface ranges from ~0-32 C, hence >42 would be it's own (or multiple step-wise categories of it's own). It may be worth running the 29 root > Environmental > Aquatic > Thermal springs > Hot (42-90C)., and 3 root > Environmental > Aquatic > Thermal springs > Hot (42-90C) > Sediment results through the pipeline to see if anymore are >1 million interpro annotations

Taking this a step further, a really interesting paper would explore the clustering of metagenomes along both a pH and Temp gradient. This may be feasible as many of the marine samples which have a Temp also have a pH. Exploring NCBI, I find things such as hot spring water metagenome Nonoyufunki sample which has both pH and temperature fields.

Hence to achieve this I would get all the metagenomes I could which had associate pH and temperature values, and which have >1 million interpro terms (or some other cutoff threshold) when run on the EBI pipeline. The simple question:

Do the functional genes of ecosystems cluster by temperature or pH? Or by some combination thereof?

Searching the NCBI biosamples page for hot spring we get many results such as:

https://www.ncbi.nlm.nih.gov/biosample/10617608 from the bio project: https://www.ncbi.nlm.nih.gov/bioproject/485075

Maybe search by bioproject or by organism: hot springs metagenome

Searching: BioProject for hot springs metagenome and Data type: Metagenome we get 1273 results.

Searching: BioProject for hot spring and Data type: Metagenome we get 1089 results.

Doing the search many are non-useful JGI samples with no metadata and not public in addition to

https://www.ncbi.nlm.nih.gov/bioproject/485075 (the Japanese one with 9 samples)

we find:

https://www.ncbi.nlm.nih.gov/bioproject/435466 but temp only no ph

https://www.ncbi.nlm.nih.gov/bioproject/427880 TEMP and PH only 1 sample but good!

https://www.ncbi.nlm.nih.gov/bioproject/296750 Most seem to have ph but not all temp, hopefully can find some useful samples from here.

https://www.ncbi.nlm.nih.gov/bioproject/267597 very good lots with ph and temp

https://www.ncbi.nlm.nih.gov/bioproject/247409 also good temp and ph from china

Just searching: BioProject for hot spring -> nothing new

searching: BioProject for hypersaline,find:

https://www.ncbi.nlm.nih.gov/bioproject/341268 2 out of the 3 salt mine samples with temp and ph!

https://www.ncbi.nlm.nih.gov/bioproject/240206 permafrost hypersaline with temp and ph

searching: BioProject for permafrost,find: nothing new

searching: BioProject for Meromictic,find: nothing new but as a side note I found the hallam lab Sakinaw pyrotags https://www.ncbi.nlm.nih.gov/bioproject/257655 which have some nice metadata and could be incorporated into PM.

searching: BioProject for snow,find: nothing new

searching: BioProject for ice,find:

https://www.ncbi.nlm.nih.gov/bioproject/476152 Ice from glaciers and crater lakes has 4 samples with ph and temp

https://www.ncbi.nlm.nih.gov/bioproject/448431 glacier lake metagenome 6 with ph and temp

searching: BioProject for cryoconite,find: nothing new searching: BioProject for frost,find: nothing new searching: BioProject for glacier,find: nothing new searching: BioProject for melt water,find: nothing new searching: BioProject for fjord,find: nothing new

side not for pm https://www.ncbi.nlm.nih.gov/bioproject/308531 baltic sediments with lots of metadata (but not temp or ph) better something like https://www.ncbi.nlm.nih.gov/bioproject/213719 from the Marine Biological Labs at U Chicago really good metadata with pyrotags. https://www.ncbi.nlm.nih.gov/bioproject/329071 -> Bergen Ocean Acidification Mesocosms also good metadata. Found Sannich: https://www.ncbi.nlm.nih.gov/bioproject/247822.

Saanich Inlet Data:

For The Hallam lab Saanich Inlet data on NCBI: From samples 97-242 we get pyrotag samples e.g. pyrotags from Saanich Inlet water incubation taken at station S3 100m 2015-02-15, the metada for these isn't as good, but they are publicly accessible.

From 243-~287 (some mixups with the next thing) we get metatranscriptomic data for example: Marine microbial communities from expanding oxygen minimum zones in the Saanich Inlet - MetaT SI074_150m_A from the JGI so no metadata, however names have the SI cruise numbers, so we could ask the Hallam lab people for these cruise excel sheets for example the first one is cruise number: SI074, never mind these are not publicly available.

For samples: 288-389 we get what I believe are metagenomes Marine microbial communities from expanding oxygen minimum zones in the Saanich Inlet - SI075_LV_DNA_10m or genomic data Marine microbial communities from expanding oxygen minimum zones in the Saanich Inlet - SI054_200m_RNA but there is no public data for any of these. :(.

from sample 390-815 we get the non-JGI pyrotags samples with good metadata, for example: pyrotags (pre-filtered) from Saanich Inlet station S3 200m 2006-11-14. These are also publicly available. These ~400 protags with good metadata would be worth importing into PM. These seem to cover the years 2006-2011.

From samples: 816-827 we get the more 454 pyrotag sequencing data e.g. 454 pyrosequencing from Saanich Inlet cruise SI037, S4 190m depth V6-V8 region of the 16S gene

From samples: 828-830 we get the WGS metagenomes e.g. WGS metagenome from Saanich Inlet cruise SI020, 200m depth but I can't find links to SRA...

However when I check the link to SRA Experiments 804 there are many of the samples including metagenomes, metatranscriptomes and pyrotags... I'd have to figure out how they align... for the metatranscriptomes from their SRA page they link to the BioSample numbers, same for the metagenomes, however they all miss metadata, so we'd have to collect that from the associated pyrotags. Or just ask for all the excel data. The 454 pyrosequencing from Saanich Inlet data does link to the BioSamples with metadata.

More from Anna Kuparinen

She liked the Temp and pH metagenome clustering idea. Her notes were to also do the taxonomic analysis and do some statistical methods to see how much of the variation is captured by taxonomy in addition to the functional gene cluster. Do some stuff like GLMM, (something like GLMMER4 in R). Also add the lat/long as parameters (most things that have T and pH should also have that), in order to try and examine the lat/long gradient. I may also want to try Bayesian stat methods, and figure out a good normalization such as Z-score and log transform. She mentioned she knew some ppl who may have done transformations which may be useful.

From her experience it was good to just go and travel. She would just ask scientists who published things she found interesting if she could come and visit, and they always said yes. Then give a talk there as a visiting scientists.

She recommended 2 places I should visit

  1. The Helmholtz Centre for Environmental Research in Liepzig. Specifically I should check out the work of and ask if I could visit Prof. Dr. Ulrich Brose. He did some seminal work in functional characteristics of soil bacteria (or something to that effect) and now studies stuff like

"What are the effects of environmental gradients, climate change, and habitat fragmentation on the structure, stability and functioning of ecological networks?"

  1. She also recommended I check out the Centre for Ecological and Evolutionary Synthesis (CEES), UiO in Oslo. She recommended I reach out to Professor Nils Chr. Stenseth, write him a nice formal email etc and see if he would be willing to pass me along to along to people at CEES working on synthesizing Genomic data. She said I could tell him that she recommended me to email him.

Finally she mentioned I take an applied Bayesian stats course, she mentioned there was one at the epidemiology school in Estonia. I'm only finding one in the netherlands: Applied Bayesian Statistics | Utrecht Summer School, which other than being in the wrong country fits her description of content and price. But I'll still be here in April, but maybe I could look at the course program.

She previously had also mentioned something about a genomics synthesis institute in San Diego or Santa Barbara (somewhere in California).

Jan.15

From Alise:

COPO - Collaborative Open Plant Omics an English project for plant omics very similar to planet microbe, should look into what they're doing and try to communicate/meet with them to get ideas. It would be cool to visit these guys.

EUNIS habitat classification from the European Environment Agency. An ontology/controlled vocabulary? for the descriptions of habitats used in European natural parks and by some naturalists and scientists. Pier is aware, it is quite a chunk of work to ontologise properly, but large EU data stores are required to be annotated as such.

Jan.16

Meeting minutes from Bonnie. People with whom to be familiar:

Folker Meyer at Argonne National Labs in Chicago. MG-RAST, EMP, GSC.

Guy Cochrane head of the European Nucleotide Archive, enough said. Lots of interesting publications to look through here. Skim/read some of these.

Read The new science of metagenomics as well as Functional metagenomic profiling of nine biomes and Uncovering Earth’s virome to get incites.

Briefly check out Greg Caporaso Northern Arizona. A semi-important genomics guy, Not quite careful enough according to Bonnie.

Elizabeth A. Dinsdale first author on Functional metagenomic profiling of nine biomes see her other papers: 10.1038/nmicrobiol.2016.42 http://dx.doi.org/10.1038%2Fncomms5498

Dr. Jill Banfield huge name in the field, Hugenholtz, Greg Dick and many more were her post-docs (almost all now profs). Matt S also mentioned she had a metagenome database of some sort I should check out. I think she did that really provocative phylogenetic tree

Professor Phil Hugenholtz Also big name in metagenomics. at the Australian centre for ecogenomics Also works with Prof. Gene Tyson important early Acid mine drainage metagenomics.

Nikos Kyrpides JGI leads the Prokaryotic Super Program and the Microbiome Data Science group worked with Carl Woese... no biggie. develops the data management and comparative analysis platforms for microbial genomes and metagenomes (IMG).

from these

From the The European Bioinformatics Institute in 2018: tools, infrastructure and training. paper, see the elixr platforms site, where they have an Interoperability Platform /interoperability).

The metagenomic data life-cycle: standards and best practices in this paper they talk about the great achievement of comparing Tara and OSD sample with the same depth and salinity... we could do a lot better, maybe report how we do way more metadata from more projects in PM.

Jan.23

Tara paper collect the ~5 main tara papers and see what metagenomes/transcript/amplicons they link to so we can figure out the real number of Tara metagenomes. Ask Matt how many they used.

Lab meeting, update for website. Send bio/research. blurb, hiring a photographer. Bios due next lab meeting.

From Matt Schechter microbiology blogs to follow:

Microbiome Digest

Living in an Ivory Basement Stochastic thoughts on science, testing, and programming.

Getting Genetics Done

Blog: In between lines of code works at the Centre for Ecological and Evolutionary Synthesis (CEES), which Ana mentioned I should contact.

Enseqlopedia

Simply Statistics blog on being a data scientist

Analysis pipelines with Python

http://eccb18.org/tutorial-13/

Thrash Lab

7 Things You Need to Know About Writing Lists That Work 1 off blog

The Molecular Ecologist

BASEM AL-SHAYEB MICROBIOLOGY BIOINFORMATICS GENETICS a PhD student in Jill Banfield's lab, could be a model of starting early to have your own blog at the PhD stage. Build the online presence.

GO - Ship Towards a Sustained Global Survey of the Ocean's Interior GO-SHIP brings together scientists with interests in physical oceanography, the carbon cycle, marine biogeochemistry and ecosystems, and other users and collectors of hydrographic data to develop a globally coordinated network of sustained hydrographic sections as part of the global ocean/climate observing system.

GO ship doesn't seem to have omis data but they have a large variety of biochemical metadata. Has links to theoretically the HOT and BATS CTD data but I'm not convinced this would be of help to Planet Microbe.

Tara

Tara paper collect the ~5 main tara papers and see what metagenomes/transcript/amplicons they link to so we can figure out the real number of Tara metagenomes. Or take it from http://science.sciencemag.org/content/348/6237/873.full -> links to Computational eco‐systems biology in Tara Oceans: translating data into knowledge which details how there are five Research Articles in this issue of Science describe the samples, data, and analysis from Tara Oceans (based on a data freeze from 579 samples at 75 stations as of November 2013).

For example, a data volume of ca. 13 terabytes has already been archived at the EBI (PRJEB402)

In table 1 from this paper there are 243 prokaryotic metagenomes and 43 virus metagenomes.

Jan.27

From the first 10 pages of google scholar results for papers citing: Functional metagenomic profiling of nine biomes which may be relevant to me, we have: Cross-biome metagenomic analyses of soil microbial communities and their functional attributes, Identifying biologically relevant differences between metagenomic communities, Comparative metagenomic, phylogenetic and physiological analyses of soil microbial communities across nitrogen gradients, Viral and microbial community dynamics in four aquatic environments, The microbial ocean from genomes to biomes, An Introduction to Ecological Genomics (book), Uncovering Earth’s virome which bonnie had suggested, Toward molecular trait‐based ecology through integration of biogeochemical, geographical and metagenomic data

Jan.28

Saanich Inlet pyrotag data for PM. From the SI SRA page records 383 - 792 (409 total) are pyrotags of the V6-V9 hypervariable region of the small subunit rRNA gene, with lots of accompanying metadata as shown in the example file: These would be worth adding to PM.

strain	not Applicable
collection date	14-May-2008
geographic location	Canada: Saanich Inlet, Vancouver Island, BC
isolation source	seawater
sample type	Community DNA
broad-scale environmental context	coastal seawater, oxycline
latitude and longitude	48.6 N 123.5 W
depth (m)	100
Cruise_ID	SI021
Pre_fitration	No
Volume_filtered L	2
Oxygen uM	91.24
Phosphate PO4-3 uM	2.1
Silicate SiO4-4 uM	50.39
Nitrate NO3-2 uM	18.9
Ammonium NH4+ uM	ND
Nitrite NO2- uM	ND
Hydrogen Sulfide H2S uM	0
Methane CH4 nM	38
Nitrous Oxide N2O nM	14.2 

EBI marine metagenomes

For the 1493 results of EBI marine metagenomes with Depth and Temp, to calc the total number of projects I did:

awk -F "\"*,\"*" '{print $4}' ebi_marine_wgs_temp_depth.csv | sort | uniq

which results in 20 projects with metagenomes:

MGYS00000257 -> 10 Western English Channel diurnal study 20 with metatranscriptomes. 
MGYS00000275 -> 1 The Metagenome of the deep chlorophyll maximum in the Mediterranean studied by direct and fosmid library 454 pyrosequencing
MGYS00000288 -> 1 Brazos-Trinity Basin Sediment Metagenome: IODP Site 1320
MGYS00000289 -> 5 Peru Margin Subseafloor Biosphere good metadata with envo annotations
MGYS00000296 -> 2 Global Ocean Sampling Expedition (GOS) 52 results with tag sequences. 
MGYS00000297 -> 1 Arctic Winter marine ecosystem
MGYS00000324 -> 6 A metagenomics transect into the deepest point of the Baltic Sea reveals clear stratification of microbial functional capacities good other metadata
MGYS00000338 -> 1 Metagenome of a microbial consortium obtained from the Tuna oil field in the Gippsland Basin, Australia
MGYS00000382 -> 24 Amazon Continuum Metagenomes from ocean inward, unpublished data, good other metadata
MGYS00000391 -> 6 Sydney Harbour, good other metadata
MGYS00000410 -> 249 Shotgun Sequencing of Tara Oceans DNA samples size fractions for prokaryotes
MGYS00000462 -> 617 Ocean Sampling Day (OSD) 2014: amplicon and metagenome (617 is both together)
MGYS00000641 -> 146 Beyster Family Fund and Life Technologies Foundation-funded Global Ocean Sampling Expedition, 2009-2011
MGYS00000974 -> 344 Core genomes of cosmopolitan surface ocean plankton good metadata. 
MGYS00000991 -> 278  Arctic Ocean metagenomes from HLY1502, decent metadata. 
MGYS00001918 -> 114 Tara Oceans DNA samples corresponding to size fractions for small DNA viruses.
MGYS00001932 -> 58 Gulf of Mexico water and sediments Metagenome, also has transcriptomes, not all have temp total of 76 I'm guessing with transcriptomes included. 
MGYS00001935 -> 50 metagenomes EMOSE (2017) Inter-Comparison of Marine Plankton Metagenome Analysis Methods, envo annotations, 542 samples with tags
MGYS00002105 -> 37 Baltic Sea Surface Water Metagenome, good metadata envo/MIXS
MGYS00002108 -> 45 Red Sea metagenomes, decent metadata. 

I also found MGYS00001977 which is 282 results from OSD, not sure why it wasn't in this list.

Feb.10

Thinking about applying basic stats concepts to metagenomes for Tax-E. If were were to use samples which were run on the same/similar platform with the same/simliar read length, for example illumina my or hiseq which have commonly been used, could we treat the number of reads sequenced as the sample size? We could treat the numbers of known and unknown ORF-calls (or some variation on this) as treatments and perform an F-test? Doing so to explore the feasibility of comparing metagenomes from different biomes. looking for biomes where the ratios of between and within group variation is similar as to which may be compared? Maybe this won't help as the differences of unknowns aren't standardized... What if we did a clustering of ORFs and compared samples of different biomes based on variation among the ORF-clusters. It's analagous to an F-test between the variance of two machines for measuring something like size, but we'd split it up into the different ORF-clusters and see how the variances across samples of different platforms compare for the different ORF-clusters. TO deal with the sample-size/degrees of freedom would it be possible to treat reads as "observations" but normalize them to a constant per basepare. Or set a cutoff of basepairs sequenced as an observation for the sample size. Or just treat metagenomes as samples (and normalize for number of basepairs sequenced)? For a chi-squared distribution you want at least sample size 10, 20 would be better. Maybe something inbetween make up a unit of sequenced genome to relativize and call an observation? All this with the F-test would be to see if the different platforms or biomes have similar (statistically significant) variance for the different ORF cluster groups. I guess if enough of the clusters are of the same variance (in their detection) then we would say we can compare the metagenomes from different biomes/platforms. Analogous to the gas price F-test example here we'd have Number of ORFs (of a given cluster) observed per X bp, and were comparing the sample variance of two different machines in this case for example illumina miseq vs Roche 454. This should ideally be done on the same type of sample, analogous to 2 aluminum producing machines or gas prices from two different cities. Tara used Illumina HiSeq 2000. OSD: Illumina MiSeq. Beyster Family Fund and Life Technologies Foundation-funded Global Ocean Sampling Expedition, 2009-2011: 454 GS FLX Titanium. The sizes of which are all quite different for sample OSD and GOS files I could download and open them right away whereas the TARA file was much larger and would take ~40-50 mins to download. Obviously the numbers of reads are quite different as well.

We need to differentiate the effects of platform type, batch effect and sequencing depth. Maybe I'd need to first try an experiment where I take samples of the same biome ideally the same place, and see how the variance of the different ORF clusters compare. Maybe try to normalize by number of reads/read length or total number of nucleotides sequenced? Just see if it's possible to compare the functional space of the same biome across platform. If that can't be done, then we'd probably need to make use of data from the same/similar platforms to compare across biomes quantitatively.

For the question of comparing marine/aquatic vs soil/other biomes which aren't filter fractionated, we'd definitely need a viral trimming step. Tax-E should either take the direction of compare similar environments marine/aquatic/aqueous and try to Hclust them, or explore if you can compare aqueous vs solid. I still think it'd be interesting to do the clustering of biomes along T/pH gradients and see which gene clusters are most affected by the gradients. But I need to establish some baseline for comparison first.

Feb.17

Some examples of Envo Biomes.

ENVO:biome

SubClass of (Anonymous Ancestor)

('environmental system'
 and ('has part' some 'cellular organisms')) or ('has part' some 'collection of organisms') or ('determined by' some 'cellular organisms') or ('determined by' some 'collection of organisms') or ('determined by' some 'plant anatomical entity') or ('determined by' some 'anatomical entity')

ENVO:aquatic biome

SubClass of:

'biome'
'determined by' some 'water body'

'water body' which is in the 'environmental feature' hierarchy.

ENVO:marine biome

SubClass of:

'aquatic biome'
'determined by' some 'marine water body'

the relation ENVO:determined by is an ENVO relation, not one for RO. I don't see anything about it's domain or range, nor definition.

ENVO:ocean biome

SubClass of:

'marine biome'
'has part' some ocean

hence we can also have the 'has part' relation to an 'environmental feature'. I suspect we'd prefer to use the 'determined by' but I'm not sure. There are many examples where they seem to be used interchangeably.

ENVO:forest biome

SubClass of:

'terrestrial biome'
'has part' some 'forest ecosystem'

ENVO:forest ecosystem

SubClass of:

ecosystem
'has part' some 'forest canopy'

forest canopy is a 'geographic feature' but it's an ecosystem class which has this not a biome.

ENVO:marine upwelling biome

SubClass of:

'marine biome'
'has part' some 
    ('environmental system'
     and ('determined by' some upwelling))

where upwelling is an 'environmental feature'.

ENVO:cropland biome

SubClass of:

'anthropogenic terrestrial biome'
'composed primarily of' some 'cropland ecosystem'

ENVO:desert biome

SubClass of:

'has quality' some arid
'terrestrial biome'
overlaps some 'desert area'

'desert area' is a

ENVO:flooded grassland biome

SubClass of:

'grassland biome'
'has part' some 
    (soil
     and ('has quality' some wet))

Here we have an example where it can have part a 'material entity' (soil). This is what I was looking for.

In summary Biome can:

biome is in the system hierarchy (and is a material entity)

system 
  environmental system
    ecosystem
      biome

The relations for biome:

Subclass of 'environmental system' 
'has part' some ecosystem (could it thus have part some 'environmental system'?)
'has part' some 'environmental feature' || 'determined by' some 'environmental feature'
'has part' some 'material entity'
'has quality' some quality
'mereotopologically related to' some 'environmental zone' (site)

There also exists the 'environmental system' associated or determined by ... classes such as

ENVO:environmental system determined by a material

SubClass of:

'environmental system'
'determined by' some 'environmental material'

Feb.21

for semantics to deal with sequence info: use OBI when possible, also the new release of the Sequence types and features ontology, and maybe a bit of edam (for filetypes) which can be forced as a subclass of an OBO term. Maybe also the SWO Software Ontology.

Feb.25

Visit with Chris at LBNL.

Objectives:

  • Launch Planet_microbe_ontology

  • Create the first round of PM terms using the ENVO: Design Patterns

  • create new design patterns to create the EBI biome terms

[Trying to install docker on WSL](https://askubuntu.com/questions/1049852/how-to-solve-system-has-not-been-booted-with-systemd-as-init-system-pid-1 http://robot.obolibrary.org/) our conclusion, it's probably not going to work. see the blog here

'ODK'

maybe need to dual boot ubuntu.

Feb.26

To create the planet microbe ontology using their ontology development kit 'ODK'. Doing it on a linux machine borrowed here.

first step install docker on ubunut: https://docs.docker.com/install/linux/docker-ce/ubuntu/

GO workflow

From Chris the GO daily workflow the essence is to do small pull requests from origin rather than forking.

Create the new ontology

./seed-via-docker.sh -d ro pato bco envo iao obi -u kaiiam -t "Planet Microbe Ontology" pmo

Chris said to do it with the yaml file instead:

./seed-via-docker.sh -C examples/triffo/project.yaml

pmo.yaml

id: pmo
title: Planet Microbe Ontology
github_org: hurwitzlab
repo: planet-microbe-ontology
import_group:
  products:
    - id: ro
    - id: pato
    - id: bco
    - id: envo
    - id: bfo
    - id: pato
    - id: bco
    - id: envo
    - id: iao
    - id: obi

We fixed the python file on the odk project. then I got the example triffo yaml file (on linux) to work.

Now I'm trying it on windows with the docker quick start terminal. Where I ran systemctl start docker then docker build . which seems to be working building the docker file from the updated python script. Which I'm hoping is the equivalent of having run sudo make docker-build on linux.

I also ran: docker build -t obolibrary/odkfull:v1.2.5 . && docker tag obolibrary/odkfull:v1.2.5 obolibrary/odkfull:latest to set the tags and such properly (from how it was suppose to be run in the makefile). to make it equivalent to running make on linux.

Next I'm hoping that I can use the docker quick start terminal or the WSL to run the command to seed a new ontology. from the ontology-development-kit

./seed-via-docker.sh -C examples/triffo/project.yaml

didn't work on windows

OSError: [Errno 26] Text file busy: 'target/triffo/_dynamic_files'

Got that to work with Chris help could run the test creation, now doing it with the real PMO.

./seed-via-docker.sh -C pmo.yaml

got the error

ERROR:root:b'make: *** [Makefile:182: imports/bco_import.owl] Error 1\n'

Perhaps something to do with BCO? I'm trying it again without importing BCO.

running with just these to start: bco had a purl formatting issue.

pmo.yaml

id: pmo
title: Planet Microbe Ontology
github_org: hurwitzlab
repo: planet-microbe-ontology
import_group:
  products:
    - id: ro
    - id: bfo
    - id: pato
    - id: envo
    - id: iao
    - id: obi

still having issues with ENVO,

Ask seth about Noctua

Feb.27

pmo.yaml

id: pmo
title: Planet Microbe Ontology
github_org: hurwitzlab
repo: planet-microbe-ontology
import_group:
  products:
    - id: ro
    - id: bfo
    - id: pato
    - id: envo
    - id: iao
    - id: obi
    - id: uo

Got that to run on the Ubuntu machiene they lent me (my windows is being useless today won't even open skype !!!!!)

got the output:

####
NEXT STEPS:
 0. Examine target/pmo and check it meets your expectations. If not blow it away and start again
 1. Go to: https://github.com/new
 2. The owner MUST be hurwitzlab. The Repository name MUST be planet-microbe-ontology
 3. Do not initialize with a README (you already have one)
 4. Click Create
 5. See the section under '…or push an existing repository from the command line'
    E.g.:
cd target/pmo
git remote add origin git\@github.com:hurwitzlab/planet-microbe-ontology.git
git push -u origin master

BE BOLD: you can always delete your repo and start again


FINAL STEPS:
Folow your customized instructions here:

    https://github.com/hurwitzlab/planet-microbe-ontology/blob/master/src/ontology/README-editors.md

Work flow to make changes to github repo via pull requests see GO Daily workflow

First on github start an issue, then get an issue number from example 666

Next checkout a branch corresponding to that issue. git checkout -b issue-666

workflow for patterns run this to change the line formatting from unix to CRLF ./unix2dos.sh ../modules/chemical_concentration.csv

commit/add a file (all files staged to commit) with a message git commit -am 'added utility to convert files form unix line format to CRLF'

push my branch of an issue to origin to create a pull request git push origin issue-666

Finally on github an editor can view the pull request and if the build checks and everything is ok: click Merge Pull Request

git checkout master

To delete the branch:

git pull (to get the new origin with the merged branch.

git branch -d issue-674

Design patterns following similar patterns as in upheno

For the ebi biomes, check out robot templates, alternatively we use yaml patterns.

Feb.28

For the EBI biomes we've decided to go with the robot templates

to run it, for whatever reason it didn't seem to be working from ODK so I'm trying to run it from Robot directly

firstly install robot make sure to export to path and install java.

run it as:

robot template --template modules/ebi_biomes.csv --output foo.owl -i imports/ro_import.owl

getting some errors, chris posted it to the robot issue

fixed that up.

We can now run the robot templates ebi_biomes.csv with the command (as it's in the envo makefile):

make modules/ebi_biomes.owl

chemical concentrations

For the design patterns

assuming you have docker installed by doing through the dosdp instructions you can run:

sudo ./run.sh make all_modules

Mar.01

To build the release

on linux I installed dsdp docker using sudo so it was making me do everything with sudo which could change file permissions in nefarious ways so I used from here: sudo usermod -a -G docker $USER to get docker away form sudo

'ODK' put a note about this in the ENVO readme editors. its the uber do everything you need for the ontology in one docker container. It has robot for the robot templates, owltools for releases and patterns.

to prepare the release: in envo/src/envo (need at least 8gb memory)

./run.sh make prepare_release

to fix the error:

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.

I ran: systemctl start docker

Fun docker/java memory issues: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

https://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#par_gc.oom sugests to use the concurrent collector is enabled with the command line option -XX:+UseConcMarkSweepGC.

from here it says how to add a java arg to docker:

putting both together I try the run.sh file as:

#!/bin/sh
docker run -m 12g -e JAVA_OPTS='-XX:+UseConcMarkSweepGC' -v $PWD/../../:/work -w /work/src/envo --rm -ti obolibrary/odkfull  "$@"

this failed with error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "exec: \"-m\": executable file not found in $PATH": unknown.

Changed it to

#!/bin/sh
docker run -m 8g -e ROBOT_JAVA_ARGS=-Xmx7G -v $PWD/../../:/work -w /work/src/envo --rm -ti obolibrary/odkfull  "$@"

get a diff error due to reasoning, some earlier changes. installed ELK (put .jar in plugins folder) to check the thing over.

can open the envo-edit-module-merge.owl, run the elk reasoner, change from asserted to inferred classes and look for ones which end up in the owl:no/null thing (in red), shows us the unsatisfied axioms, make issues and fix them.

for PMO to add To add new import ontologies

Inside the make file add 'foo' to the imports an additonal ontology to the list.

add new foo.terms.txt file. (seed with necessary terms) -> git add it

Make imports foo_imports.obo

git add the obo and owl files.

add foo to the catolog-voo1.xml file

add foo to envo-edit.owl

next time make release, git add the new files in the root/imports (which are not in the src)

git workflow again:

git checkout -b issue-686

do stuff to a file (or more)

git commit -am 'fixes #686' DON'T FORGET TO do the fixes!!!

git push origin issue-686

git checkout master

On github:

compare pull-request (in yellow) -> create pull request

wait for checks

merge pull request -> confirm merge

delete branch

Mar.05

Mungalls-Ontology-Design-Guidelines

Mar.06

Making use of the frictionlessdata schema types for the columns of type and Frictionless Format in our purl-mapping spreadsheets.

Matts pm schema repo, where we'll put the datasets.

PM_datasets on basecamp where we'll put the purl-mapping spreadsheets.

Playing around with PMO.

added terms to the OBI terms.txt file in the imports/ folder

then from the ontology folder ran:

./run.sh make all_imports

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.

I had to turn docker on for this to work.

It worked and I now have the newly imported terms showing up in the ontology.

From pier:

add "templated request, no review required" message when adding to the design pattern csvs without need for PR revision.

Mar.07

I was able to finally make the pmo.json file by getting ./run.sh make prepare_release to run by hacking the makefile to not Generate Import Modules

I had previously it normally then I deleted the comment on investigation http://purl.obolibrary.org/obo/OBI_0001898 class from the mirror/obi.owl file, the seed.txt file and the ontology.

Then in the make file I commented out the lines:

#imports/%_import.owl: mirror/%.owl imports/%_terms_combined.txt
#	$(ROBOT) extract -i $< -T imports/$*_terms_combined.txt --method BOT \
		 annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@
#.PRECIOUS: imports/%_import.owl

which were importing the latest versions of the https://raw.githubusercontent.com/obi-ontology/obi/v2018-08-27/obi.owl files (which was importing the bad comment on investigation class which was making the build process fail by failing during the ELK reasoning step. All this hacking, allows me to prepare the release which builds the pmo.json file which Matt can use.

Create RDF types for our schema table objects:

runs experiment samples sampling events

Mar.11

https://github.com/jakevdp/PythonDataScienceHandbook

Mar.13

Link my presentation at LBNL

Mar.14

post about owl to json

links to webVOWL trying the OWL2VOWL program.

java -jar owl2vowl.jar -file pmo.owl

getting the error

Exception in thread "main" java.lang.UnsupportedClassVersionError: de/uni_stuttgart/vis/vowl/owl2vowl/ConsoleMain : Unsupported major.minor version 52.0

Got it to work. made util folder to run this and produce the pmo_owl.json for Matt to pull from.

Discussion with Matt: I'm making the schema json objects corresponding to the resource (csv's) in the overall data-packages for each project. Matt's script schema_tsv_to_json.py will take in my Schema TSV from stdin and output a Frictionless Data Table Schema JSON template, which I can use to create the json objects to put into the project's (master overall) datapackage.json file.

For now I need to establish a "final" template for this (one set of proprietary headers) which we'll use to create the json blobs, so that I can edit the files in .tsv format.

headers for schema_tsv_to_json

Here is (for now hopefully) the final version of the headers for the schema_tsv_to_json.py

parameter	rdf type purl label	rdf type purl	pm:searchable	units label	units purl	measurement source purl label	measurement source purl	pm:source category	pm:source url	frictionless type	frictionless format

Protege entity creation preferences:

Entity IRI:

Start with: (choose) Specified IRI: http://purl.obolibrary.org/obo

Followed by: (choose) /

End with: (choose) Auto-generated ID

Entity Label (for use with Auto-generated ID)

(choose) Same as label renderer

Auto-generated ID:

(choose) Numeric (iterative)

prefix: ENVO_

(no suffix)

Digit count 8

Start: 3,000,000

End: 3,100,000

Probably should check the Remember last ID between Protege sessions box

Mar.18

remove the pm:source category from the ontology sheets and add measurement source protocol which links to the relevant protocol page (when possible). For example: History of CTD Dissolved Oxygen Measurements during the Hawaii Ocean Time-series (HOT)

Mar.19

responding to the pull request 720, I did the following to check it out on my own computer and view it in protege.

git checkout issue-719 to switch to the branch

git pull origin issue-719 to pull the changes from the branch to my computer (it's really hard for me to read owl diff files on github).

Mar.21

Creating the OSD datapackage.json

from ~/Desktop/software/planet-microbe-datapackages/OSD/ontology$ I ran:

cat osd.tsv | ../../../pm-schemas/scripts/schema_tsv_to_json.py > ../datapackage.json

Filling in the headers of the datapackage using Matt's example from here

made some changes to my atom .config file added the following in order to have tabs be 2 spaces in .json files:

".js.source":
  editor:
    tabLength: 2

Read more about https://frictionlessdata.io/

Mar.25

create new SSH key:

cd

if you don't have a .ssh create one mkdir .ssh

ssh-keygen run keygen to generate a public key

pbcopy < ~/.ssh/id_rsa.pub copy the public key to the buffer (mac) otherwise cat it out.

go to https://github.com/settings/keys

new SSH key paste the key into the main box (and add a title).

Connect to UA hpc

pbcopy < ~/.ssh/id_rsa.pub

vi ~/.ssh/authorized_keys add the line of text from the 'id_rsa.pub.'

If the file does not exist, create it. Ensure that the permissions on the file are 600 ('chmod 600').

ssh @hpc.arizona.edu

In your local terminal, open '~/.ssh/config' and add the following lines:

Host hpc
Hostname hpc.arizona.edu
User <NetID>

bash_completion on mac

brew install bash-completion

Setting remote for github fork

Post on configuring-a-remote-for-a-fork

UA VPN

link to doccumentation about how to download and install

Mar.26

example better way of printing strings in pythong (right justify) based on War.py hw asg

a = 'ab'
print('{:>3}'.format(a))

will print ab

can also use an idex into a list:

vals = ['2','3','4']
vals.index('3')

prints 1 (the 1th position in the list) could have used this instead of a dict for the card values.

instead of while len(deck) > 0: more pythonic to do while deck:

regex:

can also write print(re.match('\d{1,4}', '1234567890')) as print(re.search('^\d{1,4}', '1234567890'))

Mar.27

1:1 Meeting with Bonnie:

Follow up from LBNL presentation there is scope based on need for an ontology biome term suggestion/annotation tool, people form KBASE/JGI say. Hence it would be presumably low hanging fruit for me to develop in conjunction with the creation of the EBI biome terms. Bonnie strongly suggestes kmer (information theory) based approaches are superior to protein clustring/ orf annotation. Workflow where we'd suggest a biome annotation term for new WGS metagenomic data. I could rework the setup idea from Tax-E interm of the curration of starting biomes. Bonnie suggests I use Libra or at least the cosine similarty of kmer frequencies. I could potentially have centroids of "canonical" metagenomes representing an ontology biome term, against which new metagenomes are computed all vs all, which we'd assign to a group by it's greatest cosine similarity. Bonnie suggests we could perhaps have a first small proof of concept paper, then perhaps try to shoot for a second larger higher impact paper using all of ebi to train and a bunch of metagenomes from NMDC: JGI whichever source we can access. Implement everything as a pm (or later kbase workflow).

I need figure out how I could use the cosine similarity of kmer frequencies to develop ml model. Perhaps using a mash index for the first step of all vs all comparison, then using libre libre on the result from mash and suggest ontology terms. Start with just using libra on a small number of datasets. New project idea for the summer. Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons

Raskin scholarship

ODK and docker

to managae ontologies via ODK see this ontology-development-kit repo. Clone it then docker pull obolibrary/odkfull On a mac it just works, for example running the makefile for pmo. This container included robot and owltools.

java

download latest Java SE from here didn't seem to work alone to run the java script in pmo

tried brew cask install java

on my previous computer I had the following in my .bash_profile

#Java
export JAVA_HOME=`/usr/libexec/java_home -v 1.8`

when I source the bashprofile I get Unable to find any JVMs matching version "1.8".

trying to install java 8 from here: https://java.com/en/download/ did it didn't fix the issue.

trying java sdk (I think this is what solved the issue last time) now when I source the bashprofile I don't get the error. but the java script isn't working. try restarting the computer?

The latter java sdk link fixed the issue, I was getting a different error as I hadn't ran the make prepare_release step which creates the file my mini java pipeline takes as input (which was the error I was getting).

Mar.28

solved the pandoc pdflatex issue

pdflatex not found. Please select a different --pdf-engine or install pdflatex pandoc

by running brew cask install basictex

making changes to pull request

I submited a pull request but recieved reviews requesting changes be made. On github it says Add more commits by pushing to the kaiiam-patch-1-1 branch on EnvironmentOntology/envo.

trying:

git checkout -b kaiiam-patch-1-1

git pull origin kaiiam-patch-1-1 added the merge commit message (autogenerated) in vim.

modify the file(s) according to the comments

git commit envo-idranges.owl -m 'Modifed envo-idranges.owl file as specified by @pbuttigieg in PR #734'

tried git push and it told me push as:

git push --set-upstream origin kaiiam-patch-1-1

back on github pr page I clicked Re-request review

Apr.01

jupyter notebook re.ipynb to open a jupyter notebook

Apr.03

Meeting with Alise:

Check with Chris / Elisha + perhaps someone from NCBI and or EBI to determine if the retrieval and annotation of biomes is truly of scope.

First pass at the project: Build a test dataset from EBI with the biome terms. Use a kmer-based method like SIMCA to calculate distances, cluster the biomes by distance and check if the clusters match the labels. Try to do this for the summer. I could also try to use libra or the distance derived therefrom to get the distances. Drawing form Alise's work on her review investigating if kmers correspond to beta-diversity, her work suggest it maybe does at least for Jaccard-Distance, we try to optimize a method from which we can retrieve the distances from a new sample to an existing matrix built from the test set. We'd filter the test set's kmers down to remove redundant kmers to make it run faster. Using a tool like MASH (I believe) only takes kmer profiles into account, not frequencies as does Libra. Hence, we'd need to test how both methods work to retrieve biome-metagenomes at various hierarchical depth levels. We'd want to explore if it's possible to retrieve and cluster based only on kmer profiles instead of frequences, as that would significanly decrease computation time. If we used long kmers ~20bp, their profiles may function similarly to beta diversity using a Jaccard-Distance which only deals with presence/absence data. Comparing Libra's cosine-similarity based approach with a Jaccard-like kmer profile based approach makes a lot of sense thematically form my comprehensive exam as to really leverage Dr. Jana U'Ren and Dr. Clay Morrison's expertise and draw them into a deeper comparison.

A paper (or papers) could involve:

  • A final pipeline putting in all together, allowing for either automated (lower biome-taxonomic resolution) legacy dataset annotation, or to help suggest biome-annotation terms to users, perhaps as a tool plugged into EBI.

  • tests comparing kmer profile (only) vs kmer profile + frequency in order to cluster and retrieve from the test sets.

  • Determine if we can actually get a signal out of noise based on sequencing type. Set up tests determine the extent to which sequencing platform type affects what is retrieved. For example set up a test in which we choose a couple of biomes which each been sequnced on two seperate platforms, to see if we retrieve clusters based on biome or just by sequencing type.

  • Determine if kmer profile and or frequency methods work for amplicon data as well (I'm not certain it will be as clearn considering the large variation in amplicon types 16S vs 18S, variable regions etc. would we need seperate models for 16 and 18S?

  • We could maybe also look into the related question of the comparison biome classification based on kmers, 16s Taxonomic annotations and Functional annotations. Do we get different results is one method faster ... . Or to try some actual machine learning methods to cluster the biomes ... .

Pipeline:

  1. users upload metagenome(s)

  2. Use PARTIE to seperate amplicon and WGS data.

2a) if Amplicon proceed with Amplicon dataset 2b) if WGS proceed with WGS dataset

  1. compute kmer distance for new sample

  2. compare new distance to precomputed SIMCA or libra distance matrix and retrieve closest matching cluster(s)

  3. prompt user with suggested annotation term, (or list if multiple from different clusters), while displaying it in context within the ENVO biome hierarchy. The user could then accept the suggested term or search the hierarchy starting at the suggested one to find an appropriate one. If no appropriate term exists there could be a button which links to the ENVO issue tracker helping to automate the creation of a message for the creation of term request.

Apr.04

Jeff Oliver learn-r

read the GNU make documentation to try and better understand all that can be done with make.

For Biosys hw 12-unclustered-proteins:

parse line by line use regex make a capture group could use re.search want the >gi| datastruct to hold what clustered use a dict? perhaps set is better.

Apr.08

Amazon continuum metagenomes paper perhaps Alise and I can do something with this.

Apr.09

1st talk AquaDiv

make use of Joyce to select for ontologies talk to the guy. Germany. Talked to them and send email.

2nd talk EDMED

(talk to her as well Spanish woman working in Ireland)

search env data from edmed.seadatanet.org

have their Data catologue maping vocap

using schema.org

3rd talk Ontogeonous

ontology for geoscienes. Alizia Mantovani

makes use of SWEET, and GeoSciML

how do they map between their ontology and database?

uses CGI vocabularies

Geosciences knowledge management

GFZ helmholtz centre potsdam

uses the Semantic Network Ontology (is this the same as the semantic sensor network ontology?) Loos like it as well as using SWEET.

semantic challanges facing Ai appliations in Earth Geosciences

Stephan Richards, minerva intelligence in Vancouver BC.

how to use human obs in AI appliations

built on HILUCS from the INSPIRE vocabulary. (which is maybe connected to the CGI vocabularies from the questions?)

Ont based reasoning and AI

also from minerva intelligence in Vancouver BC.

Apr.10

Notes from Pier:

Mseqs2 repo to checkout maybe useful for clustering/retrieval of metagenomes (maybe).

He believes there is a rational for the retrieval of ontology terms idea, it would help encourage people to annotate at deeper hierarchical levels be less sloppy (epipelagic instead of marine biome). Also good for people dealing with their center's or collaborators data of which they aren't sure about. Could maybe even turn it around and suggest ontology terms for things which don't cluster well, could be a new type of marine biome.

Biomes: make a sublcass of biome (for now) like microbial biome, with relation to PCO:microbial community. Some editors notes about the semantics of biome the issues therein etc microbial adaptation vs just habitation. Have all the ebi biome classes be subclasses thereof.

Apr.17

jellyfish Kraken centrifuge. Matt thinks the kmer based retrieval should work.

From chris:

stuff he's up to today: trying to integrate the semantics from a bunch of systems for samples and specimens Extensible standardized representation of samples in RDF and JSON-LD

“Standards” for sample information

International Geo Sample Number IGSN

Experimental-OBO-Core

Apr.23

LDA

We had discussed using LDA to do topic modeling of genes from metagenomes. There are some good metagenomic datasets have lots of accompanying environmental metadata, some of which exist along an ecological gradient. Typically ecologists would do something like a principal coordinant analysis to see how the sites ordinate relative to one another based on their metadata.

I was thinking of trying to do both on the same data and comparing the relative distances between samples. Comparing how the DNA as a bag of genes LDA ordinates in the topics space as well as how the samples ordinate in ecological metadata space.

Resources: great video explaining LDA

Python implemetation of LDA which I think I could get working realtively quickly see the author's code here. From May 2018.

See this post about Topic Modeling with Gensim (Python) for a bit more sophisticated implementation including topic visulalization and Mallet Model for cleaner topics. Perhaps try this one after?

Here is another one that's even simpler. I think starting with this maybe easiest.

Here is the Original gensim site and homepage as well as their github page which also has links to getting started and tutorials which would be good to start with.

Here is an older one from 2016

Apr.25

Malaspina project site and publication

Clone this wiki locally