Skip to content

summer_2019_log

Kai Blumberg edited this page Oct 29, 2021 · 163 revisions

Table of contents

May 2019

May.02, May.04, May.28, May.29, May.30, May.31,

June 2019

Jun.07, Jun.12, Jun.13, Jun.14, Jun.17, Jun.18, Jun.19, Jun.20, Jun.21, Jun.24, Jun.28,

July 2019

Jul.02, Jul.15, Jul.16, Jul.17, Jul.18, Jul.19,

August 2019

Aug.06, Aug.08, Aug.09, Aug.16, Aug.20, Aug.21,

May.02

From Alise Illyong's Biospectra writeup. Probably not useful for what I want to do, go with simka instead.

Alise's kmers and taxonomy review paper could be useful to understand the tools.

Multiple comparative metagenomics using multiset k-mer counting, Alise suggested I start with this. Does all vs all metagenomic comparison based on all kmers in all samples (only take first x number of reads from each genome). Designed to run on HPCs, vs Lira which required a hadoop architecture. Can also check out the authors thesis for citations.

basic ebi biome ontology term retrieval project outline:

  • Make "Bins" collections of metagenomes by biome. Using the EBI hierarchy and metagenomes

  • Run all vs all comparison using simka -> outputs distance matrix.

  • in distance matrix (jaccard or bray curtis) make clusters by biome, along biome taxonomic levels (biome hierarchy). Could test in a leave one out cross validation setup or with designated testing and training split (bootstrapping). Could also do k-nearest neighbors. -> calc stats false positives precision recall etc.

First pass could manually get biome samples like with TaxE, for future iterations could hit the EBI API to get all samples associated with a biome. Unfortunately for many there aren't metagenomes (just amplicon)

Would be good to test how well it performs at different hierarchical levels soil vs marine vs human etc vs within marine hierarchy like I did for taxe. Also test how well it does when the samples are from different sequencing technologies illunina only vs mix of technologies. Also test the use of different ecological distance metrics jaccard vs bray curtis. Maybe also test different clustering methods to retrieve from the clusters centroids is the first way I can think of but look into this I'm sure this exists.

If these initial tests don't seem to pan out, ditch the idea and move on. If all this works then we'd consider writing our own version which selects only subsets of kmers (statistically determined to be relevant) to speed things up, final version would have the precomputed kmer spectra and for a new metagenome it would simply be a matter of scanning for those kmers and calculating a distance.

May.04

A call for standardized classification of metagenome projects paper which describes the original metagenomes ontology which both the EBI and JGI use for their metagenomes. This is effectively what I'm ontologizing (rather than just ebi). But I need to to find an updated list of these terms if it exists the http://www.genomesonline.org/pdf/MetagenomesOntology.pdf link is down. I found this GOLD Ecosystems Tree View however it takes forever to load.

Citation for JGI IMG: IMG/M: a data management and analysissystem for metagenomes. Newer paper for GOLD: Genomes OnLine database (GOLD) v.7: updates and new features

should also cite FOAM (Functional Ontology Assignments for Metagenomes): a Hidden Markov Model (HMM) database with environmental focus for ch2 work, similar thing that's been done. Probably also cite this paper

Should cite this for ch1 for similar projects to PM http://metagenomesonline.org/

May.28

Have previously been working on the EBI/JGI GOLD microbiome terms, see my Tax-E page.

I was going through the JGI GOLD (see Biosample Ecosystem Classification and Ecosystem Tree), many of the publicly shown microbiome samples are publicly availible in NCBI and you can follow the links through the JGI pages to get to the SRX numbers. Hence I think it may actually be possible (perhaps a better way like an API also exists) to grab all (or most of) that JGI microbiome data in addition to the EBI data. Especially once I've added JGI paths to the microbiome terms, as I have done with the EBI path URLs. Taken together, I could harness both the EBI and JGI metagenome dataholdings in the kmer based retrieval problem to predict/suggest a new sample's microbiome annotation term. Bonnie and I had previously discussed this, however now that I have a better feel for the JGI dataholdings, which are larger than those at EBI MGNIFY, I think we could really pull this off. It's leveraging the best of both worlds Pier and Chris' defacto life science/omics data annotation ontology, and the Hurwitz lab's kmer analytics.

Need to look into using MGNIFY's API, as well as if JGI has an API otherwise how to webcrawl through their pages to get SRX numbers, then use a tool like Matt or perhaps from NCBI's API has to pull the metagenomic data to an HPC. Use a tool like simka make kmer profiles and classify them, and suggest nearest neighbors, filter those through the ENVO microbiome hierarchy and return/suggest the lowest common ancestor node. It would be cool to do this using the non-available or messy JGI data to help them sort it out and go for a higher impact paper.

May.29

Link to our summer internship undergraduate student Shelby's github repo

Found all the existing purls to serve in axioms for the engineered microbiome classes. After I fill these out the plan is to make the necessary PR's finish off the environmental microbiome classes. Then I can do a first release of the EBI/JGI classes which includes all of the environmental classes which are at least in EBI. We'll see if Chris is provided with JGI urls to link to, if so I could add those as well.

releasing this before my talk would be ideal, so I can point to it and say look the environmental ones are done. I'll need to come up with some good examples from of environmental microbiome classes to use for my talk which perhaps relate to the UNSDGIO and or Essential Variables (EOV). Pier and Chris like the plan of presenting the idea of (ENVO microbiome x terms) with axiom determined by (x env feature term) which maps to (x SWEET) terms.

May.30

link to google drive for EBI/JGI biomes project from chris

May.31

envo microbiome paper draft

Jun.07

Link to newly released Planet Microbe. PM Issues page from Matt to help with. First release of Planet Microbe Ontology. Updated the ontology to have definitions, so it could be useful when rendered on the site. I also made the HOT_Delong_metatranscriptomes datapackage

Jun.12

Bonnie and Elisha's AGU session I feel like I should try to go.

ontology tips from chris mungall, as well as google drive about similar things

Picking up on the SIMKA kmer based retrieval plan:

Tacc:

---------------------- Project balances for user kaiiam -----------------------
| Name           Avail SUs     Expires |                                      |
| iPlant-Collabs       6570  2020-03-31 |                                      |
------------------------- Disk quotas for user kaiiam -------------------------
| Disk         Usage (GB)     Limit    %Used   File Usage       Limit   %Used |
| /home1              0.0      10.0     0.00           11      200000    0.01 |
| /work               0.0    1024.0     0.00            4     3000000    0.00 |
| /scratch            0.0       0.0     0.00           12           0    0.00 |
-------------------------------------------------------------------------------

/home1/06091/kaiiam /scratch/06091/kaiiam and /work/06091/kaiiam are my directories. Based on the above table it looks like the Disk quota limits are the difference, so I presume I'll just do stuff on /work as it has the best specs.

Alise suggests put most of your stuff in work. You can put as much as you want in Scratch, but it is erased after a few weeks. As per Ken usually install tools on home and store data in work.

Simka github trying to install simka-v1.5.0-bin-Linux.tar.gz

as modifed per Install a binary release of simka page doing it on ~/software

wget https://github.com/GATB/simka/releases/download/v1.5.0/simka-v1.5.0-bin-Linux.tar.gz
gunzip simka-v1.5.0-bin-Linux.tar.gz
tar -xf simka-v1.5.0-bin-Linux.tar
cd simka-v1.5.0-bin-Linux
chmod +x bin/* example/*.sh

Test it out:

cd example
./simple_test.sh

it works partial output:

Output dir: ./simka_results/

Simka



*** Test: PASSED

Command used:
	../bin/simka -in ../example/simka_input.txt -out ./simka_results/ -out-tmp ./simka_temp_output

Command for visualizing results:
	python ../scripts/visualization/run-visualization.py -in ./simka_results/ -out ./simka_results/ -pca -heatmap -tree

Command for visualizing results with metadata annotations:
	python ../scripts/visualization/run-visualization.py -in ./simka_results/ -out ./simka_results/ -pca -heatmap -tree -metadata-in ../example/dataset_metadata.csv -metadata-variable VARIABLE_1

scp to get files from HPC to local computer to run the visualizations.

scp -r [email protected]:/home1/06091/kaiiam/software/simka-v1.5.0-bin-Linux/example/simka_results simka_results (have tacc code ready).

try authors visualizations:

from here: ``~/Desktop/software/ebi_biomes/paper2/example_test

python ~/Desktop/software/simka/simka-v1.5.0-bin-Darwin/scripts/visualization/run-visualization.py -in ./simka_results/ -out ./simka_results_out/ -pca -heatmap -tree -metadata-in dataset_metadata.csv -metadata-variable VARIABLE_2

got the heatmaps to work (weren't as the source code went) by adding repos = "http://cran.us.r-project.org to the heatmap.r script in simka. I also added the metadata to the command, putting the dataset_metadata.csv file in the same directory, (will need to do that when scping from tacc).

if (!require("gplots")) {
   install.packages("gplots", dependencies = TRUE, repos = "http://cran.us.r-project.org")
   library(gplots)
   }

on TACC will be storing the data /work/06091/kaiiam/stampede2/ch2/simka_microbiomes_1 for the first experiment using simka.

manually choose samples for test run:

TARA Shotgun Sequencing of Tara Oceans DNA samples corresponding to size fractions for prokaryotes. ENA page: https://www.ebi.ac.uk/ena/data/view/PRJEB1787 instead use the metagenomes from MGNIFY I had previously cooked up for the Tax_e analysis, except instead of randomly selecting a small subset as we did before, I'll narrow them down by sorting the csv's by MGnify ID unique project, and just take one (or a few for dataset2) samples from each. This way I know they're all metagenomic WGS, and from the variety of environments and it's a comparison to how it sort of worked when we classified by functional predicted orfs, so we can compare that to kmer space.

Doing it on tacc /work/06091/kaiiam/stampede2/ch2/simka_microbiomes_1 and my computer ~/Desktop/software/ebi_biomes/paper2/simka_microbiomes_1

try and modify awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt from this stack overflow to splitt the dataset_metadatta.csv into the ftp links and get them.

Jun.13

picking up from yesterday:

cut -d ";" -f 1,3 dataset_metadata.csv works

wget for ebi files is super slow, ftp also slow. Trying aspera based on this ebi page

[login3@~/software]$ ./ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.sh

Installing IBM Aspera Connect

Deploying IBM Aspera Connect (/home1/06091/kaiiam/.aspera/connect) for the current user only.

Install complete.

from this page on aspera for ebi

try:

ascp -QT -l 300m -P33001 -i /home1/06091/kaiiam/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR977/ERR977412/ERR977412_1.fastq.gz .

didn't work try running it from ~/.aspera/connect/bin

./ascp -P33001 -O33001 -QT -L- -l 1000M [email protected]:/vol1/fastq/ERR174/002/ERR1742652/ERR1742652.fastq.gz /work/06091/kaiiam/stampede2/ch2/simka_microbiomes_1/dataset1/

needs user and pass. Alise recomends to just use wget.

manually parralleized via screen. Geting everything I wanted for dataset1 that way.

Tried it first with just 4 metagenomes saved to ~/Desktop/software/ebi_biomes/paper2/simka_microbiomes_0/simka_results

marine_oceanic_454:    SRR062154.fastq.gz
marine_deep_chl_max:    ERR1701760_1.fastq.gz
human_fecal1:    ERR1549323_1.fastq.gz
human_fecal2:    ERR1600436_1.fastq.gz

get those results back scp -r [email protected]:/work/06091/kaiiam/stampede2/ch2/simka_microbiomes_1/dataset1/simka_results ~/Desktop/software/ebi_biomes/paper2/simka_microbiomes_0/simka_results

From Alise:

use the presence-absence Jaccard (which is the original one). Another good metric is the abundance Bray-curtis.

minion pore tools

export PATH=$PATH:/home1/06091/kaiiam/.local/bin but in the end it didn't work and the minion nanopore format is too wierd. Will probaby not use, keep to illumina, 454, and ion torrent.

got kicked off because I was on the head node

see stampede 2 userguide

do it (more) properly with

sbatch -p normal -n 28 -N 8 -t 48 ./dataset1.sh

-p normal priority, -n total core count -N total node cound -t time (I hope in hours)

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login4)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/06091/kaiiam)...OK
--> Verifying availability of your work dir (/work/06091/kaiiam/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/06091/kaiiam)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normal)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (iPlant-Collabs)...OK
Submitted batch job 3780672

check jobs via squeue -u kaiiam

Jun.14

Launched dataset2 job.

semi pipeline to run from ~/Desktop/software/ebi_biomes/paper2/simka_microbiomes_X folder:

scp -r [email protected]:/work/06091/kaiiam/stampede2/ch2/simka_microbiomes_1/dataset1/simka_results ./simka_results
python ~/Desktop/software/simka/simka-v1.5.0-bin-Darwin/scripts/visualization/run-visualization.py -in ./simka_results/ -out ./simka_results_out/ -pca -heatmap -tree 

note I removed "-metadata-in dataset_metadata.csv -metadata-variable microbiome_label" from above as it wasn't working. Also the heatmap for abundancy_bracurtis also isn't working. perhaps negative values?

command to extract the useful results form simka_results_out and simka_results dirs.

mkdir simka_results_out/use ; cp simka_results_out/hclust_presenceAbsence_jaccard.png simka_results_out/use ; cp simka_results_out/heatmap_presenceAbsence_jaccard.png simka_results_out/use ; cp simka_results_out/pca_presenceAbsence_jaccard.png simka_results_out/use ; cp simka_results_out/hclust_abundance_braycurtis.png simka_results_out/use ; cp simka_results_out/heatmap_abundance_braycurtis.png simka_results_out/use ; cp simka_results_out/pca_abundance_braycurtis.png simka_results_out/use ; mkdir csv ; cp simka_results/mat_presenceAbsence_jaccard.csv.gz csv ; cp simka_results/mat_abundance_braycurtis.csv.gz csv ; gunzip csv/*.csv.gz

Simka job cancled due to time limit try:

sbatch -p normal -n 28 -N 8 -t 24:00:00 ./dataset2.sh

Jun.17

ran sbatch -p normal -n 28 -N 8 -t 24:00:00 ./dataset3.sh for dataset3 with 73 marine and human samples. Marine samples from dataset2 and human samples from respiratory systems,skin,blood,oral and fecal.

prototype sparql query for talk. Finds any envo classes which are or have a subclass which links to SWEETRealm:Soil. Run from http://www.ontobee.org/sparql

PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl> 

SELECT DISTINCT ?s ?label
FROM <http://purl.obolibrary.org/obo/merged/ENVO>

WHERE
{
?s a owl:Class .

?s rdfs:subClassOf*/<http://www.geneontology.org/formats/oboInOwl#hasDbXref> "SWEETRealm:Soil"^^<http://www.w3.org/2001/XMLSchema#string> .
?s rdfs:label ?label .

}

This query gets any ?s term which is determined by a ?x term.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl> 
PREFIX obo-term: <http://purl.obolibrary.org/obo/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT ?s ?x

FROM <http://purl.obolibrary.org/obo/merged/ENVO>

WHERE
{
?s a owl:Class .
?s owl:equivalentClass/owl:intersectionOf/rdf:rest/rdf:first/owl:onProperty obo-term:RO_0002507 ; 
   owl:equivalentClass/owl:intersectionOf/rdf:rest/rdf:first/owl:someValuesFrom ?x .

?x rdfs:label  ?label.
}

Put them together to get any s envo term which is determined by a x term which is or has a subclass which links to SWEETRealm:Soil, and if ?s term has a ?y inSubset link return that. Once I push the microbiome terms, I should be able to run this replacing the inSubset with hasDbXref to retrieve the JGI/MGNIFY links to genomic data which correspond to the input SWEET term.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl> 
PREFIX obo-term: <http://purl.obolibrary.org/obo/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT ?s ?x ?y

FROM <http://purl.obolibrary.org/obo/merged/ENVO>

WHERE
{
?s a owl:Class .
?s owl:equivalentClass/owl:intersectionOf/rdf:rest/rdf:first/owl:onProperty obo-term:RO_0002507 ; 
   owl:equivalentClass/owl:intersectionOf/rdf:rest/rdf:first/owl:someValuesFrom ?x .

?x rdfs:label  ?label.
?x rdfs:subClassOf*/<http://www.geneontology.org/formats/oboInOwl#hasDbXref> "SWEETRealm:Soil"^^<http://www.w3.org/2001/XMLSchema#string> .
?x rdfs:label ?label .

?s rdfs:subClassOf*/<http://www.geneontology.org/formats/oboInOwl#inSubset> ?y .
}

Jun.18

Best results from simka run 2 (dataset 3 is still computing went over 24 hours and is now in the queue to restart)

with method = presenceAbsence_jaccard, k = 5 neighbors, the accuracy is 0.627906976744186 with method = abundance_braycurtis, k = 7 neighbors, the accuracy is 0.6511627906976745

for the latter we get a visual of how well it's working by printing out the label for each along with the top matches:

coastal: 'coastal': 3, 'oceanic': 2, 'intertidal': 1, 'hydrothermal_vent': 1
coastal: 'intertidal': 4, 'coastal': 1, 'hydrothermal_vent': 1, 'oceanic': 1
coastal: 'intertidal': 3, 'coastal': 2, 'oceanic': 1, 'hydrothermal_vent': 1
coastal: 'coastal': 4, 'intertidal': 3
coastal: 'intertidal': 3, 'coastal': 2, 'oceanic': 1, 'hydrothermal_vent': 1
coastal: 'intertidal': 4, 'hydrothermal_vent': 1, 'oceanic': 1, 'coastal': 1
coastal: 'oceanic': 4, 'coastal': 1, 'intertidal': 1, 'hydrothermal_vent': 1
coastal: 'oceanic': 6, 'hydrothermal_vent': 1
hydrothermal_vent: 'hydrothermal_vent': 3, 'oceanic': 3, 'intertidal': 1
hydrothermal_vent: 'hydrothermal_vent': 4, 'oceanic': 1, 'coastal': 1, 'intertidal': 1
hydrothermal_vent: 'hydrothermal_vent': 4, 'oceanic': 2, 'intertidal': 1
hydrothermal_vent: 'hydrothermal_vent': 5, 'coastal': 1, 'oceanic': 1
hydrothermal_vent: 'oceanic': 2, 'coastal': 2, 'intertidal': 2, 'hydrothermal_vent': 1
hydrothermal_vent: 'oceanic': 4, 'hydrothermal_vent': 2, 'coastal': 1
hydrothermal_vent: 'hydrothermal_vent': 4, 'oceanic': 2, 'coastal': 1
intertidal: 'intertidal': 4, 'coastal': 2, 'hydrothermal_vent': 1
intertidal: 'intertidal': 5, 'coastal': 1, 'oceanic': 1
intertidal: 'intertidal': 4, 'coastal': 1, 'oceanic': 1, 'hydrothermal_vent': 1
intertidal: 'intertidal': 6, 'oceanic': 1
intertidal: 'intertidal': 3, 'oceanic': 2, 'hydrothermal_vent': 2
intertidal: 'oceanic': 3, 'coastal': 2, 'intertidal': 2
intertidal: 'intertidal': 4, 'coastal': 3
intertidal: 'intertidal': 3, 'oceanic': 2, 'hydrothermal_vent': 1, 'coastal': 1
intertidal: 'intertidal': 5, 'coastal': 1, 'oceanic': 1
intertidal: 'intertidal': 2, 'oceanic': 2, 'hydrothermal_vent': 2, 'coastal': 1
intertidal: 'coastal': 3, 'oceanic': 2, 'intertidal': 1, 'hydrothermal_vent': 1
intertidal: 'oceanic': 5, 'intertidal': 1, 'coastal': 1
intertidal: 'intertidal': 4, 'coastal': 3
oceanic: 'oceanic': 6, 'hydrothermal_vent': 1
oceanic: 'oceanic': 6, 'coastal': 1
oceanic: 'oceanic': 5, 'coastal': 1, 'intertidal': 1
oceanic: 'oceanic': 3, 'intertidal': 2, 'hydrothermal_vent': 1, 'coastal': 1
oceanic: 'oceanic': 5, 'coastal': 1, 'intertidal': 1
oceanic: 'oceanic': 5, 'coastal': 1, 'hydrothermal_vent': 1
oceanic: 'oceanic': 5, 'coastal': 1, 'hydrothermal_vent': 1
oceanic: 'oceanic': 5, 'coastal': 1, 'hydrothermal_vent': 1
oceanic: 'oceanic': 6, 'coastal': 1
oceanic: 'oceanic': 4, 'intertidal': 1, 'coastal': 1, 'hydrothermal_vent': 1
oceanic: 'hydrothermal_vent': 2, 'oceanic': 2, 'intertidal': 2, 'coastal': 1
oceanic: 'oceanic': 7
oceanic: 'intertidal': 2, 'oceanic': 2, 'coastal': 2, 'hydrothermal_vent': 1
oceanic: 'oceanic': 5, 'hydrothermal_vent': 1, 'coastal': 1
oceanic: 'oceanic': 5, 'hydrothermal_vent': 1, 'coastal': 1

Jun.19

Thesis ideas: chapter1 PM, with at least my role as the ontology paper if not another one later, ch2 microbiomes: paper1 the EBI/JGI gold microbiome envo paper, paper2 the kmer based retrieval. Chapter3 actually trying to compare omics/env metadata from different projects leveraging PM datasets. Obvious first step is a big correlational analysis of quantitative metadata with gene or taxon abundance, (run through all parameters in PM against all genes/taxa) similar to the Ed Delong paper you pulled data from. Finally try to do something with LDA, see if ordination of a samples genome within in an LDA space and the metadata in a PCOA space yield anything new ...

Notes from meeting with Bonnie:

phd proposal due 2 weeks before comps, proposal can be written exam

draft 2 weeks revisions 3 weeks then 3 weeks for oral exam part 1 proposal part 2 questions biology cyberinfrastructure (am I competent)

need to define dates for this (bonnie will make timeline on basecamp)

bonnie not allowed to do line by line editing

Format like an NSF proposal (get bonnie to send me some examples)

  • executive summary

  • background lit review

  • components

Bonnie is also writing a datapiracy paper which I could get a chance to be on.

earthcube wants registry of data and tools all FAIR

community engagment

bonnie want's 4 Planet Microbe papers

  • 0th maybe announcement with Alise 1st.

  • 1st paper me 1st Alise nd ontology dp (like what I have) model for the community my 3.1 Planet Microbe an experimental cyberinfrastructure system paper.

  • 2nd paper: Matt Ken Alise me, cyberinfrastructure development behind PM why postgress front end specifics about building the cyberinfrastructure use 4d search architecture

  • 3rd paper: community paper (but not) My second paper about kmers making refined ontology terms avialbe layered in with PM. Enabling ontolgies for the everyday person paper maybe 2 papers one on the algritms. Convienced bonnie otherwise, this just ends up being my 4.2 Kmer based metagenomic retrieval for term annotation suggestion paper

  • 4th paper: consolidation paper Alise or Bonnie first. Developed components cite it all in review package complete vision.

Jun.20

ISA tools

The open source ISA framework and tools help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies.

JSON and RDF sounds similar to what were doing with frictionless. Should discuss it with Matt and Alise.

I'll need to check out the ESIP summer meeting schedule before I go.

Playing with MASH:

releases documentation

better to sketch first then calc distance. -r for raw reads, -m 2 filters unique kmers -b to use a bloom filter to filter out most unique kmers using constant time memory -c stops read sketching once estimate average coverage is reached. -s for sketch size examples of refseq seem to be s=1000. -k for kmer size default is 21. As in any k-mer based method, larger k-mers will provide more specificity, while smaller k-mers will provide more sensitivity. Larger genomes will also require larger k-mers to avoid k-mers that are shared by chance. mash paste to cat sketch files of same size k.

proto mash pipeline:

mash sketch -s 1000 -r -b -k 21 -c -o reference *.fna not working with the .fastq.gz

mash triangle reference.msh > out.txt

try bash for loop based on for x in *.fastq.gz; do echo $x ; done

for x in *.fastq.gz; do mash sketch -s 1000 -r -m 2 -k 21 $x ; done

mash triangle *.msh

had to up the sketch size to -s 100000 to get differnces

-s 100000 -r -m 2 -k 21 gives:

ERR010488.fastq.gz
ERR1055664.fastq.gz	0.429909
ERR1056317_1.fastq.gz	0.462914	0.295039
Max p-value: 2.71832e-08

-s 100000 -r -m 2 -k 31 gives:

ERR010488.fastq.gz
ERR1055664.fastq.gz	0.286256
ERR1056317_1.fastq.gz	0.349025	0.234348
Max p-value: 1.94274e-08

-s 1000000 -r -m 2 -k 31 gives: a much better p value

ERR010488.fastq.gz
ERR1055664.fastq.gz	0.303511
ERR1056317_1.fastq.gz	0.343144	0.235473
Max p-value: 5.71557e-90

-s 100000 -r -k 31 -b 2M with bloom filters of 2 (megabytes) gives:

ERR010488.fastq.gz
ERR1055664.fastq.gz	0.326666
ERR1056317_1.fastq.gz	0.349025	0.22771
Max p-value: 6.86583e-09

-s 100000 -r -k 31 -b 20M. increasing the bloom filter size seemed to make it a little worse?

ERR010488.fastq.gz
ERR1055664.fastq.gz	0.349025
ERR1056317_1.fastq.gz	0.349025	0.226969
Max p-value: 6.50473e-09

-s 100000 -r -k 31 -b 200K reduce to 200 kilobytes improves p values.

ERR010488.fastq.gz
ERR1055664.fastq.gz	0.297109
ERR1056317_1.fastq.gz	0.349025	0.235283
Max p-value: 9.42697e-09

-s 100000 -r -k 31 -b 20K reduce to 20K and it got worse

ERR010488.fastq.gz
ERR1055664.fastq.gz	0.286256
ERR1056317_1.fastq.gz	0.349025	0.236245
Max p-value: 1.71885e-08

-s 1000000 -r -k 31 -b 200K

ERR010488.fastq.gz
ERR1055664.fastq.gz	0.308615
ERR1056317_1.fastq.gz	0.352424	0.230014
Max p-value: 3.0545e-69

-s 1000000 -r -k 27 -b 200K reducing k to 27 improves p value a lot.

ERR010488.fastq.gz
ERR1055664.fastq.gz	0.34939
ERR1056317_1.fastq.gz	0.388271	0.249148
Max p-value: 1.78104e-76

-k 26 -> Max p-value: 5.00889e-63 -k 28 -> Max p-value: 4.83173e-72

Based on these mock trial runs (no idea how representative this is) but I've selected some mash parameters to try.

for x in *.fastq.gz; do mash sketch -s 1000000 -r -k 27 -b 200K $x ; done

mash triangle *.msh

Jun.21

mash triangle *.msh > dist_tri.csv

This is insane running mash with -s 1000000 -r -k 27 -b 200K on the marine set gives:

using mash distances, k = 7 neighbors, the accuracy is 0.6976744186046512 this higher than the predictions with simka!!!! which maxed out at 0.65!!

Retrieval results

Results of analysis comparing MASH and SIMKA using marine and or human samples. MASH run using -s 1000000 -r -k 27 -b 200K and SIMKA with defaults.

SIMKA (bray curtis) SIMKA (Jaccard) MASH (Jaccard est) Number of Neighbors
marine only 0.674 0.674 0.651 1
marine only 0.674 0.674 0.605 3
marine only 0.651 0.605 0.698 7
marine only 0.581 0.535 0.465 14
marine only 0.349 0.349 0.349 43 (all)
human only 0.387 0.419 0.419 1
human only 0.323 0.323 0.323 3
human only 0.419 0.452 0.387 7
human only 0.516 0.516 0.516 14
human only 0.516 0.516 0.516 31 (all)
marine and human 0.5 0.514 0.527 1
marine and human 0.446 0.405 0.419 3
marine and human 0.446 0.459 0.446 7
marine and human 0.486 0.446 0.432 14
marine and human 0.486 0.392 0.432 34
marine and human 0.0 0.0 0.0 74
marine v.s. human 0.865 0.865 0.878 1
marine v.s. human 0.838 0.919 0.905 3
marine v.s. human 0.838 0.878 0.878 7
marine v.s. human 0.797 0.851 0.851 14
marine v.s. human 0.770 0.784 0.865 34
marine v.s. human 0.581 0.581 0.581 74 (all)

should run mash as I did for simka

for x in *.fast*; do mash sketch -s 1000000 -r -k 21 -m 2 $x ; done

Table again with MASH at k = 21 to be comparable

SIMKA (bray curtis) SIMKA (Jaccard) MASH (Jaccard est) Number of Neighbors
marine only 0.674 0.674 x 1
marine only 0.674 0.674 x 3
marine only 0.651 0.605 x 7
human only 0.387 0.419 x 1
human only 0.323 0.323 x 3
human only 0.419 0.452 x 7
marine and human 0.5 0.514 x 1
marine and human 0.446 0.405 x 3
marine and human 0.446 0.459 x 7
marine v.s. human 0.865 0.865 x 1
marine v.s. human 0.838 0.919 x 3
marine v.s. human 0.838 0.878 x 7

*Finish later when I'm back on TACC.

Jun.24

Doing the datapackage for GOS, and I couldn't find details on their protocols anywhere for example what they used to measure ammonia. I was also worried about the fact that column is named ammonia, not ammonium as in the other cases. I looked around didn't find protocols but found these 2 papers:

Spectrophotometric determination of ammonia nitrogen in water by flow injection analysis based on NH3- o-phthalaldehyde -Na2SO3 reaction and this paper Determination of nanomolar levels of nutrients in seawater on common methods for nutrient quantification from seawater. Which describe what's commonly done.

from the first one:

Ammonia nitrogen is one of the major nitrogen forms in the nitrogen cycle, especially in natural waters [1]. Ammonia nitrogen consists of ammonia (NH3) and ammonium (NH4+) in natural waters. Ammonium is predominant when the pH is below 8.75, and ammonia is predominant when the pH is above 9.75

and the common methods are for quantitative analysis of ammonia nitrogen in natural water are the indophenol blue (IPB) spectrophotometric methods and o-phthalaldehyde (OPA) fluorometric methods. So my undestanding is it would be a measurment of both.

Also check out this paper on the Marine microbial biodiversity, bioinformatics and biotechnology (M2B3) data reporting and service standards which is what OSD uses for their metadata reporting. Perhaps to cite for PM.

Jun.25

In order to use gensim to do LDA it expects document imput (as its normally done on text docs) A "better" way to import gene annotation frequency data would be to "hack" the Source code for gensim.corpora.dictionary to write a function similar to doc2bow which takes a dict as input. Directly making the dictionary from the dict of MGNIFY term frequences.

This would require a little hacking to get it correct, so instead do it using this gensim tutorial or this one where I first print out the gene annotations to dicts (this is super lame but would work with how gensim is setup). Later conslut the gensim tfidf model

Jun.28

Chris made a project for the ENVO microbiome work https://github.com/orgs/EnvironmentOntology/projects/1

Finally got the Amazon continuum's metagenomes LDA running yesterday. Running it as the for loop over number of topics but it's taking forever to run on my laptop. for 12 topics it took 10 hours 47 minutes 47 seconds

and the topics were:

0	4.16667	IPR016040 IPR017849 IPR011990 IPR013785 IPR003594 IPR020683 IPR015943 IPR016038 IPR015421 IPR000917 IPR011991 IPR014729 IPR018247 IPR013781 IPR011992 IPR012336 IPR017442 IPR002110 IPR013783 IPR019781
1	4.16667	IPR016040 IPR013785 IPR014729 IPR015421 IPR017849 IPR012340 IPR000917 IPR015422 IPR012336 IPR013816 IPR011990 IPR013221 IPR001509 IPR002300 IPR000795 IPR003439 IPR024084 IPR003594 IPR007120 IPR000788
2	4.16667	IPR016040 IPR013785 IPR014729 IPR000484 IPR015421 IPR011991 IPR001280 IPR015422 IPR012340 IPR012336 IPR016038 IPR012128 IPR017849 IPR013816 IPR013781 IPR014710 IPR012338 IPR000795 IPR003594 IPR000873
3	4.16667	IPR016040 IPR003439 IPR002198 IPR000531 IPR002347 IPR013785 IPR000873 IPR000795 IPR011991 IPR006076 IPR000515 IPR012910 IPR001036 IPR015590 IPR001030 IPR012336 IPR004839 IPR014729 IPR000620 IPR008274
4	4.16667	IPR016040 IPR013785 IPR015421 IPR014729 IPR015422 IPR015590 IPR024041 IPR011991 IPR013816 IPR012340 IPR016162 IPR008274 IPR024084 IPR000873 IPR012336 IPR003594 IPR015813 IPR017849 IPR001030 IPR016038
5	4.16667	IPR016040 IPR013785 IPR014729 IPR015421 IPR015422 IPR012340 IPR012336 IPR003439 IPR011990 IPR013221 IPR001509 IPR011991 IPR003594 IPR013816 IPR001986 IPR002300 IPR001280 IPR000795 IPR012338 IPR024084
6	4.16667	IPR007087 IPR013087 IPR011889 IPR017442 IPR005046 IPR020683 IPR001280 IPR003409 IPR011992 IPR011990 IPR002110 IPR015943 IPR019781 IPR001752 IPR018247 IPR013783 IPR016040 IPR000477 IPR002885 IPR018357
7	4.16667	IPR000484 IPR001343 IPR018511 IPR019636 IPR012340 IPR000795 IPR001280 IPR000994 IPR007080 IPR022666 IPR013126 IPR000883 IPR002423 IPR012677 IPR016040 IPR003610 IPR020818 IPR005532 IPR013025 IPR014722
8	4.16667	IPR016040 IPR013785 IPR014729 IPR015421 IPR012340 IPR015422 IPR012336 IPR003594 IPR013816 IPR011991 IPR013221 IPR016038 IPR024084 IPR012338 IPR000531 IPR016162 IPR015590 IPR001986 IPR000994 IPR002300
9	4.16667	IPR003514 IPR006777 IPR004196 IPR003515 IPR003513 IPR016407 IPR011889 IPR005046 IPR020962 IPR007087 IPR006815 IPR013103 IPR013087 IPR016040 IPR000788 IPR013785 IPR000477 IPR018247 IPR014729 IPR001584
10	4.16667	IPR013103 IPR001584 IPR000477 IPR013084 IPR001878 IPR001280 IPR001969 IPR011989 IPR000484 IPR013762 IPR003034 IPR017442 IPR011992 IPR016040 IPR019781 IPR008543 IPR015943 IPR018247 IPR002559 IPR020683
11	4.16667	IPR016040 IPR013785 IPR014729 IPR015421 IPR000484 IPR012340 IPR000531 IPR012336 IPR011990 IPR000788 IPR013816 IPR015422 IPR003594 IPR001509 IPR012910 IPR000795 IPR015590 IPR010823 IPR002300 IPR000994

I've parsed this to get to get the Interpro names, work is in amazon-continuation/lda/test3/12_topics_24genomes_model/output_topics

Thinking more about the problem LDA is meant to work on a collection of many coherent (intact) full doccuments. articles text snipits which are from one source. A metagenome is seperated into many reads each of which are more akin to document "snipits". Perhaps it would make more sense to run the LDA where each read is considered a doccument. Then find topics based on reads, then in the end map back the distribution of read topics to their original genomes. i.e metagenome 1 has 12% reads of topic 1, 2% reads of topic 2, metagenome 2 has 5% reads of topic 1, 2% reads of topic 2 etc, and compore that way. My semi-end goal is to see if I can ordinate the genomes in LDA topic space and match the ordination in PCOA space and see which topics correlate with the ecological parameters. Maybe that's too much but at least which reads are from what "Topics" and then try to make sense of those topics.

Looking though the first amazon continuum interpro annotation file, it looks like most reads have at most two interpro annotations. Hence if I try it this way I'll have a ton of doccuments with 0 1 or 2 terms. Is that useful or meaningful in an LDA? I'm not sure. I could try it... just to see. Would it be silly to bin or assemble reads into short contigs then annotate then try LDA?

Based on this post LDA doesn't work well with short texts like tweets shown in this paper or here. The Biterm Topic Model was suggested which maybe better for infering topics from reads as the model infers over the whole corpus then assumes each "Biterm" is drawn from a topic giving context to each term allows for easier topic inference than from a single word in LDA.

Also check out this paper

Perhaps I can try the https://github.com/xiaohuiyan/BTM Biterm model on the reads. Trying it but it looks like there would bassically be no reads which have more than one interpro annotation, kindof makes sense since the amazon continuum metagnomes were sequenced via illumina with short reads. Could try something with longer reads such as the this sample from GOS. Looking through the GOS sample there don't appear to be any cases where there are muliple orfs per read. This makes sense, complete proteins should be longer than 250bp illumina reads or even longer 454 reads. Should have thought of this before.

Recap of the directories in amazon-continuation/lda:

test1 old just trying to get LDA running

test2 workin lda on text docs small example

test2.1 working LDA on text docs with the khoggi / world cup examples.

test3 good copy has the scripts to unpack all the amazon continuums interpro tsvs, parse them into text files then run a single LDA or multiple LDA's in a for loop to find an optimal number of topics based on coherence scores plateauing out. Also prints some tsvs about the model, and finally a script to get the interprolabels from the output model and get the text to make sense of them. I was running it but it was taking too long but the resutls of for 12 topics are in amazon-continuation/lda/test3/12_topics_24genomes_model/output_topics. Come back to this and try running it on TACC.

test4 trying to do it using the Biterm Topic Model on reads, had it sort of working but then I realized there would only ever be one annotation per read ... :( I thought maybe 2 so we could use the BTM model, but it's unlikely.

Jul.02

prep for ESIP talk

ID label definition EBI Biomes EBI Biomes 2 JGI Biomes environmental feature environmental material anatomical entity environmental system environmental system 2 quality biome (subclass of)

label environmental feature material entitiy subclass of
'air-associated ecosystem' 'air' 'ecosystem'
'glacial ecosystem' 'glacial feature' 'glacier' 'ecosystem'

Jul.15

ESIP day 1

Centre de Recherche Informatique de Montréal

It would be interesting to do something like image recognition but with kmers based on their ontology annotations, perhaps PM could be framework for this.

It could be cool to do a basic NLP in PM, Pier mentioned there is something for stemming terms based on ENVO (back in my masters work I should ask him or try to find it from my lab rotation work), but we could allow people to type somthing like:

What data do you have form rivers?

could use Stemmer-Support in WEKA or any off the shelf python NLP package.

open data cube

Data analytics for canadian climate services starts soon.

DASK to scale python jobs. mentioned again in the ESRI talk, disributes pandas jobs over larger data.

Microsoft AI for Earth could maybe apply for a grant from them? Tell bonnie about this one. Next call in October.

Integrated Marine Observing System

SSN extension for observation collection, at some point will need to make sure BCO is aligned with this and used properly for the PMO classes I need to push.

Amazon alexa uses amazon neptute which uses a graph/rdf database to answer questions like: "Who is the president of france?".

Going through the ESRI GIS presentation, using a python notebook to add existing gis layers to their map. Perhaps we could do something similar to add the longhurst province shape files (etc) to our PM map and auto assign extra metadata. Perhaps also the the NOAA/NASA biogeochemical parameters such as the extra metadata from OSD.

Empirical Bayesian Kriging, infering a pattern from geospatial data. The esri demo did this with oxygen and depth interprolated in 3d space. could we do this with taxonomy and function like I'd mentioned based on the ecosystems genomics talk from last fall. link to esri kriging here I wonder if we have enough spatiotemproal variation to do this kind of thing. Perhas an app to do fun stuff with the EBI functional and taxonomic pipeline? if we can really quantify genes/taxa across datasets. Perhaps this along with the GO terms searching as my last paper?

If we could import something like water masses as GIS or (something similar) in 4d perhaps we could try to find all genomic data in a given water mass, and see if that has an effect (as we'd expect it to) on the metagenomic and physiochemical data.

permutation based entropy and mimimum spanning tree observing entropy for enighborhood of 2 vars if dependent or independent from each other. Can determine if there is relationship between to vars using entropy giving siginficance with p values. Be cool to do this with the genes taxa and functional data somehow. Done in context of neighboorhood for a feature. Their example with air polution and median household income can they predict particulate organic matter based on median income. Mean to be check what relatinships could exist in various datasets. Most of the city isn't signigicant but some regions are in the foot hills of LA there is a correlation between income and less polution. Local bivariant relationships. Could I do this with the functional taxonomic and physio data from PM?

Spatial clustering:

Density based clustering. dbscan or hdbscan (hierarchical no search distance). Determine if obs in data are in a cluster or are spare noise. How many features per cluster and min search distance to cluster. search distance vs core distance to make clusters. For feature selection I think. hdbscan dist of data tell you what cluster is.

Optics peaks in feature space to find clusters. Social media data for checkin data (timestamp of when ppl checkin) label observations as part of a cluster, finds the malls in suburban sandiago. with HDBSCAN they find clusters for different buildings.

multivariate clustering

clusters of features in values spacially constratined. rates of uninsured people, rates of ethinc group etc. K means. Spatially constrained multiv clustering uses minimum spanning tree.

Geographically weighted regression uses Ordinary least squares of featuers on GIS spacial data relationship between hospital re-addmitance rates and medicare spending. Iterary regression exprores different candidate expatorty factors for effect on a given factor. Found vars like dehydration distance to Huston, number of beds in hospital as best factors, also optimizes neighboor hood size. Check if residuals have spacial clusering. Have global R2 but also local R2 values for local variance clusters. Can see where dehydration rates better explained total medicare costs.

Forrest based classification and regression, uses random forest with geospatial data. Example Presence of a specific plan presence in Oregon forrest. Goal to predict to larger study area. passed paramaeters like elevation canopy, slope characteristics etc as explantory vars against presence absence of the species. Gives top variable importances. Could we do this with presence/absence of genes or taxa against env vars in pmo? Could be my last paper, doing this with random forest. gives a sence of which variable are potential predictors for the presence absence of the plant. Asses which factors are useful for model.

Examples of R-arcGIS for plastics in the ocean (didn't cover it).

Emerginig hotspot analysis sumarize findings in a story map, example is deforestation. Learn arc gis com lessons on dealing with the ARCGIS data in lessons, has a cool one on climate.

Jul.16

esip shedule

Fair metadata session Interoperable session github

DataCite plot measure of FAIRness for the data cite providers.

DataOne federation of repository systems, dataset from wide variety of providers. List of checks for FAIR, generate composite score for FAIRness of datasets.

LTER is now called EDI (evolved into EDI), uses primarily EML ecological markup language. question for @pier do we have an envo alignment to EML?

Research Organization Registry like orcid for institutions (UA etc) perhaps add to datapackages along with the sample ID's Kirsten Lenhert is working on?

cross ref

Integrated Ocean Observing System (IOOS)

Free text bad, cross refferences/identifiers good.

  • Simon Cox

Sad to see that this is where were at, metadata for people and affiliation already hard to get, never mind the 13th column of the 2nd spreadsheet.

Kirsten works at Interdisciplinary Earth Data Alliance (IEDA)

Met Dr. Karen Stocks, from Geological Data Center, Scripps Institution of Oceanography, she mentioned a 1/10 year ocean omics and biogeochimcal meeting happening in September in Hawaii, perhaps something Bonnie should be attending? or at least try to slip in PM, I'll try to follow up with her.

Jul.17

Data | An Open Access Journal from MDPI

esri living atlas large gis library.

GeoCODES | EarthCube very large cyberinfrastruce system. ESRI adopted it and does real time analytics on the data.

Dashboard suggested for data visualization in real time. We don't have real-time data for PM, however it could be cool to have an interactive one on the computed functional and taxonomic data, it would be really cool to be able to do kriging with it in real time.

nature paper: make scientific data fair cite this for PM.

API and schema.org talk (Adam sheppard) ESIP Summer 2019 API Session and here

examples of API's which other systems can directly get data from (drag and drop a geospatial snapshot) see if schema marked up datasets show up: state of the ocean, Giovanni The Bridge Between Data and Science

Data Catalog Vocabulary (DCAT) schema.org apparently stared with material this.

There are these two sessions bonnie is convinieng: Ocean science meeting 16-21 Feb 2020 and AGU 9-13 December Figure out if I'm going to either.

National Ecological Observatory Network

I should join in one of the ESIP communties: science on schema.org or Marine data. Probably Marine data to get PM in there.

of relevance to the discussion on the ESIP schema community: BioSchemas and iotschema, Open Geospatial Consortium. Open question on how that would like to Domain ontologies like OBO/SWEET. Ruth is thiking under the semantics committee.

Harvesting Metadata through Schema.org we should markup all our pm datapackages with schema.org and get them published and searchable.

Jul.18

ESIP Lab, gives small grands and stragetic outreach get community input and get some organizational support. One memeber has to be from an esip member organization.

fairtools.org guy building it could be interesting very alpha.

National Center for Ecological Analysis and Synthesis first synthesis science center in the world

conda-forge allows package to bulid and automate workflow.

example ESIP project Explore-HRRR-with-XrViz bring visulaization to a community. Jupiter notebook to visualize data in system. It would be cool to have something like this that's interactive to visualize avaialbe data layers from PM. It maybe a bit sparse planetary coverage wise but could be cool. Searchable parameters from pm, but maybe also the taxonomy and gene frequency.

Presenation from Anaconda guy nice! PyViz.org explains what python tools can be used for different visualization. HoloViz.org opinionated subset of pyviz. Server side rendering of large datasets. Some of the examples from HoloVis, cloud enabled data visulatization/exploration. I think it would be really cool to have this for GO/NCBI taxon terms computed by the MGNIFY pipeline for all the data in PM. Have widigets for subclasses of GO/NCBI paired with the map visulatization so we can explore what genes / taxa are where in worlds oceans. Bring in that cool data visualization from the geosciences world into the life sciences, as a note to how powerful the combined use of semantically-standardized data can be, especially with visulalization. Extra stuff could be to add overlaps of external data products satelite data chl, temp etc. Look to HoloVis.org for inspration.

Daily Satellite Imagery and Insights | Planet Largest private constellation of satelites in the world (at the moment).

Piers site on oceanbestpractices be cool to link to this for PM.

From Dr. Ilya Zaslavsky Director - Spatial Information Systems Laboratory, SDSC, UCSD datadiscoverystudio as well as SuAVE: Survey Analysis via Visual Exploration. with a new 2018 SDG Index. Discovers data for the UN SDG's using data mobilized by a variety of ontologies including OBO.

Ruth Duer's talk

WMO world meterology organisation's GCW global cryosphere watch 26 glosseries asked Ruth to deal with it, come up with something consistent. GCMD geoscience metadata conventation. Arctic Data Committee. NSF's earthcube program plans on crawling known repos to accumlate federated data discovery system via schema.org.

ESIP vocabularies and semantics working group, coordinate activies across comunity. Compiling resources and activies in polar relm.

Pier's session

ESIPFed sweet issues page

Jul.19

COR ESIP Summer Meeting Running Minutes

Currently the semantic tech regular calls are for the cryosphere, future work could be marine.

ML knowledge representation also as focus for the semantic tech committee.

Aug.06

On branch issue-821 working on PR to fix up marine oxygen minimum zone For the axiom perhaps I could use something similar to ENVO:hypoxic water changed liquid water to sea water.

'has quality' some 
    ('decreased concentration'
     and ('inheres in' some 'sea water')
     and (towards some 
        (dioxygen
         and ('has quality' some dissolved))))

Bonnie suggests I apply for the NSF graduate fellowship specifically under the Computer and Information Science and Engineering dicipline subdicipline something about cyberinfrastructure. Also try to apply for ARCS (could be politics keeping me back) and def apply to the ESIP community fellows program and go to the cal libraries workshop.

Aug.08

From Dr. Simon Cox: https://github.com/UKGovLD/registry-core example deployments

http://codes.wmo.int/ http://environment.data.gov.uk/registry/?_browse=true http://registry.it.csiro.au/?_browse=true I think these are controlled vocabularies supported by the World Meteorological Organization.

From Dr. Karen Stocks " NetCDF-CF: for parameter naming conventions: http://cfconventions.org/standard-names.html

BODC/NERC?SeaDataNet vocabulary: http://seadatanet.maris2.nl/v_bodc_vocab_v2/welcome.asp - check out the parameter ones

This is very small compared to your efforts, but it was a static interop effort that included a HOT component and it may amuse you: http://www.seaviewdata.org/ the good part of this project was very easy export into a highly used tool ODV, plus NetCDF for Matlab, Python, R, etc. users.

R2R website is at rvdata.us, go to the cruise catalog. Reach the R2R team with [email protected] "

NERC Vocabulary Server The NERC Vocabulary Server provides access to lists of standardised terms that cover a broad spectrum of disciplines of relevance to the oceanographic and wider community.

From Ramona https://github.com/The-Sequence-Ontology/MSO sequence ontology maybe useful (but probably not)

Aug.09

Ongoing idea for PM, Could consider to link out to satelite data products such as MODIS or the NSIDC sea ice cover and do an automated joining programmatically.

Other longer term idea precomputing via the ebi tools the GO and NCBI taxon terms, and allowing ontological searach of genes or taxa on a map, could be a really cool showcase of the use of ontologies.

ENVO issue on NERC VS Mapping if/when I get to that for PM.

Aug.16

Notes From Pier

For the EBI/JGI work were going to go with making heavily axiomatized pre-composed ecosystem terms in a similar fashion to what done in the ENVO EMPomics subset.

For example: cnidarian-associated environment

'animal-associated environment'
and ('determined by' some
   (Cnidaria or ('part of' some Cnidaria)))

But with ecosystem instead of system of system.

For the Host-associated terms, the issue with fuzzy terms like "associated " (add that to your paper) comes in. The white cliffs of Dover are "associated " with coccolithophores In all the EBI cases, they would be ecosystems or have an ecosystem as a part. Cite Luke's EMP paper where ENVO-EMPO links were described (my text) and frame this as a natural extension boosting interoperability between these resources

They got some sort of community out of these environments to sequence, so they were/had as parts ecosystems.

It would be cool to even in the second (retrieval) paper draw on the heavily axiomatized terms to provide better suggestions I'll have to think more on that later.

EBI/JGI semantics:

examplpe ecosystem associated with biofilm inside drainage pipe

1 create solid semantics for drainage pipe as a manufactured product 2 make sure biofilm is imported of represented 3 create an ecosystem class defined as "ecosystem and determined by some (biofilm and located in some drainage pipe) " Strictly its (part of some biofilm.) As they may not have sampled the whole thing

example Ocean/marine

would be something like ecosystem and 1) determined by some marine water body and 2) located in some (part of some marine water body) (edited)

Same for EBI:soil The located in (part of) is important, as it shows we're not talking about the whole thing

From Earlier: when I said

It would be cool to even in the second (retrieval) paper draw on the heavily axiomatized terms to provide better suggestions I'll have to think more on that later.

Idea, for the kmer based retrival paper. The program would be a website/gui that not only retrieved semantics based on kmer profiles, but also helped users to create in depth annotations like those I created manually in the first paper.

It would be something along the lines of accessing the kmer profiles of the GOLD set as well as their assoicated axioms, and it would find the best matches based on what was there, so for example the best matches were to kmer profiles associated with envo terms which had semantics which included Ocean and surface water. The software would operate in two modes basic which would just give a general purpose (automated) annotion at a higher level like marine ecosystem and provide them the purl. Alternatively for advanced mode it could help users (with a gui) to create complex axioms for annotation in the style of the 1st paper. For example:

ecosystem
  and ('determined by' some 
    (air 
        and ('part of' some 
            ('nitrogen-oxygen planetary atmosphere'
               and ('located in' some 
                    ('part of' some construction))))))

Where users would get to keep on chaining in semantics which better describe the sample. The program would provide the relations (and dropdown menues for the different envo sub-hierarchies (feature, material, biome etc) and would generate the complete owl code for the term, as if they had edited in protege and put it together (as I would do in the first paper). The program would make sure to only allow BFO-correct axioms and do ELK resoning checks etc, then would do something like create a pull request to envo withh the new code (in a range we specify for it). This way peoplp could create their own well axiomatized annotations to comprehensively describe their own samples/data, making use of existing data/annotations to retieve suggested semantics. This would fit nicely with the MIxS standards and how they've been used to describe data by using combination of feature, material, biome, or in the MIxS 5 any 3 ENVO terms. It could maybe also add the PCO:microbial community term in there too.

Aug.20

From Chris mungall https://douroucouli.wordpress.com/2019/06/29/ontotip-learn-the-rector-normalization-technique/ as well as https://douroucouli.wordpress.com/2019/05/10/ontotip-single-inheritance-principle-considered-dangerous/. I should read throgh the rest of his blog good stuff there. As well as read this rector 2003 paper

TARA pm datapackages:

Niskin_profiles_PANGEA.tsv Done

TARA_samples_nutrients_PANGEA.tsv Done

Sampling_events.tsv Done

TARA_water_context_PANGEA.tsv Not sure how to handle semantics for this will do in next round of Tara

TARA_Ardyna_water_context_PANGEA.tsv Will need a lot of new terms to do this properly will do in next round of Tara

campaign.tsv Done

TARA_samples_HPLC_PANGEA.tsv Done for now need to make some new ENVO concentration of terms to finish

sample_NCBI.tsv Done for now

TARA_samples_carbonate_PANGEA.tsv Will need a lot of new terms to do this properly see pangaea link and paper about the measurments. will do in next round of Tara

Aug.21

NSF GFRP eligibility and FAQ From this I think that having a masters degree makes me ineligible to apply ...

Frictionless data units proposal

Making the first pass at the Tara datapackage.json will include

Niskin_profiles_PANGEA.tsv, TARA_samples_nutrients_PANGEA.tsv, Sampling_events.tsv, campaign.tsv, TARA_samples_HPLC_PANGEA.tsv, sample_NCBI.tsv

Later I can do a second releasae including:

TARA_Ardyna_water_context_PANGEA.tsv, TARA_samples_carbonate_PANGEA.tsv,, TARA_water_context_PANGEA.tsv

Note that not every Tara folder has Niskin_profiles_PANGEA.tsv and in APX its' Niskin_profiles_PANGEA_allatt with less fields. Did some quick checking the other stuff is the same.

while making the datapackage I got the error:

cat TARA_samples_carbonate_PANGEA.tsv | ../../../../pm-schemas/scripts/schema_tsv_to_json.py
Traceback (most recent call last):
  File "../../../../pm-schemas/scripts/schema_tsv_to_json.py", line 9, in <module>
    header = sys.stdin.readline()
  File "/Users/kai/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

It was only happening to files which were of type UTF-16 Unicode .txt in excel were of resolved it by changing it to ``Tab delimited Text (.txt)

PM resource types:

"pm:resourceType": "niskin",

"pm:resourceType": "campaign",

"pm:resourceType": "sample",

"pm:resourceType": "sampling_event",

"pm:resourceType": "ctd",