2021_log

01.26

ESIP

02.04

PM meeting:

get irods iinit

use a script like what Alise has to pull down datasets to UA-hpc

Create a script to benchmark the datasets sanity checks for .. flag bad files or look wierd non annotated quality control in general flag dataset if >X% are human, or if we complete lost a sample got no annotion

Should be a protocols.io link for irods, or just see Alise' example script.

submit script and run script the latter runs on HPC and executes python code, look at Matt's pipeline as example of how to run it, and the Iget commands from alyse. Submit script will run a for loop to submit multile with ars, executes run scrip loads singularity pyton exe fil etc. Split list into cunks and run array command.

create sumbit and run scripts to download data for processing.

Create script to sanity check thhe data out of the pipeline, QC leaves too few reads, too few annotations. Want summary table of what went wrong, maybe pull some of matts intermediate steps, and do additional cheks

02.11

https://github.com/ontodev/cogs https://github.com/INCATools/ontology-development-kit/releases/tag/v.1.2.26 https://github.com/ontodev/robot/blob/master/CHANGELOG.md

ESIP marine data Vocabularies Special Session

cf conventions climate model data standard to deal with interoperability based on NETCDF fileformat conventions are to provide full meta spatiao temploral spectral info, upto 32 dimensions on a variable standard names to show what is in the variable.

GCMD keywords http://gcmd.arcticlcc.org/ facepalm

seadatanet NVS exchange formats, platforms, measurement devices etc. Similar stuff. NERC L22 for instruments.

Pier's mapping of ENVO to NERC https://github.com/EnvironmentOntology/envo/issues/731

02.16

https://github.com/tanyagupta/codeblogs/blob/development/githubHowTos/howto.md

02.18

Meeting with Alise:

iinit

ERROR: environment_properties::capture: missing environment file. should be at [/home/u19/kblumberg/.irods/irods_environment.json]
One or more fields in your iRODS environment file (irods_environment.json) are
missing; please enter them.

Enter the host name (DNS) of the server to connect to: data.cyverse.org Enter the port number: 1247 Enter your irods user name: kblumberg Enter your irods zone: iplant Those values will be added to your environment file (for use by other iCommands) if the login succeeds.

irods docs: https://docs.irods.org/4.1.0/ https://cyverse-data-store-guide.readthedocs-hosted.com/en/latest/step2.html#icommands-first-time-configuration

check running jobs qstat -u kblumberg

example of how to delete a job by it's nunmber: qdel 3843824.head1

have numbersplit be 50

have lists of 500 at a time (to submit 4x). Randomize these. If this doesn't work we can try groups of 200 files at time? (would need to submit 10 x)

02.23

FAIR digital SI EOSC has ontology mapping framework Dublin coure ddi schema.org

Iadopt

Seadatanet

Sea data cloud? for thesis.

Prefer cc0 over cc-by for licenses

02.24

OBO dashboard, OMO (useful subset of IAO).

Papers relevant to PM paper 2: https://www.frontiersin.org/articles/10.3389/fmars.2019.00440/full (DEF CITE), https://peerj.com/articles/cs-110/, https://content.iospress.com/articles/information-services-and-use/isu824 (be good too).

03.11

For my committee: 2 pager about getting back on track:

page one explaintion of material going into paper 2: ontology contributions supporting physicochemical data, ontology choices, CI choices frictionless data specification analogous to NetCDF etc. Then use Planet microbe to ask and answer/hightlight uses cases for discovery of physicochem data across environments.

page two Plan for integrating that data with the functional and genomic annotation and getting it all into an rdf tripplestore to ask and answer questions deeper questions about distribution of taxa and genes correlation that with the physicochem and env types. Include timeline when the data should be available.

Some quick Q's:

do synocococuss and prochrlorococcus vs depth (or another parameter) get files from 2 projects.

Acidity of water lots of materials and features to compare as its so many projects.

Redfield ratio (po4, no3) 1000 samples good amount of features and some materials. See what features/materials/biomes deviate from the standard redfield ratio.

03.16

From Matts example PM API calls: https://github.com/hurwitzlab/planet-microbe-app/blob/master/README.md

Open Chrome developer tools and select “Network” tab

option+command+c

I think I should be able to play around with this to setup the API calls I’ll want to make to build my RDF triple store. I could do them all manually with the UI’s search interface (downloading csvs then adding those to triplestore) but I think it would be cooler, more automated and more reproducible to build the triple store from the the json outputs from API calls. That way it’s showing that someone else could leverage the PM API to do something totally different with the data. so the data store would be build from a bunch of calls like:

See more examples in ~/Desktop/scratch/pm_query

Query by numeric attribute

curl -s -H 'Content-Type: application/json' -d '{"http://purl.obolibrary.org/obo/ENVO_09200014":"[0,10]","limit":"40"}' https://www.planetmicrobe.org/api/search | jq .sampleResults > temp.json

Then I could figure out a “correct” way of converting these json products to rdf, using something like ShEx or SHACL.

04.08

ESIP Marine Data meeting

Shawn Smith to Everyone (8:22 PM)

Kai - How would the units interchange system differ from software like Udunits?

04.09

Add comments to James Units for OBO.

Citation for pato https://academic.oup.com/bib/article/19/5/1008/3108819

05.19

1:1 with bonnie

Remove some bad samples

Look at the graph by families of organisms

show overall trends big patterns to be followed up on in paper 3.

DOesn't have to be all organisms in all graphs just a few interesting stories.

Might not need a high level pattern with all data

Not recreating a phylogeny instead taking an existing one, e.g. synoccocccus and putting it together with our physicochem and observation data.

some example figures in this paper: https://www.pnas.org/content/112/17/5443.full

co-occurence networks? samples organized by depth

Remove bad samples

Try CCA but subset by groups of taxa, are there patterns for groups of taxa, maybe gives incite into bigger summary graph

Maybe instead try with correlations all species against all chemistry look for interesting patterns then pick a couple for a prettier figure.

from Mark

celfie protege plugin similar to robot?

Michael DeBellis Protege 5 tutorial apparently very good.

05.20

From alise: https://mibwurrepo.github.io/Microbial-bioinformatics-introductory-course-Material-2018/multivariate-comparisons-of-microbial-community-composition.html

05.23

Comparison of normalization methods for the analysis of metagenomic gene abundance data def read before planet microbe paper 3

06.03

BCODMO/NMDC meeting https://microbiomedata.github.io/nmdc-schema/MetaproteomicsAnalysisActivity/ https://proteinportal.whoi.edu/ https://lod.bco-dmo.org/browse-all/

06.22

EBI Ontologies CoP Webinar

Website: https://bigdata.cgiar.org/communities-of-practice/ontologies/ LinkedIn: https://www.linkedin.com/groups/13707155/ Youtube: https://www.youtube.com/channel/UCpAupP30tbkcFwMBErFdDhA/videos Newsletter: https://cgiar.us15.list-manage.com/subscribe?u=59a7500ef5c94dd50e2b9e2fb&id=454e94d3f2

OLS: ontology lookup service

ZOOMA: to mine terms from text

OXO: ontology mapping. Confidence is just count of mappings (pretty rudimentary) but they'll have this follow best practices set by SSSOM.

SSSOM: mapping standard

WEBINARS:  July 27:  Doing a Governance Operational Model for Ontologies, GOMO with Oscar Corcho, Jose Antonio Bernabe Diaz and Edna Ruckhaus from the Technical University of Madrid and Alexander Garcia from BASF.
Register: https://lnkd.in/gb2cG2h

- September 7: Neo4J as a backend DB for web protégé and Towards a plugin based architecture for web protégé with Mattew Horridge and Mark Musen from Stanford University.
Register: https://lnkd.in/gU92u96

- September 21: Enterprise Knowledge Graph with Alexander Garcia from BASF.
Register: https://lnkd.in/gniS2Mm

paper: Matching sensor ontologies through siamese neural networks without using reference alignment

06.24

Notes from PM meeting:

TODO: add more test examples in the 10 Gb range to this for Matt to try and run: https://github.com/hurwitzlab/planet-microbe-functional-annotation/blob/master/test_data/get_test_data.sh

file handling

edit config based on the individual files and their needs

#rm results/interproscan.txt results/killed_interproscan.txt to start up lookup server

Might want to consider another Kraken Database

configfile: "config/config.yml" to this put the sample info.

config params to train:

vsearch_filter_maxee: 20
vsearch_filter_minlen: 75
frag_train_file: "illumina_10"
adapter_fasta: "data/configs/TruSeq3-PE.fa"

+ read lenght (if Matt can't automate)

Interesting links: https://gtdb.ecogenomic.org/ https://thispersondoesnotexist.com/. https://medium.com/analytics-vidhya/apriori-algorithm-in-association-rule-learning-9287fe17e944

My file with the list of PM analysis is ~/Desktop/software/planet_microbe/planet-microbe-semantic-web-analysis/job_listpmo_samples_unique.xlsx

06.29

Finalizing PM database for paper2.

Matt's current PM valdiation

Currently, Matt's scripts/load_datapackage_postgres.py deals with lat/long constraints when loading in to planet microbe DB. Alise Found this example https://data-blog.gbif.org/post/frictionless-data-and-darwin-core/ for using frictionless data with darwin core which set lat/long constrains:

Using Matt's validation:

conda info --envs manage conda environments

conda activate pm

Validate:

scripts/validate_datapackage.py ../planet-microbe-datapackages/OSD/datapackage.json
scripts/validate_datapackage.py ../planet-microbe-datapackages/Tara_Oceans/datapackage.json

conda deactivate #at the end

Suggested Frictionless DP validation script

Script to validate PM datapackages for usage see readme of the repo.

{
  "name": "year",
  "type": "integer",
  "rdfType": "http://rs.tdwg.org/dwc/terms/year",
  "constraints": {
    "required": true,
    "minimum": 1000,
    "maximum": 2050
  }
},

That example makes use of the goodtables script: https://goodtables.readthedocs.io/en/latest/

example call:

goodtables MiturgidaeBE_DP/data_package.json

pip install goodtables # install

goodtables OSD/datapackage.json #to run it works.

{
            "name": "Latitude",
            "constraints": {
              "required": true,
              "minimum": -90,
              "maximum": 90
            },
            ...

added this and run goodtables OSD/datapackage.json it works! Ran Matt's validate_datapackage.py on the constraint modified example above and it works too. I think I'm safe to add this for all Lat/Longs in all DPs.

07.16

https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a

07.20

ESIP marine meeting

https://github.com/ESIPFed/marinedata-vocabulary-guidance

from Pier: https://www.tdwg.org/community/gbwg/MIxS/

https://search.oceanbestpractices.org/

08.02

sudo ./run.sh make all_imports
OWLTOOLS_MEMORY=12G owltools ncbitaxon.obo -l -s -d  --set-ontology-id http://purl.obolibrary.org/obo/ncbitaxon.owl -o mirror/ncbitaxon.owl
2021-08-02 09:59:40,481 ERROR (CommandRunnerBase:213) Could not find an OWLObject for id: '-s'
## Showing axiom for: null
Exception in thread "main" java.lang.NullPointerException
	at owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:3418)
	at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:76)
	at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:68)
	at owltools.cli.CommandLineInterface.main(CommandLineInterface.java:12)
make: *** [Makefile:431: mirror/ncbitaxon.owl] Error 1

tried again without the -l -s -d flags and got:

sudo ./run.sh make all_imports
OWLTOOLS_MEMORY=12G owltools ncbitaxon.obo --set-ontology-id http://purl.obolibrary.org/obo/ncbitaxon.owl -o mirror/ncbitaxon.owl
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
make: *** [Makefile:223: imports/ncbitaxon_import.owl] Error 1
rm imports/ncbitaxon_terms_combined.txt

08.04

From Adam: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html which links to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html

example from Adam: ssh -i /Users/.../adam.pem [email protected]

ssh -i ~/Desktop/software/kai-ontology.pem [email protected] DIDN'T WORK

ssh -i ~/Desktop/software/kai-ontology.pem [email protected] Also tried this didn't work.

ssh -i ~/Desktop/software/kai-ontology.pem [email protected]

did all with chmod 600 and 400 to the pem file.

Maybe this is due to instance not being started? https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ec2/start-instances.html has info on starting instance need aws command perhaps from: awscli

example command:

aws ec2 start-instances --instance-ids i-1234567890abcdef0

Try install of aws from https://www.youtube.com/watch?v=BNH4i7CQ4Oc

aws ec2 start-instances --instance-ids i-1234567890abcdef0

but to do this I'll need the instance id.

AWS management console: https://aws.amazon.com/console/ -> login -> IAM user try entering: 504672911985 which I got from the command line. But I don't have a password, secret key doesn't work.

From aws-sec-cred-types.html they have the link: https://account_id_or_alias.signin.aws.amazon.com/console/ tried https://504672911985.signin.aws.amazon.com/console/ but that just redirects to the signin for which I don't have a password.

08.12

TODOs for PM paper: check formatting, reference formatting, push code, respond to comments from Elisha and ED, prepare Cover letter.

08.27

Cancel Student Health Insurance Plan https://health.arizona.edu/sites/default/files/Cancellation%20Guide%20SHIP.pdf

09.02

http://libgen.li/

09.09

Meeting with alise about getting Matt Millers' pipeline running

interactive

source ~/.bashrc

conda env create -f kraken2.yml

conda env create -f bracken.yml

conda env create -f pm_env.yml   // this failed make a new pm_env.yml with snakemake

# steps to create pm_env again do this in interactive
conda create -n pm_env

conda activate pm_env

conda install -n base -c conda-forge mamba

mamba create -c conda-forge -c bioconda -n snakemake snakemake

#modify the cluster.yml and config.yml files

#bowtie index folder is in:
/xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/planet-microbe-functional-annotation/data
# copy that into my version of the git repo so that the whole thing is portable

# submit the main snakemake job which will submit other jobs
# need to make sure this isnt' submitting too many 
sbatch run_snakemake.sh //make sure to update the path in this to the repo

squeue -u kblumberg

scancel job_ID_number

09.10

/xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai

Did the above witch worked except for the pm_env instead it make a snakemake env so I changed the run_snakemake.sh to activate snakemake instead of pm_env.

Perpared it with 1 Amazon sample SRR4831664 from the example get_test_data.sh. tried snakemake -n and it seemed to be working, then ran sbatch run_snakemake.sh and got Submitted batch job 2037818.

started the job ~ 14:15 sep 10th. 6.5 hours out still going no outputs yet.

17 hours 42 mins still going no outputs.

09.13

Job never got past the bowtie step. See job output https://metrics.hpc.arizona.edu/index.php#job_viewer?realm=SUPREMM&recordid=0&jobid=12767280&infoid=0

09.14

to the snakemake conda env I added: conda install -c conda-forge biopython after the qc pipeline couldn't load biopython and failed

COB workshop

https://github.com/turbomam/pizzacob/tree/main http://dashboard.obofoundry.org/dashboard/index.html

09.15

FOODON workshop the seafood workflow from annosh is pretty neat with the E3E lookup tool to get the labels. Could be relevant to BCODMO.

https://oboacademy.github.io/obook/tutorial/icbo2021/

09.16

Meeting with Alise:

ERR1234
  GO:0055114 5705414
  GO:0008152 3836620
  GO:0006810 1409368
  #_annotations (SUM column)

ERR1234
  number_basebairs 12322
  
Last rule: which includes a python script for each of these

1) kraken with only the columsn we want (taxID and the number reads unique to that node)
2) InterProScan counts
3) GO counts (just drop any duplicated interpro2go annotations that are in the same MF BP or CC family alphabetise and drop the 2nd one)
4) qc_scan log # BP in the fasta from step_02_qc_reads_with_vsearch (use biopython libary for fasta give lengh of reads)

/groups/bhurwitz/tools/interproscan-5.46-81.0/interproscan.sh -appl Pfam -i results/SRR4831664/step_05_chunk_reads/SRR4831664_trimmed_qcd_frags_2047.faa -b results/SRR4831664/step_06_get_orfs/SRR4831664_trimmed_qcd_frags_2047_interpro -goterms -iprlookup -dra -cpu 4 was working when run alone but not in pipeline.

09.23

repo from Alise to download from irods https://github.com/aponsero/download_irods

meeting:

for paper: first q benchmakring data look sane, then a couple core stories we can address.

Run amazon datasets first.

va on hpc shows allocation remaining.

09.30

Notes for running the big PM analysis on UA HPC

chunk sizes I think in units of MB

files bash/run_interproscan.sh (chunk size) and run_interproscan_chunk.sh (--time=)

200000 works but spawns lots of nodes perhaps 100 per sample. uses at most 2 hours
20000 spans way too many jobs 


1000000 too large gets a memory error, but only spawns 40 jobs for the 2.9 GB file. SRR5720248_1 


Try 500000 with 6 hours. spawned 81 jobs for the same 2.9 GB file. testing see `err/interproscan.base=SRR5720248_1.err`
may or maynot be working check one of the chunk files e.g `err/ips_chunk40_SRR5720248.err` 
seems to be working
~3.5 hours worked. 

started next batch ~3pm with 16.1G of samples it started 493 jobs huristic is pretty good. Finished at ~10pm. However it didn't work because some of the chunks timed out I had them set to 6 hours. They needed more time. I'll set it to 12 hours just to be safe. Deleted the step5 and 6 folders just to be safe as it crashed and I'm not sure if the snakemake will pick it up after the failure with the parallelization. Now it's at 12 hours per chunk job. Started it again at 10:10pm sep 30th. 



#run in cluster
snakemake --cluster "sbatch -A {cluster.group} -p {cluster.partition} -n {cluster.n} -t {cluster.time} -N {cluster.N} --mem={cluster.m} -e {cluster.e} -o {cluster.o}"  --cluster-config config/cluster.yml -j 30 --latency-wait 30 --until `rules interproscan?` 

OR to run upto step 4 do some commenting out:

rule all:
    input:
        expand("results/{sample}/bowtie/{sample}.fastq.gz", sample=config["samples"]),
        #expand("results/{sample}/step_02_qc_reads_with_vsearch/{sample}_trimmed_qcd.fasta", sample=config["samples"]),
    THIS    #expand("results/{sample}/step_07_combine_tsv/{sample}_trimmed_qcd_frags_interpro_combined.tsv", sample=config["samples"]),
        expand("results/{sample}/bracken/{sample}_profiles.txt", sample=config["samples"]),
        "results/killed_interproscan.txt",


Rule for interproscan and start and stop server. can just comment out.

https://www.biostars.org/p/265420/

2nd PM Paper: higher taxonomic resolution better with ecology/habitat. Prepare 2 version of the docs file one with the new changes and one with the original with track changes. Papers to add from review: https://doi.org/10.1111/1462-2920.15173 and https://doi.org/10.1007/s00248-020-01526-5 Alise's example response to reviewers: https://docs.google.com/document/d/17uT6JbOoyAj6tRtceHk46ZfW0sHK8t6W9KcWdPicKDI/edit#heading=h.5qbn069bmn46

10.01

Regarding the step 7 configurations I tired 500000 for 6 hours but several didn’t finish in that time. So I upped it to 12 hours for the chunk jobs and ran it last night. It seemed to have worked all the step 7's are there and the job completed successfully, however when I cat out the chunk files I get a few possile oom’s:

(puma) (base) [kblumberg@junonia err]$ cat ips_*
/groups/bhurwitz/tools/interproscan-5.46-81.0/interproscan.sh: line 46: 34401 Killed                  "$JAVA" -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xms1028M -Xmx6072M -jar interproscan-5.jar $@ -u $USER_DIR
Sep 30 14:52:43.646263 33724 slurmstepd   0x2b834568b340: error: Detected 1 oom-kill event(s) in StepId=2181585.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
/groups/bhurwitz/tools/interproscan-5.46-81.0/interproscan.sh: line 46: 54774 Killed                  "$JAVA" -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xms1028M -Xmx6072M -jar interproscan-5.jar $@ -u $USER_DIR
Sep 30 16:22:45.860395 54039 slurmstepd   0x2b489e9d8340: error: Detected 2 oom-kill event(s) in StepId=2181737.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The files say they were successful for every chunk but I don’t think that deals with the OOM but it was only 3 events for those data files so I don’t think it’s worth re-running them again, but I’ve uped the memory to 7gb. Resubmitted the next job at 8:08am CET.

From Simon and OGC: SOSA: A lightweight ontology for sensors, observations, samples, and actuators https://w3c.github.io/sdw/ssn/ Spatial Data on the Web Interest Group https://www.w3.org/TR/vocab-ssn/ https://github.com/opengeospatial/om-swg

10.04

Interpro job tests:

Tried 875000 chunk size works for 50 nodes and worked with 12gb mem.

Testing 20 nodes with 16gb memory didn't work. Try again with more mem.

Try 20 nodes 2000000chunk size with 50gb again with sample SRR5720221_1 -> one sample got an out of memory error.

Try again with less nodes Try 40 nodes 1000000chunk size with 50gb again with sample SRR5720221_1 worked! ~3.5 hours

Try 30 nodes aka 1125000 chunk size with 50gb with sample SRR5720225_1 -> made 39 nodes not 30? first time this breaks the pattern from before. but it worked. Proceed with this chunk size assuming it's 40 nodes to be safe then by my calculation we can do 35Gb at a time.

Reducing sample sizes:

gunzip -c data/SRR5720300_1.fastq.gz | head -n 100 | gzip > test_data/SRR5720300_1_mini.fq.gz

Looking at the "5"Gb samples from my parsed sample list:

BATS -> 150bp reads
3.4G SRR5720275_1.fastq.gz //gunzip -c data/SRR5720275_1.fastq.gz | head -n 10
3.4G SRR5720249_1.fastq.gz
3.2G SRR5720285_1.fastq.gz

HOT Chisholm -> 150bp reads
3.3G SRR5720293_1.fastq.gz //gunzip -c data/SRR5720293_1.fastq.gz | head -n 10
3.4G SRR5720302_1.fastq.gz

HOT ALOHA time/depth series -> 150bp reads
3.4G SRR9178068_1.fastq.gz //gunzip -c data/SRR9178068_1.fastq.gz | head -n 10
3.2G SRR9178368_1.fastq.gz //gunzip -c data/SRR9178368_1.fastq.gz | head -n 10
3.2G SRR9178503_1.fastq.gz
6.2G SRR5002405.fastq.gz    //gunzip -c data/SRR5002405.fastq.gz | head -n 10
6.2G SRR5002321.fastq.gz

Tara -> ~100bp reads
3.2G ERR599134_1.fastq.gz //gunzip -c data/ERR599134_1.fastq.gz | head -n 20
3.3G ERR599172_1.fastq.gz
3.4G ERR598972_1.fastq.gz //gunzip -c data/ERR598972_1.fastq.gz | head -n 20

It seems like that 5Gb I parsed from NCBI might be the forward and reverse? because none of these are 5Gb. most are 3.4Gb and two are 6.2Gb.

Regardless of that confusion assuming we want to get to 3.5Gb (in real file size) which is close to the real value median of the “5”Gb files then the following command works to subset down to 3.5G:

gunzip -c data/SRR6507280_1.fastq.gz | head -n 175000000 | gzip > test_data/SRR6507280_3.5gb_test.fq.gz

Provenance for that calcuation: of n value 175000000

gunzip -c data/SRR6507280_1.fastq.gz | head -n 100000 | gzip data/SRR6507280_1_test.fq.gz
2.0M Oct  4 04:33 test_data/SRR6507280_1_test.fq.gz
gunzip -c data/SRR6507280_1.fastq.gz | head -n 1000000 | gzip data/SRR6507280_1_test.fq.gz
20M Oct  4 04:34 test_data/SRR6507280_1_test.fq.gz
gunzip -c data/SRR6507280_1.fastq.gz | head -n 10000000 | gzip data/SRR6507280_1_test.fq.gz
199M Oct  4 04:37 test_data/SRR6507280_1_test.fq.gz
gunzip -c data/SRR6507280_1.fastq.gz | head -n 100000000 | gzip data/SRR6507280_1_test.fq.gz
2.0G Oct  4 04:57 test_data/SRR6507280_1_test.fq.gz

To count the indivdual files: gunzip -c test_data/SRR6507280_3.5gb_test.fq.gz | wc -l could make a bash script that for the list of samples: downloads the samples, counts the number of lines writes that plus the sample name to a file then deletes the file.

10.05

if we cut at 15 million reads we loose 79 samples: Amazon Plume Metagenomes 20, HOT ALOHA time/depth series 35, Amazon River Metagenomes 13, Tara 2, Bats 4, HOT Chisholm 5
if we cut at 10 million reads we loose 47 samples: Amazon Plume Metagenomes 16, HOT ALOHA time/depth series 21, Amazon River Metagenomes 8, Tara 2
if we cut at 7.5 million reads we loose 33 samples: Amazon Plume Metagenomes 14, HOT ALOHA time/depth series 16, Amazon River Metagenomes 3 
if we cut at 5 million reads we loose 28 samples: Amazon Plume Metagenomes 11, HOT ALOHA time/depth series 15, Amazon River Metagenomes 2

I think I can rafify to 10^7 reads because after that size you’re not guarentied to get more interproscan file size according to the files I’ve already run. If we want 10^7 reads 10000000 then we need that * 4 for the head -n number = 40000000 with the command:

gunzip -c tmp_data_dir/{sample}.fastq.gz | head -n 40000000 | gzip > data/{sample}.fastq.gz

10.06

example commands to build Pfam and NCBItaxon rarefaction curves:

Pfam: cut -f 5 results/SRR1790489_1/step_07_combine_tsv/*.tsv | sort | uniq | wc -l

NCBITaxon: cut -f 5 results/SRR4831663/kraken2/*_report.tsv | sort | uniq | wc -l

Alise thinks 10 million reads as rarefaction cutoff and keep the >=5 Million read samples.

Downloading and trimming data in /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai run:

sbatch download_trim.sh

with the list of files in list.txt formatted like

ERR771007_1
...

modify the bash script to the correct data path after we settle on a cutoff threshold.

10.07

Want to get back to single threaded version of pipeline with regard to interproscan commit 6ee330c54b1952b8c5e1866a83b9a046941d1f6f is where he added the rule interproscan and the bash/run_interproscan.sh. The previous commits d6f824e7d1c95770cd60389da644dbd1dc9e7975 and 3b9389f7aefc10e1c5f8b7ae048441dba803d89e he adds the submit_snakemake.sh I had it working on a single thread prior to this. So I'll revert back to e1cf3048c6f6e4685680ac1032a36a299a3b6952.

From the first answer here try:

git checkout e1cf3048c6f6e4685680ac1032a36a299a3b6952

10.08

download and trim script based on my initial test, first job 2021-10-07.10:47:52, last job 2021-10-07.17:49:40, it 7 hours for 162 NCBI Gb. 23gb/hr round down to 20 to be safe. *48 hr = 960 round down to 950 to be safe.

conclusion: Sumbit <=950 gb in 48 hr job nope redid calc at more like 17Gb/hr so 55 hours for the same chunk size. So I changed to jobs to be 72 hours to be safe.

Regarding test_10_million it finished shortly after being resubmitted (after not finishing in the first 24 hours). I think I'm safe to set the job time to 48 hours.

https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account

10.12

possible freshlake metagenomes:

Best option: https://www.ncbi.nlm.nih.gov/bioproject/664399 28 metagenomes almost all are >= 5M spots, filter size 0.22, all freshwater from Lake Superior. Illumina NovaSeq 6000.

https://www.ncbi.nlm.nih.gov/bioproject/479434 Maybe but metadata is pretty unclear some seem like sediments.

https://www.ncbi.nlm.nih.gov/bioproject/51219 only 5 illumina

https://www.ebi.ac.uk/metagenomics/studies/MGYS00002504 only 8 though ncbi: https://www.ncbi.nlm.nih.gov/bioproject/PRJEB27578

https://www.ncbi.nlm.nih.gov/bioproject/335374 this could work 32 wgs lake metagenomes NO all are < 5M spots

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA400857 urban but not so many

https://www.ncbi.nlm.nih.gov/bioproject/636190 few more great lake sequences

Vents

https://www.ncbi.nlm.nih.gov/bioproject/PRJEB19456 this could work only ~6 metagenomes but all good size and diffuse flow

https://www.ncbi.nlm.nih.gov/bioproject/243235 slightly too small 4M spots

https://www.ncbi.nlm.nih.gov/bioproject/PRJEB7866 slightly too small 4M spots

https://www.ncbi.nlm.nih.gov/bioproject/PRJEB15541 many are slightly too small 4M spots I think only the sags are the right size

https://www.ncbi.nlm.nih.gov/bioproject/306467 slightly too small 4M spots

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA522654 not sure whats up here.

https://www.ncbi.nlm.nih.gov/bioproject/530185 maybe but only 3 samples

https://www.ncbi.nlm.nih.gov/bioproject/PRJEB9204 slightly too small 4M spots

10.17

step_07_combine_tsv]$ du -sh .
189M

step_06_get_orfs]$ du -sh .
5.4G

step_05_chunk_reads]$ du -sh .
361M

step_04_get_gene_reads]$ du -sh .
1.8G

step_02_qc_reads_with_vsearch]$ du -sh .
884M

step_01_trimming]$ du -sh .
1.8G

bowtie]$ du -sh .
610M

bracken]$ du -sh .
628K

kraken2]$ du -sh .
125M

10.20

COData working group: meeting:

Other system similar to UOM https://umis.stuchalk.domains.unf.edu/

Links between BIPM and the #CODATA Task Group on Fundamental Constants and Digital Representation of Units, mentioned by Joachim Ulrich, of Measure: see MOU https://www.bipm.org/en/-/2021-10-11-mou-bipm-codata liaison page https://www.bipm.org/en/liaison-partners/codata-tgfc and TGFC page https://www.bipm.org/en/hosting/codata-tgfc

10.25

SciDataCon vocabs

Scientific Vocabularies: needs, status, validity, governance and sustainability slides

SciDataCon copy of Catalogue of Vocabulary tools

Notes. Session One.

Notes. Session Two.

Earth, Space and Environmental Sciences Data Vocabulary, Ontology and Semantic Repositories/Services

10.27

https://adyork.github.io/intro-to-apis-taxamatch/

10.28

3 scripts 

1) for step 2 to grep > for the number of reads that passed qc then | wc -l 

2) For step 4 to get the number of predicted ORFS from the .faa files again grep for > 

3) gunzip -c SRR1786608_1.fastq.gz | wc -l for the data files to see how many reads we actually have divide by 4 we should drop samples that reduce too much between this and step 2

11.3

NMDC pannel notes

11.18

gunzip -c ../old/planet-microbe-functional-annotation/data/ERR315856_1.fastq.gz | head -n 4000 | gzip > data/mini_test.fastq.gz
gunzip -c data/mini_test.fastq.gz | wc -l

11.24

Matt added a new commit to planet-microbe-functional-annotation. That adds a loop to the run_cmd function in pipeline/utils.py, hence it should keep retrying it until it doesn't error out. Testing it out with windfall_test_2 start wed 11.24 ~9:10am CET.

12.07

Testing step4 frag counts against step7 final merge sizes:

Small 4M step7 file:
grep -c "^>" ERR873967_1_trimmed_qcd_frags.faa
5554493

larger 300M step7 file:
grep -c "^>" SRR9178330_1_trimmed_qcd_frags.faa
8973982

Medium 181M step7 file:
ERR594323_1

grep -c "^>" ERR594323_1_trimmed_qcd_frags.faa
8982997

12.09

meeting with Alise:

Cutoff at 9000 NCBITaxon richness

Keep raw counts for GO/NCBITaxon and always query for Number of Reads Initial,Number of Reads after QC,Number of ORFs

Globus to push to google drive https://public.confluence.arizona.edu/display/UAHPC/Transferring+Files#TransferringFiles-GridFTP/Globus create two endpoints on drive and hpc use the interface to tranfer files between the 2. Use the Kai_Blumberg shared drive.

12.12

See also https://public.confluence.arizona.edu/plugins/viewsource/viewpagesrc.action?pageId=86409320 basically same instructions worked to transfer between UAHPC and googledrive

12.15

Trying to run the GO processing script jobs slurm-2759948.out slurm-2759949.out, slurm-2759950.out all got memory errors.

Figured it out it's the samples that have step7 file sizes >=441M that are the problem hence the following are the list of the large samples sorted by step7 size. These 27 samples won't run on the regular puma compute node. So either a high memory node, or ship off to atmosphere vm instance. See UA HPC RunningJobswithSLURM-GPUJobs page section on High Memory Nodes.

Can probably use these sbatch headers:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --constraint=hi_mem

SRR5002414
SRR5002356
SRR5002391
SRR5002327
SRR5002349
SRR5002365
SRR4831666
SRR5002309
SRR4831664
SRR4831655
SRR4833064
SRR4833053
SRR4833087
SRR4831657
SRR4833056
SRR4833084
SRR5123275
SRR5123277
SRR4833077
SRR4833081
SRR4831662
SRR6507279_1
SRR5720286_1
SRR5123274
SRR5720305_1
SRR5002376
SRR5720260_1

12.16

sbatch go_high_mem.sh parser_lists/high_mem/list2.txt
Submitted batch job 2766936
sbatch go_high_mem.sh parser_lists/high_mem/list3.txt
Submitted batch job 2766937
sbatch go_high_mem.sh parser_lists/high_mem/list4.txt
Submitted batch job 2766938
sbatch go_high_mem.sh parser_lists/high_mem/list5.txt
Submitted batch job 2766939
sbatch go_high_mem.sh parser_lists/high_mem/list6.txt
Submitted batch job 2766940
sbatch go_high_mem.sh parser_lists/high_mem/list7.txt
Submitted batch job 2766941
sbatch go_high_mem.sh parser_lists/high_mem/list8.txt
Submitted batch job 2766942
sbatch go_high_mem.sh parser_lists/high_mem/list9.txt
Submitted batch job 2766943

Later also:

sbatch interpro.sh
Submitted batch job 2767027

PMO commits to redo :(

Author: kaiiam <[email protected]>
Date:   Thu Dec 16 15:58:55 2021 +0100

    Add ENVO back to regular imports line

commit 8c23bc1dc6258e2cc563838355d9ae72d48d7fa6
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 15:26:10 2021 +0100

    Obsolete PMO concentration of nitrate and nitrite in water

commit 19fa89056caf8c1a3eb71516e6b4ba004ede5db2
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 15:15:18 2021 +0100

    Depricate PMO conc urea in water in favor of ENVO term

commit 28200812a0b8d98e6720b152400dda995b2ad6ae
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 15:01:44 2021 +0100

    Depricate PMO salinity and replace with PATO term

commit 679993f37042a873016028d735b3b544bdbe8c10
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:56:03 2021 +0100

    Run make pato import

commit ebb0c3db4b750c18ccc0c192e4e403f3532768d0
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:55:47 2021 +0100

    Import PATO salinity

commit 564e6cc93129fca17e0c69fe103f251727caa106
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:51:21 2021 +0100

    Move envo outside regular imports

commit 9c8b55b1b3d14cdbd82e15cf703cf8e38423b426
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:48:47 2021 +0100

    Run imports and make env feature non obsolete

commit 3dce14982594560ead323745a6c26ce5e31d484c
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:48:20 2021 +0100

    Run make imports

commit 8f4e0fb28895d023921f1873d243864866498790
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:47:57 2021 +0100

    Run make imports

commit c5b3e3f2d67128e696d842ebaeba613dc3429330
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:47:48 2021 +0100

    Run make imports

commit 3d208ddf81e4651467371e2441cbb5176b34a686
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:47:42 2021 +0100

    Run make imports

commit 1fad55722031a2536ecbba8e792ab46ddf8e9db1
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:47:35 2021 +0100

    Run make imports

commit f400357adfec4c69a2ec35ab52a979e289526c24
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:47:18 2021 +0100

    Run make imports

commit ba797ad91bae55b0e8e2466b1f29a207667421b6
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:33:04 2021 +0100

    Update purl for concentration of dioxygen in liquid water

commit 92c06f904024a6f8a0cb7b30c354a2c5e0d2b745
Author: kaiiam <[email protected]>
Date:   Thu Dec 16 14:21:06 2021 +0100

    Fix PURLs with extra pmo.owl/