-
Notifications
You must be signed in to change notification settings - Fork 0
2021_log
ESIP
ESSDIVE id metadata and supplemental
PM meeting:
get irods iinit
use a script like what Alise has to pull down datasets to UA-hpc
Create a script to benchmark the datasets sanity checks for .. flag bad files or look wierd non annotated quality control in general flag dataset if >X% are human, or if we complete lost a sample got no annotion
Should be a protocols.io link for irods, or just see Alise' example script.
submit script and run script the latter runs on HPC and executes python code, look at Matt's pipeline as example of how to run it, and the Iget commands from alyse. Submit script will run a for loop to submit multile with ars, executes run scrip loads singularity pyton exe fil etc. Split list into cunks and run array command.
create sumbit and run scripts to download data for processing.
Create script to sanity check thhe data out of the pipeline, QC leaves too few reads, too few annotations. Want summary table of what went wrong, maybe pull some of matts intermediate steps, and do additional cheks
https://github.com/ontodev/cogs https://github.com/INCATools/ontology-development-kit/releases/tag/v.1.2.26 https://github.com/ontodev/robot/blob/master/CHANGELOG.md
ESIP marine data Vocabularies Special Session
cf conventions climate model data standard to deal with interoperability based on NETCDF fileformat conventions are to provide full meta spatiao temploral spectral info, upto 32 dimensions on a variable standard names to show what is in the variable.
GCMD keywords http://gcmd.arcticlcc.org/ facepalm
seadatanet NVS exchange formats, platforms, measurement devices etc. Similar stuff. NERC L22 for instruments.
Pier's mapping of ENVO to NERC https://github.com/EnvironmentOntology/envo/issues/731
https://github.com/tanyagupta/codeblogs/blob/development/githubHowTos/howto.md
Meeting with Alise:
iinit
ERROR: environment_properties::capture: missing environment file. should be at [/home/u19/kblumberg/.irods/irods_environment.json]
One or more fields in your iRODS environment file (irods_environment.json) are
missing; please enter them.
Enter the host name (DNS) of the server to connect to: data.cyverse.org Enter the port number: 1247 Enter your irods user name: kblumberg Enter your irods zone: iplant Those values will be added to your environment file (for use by other iCommands) if the login succeeds.
irods docs: https://docs.irods.org/4.1.0/ https://cyverse-data-store-guide.readthedocs-hosted.com/en/latest/step2.html#icommands-first-time-configuration
check running jobs qstat -u kblumberg
example of how to delete a job by it's nunmber: qdel 3843824.head1
have numbersplit be 50
have lists of 500 at a time (to submit 4x). Randomize these. If this doesn't work we can try groups of 200 files at time? (would need to submit 10 x)
FAIR digital SI EOSC has ontology mapping framework Dublin coure ddi schema.org
Iadopt
Seadatanet
Sea data cloud? for thesis.
Prefer cc0 over cc-by for licenses
OBO dashboard, OMO (useful subset of IAO).
Papers relevant to PM paper 2: https://www.frontiersin.org/articles/10.3389/fmars.2019.00440/full (DEF CITE), https://peerj.com/articles/cs-110/, https://content.iospress.com/articles/information-services-and-use/isu824 (be good too).
For my committee: 2 pager about getting back on track:
page one
explaintion of material going into paper 2: ontology contributions supporting physicochemical data, ontology choices, CI choices frictionless data specification analogous to NetCDF etc. Then use Planet microbe to ask and answer/hightlight uses cases for discovery of physicochem data across environments.
page two
Plan for integrating that data with the functional and genomic annotation and getting it all into an rdf tripplestore to ask and answer questions deeper questions about distribution of taxa and genes correlation that with the physicochem and env types.
Include timeline when the data should be available.
Some quick Q's:
do synocococuss and prochrlorococcus vs depth (or another parameter) get files from 2 projects.
Acidity of water lots of materials and features to compare as its so many projects.
Redfield ratio (po4, no3) 1000 samples good amount of features and some materials. See what features/materials/biomes deviate from the standard redfield ratio.
From Matts example PM API calls: https://github.com/hurwitzlab/planet-microbe-app/blob/master/README.md
Open Chrome developer tools and select “Network” tab
option+command+c
I think I should be able to play around with this to setup the API calls I’ll want to make to build my RDF triple store. I could do them all manually with the UI’s search interface (downloading csvs then adding those to triplestore) but I think it would be cooler, more automated and more reproducible to build the triple store from the the json outputs from API calls. That way it’s showing that someone else could leverage the PM API to do something totally different with the data. so the data store would be build from a bunch of calls like:
See more examples in ~/Desktop/scratch/pm_query
curl -s -H 'Content-Type: application/json' -d '{"http://purl.obolibrary.org/obo/ENVO_09200014":"[0,10]","limit":"40"}' https://www.planetmicrobe.org/api/search | jq .sampleResults > temp.json
Then I could figure out a “correct” way of converting these json products to rdf, using something like ShEx or SHACL.
ESIP Marine Data meeting
Shawn Smith to Everyone (8:22 PM)
Kai - How would the units interchange system differ from software like Udunits?
Add comments to James Units for OBO.
Citation for pato https://academic.oup.com/bib/article/19/5/1008/3108819
1:1 with bonnie
Remove some bad samples
Look at the graph by families of organisms
show overall trends big patterns to be followed up on in paper 3.
DOesn't have to be all organisms in all graphs just a few interesting stories.
Might not need a high level pattern with all data
Not recreating a phylogeny instead taking an existing one, e.g. synoccocccus and putting it together with our physicochem and observation data.
some example figures in this paper: https://www.pnas.org/content/112/17/5443.full
co-occurence networks? samples organized by depth
Remove bad samples
Try CCA but subset by groups of taxa, are there patterns for groups of taxa, maybe gives incite into bigger summary graph
Maybe instead try with correlations all species against all chemistry look for interesting patterns then pick a couple for a prettier figure.
celfie protege plugin similar to robot?
Michael DeBellis Protege 5 tutorial apparently very good.
Comparison of normalization methods for the analysis of metagenomic gene abundance data def read before planet microbe paper 3
BCODMO/NMDC meeting https://microbiomedata.github.io/nmdc-schema/MetaproteomicsAnalysisActivity/ https://proteinportal.whoi.edu/ https://lod.bco-dmo.org/browse-all/
EBI Ontologies CoP Webinar
Website: https://bigdata.cgiar.org/communities-of-practice/ontologies/ LinkedIn: https://www.linkedin.com/groups/13707155/ Youtube: https://www.youtube.com/channel/UCpAupP30tbkcFwMBErFdDhA/videos Newsletter: https://cgiar.us15.list-manage.com/subscribe?u=59a7500ef5c94dd50e2b9e2fb&id=454e94d3f2
OLS: ontology lookup service
ZOOMA: to mine terms from text
OXO: ontology mapping. Confidence is just count of mappings (pretty rudimentary) but they'll have this follow best practices set by SSSOM.
SSSOM: mapping standard
WEBINARS: July 27: Doing a Governance Operational Model for Ontologies, GOMO with Oscar Corcho, Jose Antonio Bernabe Diaz and Edna Ruckhaus from the Technical University of Madrid and Alexander Garcia from BASF.
Register: https://lnkd.in/gb2cG2h
- September 7: Neo4J as a backend DB for web protégé and Towards a plugin based architecture for web protégé with Mattew Horridge and Mark Musen from Stanford University.
Register: https://lnkd.in/gU92u96
- September 21: Enterprise Knowledge Graph with Alexander Garcia from BASF.
Register: https://lnkd.in/gniS2Mm
paper: Matching sensor ontologies through siamese neural networks without using reference alignment
Notes from PM meeting:
TODO: add more test examples in the 10 Gb range to this for Matt to try and run: https://github.com/hurwitzlab/planet-microbe-functional-annotation/blob/master/test_data/get_test_data.sh
file handling
edit config based on the individual files and their needs
#rm results/interproscan.txt results/killed_interproscan.txt to start up lookup server
Might want to consider another Kraken Database
configfile: "config/config.yml" to this put the sample info.
config params to train:
vsearch_filter_maxee: 20
vsearch_filter_minlen: 75
frag_train_file: "illumina_10"
adapter_fasta: "data/configs/TruSeq3-PE.fa"
+ read lenght (if Matt can't automate)
Interesting links: https://gtdb.ecogenomic.org/ https://thispersondoesnotexist.com/. https://medium.com/analytics-vidhya/apriori-algorithm-in-association-rule-learning-9287fe17e944
My file with the list of PM analysis is ~/Desktop/software/planet_microbe/planet-microbe-semantic-web-analysis/job_listpmo_samples_unique.xlsx
Finalizing PM database for paper2.
Currently, Matt's scripts/load_datapackage_postgres.py deals with lat/long constraints when loading in to planet microbe DB. Alise Found this example https://data-blog.gbif.org/post/frictionless-data-and-darwin-core/ for using frictionless data with darwin core which set lat/long constrains:
Using Matt's validation:
conda info --envs
manage conda environments
conda activate pm
Validate:
scripts/validate_datapackage.py ../planet-microbe-datapackages/OSD/datapackage.json
scripts/validate_datapackage.py ../planet-microbe-datapackages/Tara_Oceans/datapackage.json
conda deactivate
#at the end
Script to validate PM datapackages for usage see readme of the repo.
{
"name": "year",
"type": "integer",
"rdfType": "http://rs.tdwg.org/dwc/terms/year",
"constraints": {
"required": true,
"minimum": 1000,
"maximum": 2050
}
},
That example makes use of the goodtables script: https://goodtables.readthedocs.io/en/latest/
example call:
goodtables MiturgidaeBE_DP/data_package.json
pip install goodtables
# install
goodtables OSD/datapackage.json
#to run it works.
{
"name": "Latitude",
"constraints": {
"required": true,
"minimum": -90,
"maximum": 90
},
...
added this and run goodtables OSD/datapackage.json
it works! Ran Matt's validate_datapackage.py on the constraint modified example above and it works too. I think I'm safe to add this for all Lat/Longs in all DPs.
https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a
ESIP marine meeting
https://github.com/ESIPFed/marinedata-vocabulary-guidance
from Pier: https://www.tdwg.org/community/gbwg/MIxS/
https://search.oceanbestpractices.org/
sudo ./run.sh make all_imports
OWLTOOLS_MEMORY=12G owltools ncbitaxon.obo -l -s -d --set-ontology-id http://purl.obolibrary.org/obo/ncbitaxon.owl -o mirror/ncbitaxon.owl
2021-08-02 09:59:40,481 ERROR (CommandRunnerBase:213) Could not find an OWLObject for id: '-s'
## Showing axiom for: null
Exception in thread "main" java.lang.NullPointerException
at owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:3418)
at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:76)
at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:68)
at owltools.cli.CommandLineInterface.main(CommandLineInterface.java:12)
make: *** [Makefile:431: mirror/ncbitaxon.owl] Error 1
tried again without the -l -s -d
flags and got:
sudo ./run.sh make all_imports
OWLTOOLS_MEMORY=12G owltools ncbitaxon.obo --set-ontology-id http://purl.obolibrary.org/obo/ncbitaxon.owl -o mirror/ncbitaxon.owl
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
make: *** [Makefile:223: imports/ncbitaxon_import.owl] Error 1
rm imports/ncbitaxon_terms_combined.txt
From Adam: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html which links to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html
example from Adam: ssh -i /Users/.../adam.pem [email protected]
ssh -i ~/Desktop/software/kai-ontology.pem [email protected]
DIDN'T WORK
ssh -i ~/Desktop/software/kai-ontology.pem [email protected]
Also tried this didn't work.
ssh -i ~/Desktop/software/kai-ontology.pem [email protected]
did all with chmod 600 and 400 to the pem file.
Maybe this is due to instance not being started? https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ec2/start-instances.html has info on starting instance need aws
command perhaps from: awscli
example command:
aws ec2 start-instances --instance-ids i-1234567890abcdef0
Try install of aws from https://www.youtube.com/watch?v=BNH4i7CQ4Oc
aws ec2 start-instances --instance-ids i-1234567890abcdef0
but to do this I'll need the instance id.
AWS management console: https://aws.amazon.com/console/ -> login -> IAM user
try entering: 504672911985
which I got from the command line. But I don't have a password, secret key doesn't work.
From aws-sec-cred-types.html they have the link: https://account_id_or_alias.signin.aws.amazon.com/console/
tried https://504672911985.signin.aws.amazon.com/console/
but that just redirects to the signin for which I don't have a password.
TODOs for PM paper: check formatting, reference formatting, push code, respond to comments from Elisha and ED, prepare Cover letter.
Cancel Student Health Insurance Plan https://health.arizona.edu/sites/default/files/Cancellation%20Guide%20SHIP.pdf
Meeting with alise about getting Matt Millers' pipeline running
interactive
source ~/.bashrc
conda env create -f kraken2.yml
conda env create -f bracken.yml
conda env create -f pm_env.yml // this failed make a new pm_env.yml with snakemake
# steps to create pm_env again do this in interactive
conda create -n pm_env
conda activate pm_env
conda install -n base -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
#modify the cluster.yml and config.yml files
#bowtie index folder is in:
/xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/planet-microbe-functional-annotation/data
# copy that into my version of the git repo so that the whole thing is portable
# submit the main snakemake job which will submit other jobs
# need to make sure this isnt' submitting too many
sbatch run_snakemake.sh //make sure to update the path in this to the repo
squeue -u kblumberg
scancel job_ID_number
/xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai
Did the above witch worked except for the pm_env instead it make a snakemake env so I changed the run_snakemake.sh to activate snakemake instead of pm_env.
Perpared it with 1 Amazon sample SRR4831664
from the example get_test_data.sh. tried snakemake -n
and it seemed to be working, then ran sbatch run_snakemake.sh
and got Submitted batch job 2037818
.
started the job ~ 14:15 sep 10th. 6.5 hours out still going no outputs yet.
17 hours 42 mins still going no outputs.
Job never got past the bowtie step. See job output https://metrics.hpc.arizona.edu/index.php#job_viewer?realm=SUPREMM&recordid=0&jobid=12767280&infoid=0
to the snakemake conda env I added:
conda install -c conda-forge biopython
after the qc pipeline couldn't load biopython and failed
https://github.com/turbomam/pizzacob/tree/main http://dashboard.obofoundry.org/dashboard/index.html
FOODON workshop the seafood workflow from annosh is pretty neat with the E3E lookup tool to get the labels. Could be relevant to BCODMO.
https://oboacademy.github.io/obook/tutorial/icbo2021/
Meeting with Alise:
ERR1234
GO:0055114 5705414
GO:0008152 3836620
GO:0006810 1409368
#_annotations (SUM column)
ERR1234
number_basebairs 12322
Last rule: which includes a python script for each of these
1) kraken with only the columsn we want (taxID and the number reads unique to that node)
2) InterProScan counts
3) GO counts (just drop any duplicated interpro2go annotations that are in the same MF BP or CC family alphabetise and drop the 2nd one)
4) qc_scan log # BP in the fasta from step_02_qc_reads_with_vsearch (use biopython libary for fasta give lengh of reads)
/groups/bhurwitz/tools/interproscan-5.46-81.0/interproscan.sh -appl Pfam -i results/SRR4831664/step_05_chunk_reads/SRR4831664_trimmed_qcd_frags_2047.faa -b results/SRR4831664/step_06_get_orfs/SRR4831664_trimmed_qcd_frags_2047_interpro -goterms -iprlookup -dra -cpu 4
was working when run alone but not in pipeline.
repo from Alise to download from irods https://github.com/aponsero/download_irods
meeting:
for paper: first q benchmakring data look sane, then a couple core stories we can address.
Run amazon datasets first.
va
on hpc shows allocation remaining.
Notes for running the big PM analysis on UA HPC
chunk sizes I think in units of MB
files bash/run_interproscan.sh (chunk size) and run_interproscan_chunk.sh (--time=)
200000 works but spawns lots of nodes perhaps 100 per sample. uses at most 2 hours
20000 spans way too many jobs
1000000 too large gets a memory error, but only spawns 40 jobs for the 2.9 GB file. SRR5720248_1
Try 500000 with 6 hours. spawned 81 jobs for the same 2.9 GB file. testing see `err/interproscan.base=SRR5720248_1.err`
may or maynot be working check one of the chunk files e.g `err/ips_chunk40_SRR5720248.err`
seems to be working
~3.5 hours worked.
started next batch ~3pm with 16.1G of samples it started 493 jobs huristic is pretty good. Finished at ~10pm. However it didn't work because some of the chunks timed out I had them set to 6 hours. They needed more time. I'll set it to 12 hours just to be safe. Deleted the step5 and 6 folders just to be safe as it crashed and I'm not sure if the snakemake will pick it up after the failure with the parallelization. Now it's at 12 hours per chunk job. Started it again at 10:10pm sep 30th.
#run in cluster
snakemake --cluster "sbatch -A {cluster.group} -p {cluster.partition} -n {cluster.n} -t {cluster.time} -N {cluster.N} --mem={cluster.m} -e {cluster.e} -o {cluster.o}" --cluster-config config/cluster.yml -j 30 --latency-wait 30 --until `rules interproscan?`
OR to run upto step 4 do some commenting out:
rule all:
input:
expand("results/{sample}/bowtie/{sample}.fastq.gz", sample=config["samples"]),
#expand("results/{sample}/step_02_qc_reads_with_vsearch/{sample}_trimmed_qcd.fasta", sample=config["samples"]),
THIS #expand("results/{sample}/step_07_combine_tsv/{sample}_trimmed_qcd_frags_interpro_combined.tsv", sample=config["samples"]),
expand("results/{sample}/bracken/{sample}_profiles.txt", sample=config["samples"]),
"results/killed_interproscan.txt",
Rule for interproscan and start and stop server. can just comment out.
https://www.biostars.org/p/265420/
2nd PM Paper: higher taxonomic resolution better with ecology/habitat. Prepare 2 version of the docs file one with the new changes and one with the original with track changes. Papers to add from review: https://doi.org/10.1111/1462-2920.15173 and https://doi.org/10.1007/s00248-020-01526-5 Alise's example response to reviewers: https://docs.google.com/document/d/17uT6JbOoyAj6tRtceHk46ZfW0sHK8t6W9KcWdPicKDI/edit#heading=h.5qbn069bmn46
Regarding the step 7 configurations I tired 500000 for 6 hours but several didn’t finish in that time. So I upped it to 12 hours for the chunk jobs and ran it last night. It seemed to have worked all the step 7's are there and the job completed successfully, however when I cat out the chunk files I get a few possile oom’s:
(puma) (base) [kblumberg@junonia err]$ cat ips_*
/groups/bhurwitz/tools/interproscan-5.46-81.0/interproscan.sh: line 46: 34401 Killed "$JAVA" -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xms1028M -Xmx6072M -jar interproscan-5.jar $@ -u $USER_DIR
Sep 30 14:52:43.646263 33724 slurmstepd 0x2b834568b340: error: Detected 1 oom-kill event(s) in StepId=2181585.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
/groups/bhurwitz/tools/interproscan-5.46-81.0/interproscan.sh: line 46: 54774 Killed "$JAVA" -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xms1028M -Xmx6072M -jar interproscan-5.jar $@ -u $USER_DIR
Sep 30 16:22:45.860395 54039 slurmstepd 0x2b489e9d8340: error: Detected 2 oom-kill event(s) in StepId=2181737.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
The files say they were successful for every chunk but I don’t think that deals with the OOM but it was only 3 events for those data files so I don’t think it’s worth re-running them again, but I’ve uped the memory to 7gb. Resubmitted the next job at 8:08am CET.
From Simon and OGC: SOSA: A lightweight ontology for sensors, observations, samples, and actuators https://w3c.github.io/sdw/ssn/ Spatial Data on the Web Interest Group https://www.w3.org/TR/vocab-ssn/ https://github.com/opengeospatial/om-swg
Interpro job tests:
Tried 875000
chunk size works for 50 nodes and worked with 12gb mem.
Testing 20 nodes with 16gb memory didn't work. Try again with more mem.
Try 20 nodes 2000000
chunk size with 50gb again with sample SRR5720221_1
-> one sample got an out of memory error.
Try again with less nodes Try 40 nodes 1000000
chunk size with 50gb again with sample SRR5720221_1
worked! ~3.5 hours
Try 30 nodes aka 1125000
chunk size with 50gb with sample SRR5720225_1
-> made 39 nodes not 30? first time this breaks the pattern from before. but it worked. Proceed with this chunk size assuming it's 40 nodes to be safe then by my calculation we can do 35Gb at a time.
Reducing sample sizes:
gunzip -c data/SRR5720300_1.fastq.gz | head -n 100 | gzip > test_data/SRR5720300_1_mini.fq.gz
Looking at the "5"Gb samples from my parsed sample list:
BATS -> 150bp reads
3.4G SRR5720275_1.fastq.gz //gunzip -c data/SRR5720275_1.fastq.gz | head -n 10
3.4G SRR5720249_1.fastq.gz
3.2G SRR5720285_1.fastq.gz
HOT Chisholm -> 150bp reads
3.3G SRR5720293_1.fastq.gz //gunzip -c data/SRR5720293_1.fastq.gz | head -n 10
3.4G SRR5720302_1.fastq.gz
HOT ALOHA time/depth series -> 150bp reads
3.4G SRR9178068_1.fastq.gz //gunzip -c data/SRR9178068_1.fastq.gz | head -n 10
3.2G SRR9178368_1.fastq.gz //gunzip -c data/SRR9178368_1.fastq.gz | head -n 10
3.2G SRR9178503_1.fastq.gz
6.2G SRR5002405.fastq.gz //gunzip -c data/SRR5002405.fastq.gz | head -n 10
6.2G SRR5002321.fastq.gz
Tara -> ~100bp reads
3.2G ERR599134_1.fastq.gz //gunzip -c data/ERR599134_1.fastq.gz | head -n 20
3.3G ERR599172_1.fastq.gz
3.4G ERR598972_1.fastq.gz //gunzip -c data/ERR598972_1.fastq.gz | head -n 20
It seems like that 5Gb I parsed from NCBI might be the forward and reverse? because none of these are 5Gb. most are 3.4Gb and two are 6.2Gb.
Regardless of that confusion assuming we want to get to 3.5Gb (in real file size) which is close to the real value median of the “5”Gb files then the following command works to subset down to 3.5G:
gunzip -c data/SRR6507280_1.fastq.gz | head -n 175000000 | gzip > test_data/SRR6507280_3.5gb_test.fq.gz
Provenance for that calcuation: of n value 175000000
gunzip -c data/SRR6507280_1.fastq.gz | head -n 100000 | gzip data/SRR6507280_1_test.fq.gz
2.0M Oct 4 04:33 test_data/SRR6507280_1_test.fq.gz
gunzip -c data/SRR6507280_1.fastq.gz | head -n 1000000 | gzip data/SRR6507280_1_test.fq.gz
20M Oct 4 04:34 test_data/SRR6507280_1_test.fq.gz
gunzip -c data/SRR6507280_1.fastq.gz | head -n 10000000 | gzip data/SRR6507280_1_test.fq.gz
199M Oct 4 04:37 test_data/SRR6507280_1_test.fq.gz
gunzip -c data/SRR6507280_1.fastq.gz | head -n 100000000 | gzip data/SRR6507280_1_test.fq.gz
2.0G Oct 4 04:57 test_data/SRR6507280_1_test.fq.gz
To count the indivdual files: gunzip -c test_data/SRR6507280_3.5gb_test.fq.gz | wc -l
could make a bash script that for the list of samples: downloads the samples, counts the number of lines writes that plus the sample name to a file then deletes the file.
if we cut at 15 million reads we loose 79 samples: Amazon Plume Metagenomes 20, HOT ALOHA time/depth series 35, Amazon River Metagenomes 13, Tara 2, Bats 4, HOT Chisholm 5
if we cut at 10 million reads we loose 47 samples: Amazon Plume Metagenomes 16, HOT ALOHA time/depth series 21, Amazon River Metagenomes 8, Tara 2
if we cut at 7.5 million reads we loose 33 samples: Amazon Plume Metagenomes 14, HOT ALOHA time/depth series 16, Amazon River Metagenomes 3
if we cut at 5 million reads we loose 28 samples: Amazon Plume Metagenomes 11, HOT ALOHA time/depth series 15, Amazon River Metagenomes 2
I think I can rafify to 10^7 reads because after that size you’re not guarentied to get more interproscan file size according to the files I’ve already run. If we want 10^7 reads 10000000 then we need that * 4 for the head -n number = 40000000 with the command:
gunzip -c tmp_data_dir/{sample}.fastq.gz | head -n 40000000 | gzip > data/{sample}.fastq.gz
example commands to build Pfam and NCBItaxon rarefaction curves:
Pfam: cut -f 5 results/SRR1790489_1/step_07_combine_tsv/*.tsv | sort | uniq | wc -l
NCBITaxon: cut -f 5 results/SRR4831663/kraken2/*_report.tsv | sort | uniq | wc -l
Alise thinks 10 million reads as rarefaction cutoff and keep the >=5 Million read samples.
Downloading and trimming data in /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai
run:
sbatch download_trim.sh
with the list of files in list.txt
formatted like
ERR771007_1
...
modify the bash script to the correct data path after we settle on a cutoff threshold.
Want to get back to single threaded version of pipeline with regard to interproscan commit 6ee330c54b1952b8c5e1866a83b9a046941d1f6f
is where he added the rule interproscan and the bash/run_interproscan.sh
. The previous commits d6f824e7d1c95770cd60389da644dbd1dc9e7975
and 3b9389f7aefc10e1c5f8b7ae048441dba803d89e
he adds the submit_snakemake.sh
I had it working on a single thread prior to this. So I'll revert back to e1cf3048c6f6e4685680ac1032a36a299a3b6952
.
From the first answer here try:
git checkout e1cf3048c6f6e4685680ac1032a36a299a3b6952
download and trim script based on my initial test, first job 2021-10-07.10:47:52, last job 2021-10-07.17:49:40, it 7 hours for 162 NCBI Gb. 23gb/hr round down to 20 to be safe. *48 hr = 960 round down to 950 to be safe.
conclusion: Sumbit <=950 gb in 48 hr job
nope redid calc at more like 17Gb/hr so 55 hours for the same chunk size. So I changed to jobs to be 72 hours to be safe.
Regarding test_10_million
it finished shortly after being resubmitted (after not finishing in the first 24 hours). I think I'm safe to set the job time to 48 hours.
possible freshlake metagenomes:
Best option: https://www.ncbi.nlm.nih.gov/bioproject/664399 28 metagenomes almost all are >= 5M spots, filter size 0.22, all freshwater from Lake Superior. Illumina NovaSeq 6000.
https://www.ncbi.nlm.nih.gov/bioproject/479434 Maybe but metadata is pretty unclear some seem like sediments.
https://www.ncbi.nlm.nih.gov/bioproject/51219 only 5 illumina
https://www.ebi.ac.uk/metagenomics/studies/MGYS00002504 only 8 though ncbi: https://www.ncbi.nlm.nih.gov/bioproject/PRJEB27578
https://www.ncbi.nlm.nih.gov/bioproject/335374 this could work 32 wgs lake metagenomes NO all are < 5M spots
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA400857 urban but not so many
https://www.ncbi.nlm.nih.gov/bioproject/636190 few more great lake sequences
Vents
https://www.ncbi.nlm.nih.gov/bioproject/PRJEB19456 this could work only ~6 metagenomes but all good size and diffuse flow
https://www.ncbi.nlm.nih.gov/bioproject/243235 slightly too small 4M spots
https://www.ncbi.nlm.nih.gov/bioproject/PRJEB7866 slightly too small 4M spots
https://www.ncbi.nlm.nih.gov/bioproject/PRJEB15541 many are slightly too small 4M spots I think only the sags are the right size
https://www.ncbi.nlm.nih.gov/bioproject/306467 slightly too small 4M spots
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA522654 not sure whats up here.
https://www.ncbi.nlm.nih.gov/bioproject/530185 maybe but only 3 samples
https://www.ncbi.nlm.nih.gov/bioproject/PRJEB9204 slightly too small 4M spots
step_07_combine_tsv]$ du -sh .
189M
step_06_get_orfs]$ du -sh .
5.4G
step_05_chunk_reads]$ du -sh .
361M
step_04_get_gene_reads]$ du -sh .
1.8G
step_02_qc_reads_with_vsearch]$ du -sh .
884M
step_01_trimming]$ du -sh .
1.8G
bowtie]$ du -sh .
610M
bracken]$ du -sh .
628K
kraken2]$ du -sh .
125M
COData working group: meeting:
Other system similar to UOM https://umis.stuchalk.domains.unf.edu/
Links between BIPM and the #CODATA Task Group on Fundamental Constants and Digital Representation of Units, mentioned by Joachim Ulrich, of Measure: see MOU https://www.bipm.org/en/-/2021-10-11-mou-bipm-codata liaison page https://www.bipm.org/en/liaison-partners/codata-tgfc and TGFC page https://www.bipm.org/en/hosting/codata-tgfc
Scientific Vocabularies: needs, status, validity, governance and sustainability slides
SciDataCon copy of Catalogue of Vocabulary tools
Earth, Space and Environmental Sciences Data Vocabulary, Ontology and Semantic Repositories/Services
https://adyork.github.io/intro-to-apis-taxamatch/
3 scripts
1) for step 2 to grep > for the number of reads that passed qc then | wc -l
2) For step 4 to get the number of predicted ORFS from the .faa files again grep for >
3) gunzip -c SRR1786608_1.fastq.gz | wc -l for the data files to see how many reads we actually have divide by 4 we should drop samples that reduce too much between this and step 2
gunzip -c ../old/planet-microbe-functional-annotation/data/ERR315856_1.fastq.gz | head -n 4000 | gzip > data/mini_test.fastq.gz
gunzip -c data/mini_test.fastq.gz | wc -l
Matt added a new commit to planet-microbe-functional-annotation. That adds a loop to the run_cmd function in pipeline/utils.py, hence it should keep retrying it until it doesn't error out. Testing it out with windfall_test_2
start wed 11.24 ~9:10am CET.
Testing step4 frag counts against step7 final merge sizes:
Small 4M step7 file:
grep -c "^>" ERR873967_1_trimmed_qcd_frags.faa
5554493
larger 300M step7 file:
grep -c "^>" SRR9178330_1_trimmed_qcd_frags.faa
8973982
Medium 181M step7 file:
ERR594323_1
grep -c "^>" ERR594323_1_trimmed_qcd_frags.faa
8982997
Cutoff at 9000 NCBITaxon richness
Keep raw counts for GO/NCBITaxon and always query for Number of Reads Initial
,Number of Reads after QC
,Number of ORFs
Globus to push to google drive https://public.confluence.arizona.edu/display/UAHPC/Transferring+Files#TransferringFiles-GridFTP/Globus create two endpoints on drive and hpc use the interface to tranfer files between the 2. Use the Kai_Blumberg shared drive.
https://docs.globus.org/globus-connect-server/v5.4/ version that mentions Google Drive
and https://docs.globus.org/globus-connect-server/v5.4/quickstart/ actually I might not need these the above link has a google drive section.