Skip to content

2021_log

Kai Blumberg edited this page Oct 12, 2021 · 134 revisions

01.26

ESIP

ESSDIVE id metadata and supplemental

02.04

PM meeting:

get irods iinit

use a script like what Alise has to pull down datasets to UA-hpc

Create a script to benchmark the datasets sanity checks for .. flag bad files or look wierd non annotated quality control in general flag dataset if >X% are human, or if we complete lost a sample got no annotion

Should be a protocols.io link for irods, or just see Alise' example script.

submit script and run script the latter runs on HPC and executes python code, look at Matt's pipeline as example of how to run it, and the Iget commands from alyse. Submit script will run a for loop to submit multile with ars, executes run scrip loads singularity pyton exe fil etc. Split list into cunks and run array command.

create sumbit and run scripts to download data for processing.

Create script to sanity check thhe data out of the pipeline, QC leaves too few reads, too few annotations. Want summary table of what went wrong, maybe pull some of matts intermediate steps, and do additional cheks

02.11

https://github.com/ontodev/cogs https://github.com/INCATools/ontology-development-kit/releases/tag/v.1.2.26 https://github.com/ontodev/robot/blob/master/CHANGELOG.md

ESIP marine data Vocabularies Special Session

cf conventions climate model data standard to deal with interoperability based on NETCDF fileformat conventions are to provide full meta spatiao temploral spectral info, upto 32 dimensions on a variable standard names to show what is in the variable.

GCMD keywords http://gcmd.arcticlcc.org/ facepalm

seadatanet NVS exchange formats, platforms, measurement devices etc. Similar stuff. NERC L22 for instruments.

Pier's mapping of ENVO to NERC https://github.com/EnvironmentOntology/envo/issues/731

02.16

https://github.com/tanyagupta/codeblogs/blob/development/githubHowTos/howto.md

02.18

Meeting with Alise:

iinit

ERROR: environment_properties::capture: missing environment file. should be at [/home/u19/kblumberg/.irods/irods_environment.json]
One or more fields in your iRODS environment file (irods_environment.json) are
missing; please enter them.

Enter the host name (DNS) of the server to connect to: data.cyverse.org Enter the port number: 1247 Enter your irods user name: kblumberg Enter your irods zone: iplant Those values will be added to your environment file (for use by other iCommands) if the login succeeds.

irods docs: https://docs.irods.org/4.1.0/ https://cyverse-data-store-guide.readthedocs-hosted.com/en/latest/step2.html#icommands-first-time-configuration

check running jobs qstat -u kblumberg

example of how to delete a job by it's nunmber: qdel 3843824.head1

have numbersplit be 50

have lists of 500 at a time (to submit 4x). Randomize these. If this doesn't work we can try groups of 200 files at time? (would need to submit 10 x)

02.23

FAIR digital SI EOSC has ontology mapping framework Dublin coure ddi schema.org

Iadopt

Seadatanet

Sea data cloud? for thesis.

Prefer cc0 over cc-by for licenses

image image image image

02.24

OBO dashboard, OMO (useful subset of IAO).

Papers relevant to PM paper 2: https://www.frontiersin.org/articles/10.3389/fmars.2019.00440/full (DEF CITE), https://peerj.com/articles/cs-110/, https://content.iospress.com/articles/information-services-and-use/isu824 (be good too).

03.11

For my committee: 2 pager about getting back on track:

page one explaintion of material going into paper 2: ontology contributions supporting physicochemical data, ontology choices, CI choices frictionless data specification analogous to NetCDF etc. Then use Planet microbe to ask and answer/hightlight uses cases for discovery of physicochem data across environments.

page two Plan for integrating that data with the functional and genomic annotation and getting it all into an rdf tripplestore to ask and answer questions deeper questions about distribution of taxa and genes correlation that with the physicochem and env types. Include timeline when the data should be available.

Some quick Q's:

do synocococuss and prochrlorococcus vs depth (or another parameter) get files from 2 projects.

Acidity of water lots of materials and features to compare as its so many projects.

Redfield ratio (po4, no3) 1000 samples good amount of features and some materials. See what features/materials/biomes deviate from the standard redfield ratio.

03.16

From Matts example PM API calls: https://github.com/hurwitzlab/planet-microbe-app/blob/master/README.md

Open Chrome developer tools and select “Network” tab

option+command+c

I think I should be able to play around with this to setup the API calls I’ll want to make to build my RDF triple store. I could do them all manually with the UI’s search interface (downloading csvs then adding those to triplestore) but I think it would be cooler, more automated and more reproducible to build the triple store from the the json outputs from API calls. That way it’s showing that someone else could leverage the PM API to do something totally different with the data. so the data store would be build from a bunch of calls like:

image

See more examples in ~/Desktop/scratch/pm_query

Query by numeric attribute

curl -s -H 'Content-Type: application/json' -d '{"http://purl.obolibrary.org/obo/ENVO_09200014":"[0,10]","limit":"40"}' https://www.planetmicrobe.org/api/search | jq .sampleResults > temp.json

Then I could figure out a “correct” way of converting these json products to rdf, using something like ShEx or SHACL.

04.08

ESIP Marine Data meeting

image

Shawn Smith to Everyone (8:22 PM)

Kai - How would the units interchange system differ from software like Udunits?

04.09

Add comments to James Units for OBO.

Citation for pato https://academic.oup.com/bib/article/19/5/1008/3108819

05.19

1:1 with bonnie

Remove some bad samples

Look at the graph by families of organisms

show overall trends big patterns to be followed up on in paper 3.

DOesn't have to be all organisms in all graphs just a few interesting stories.

Might not need a high level pattern with all data

Not recreating a phylogeny instead taking an existing one, e.g. synoccocccus and putting it together with our physicochem and observation data.

some example figures in this paper: https://www.pnas.org/content/112/17/5443.full

co-occurence networks? samples organized by depth

Remove bad samples

Try CCA but subset by groups of taxa, are there patterns for groups of taxa, maybe gives incite into bigger summary graph

Maybe instead try with correlations all species against all chemistry look for interesting patterns then pick a couple for a prettier figure.

from Mark

celfie protege plugin similar to robot?

Michael DeBellis Protege 5 tutorial apparently very good.

05.20

From alise: https://mibwurrepo.github.io/Microbial-bioinformatics-introductory-course-Material-2018/multivariate-comparisons-of-microbial-community-composition.html

05.23

Comparison of normalization methods for the analysis of metagenomic gene abundance data def read before planet microbe paper 3

06.03

BCODMO/NMDC meeting https://microbiomedata.github.io/nmdc-schema/MetaproteomicsAnalysisActivity/ https://proteinportal.whoi.edu/ https://lod.bco-dmo.org/browse-all/

06.22

EBI Ontologies CoP Webinar

Website: https://bigdata.cgiar.org/communities-of-practice/ontologies/ LinkedIn: https://www.linkedin.com/groups/13707155/ Youtube: https://www.youtube.com/channel/UCpAupP30tbkcFwMBErFdDhA/videos Newsletter: https://cgiar.us15.list-manage.com/subscribe?u=59a7500ef5c94dd50e2b9e2fb&id=454e94d3f2

OLS: ontology lookup service

ZOOMA: to mine terms from text

OXO: ontology mapping. Confidence is just count of mappings (pretty rudimentary) but they'll have this follow best practices set by SSSOM.

SSSOM: mapping standard

WEBINARS:  July 27:  Doing a Governance Operational Model for Ontologies, GOMO with Oscar Corcho, Jose Antonio Bernabe Diaz and Edna Ruckhaus from the Technical University of Madrid and Alexander Garcia from BASF.
Register: https://lnkd.in/gb2cG2h

- September 7: Neo4J as a backend DB for web protégé and Towards a plugin based architecture for web protégé with Mattew Horridge and Mark Musen from Stanford University.
Register: https://lnkd.in/gU92u96

- September 21: Enterprise Knowledge Graph with Alexander Garcia from BASF.
Register: https://lnkd.in/gniS2Mm

paper: Matching sensor ontologies through siamese neural networks without using reference alignment

06.24

Notes from PM meeting:

TODO: add more test examples in the 10 Gb range to this for Matt to try and run: https://github.com/hurwitzlab/planet-microbe-functional-annotation/blob/master/test_data/get_test_data.sh

file handling

edit config based on the individual files and their needs

#rm results/interproscan.txt results/killed_interproscan.txt to start up lookup server

Might want to consider another Kraken Database

configfile: "config/config.yml" to this put the sample info.

config params to train:

vsearch_filter_maxee: 20
vsearch_filter_minlen: 75
frag_train_file: "illumina_10"
adapter_fasta: "data/configs/TruSeq3-PE.fa"

+ read lenght (if Matt can't automate)

Interesting links: https://gtdb.ecogenomic.org/ https://thispersondoesnotexist.com/. https://medium.com/analytics-vidhya/apriori-algorithm-in-association-rule-learning-9287fe17e944

My file with the list of PM analysis is ~/Desktop/software/planet_microbe/planet-microbe-semantic-web-analysis/job_listpmo_samples_unique.xlsx

06.29

Finalizing PM database for paper2.

Matt's current PM valdiation

Currently, Matt's scripts/load_datapackage_postgres.py deals with lat/long constraints when loading in to planet microbe DB. Alise Found this example https://data-blog.gbif.org/post/frictionless-data-and-darwin-core/ for using frictionless data with darwin core which set lat/long constrains:

Using Matt's validation:

conda info --envs manage conda environments

conda activate pm

Validate:

scripts/validate_datapackage.py ../planet-microbe-datapackages/OSD/datapackage.json
scripts/validate_datapackage.py ../planet-microbe-datapackages/Tara_Oceans/datapackage.json

conda deactivate #at the end

Suggested Frictionless DP validation script

Script to validate PM datapackages for usage see readme of the repo.

{
  "name": "year",
  "type": "integer",
  "rdfType": "http://rs.tdwg.org/dwc/terms/year",
  "constraints": {
    "required": true,
    "minimum": 1000,
    "maximum": 2050
  }
},

That example makes use of the goodtables script: https://goodtables.readthedocs.io/en/latest/

example call:

goodtables MiturgidaeBE_DP/data_package.json

pip install goodtables # install

goodtables OSD/datapackage.json #to run it works.

{
            "name": "Latitude",
            "constraints": {
              "required": true,
              "minimum": -90,
              "maximum": 90
            },
            ...

added this and run goodtables OSD/datapackage.json it works! Ran Matt's validate_datapackage.py on the constraint modified example above and it works too. I think I'm safe to add this for all Lat/Longs in all DPs.

07.16

https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a

07.20

ESIP marine meeting

https://github.com/ESIPFed/marinedata-vocabulary-guidance

from Pier: https://www.tdwg.org/community/gbwg/MIxS/

https://search.oceanbestpractices.org/

08.02

sudo ./run.sh make all_imports
OWLTOOLS_MEMORY=12G owltools ncbitaxon.obo -l -s -d  --set-ontology-id http://purl.obolibrary.org/obo/ncbitaxon.owl -o mirror/ncbitaxon.owl
2021-08-02 09:59:40,481 ERROR (CommandRunnerBase:213) Could not find an OWLObject for id: '-s'
## Showing axiom for: null
Exception in thread "main" java.lang.NullPointerException
	at owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:3418)
	at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:76)
	at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:68)
	at owltools.cli.CommandLineInterface.main(CommandLineInterface.java:12)
make: *** [Makefile:431: mirror/ncbitaxon.owl] Error 1

tried again without the -l -s -d flags and got:

sudo ./run.sh make all_imports
OWLTOOLS_MEMORY=12G owltools ncbitaxon.obo --set-ontology-id http://purl.obolibrary.org/obo/ncbitaxon.owl -o mirror/ncbitaxon.owl
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
make: *** [Makefile:223: imports/ncbitaxon_import.owl] Error 1
rm imports/ncbitaxon_terms_combined.txt

08.04

From Adam: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html which links to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html

example from Adam: ssh -i /Users/.../adam.pem [email protected]

ssh -i ~/Desktop/software/kai-ontology.pem [email protected] DIDN'T WORK

ssh -i ~/Desktop/software/kai-ontology.pem [email protected] Also tried this didn't work.

ssh -i ~/Desktop/software/kai-ontology.pem [email protected]

did all with chmod 600 and 400 to the pem file.

Maybe this is due to instance not being started? https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ec2/start-instances.html has info on starting instance need aws command perhaps from: awscli

example command:

aws ec2 start-instances --instance-ids i-1234567890abcdef0

Try install of aws from https://www.youtube.com/watch?v=BNH4i7CQ4Oc

aws ec2 start-instances --instance-ids i-1234567890abcdef0

but to do this I'll need the instance id.

AWS management console: https://aws.amazon.com/console/ -> login -> IAM user try entering: 504672911985 which I got from the command line. But I don't have a password, secret key doesn't work.

From aws-sec-cred-types.html they have the link: https://account_id_or_alias.signin.aws.amazon.com/console/ tried https://504672911985.signin.aws.amazon.com/console/ but that just redirects to the signin for which I don't have a password.

08.12

TODOs for PM paper: check formatting, reference formatting, push code, respond to comments from Elisha and ED, prepare Cover letter.

08.27

Cancel Student Health Insurance Plan https://health.arizona.edu/sites/default/files/Cancellation%20Guide%20SHIP.pdf

09.02

http://libgen.li/

09.09

Meeting with alise about getting Matt Millers' pipeline running

interactive

source ~/.bashrc

conda env create -f kraken2.yml

conda env create -f bracken.yml

conda env create -f pm_env.yml   // this failed make a new pm_env.yml with snakemake

# steps to create pm_env again do this in interactive
conda create -n pm_env

conda activate pm_env

conda install -n base -c conda-forge mamba

mamba create -c conda-forge -c bioconda -n snakemake snakemake

#modify the cluster.yml and config.yml files

#bowtie index folder is in:
/xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/planet-microbe-functional-annotation/data
# copy that into my version of the git repo so that the whole thing is portable

# submit the main snakemake job which will submit other jobs
# need to make sure this isnt' submitting too many 
sbatch run_snakemake.sh //make sure to update the path in this to the repo

squeue -u kblumberg

scancel job_ID_number

09.10

/xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai

Did the above witch worked except for the pm_env instead it make a snakemake env so I changed the run_snakemake.sh to activate snakemake instead of pm_env.

Perpared it with 1 Amazon sample SRR4831664 from the example get_test_data.sh. tried snakemake -n and it seemed to be working, then ran sbatch run_snakemake.sh and got Submitted batch job 2037818.

started the job ~ 14:15 sep 10th. 6.5 hours out still going no outputs yet.

17 hours 42 mins still going no outputs.

09.13

Job never got past the bowtie step. See job output https://metrics.hpc.arizona.edu/index.php#job_viewer?realm=SUPREMM&recordid=0&jobid=12767280&infoid=0

09.14

to the snakemake conda env I added: conda install -c conda-forge biopython after the qc pipeline couldn't load biopython and failed

COB workshop

https://github.com/turbomam/pizzacob/tree/main http://dashboard.obofoundry.org/dashboard/index.html

09.15

FOODON workshop the seafood workflow from annosh is pretty neat with the E3E lookup tool to get the labels. Could be relevant to BCODMO.

https://oboacademy.github.io/obook/tutorial/icbo2021/

09.16

Meeting with Alise:

ERR1234
  GO:0055114 5705414
  GO:0008152 3836620
  GO:0006810 1409368
  #_annotations (SUM column)

ERR1234
  number_basebairs 12322
  
Last rule: which includes a python script for each of these

1) kraken with only the columsn we want (taxID and the number reads unique to that node)
2) InterProScan counts
3) GO counts (just drop any duplicated interpro2go annotations that are in the same MF BP or CC family alphabetise and drop the 2nd one)
4) qc_scan log # BP in the fasta from step_02_qc_reads_with_vsearch (use biopython libary for fasta give lengh of reads)

/groups/bhurwitz/tools/interproscan-5.46-81.0/interproscan.sh -appl Pfam -i results/SRR4831664/step_05_chunk_reads/SRR4831664_trimmed_qcd_frags_2047.faa -b results/SRR4831664/step_06_get_orfs/SRR4831664_trimmed_qcd_frags_2047_interpro -goterms -iprlookup -dra -cpu 4 was working when run alone but not in pipeline.

09.23

repo from Alise to download from irods https://github.com/aponsero/download_irods

meeting:

for paper: first q benchmakring data look sane, then a couple core stories we can address.

Run amazon datasets first.

va on hpc shows allocation remaining.

09.30

Notes for running the big PM analysis on UA HPC

chunk sizes I think in units of MB

files bash/run_interproscan.sh (chunk size) and run_interproscan_chunk.sh (--time=)

200000 works but spawns lots of nodes perhaps 100 per sample. uses at most 2 hours
20000 spans way too many jobs 


1000000 too large gets a memory error, but only spawns 40 jobs for the 2.9 GB file. SRR5720248_1 


Try 500000 with 6 hours. spawned 81 jobs for the same 2.9 GB file. testing see `err/interproscan.base=SRR5720248_1.err`
may or maynot be working check one of the chunk files e.g `err/ips_chunk40_SRR5720248.err` 
seems to be working
~3.5 hours worked. 

started next batch ~3pm with 16.1G of samples it started 493 jobs huristic is pretty good. Finished at ~10pm. However it didn't work because some of the chunks timed out I had them set to 6 hours. They needed more time. I'll set it to 12 hours just to be safe. Deleted the step5 and 6 folders just to be safe as it crashed and I'm not sure if the snakemake will pick it up after the failure with the parallelization. Now it's at 12 hours per chunk job. Started it again at 10:10pm sep 30th. 



#run in cluster
snakemake --cluster "sbatch -A {cluster.group} -p {cluster.partition} -n {cluster.n} -t {cluster.time} -N {cluster.N} --mem={cluster.m} -e {cluster.e} -o {cluster.o}"  --cluster-config config/cluster.yml -j 30 --latency-wait 30 --until `rules interproscan?` 

OR to run upto step 4 do some commenting out:

rule all:
    input:
        expand("results/{sample}/bowtie/{sample}.fastq.gz", sample=config["samples"]),
        #expand("results/{sample}/step_02_qc_reads_with_vsearch/{sample}_trimmed_qcd.fasta", sample=config["samples"]),
    THIS    #expand("results/{sample}/step_07_combine_tsv/{sample}_trimmed_qcd_frags_interpro_combined.tsv", sample=config["samples"]),
        expand("results/{sample}/bracken/{sample}_profiles.txt", sample=config["samples"]),
        "results/killed_interproscan.txt",


Rule for interproscan and start and stop server. can just comment out.

https://www.biostars.org/p/265420/

2nd PM Paper: higher taxonomic resolution better with ecology/habitat. Prepare 2 version of the docs file one with the new changes and one with the original with track changes. Papers to add from review: https://doi.org/10.1111/1462-2920.15173 and https://doi.org/10.1007/s00248-020-01526-5 Alise's example response to reviewers: https://docs.google.com/document/d/17uT6JbOoyAj6tRtceHk46ZfW0sHK8t6W9KcWdPicKDI/edit#heading=h.5qbn069bmn46

10.01

Regarding the step 7 configurations I tired 500000 for 6 hours but several didn’t finish in that time. So I upped it to 12 hours for the chunk jobs and ran it last night. It seemed to have worked all the step 7's are there and the job completed successfully, however when I cat out the chunk files I get a few possile oom’s:

(puma) (base) [kblumberg@junonia err]$ cat ips_*
/groups/bhurwitz/tools/interproscan-5.46-81.0/interproscan.sh: line 46: 34401 Killed                  "$JAVA" -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xms1028M -Xmx6072M -jar interproscan-5.jar $@ -u $USER_DIR
Sep 30 14:52:43.646263 33724 slurmstepd   0x2b834568b340: error: Detected 1 oom-kill event(s) in StepId=2181585.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
/groups/bhurwitz/tools/interproscan-5.46-81.0/interproscan.sh: line 46: 54774 Killed                  "$JAVA" -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xms1028M -Xmx6072M -jar interproscan-5.jar $@ -u $USER_DIR
Sep 30 16:22:45.860395 54039 slurmstepd   0x2b489e9d8340: error: Detected 2 oom-kill event(s) in StepId=2181737.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The files say they were successful for every chunk but I don’t think that deals with the OOM but it was only 3 events for those data files so I don’t think it’s worth re-running them again, but I’ve uped the memory to 7gb. Resubmitted the next job at 8:08am CET.

From Simon and OGC: SOSA: A lightweight ontology for sensors, observations, samples, and actuators https://w3c.github.io/sdw/ssn/ Spatial Data on the Web Interest Group https://www.w3.org/TR/vocab-ssn/ https://github.com/opengeospatial/om-swg

10.04

Interpro job tests:

Tried 875000 chunk size works for 50 nodes and worked with 12gb mem.

Testing 20 nodes with 16gb memory didn't work. Try again with more mem.

Try 20 nodes 2000000chunk size with 50gb again with sample SRR5720221_1 -> one sample got an out of memory error.

Try again with less nodes Try 40 nodes 1000000chunk size with 50gb again with sample SRR5720221_1 worked! ~3.5 hours

Try 30 nodes aka 1125000 chunk size with 50gb with sample SRR5720225_1 -> made 39 nodes not 30? first time this breaks the pattern from before. but it worked. Proceed with this chunk size assuming it's 40 nodes to be safe then by my calculation we can do 35Gb at a time.

Reducing sample sizes:

gunzip -c data/SRR5720300_1.fastq.gz | head -n 100 | gzip > test_data/SRR5720300_1_mini.fq.gz

Looking at the "5"Gb samples from my parsed sample list:

BATS -> 150bp reads
3.4G SRR5720275_1.fastq.gz //gunzip -c data/SRR5720275_1.fastq.gz | head -n 10
3.4G SRR5720249_1.fastq.gz
3.2G SRR5720285_1.fastq.gz

HOT Chisholm -> 150bp reads
3.3G SRR5720293_1.fastq.gz //gunzip -c data/SRR5720293_1.fastq.gz | head -n 10
3.4G SRR5720302_1.fastq.gz

HOT ALOHA time/depth series -> 150bp reads
3.4G SRR9178068_1.fastq.gz //gunzip -c data/SRR9178068_1.fastq.gz | head -n 10
3.2G SRR9178368_1.fastq.gz //gunzip -c data/SRR9178368_1.fastq.gz | head -n 10
3.2G SRR9178503_1.fastq.gz
6.2G SRR5002405.fastq.gz    //gunzip -c data/SRR5002405.fastq.gz | head -n 10
6.2G SRR5002321.fastq.gz

Tara -> ~100bp reads
3.2G ERR599134_1.fastq.gz //gunzip -c data/ERR599134_1.fastq.gz | head -n 20
3.3G ERR599172_1.fastq.gz
3.4G ERR598972_1.fastq.gz //gunzip -c data/ERR598972_1.fastq.gz | head -n 20

It seems like that 5Gb I parsed from NCBI might be the forward and reverse? because none of these are 5Gb. most are 3.4Gb and two are 6.2Gb.

Regardless of that confusion assuming we want to get to 3.5Gb (in real file size) which is close to the real value median of the “5”Gb files then the following command works to subset down to 3.5G:

gunzip -c data/SRR6507280_1.fastq.gz | head -n 175000000 | gzip > test_data/SRR6507280_3.5gb_test.fq.gz

Provenance for that calcuation: of n value 175000000

gunzip -c data/SRR6507280_1.fastq.gz | head -n 100000 | gzip data/SRR6507280_1_test.fq.gz
2.0M Oct  4 04:33 test_data/SRR6507280_1_test.fq.gz
gunzip -c data/SRR6507280_1.fastq.gz | head -n 1000000 | gzip data/SRR6507280_1_test.fq.gz
20M Oct  4 04:34 test_data/SRR6507280_1_test.fq.gz
gunzip -c data/SRR6507280_1.fastq.gz | head -n 10000000 | gzip data/SRR6507280_1_test.fq.gz
199M Oct  4 04:37 test_data/SRR6507280_1_test.fq.gz
gunzip -c data/SRR6507280_1.fastq.gz | head -n 100000000 | gzip data/SRR6507280_1_test.fq.gz
2.0G Oct  4 04:57 test_data/SRR6507280_1_test.fq.gz

To count the indivdual files: gunzip -c test_data/SRR6507280_3.5gb_test.fq.gz | wc -l could make a bash script that for the list of samples: downloads the samples, counts the number of lines writes that plus the sample name to a file then deletes the file.

10.05

if we cut at 15 million reads we loose 79 samples: Amazon Plume Metagenomes 20, HOT ALOHA time/depth series 35, Amazon River Metagenomes 13, Tara 2, Bats 4, HOT Chisholm 5
if we cut at 10 million reads we loose 47 samples: Amazon Plume Metagenomes 16, HOT ALOHA time/depth series 21, Amazon River Metagenomes 8, Tara 2
if we cut at 7.5 million reads we loose 33 samples: Amazon Plume Metagenomes 14, HOT ALOHA time/depth series 16, Amazon River Metagenomes 3 
if we cut at 5 million reads we loose 28 samples: Amazon Plume Metagenomes 11, HOT ALOHA time/depth series 15, Amazon River Metagenomes 2

I think I can rafify to 10^7 reads because after that size you’re not guarentied to get more interproscan file size according to the files I’ve already run. If we want 10^7 reads 10000000 then we need that * 4 for the head -n number = 40000000 with the command:

gunzip -c tmp_data_dir/{sample}.fastq.gz | head -n 40000000 | gzip > data/{sample}.fastq.gz

10.06

example commands to build Pfam and NCBItaxon rarefaction curves:

Pfam: cut -f 5 results/SRR1790489_1/step_07_combine_tsv/*.tsv | sort | uniq | wc -l

NCBITaxon: cut -f 5 results/SRR4831663/kraken2/*_report.tsv | sort | uniq | wc -l

Alise thinks 10 million reads as rarefaction cutoff and keep the >=5 Million read samples.

Downloading and trimming data in /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/kai run:

sbatch download_trim.sh

with the list of files in list.txt formatted like

ERR771007_1
...

modify the bash script to the correct data path after we settle on a cutoff threshold.

10.07

Want to get back to single threaded version of pipeline with regard to interproscan commit 6ee330c54b1952b8c5e1866a83b9a046941d1f6f is where he added the rule interproscan and the bash/run_interproscan.sh. The previous commits d6f824e7d1c95770cd60389da644dbd1dc9e7975 and 3b9389f7aefc10e1c5f8b7ae048441dba803d89e he adds the submit_snakemake.sh I had it working on a single thread prior to this. So I'll revert back to e1cf3048c6f6e4685680ac1032a36a299a3b6952.

From the first answer here try:

git checkout e1cf3048c6f6e4685680ac1032a36a299a3b6952

10.08

download and trim script based on my initial test, first job 2021-10-07.10:47:52, last job 2021-10-07.17:49:40, it 7 hours for 162 NCBI Gb. 23gb/hr round down to 20 to be safe. *48 hr = 960 round down to 950 to be safe.

conclusion: Sumbit <=950 gb in 48 hr job nope redid calc at more like 17Gb/hr so 55 hours for the same chunk size. So I changed to jobs to be 72 hours to be safe.

Regarding test_10_million it finished shortly after being resubmitted (after not finishing in the first 24 hours). I think I'm safe to set the job time to 48 hours.

https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account

10.12

possible freshlake metagenomes:

Best option: https://www.ncbi.nlm.nih.gov/bioproject/664399 28 metagenomes almost all are >= 5M spots, filter size 0.22, all freshwater from Lake Superior. Illumina NovaSeq 6000.

https://www.ncbi.nlm.nih.gov/bioproject/479434 Maybe but metadata is pretty unclear some seem like sediments.

https://www.ncbi.nlm.nih.gov/bioproject/51219 only 5 illumina

https://www.ebi.ac.uk/metagenomics/studies/MGYS00002504 only 8 though ncbi: https://www.ncbi.nlm.nih.gov/bioproject/PRJEB27578

https://www.ncbi.nlm.nih.gov/bioproject/335374 this could work 32 wgs lake metagenomes NO all are < 5M spots

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA400857 urban but not so many

https://www.ncbi.nlm.nih.gov/bioproject/636190 few more great lake sequences

Clone this wiki locally