Skip to content

Latest commit

 

History

History
212 lines (126 loc) · 3.46 KB

README.md

File metadata and controls

212 lines (126 loc) · 3.46 KB

PEARC24 Pelican Tutorial - Using Data from OSDF

Other Tutorial Sections

Using Data From the OSDF

Accompanying slides: here

Jump to:

Setup

Access a notebook here: https://notebook.ospool.osg-htc.org/

Authenticate with one of the following:

  • GitHub
  • ORCID
  • ACCESS ID
  • Your local university credentials

Select the “Basic” server and click “Start”

Clone repository and then open the README.ipynb file.

Alternatively, the commands are listed below.


WORKDIR=$HOME/training-origin/pelican-plugin
echo $WORKDIR

Recap: Fetching Data with Pelican

This set of commands downloads a test data file (in a sequence data format) from the Open Science Data Federation.

cd $WORKDIR/data

OSDF=pelican://osg-htc.org
OBJ_PATH=ospool/uc-shared/public/osg-training/tutorial-fastqc/test.fastq
pelican object get $OSDF/$OBJ_PATH test.fastq

The following command should display the beginning of a genomic sequence file:

head test.fastq

Sample Job Submission

cd $WORKDIR/sample

Look at the contents of the HTCondor job submit file below. There should be some familiar elements (resource requests, where to save stdout/stderr/log files, what commands to run) and some potentially new elements (transferring files).

cat sample.submit
condor_submit sample.submit
condor_q
cat job*.output
cat output*.txt

Job Submission with Pelican and OSDF

One Job Fetching a Container and Data File

cd $WORKDIR/fastqc
ls -lh

We are now going to submit a slightly more complex job example. This job will fetch both the test.fastq file from the OSDF that we used a minute ago, as well as a container with the fastQC bioinformatics program.

grep "pelican" single-fastqc.submit

The job itself will run the FastQC program on the fetched data file and produce a visualization, which will get written back to the results folder

cat single-fastqc.submit
condor_submit single-fastqc.submit
condor_q
ls results/

One of the script commands was an ls so we can see that the test.fastq was downloaded by looking at the standard output file.

cat logs/*.out

Multiple Jobs Fetching a Single Container and Unique Data Files

cd $WORKDIR/fastqc

Because the Pelican object links can be quite long, it's helpful to use intermediate variables in the submit file.

grep "OBJ_LOC" many-fastqc.submit

Finally, we'll run the same FastQC analysis, but with multiple data files (again, being fetched from the OSDF).

cat many-fastqc.submit
condor_submit many-fastqc.submit
condor_q
ls results/
cat logs/*.out

Job Submission with Pelican and YOUR origin

Let's go back to our sample directory and try to download a file from YOUR origin in a job!

cd $WORKDIR/sample
cat sample-origin.submit
condor_submit sample-origin.submit
condor_q