Skip to content

Tutorial 2: Data Access

Suchandra Thapa edited this page Jun 13, 2014 · 1 revision

Introduction

This page introduces the user to SkeletonKey accessing remote data using SkeletonKey. After reading through this page, the user should be able to setup jobs that access data being shared by a Chirp server.

Prerequisites

The following items are needed in order to complete this tutorial:

  1. Webserver where the user can place files to access using the web
  2. HTCondor Cluster (optional)
  3. A working SkeletonKey install
  4. Familiarity with basic usage of SkeletonKey (the first tutorial is sufficient)

Conventions

In the examples given in this tutorial, text in red denotes strings that should be replaced with user specific values. E.g. the URL for the user's webserver. In addition, this tutorial will assume that files can be made available through the webserver by copying them to ~/public_html on the machine where SkeletonKey is being installed.

Chirp Data Server

Starting and stopping Chirp

SkeletonKey installs chirp_control to so that the user can control the Chirp data server that is installed. In order to start Chirp, run chirp_control start . The user can run chirp_control stop to stop Chirp.

Configuration

The user can modify the directory that Chirp exports by editing ~/.chirp/chirp_options and change EXPORT_DIR to point to the directory that Chirp should export. If Chirp will be used to export a HDFS filesystem, EXPORT_DIR should be replaced with HDFS_URI set to the URI that should be exported (e.g. hdfs://hdfs-namenode:9000/user_directory).

Data access example

The next example will be guide the user through creating a job that will read and write from a filesystem exported by Chirp.

Creating the application tarball

Since won't be able to use a single command found on potential compute nodes, we'll need to create an application tarball to containing a shell script that will do the data access on the compute nodes. The following steps show what needs to be done:

  1. Create a directory for the script

    [user@hostname ~]$ mkdir /tmp/data_access

  2. Create a shell script, /tmp/data_access/myapp.sh with the following lines:

    #!/bin/bash echo "testing output" > $CHIRP_MOUNT/data_access_test cat $CHIRP_MOUNT/data_access_test

  3. Next, make sure the myapp.sh script is executable and create a tarball:

    [user@hostname ~]$ chmod 755 /tmp/data_access/myapp.sh [user@hostname ~]$ cd /tmp [user@hostname ~]$ tar cvzf data_access.tar.gz data_access

  4. Then copy the tarball to your webserver

    [user@hostname ~]$ cd /tmp [user@hostname ~]$ cp data_access.tar.gz ~/public_html [user@hostname ~]$ chmod 644 ~/public_html/data_access.tar.gz

Notice the use of the $CHIRP_MOUNT variable when reading or writing to the the directory exported through Chirp. SkeletonKey defines and sets $CHIRP_MOUNT so that it will correspond to the directory being exported from the chirp server.

Creating a job wrapper

You'll need to do the following on the machine where you installed SkeletonKey

  1. Open a file called data_access.ini and add the following lines:

    [Directories] export_base = /tmp read = / write = /

    [Parrot] location = http://your.host/parrot.tar.gz

    [Application] location = http://your.host/data_access.tar.gz script = ./data_access/myapp.sh

  2. In data_access.ini, change the url http://your.host/parrot.tar.gz to point to the url of the parrot tarball that you copied previously.

  3. Run SkeletonKey on data_access.ini:

    [user@hostname ~]$ skeleton_key -c data_access.ini

  4. Run the job wrapper to verify that it's working correctly

    [user@hostname ~]$ sh ./job_script.sh

The ini file used here differs from the file used in tutorial one by also including a [Directories] section. This section allows the user to specify the directory being exported by Chirp and indicate which paths applications will have read or write access to. The SkeletonKey uses the read setting in this section to set the directories that the application will have read access to. One thing to note is that this setting should be a comma separated list of directories relative to the directory given in the export_base setting. E.g. if read is set to /,data, data/input and export_base is set to /tmp then Chirp will be set to give read access to /tmp, /tmp/data, /tmp/data/input. The write setting is analogous to the read setting except for giving read/write permissions to directories instead of just read permissions.

Verification

  1. On the system running Chirp, run the following following to verify that the file was written correctly:

    [user@hostname ~]$ cat /tmp/data_access_test testing output

  2. The output should match the output given in the example above. 1. Once the output is verified, delete the output file

    %UCL_PROMPT rm /tmp/data_access_test

Using the job wrapper

Standalone

Once the job wrapper has been verified to work, it can be copied to another system and run:

[user@hostname ]$ scp job_script %REDanother_host:/ [user@hostname ~]$ ssh another_host [user@another_host ~] sh ./job_script

Submitting to HTCondor (Optional)

The following part of the tutorial is optional and will cover using a generated job wrapper in a HTCondor submit file.

  1. On your HTCondor submit node, create a file called sk.submit with the following contents

    universe = vanilla notification=never executable = ./job_script.sh output = /tmp/sk/test_$(Cluster).$(Process).out error = /tmp/sk/test_$(Cluster).$(Process).err log = /tmp/sk/test.log ShouldTransferFiles = YES when_to_transfer_output = ON_EXIT queue 1

  2. Next, create /tmp/sk for the log and output files for condor

    [user@condor-submit-node ~] mkdir /tmp/sk

  3. Then copy the job wrapper to the HTCondor submit node

    [user@hostname ]$ scp job_script.sh condor-submit-node:/

  4. Finally submit the job to HTCondor and verify that the jobs ran successfully

    [user@hostname ~]$ ssh condor-submit-node [user@condor-submit-node ~] condor_submit sk.submit