Skip to content

Commit

Permalink
add docs and script to fetch rucio dataset files
Browse files Browse the repository at this point in the history
  • Loading branch information
garciagenrique committed Jun 25, 2024
1 parent eef00cd commit 76fc9b9
Show file tree
Hide file tree
Showing 2 changed files with 58 additions and 1 deletion.
25 changes: 24 additions & 1 deletion tutorials/data-lake/pull-dataset/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,24 @@
# Pull dataset from Rucio data lake
# Interact with Rucio dataset files

The following script assumes that all the files within a "Rucio Dataset" (or `DIDs` - see below) are present in a RSE (Rucio Storage Element), and that this RSE is accessible locally.
- `DIDs` (or Data Identifiers - see Rucio [documentation](https://rucio.github.io/documentation/started/concepts/file_dataset_container/)) are composed of a scope plus a dataset name in the `SCOPE:Name` format.
- If the files are not present in the RSE, replicate the dataset on the desired RSE before running the script.

Run the following bash script

```bash
> ./rucio_dataset_files.sh <SCOPE:DataSet> <output_filename> <output_dirname>
```
where
- `SCOPE:Name` is the Rucio DID. You can list all the scopes with the command `rucio list-scopes`, and the dataset name with `rucio list-did <SCOPE>:` (note the colon).
- `output_filename` is the output file that contains the "filepath" of all the files in the dataset.
- `output_dirname` is the output directory with all the dataset files (in the form of symbolic links), to avoid duplication of files on disk. It also prevents users to search within the disk, which could get complicated depending on the storage kind and model.

```bash
# Example
> ./rucio_dataset_files.sh calorimeter:training_data_hdf5 calorimeter_files.txt calorimeter_symlink_dir

# And check the output file and the directory with the symlinks
> cat calorimeter_files.txt
> ls -l calorimeter_symlink_dir
```
34 changes: 34 additions & 0 deletions tutorials/data-lake/pull-dataset/rucio_dataset_files.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#/bin/bash
#
# G. Guerrieri & E. Garcia (CERN) - Jun 2024
#
# This script runs only on VEGA
#
# Usage - on a terminal run
# > ./rucio_dataset_files.sh <SCOPE:DataSet> <output_file> <output_symlink_dir>

set -e

ds=$1
name=$2
location=$3

pw=`pwd -P`

if [[ -f "${name}" ]]; then rm ${name}.txt; fi
touch ${name}.txt

if [ -d "${location}" ]; then echo -e "Directory exists. Exiting\n${pw}/${location}" ; exit 1 ; fi
mkdir $location

for file in `rucio list-file-replicas --rse VEGA-DCACHE $ds | awk '{ print $12 }' | sed 's|https://dcache.sling.si:2880|/dcache/sling.si|g'`
do
if [[ $file == "|" ]]; then continue; fi
fileReduced=`basename $file`
echo linking $fileReduced "..."
link=$location/${ds/:/.}.$fileReduced
ln -s $file $link
echo ${pw}/$link >> ${name}.txt
done

chmod -R 777 $3

0 comments on commit 76fc9b9

Please sign in to comment.