Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add docs and script to fetch rucio dataset files #165

Open
wants to merge 1 commit into
base: 156-easily-access-datasets-on-rucio-data-lake
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion tutorials/data-lake/pull-dataset/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,24 @@
# Pull dataset from Rucio data lake
# Interact with Rucio dataset files

The following script assumes that all the files within a "Rucio Dataset" (or `DIDs` - see below) are present in a RSE (Rucio Storage Element), and that this RSE is accessible locally.
- `DIDs` (or Data Identifiers - see Rucio [documentation](https://rucio.github.io/documentation/started/concepts/file_dataset_container/)) are composed of a scope plus a dataset name in the `SCOPE:Name` format.
- If the files are not present in the RSE, replicate the dataset on the desired RSE before running the script.

Run the following bash script

```bash
> ./rucio_dataset_files.sh <SCOPE:DataSet> <output_filename> <output_dirname>
```
where
- `SCOPE:Name` is the Rucio DID. You can list all the scopes with the command `rucio list-scopes`, and the dataset name with `rucio list-did <SCOPE>:` (note the colon).
- `output_filename` is the output file that contains the "filepath" of all the files in the dataset.
- `output_dirname` is the output directory with all the dataset files (in the form of symbolic links), to avoid duplication of files on disk. It also prevents users to search within the disk, which could get complicated depending on the storage kind and model.

```bash
# Example
> ./rucio_dataset_files.sh calorimeter:training_data_hdf5 calorimeter_files.txt calorimeter_symlink_dir

# And check the output file and the directory with the symlinks
> cat calorimeter_files.txt
> ls -l calorimeter_symlink_dir
```
34 changes: 34 additions & 0 deletions tutorials/data-lake/pull-dataset/rucio_dataset_files.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#/bin/bash
#
# G. Guerrieri & E. Garcia (CERN) - Jun 2024
#
# This script runs only on VEGA
#
# Usage - on a terminal run
# > ./rucio_dataset_files.sh <SCOPE:DataSet> <output_file> <output_symlink_dir>

set -e

ds=$1
name=$2
location=$3

pw=`pwd -P`

if [[ -f "${name}" ]]; then rm ${name}.txt; fi
garciagenrique marked this conversation as resolved.
Show resolved Hide resolved
touch ${name}.txt

if [ -d "${location}" ]; then echo -e "Directory exists. Exiting\n${pw}/${location}" ; exit 1 ; fi
mkdir $location

for file in `rucio list-file-replicas --rse VEGA-DCACHE $ds | awk '{ print $12 }' | sed 's|https://dcache.sling.si:2880|/dcache/sling.si|g'`
do
if [[ $file == "|" ]]; then continue; fi
fileReduced=`basename $file`
echo linking $fileReduced "..."
link=$location/${ds/:/.}.$fileReduced
ln -s $file $link
echo ${pw}/$link >> ${name}.txt
done

chmod -R 777 $3