Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add docs and script to fetch rucio dataset files #165

Open
wants to merge 1 commit into
base: 156-easily-access-datasets-on-rucio-data-lake
Choose a base branch
from

Conversation

garciagenrique
Copy link
Collaborator

@garciagenrique garciagenrique commented Jun 20, 2024

Summary

This PR adds a bash script, to be run on VEGA, that fetches all the files from a RUCIO dataset. The script assumes that the full dataset is present on the VEGA RSE, otherwise it would be necessary to run a replication rule first.

The script creates a simlink per file, allowing the user to access the dataset without searching nor interacting with the rucio file structure.


Related issue : #156

Co-authored-by: Giovanni Guerrieri [email protected]

@garciagenrique garciagenrique force-pushed the 156-easily-access-datasets-on-rucio-data-lake branch from 37c18fe to ef20ca2 Compare June 20, 2024 16:01
Copy link
Collaborator

@matbun matbun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an example with a dataset already on the interTwin data lake, e.g., using the CERN use case dataset.
I am a bit unsure on these aspects:

  • usage of .txt file formats: is this a constraint? If not, would it be possible to make it more general?
  • Does this script allow to copy a whole dataset in one shot ?

Also, before merging I would kindly ask to fix linter problems

tutorials/data-lake/pull-dataset/README.md Outdated Show resolved Hide resolved
tutorials/data-lake/pull-dataset/README.md Outdated Show resolved Hide resolved
@matbun
Copy link
Collaborator

matbun commented Jun 22, 2024

Thinking to it twice, I would suggest the following improvements:

First, incorporating the shell script logic into itwinai would simplify the life of the users. Instead of copying and executing the script, they could call it from within Python. We could also add it to the itwinai CLI. Example:

itwinai get-rucio-dataset <SCOPE:DataSet> <output_file> <output_symlink_dir>

This would prevent the users to manage yet another script, allowing us to ship with itwinai always the latest version of it.
If you think that this is doable, could you implement a function under src/itwinai/rucio.py, please? I will take care of integrating it into the itwinai CLI.

Second, the tutorial on how to get data from Rucio should come after the explanation on how to properly setup the Rucio client. Thus, I have created another tutorial folder under tutorials/data-lake/01-configure-rucio. I know you have some info on that, so could you please add a bit of documentation in the README file?

@garciagenrique garciagenrique force-pushed the 156-easily-access-datasets-on-rucio-data-lake branch from ef20ca2 to 76fc9b9 Compare June 25, 2024 16:40
@garciagenrique
Copy link
Collaborator Author

Thinking to it twice, I would suggest the following improvements:

First, incorporating the shell script logic into itwinai would simplify the life of the users. Instead of copying and executing the script, they could call it from within Python. We could also add it to the itwinai CLI. Example:

itwinai get-rucio-dataset <SCOPE:DataSet> <output_file> <output_symlink_dir>

What about just doing a itwin get-rucio-dataset and manage internally the <output_file> and the <output_symlink_dir> ? It will be basically doing a (more or less) os.lisdir(output_file).
I can implement it of course

Second, the tutorial on how to get data from Rucio should come after the explanation on how to properly setup the Rucio client. Thus, I have created another tutorial folder under tutorials/data-lake/01-configure-rucio. I know you have some info on that, so could you please add a bit of documentation in the README file?

I have improved the README of this PR, I will add a more detailed tutorial tomorrow on the pointed directory.

@matbun
Copy link
Collaborator

matbun commented Jun 28, 2024

What about just doing a itwin get-rucio-dataset and manage internally the <output_file> and the <output_symlink_dir> ? It will be basically doing a (more or less) os.lisdir(output_file). I can implement it of course

Indeed, the simpler the better! I would still leave an optional argument to specify the output folder, to avoid over constraining the users to cd in the target folder before executing the command.

I have improved the README of this PR, I will add a more detailed tutorial tomorrow on the pointed directory.

Thanks!

@matbun matbun linked an issue Jul 2, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Easily access datasets on Rucio data lake
2 participants