Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve datasets to merlin #149

Open
sbliven opened this issue Dec 1, 2022 · 2 comments
Open

Retrieve datasets to merlin #149

sbliven opened this issue Dec 1, 2022 · 2 comments

Comments

@sbliven
Copy link
Member

sbliven commented Dec 1, 2022

Feature Request

We would like to add an option to retrieve datasets to merlin. Currently there is a 'PSI-ra' option when retrieving from scicat. We would like to support similar functionality for merlin and other central archiving locations.

Ra implementation

(Please edit if any of this information is incorrect)

The current PSI-ra retrieval workflow is as follows:

  1. Each ra pgroup has a 'retrieve' directory owned by the retrieval service user
  2. SciCat creates a retrieval job:
  {
    "id": "c0a7cab3-acd7-4474-be75-b81024c775c8",
    "emailJobInitiator": "[email protected]",
    "type": "retrieve",
    "jobParams": {
      "username": "oidc.bliven_s",
      "destinationPath": "/archive/retrieve",
      "option": "PSI-RA"
    },
    "jobStatusMessage": "finishedSuccessful",
    "datasetList": [
      {
        "pid": "20.500.11935/a1704aba-285b-4f95-b48d-36a10930694f",
        "files": []
      }
    ],
    "jobResultObject": {
      "result": {
        "rc": "0",
        "jobid": "76033"
      }
    }
  }
  1. Arima fetches the data from tape, places it in /das/work/<pgroup>/retrieve/<user>/<pid> and reports success
  2. users copy/move the data to the desired destination

Permissions rely on ACLs to allow both the service use and the pgroup members to access the directory.

Differences to merlin

Merlin does not use DUO or pgroups. Most users use a-groups and may archive from user directories or project directories, which do not correspond 1:1 with a-groups. This means that a mechanism must be added to allow users to select a path when retrieving a dataset.

Implementation steps

The minimal implementation in the backend would require:

  1. A way to grant the service user write access to the destination folder.
    • At first this could be a fixed retrieve directory for each project like ra
    • Better would be a script that would set the appropriate permissions/acls on whatever directory the user specified. This could be incorporated into the datasetRetriever tool, and could validate some permissions at run time (e.g that the user has permission to read the dataset and permission to write to the destination folder to clean up).
  2. Modify Job model in REST api to capture destination server and path
  3. Modify Arima to write to the correct server and path

Front-end changes:

  1. datasetRetriever modifications to set up the directory, validate settings, and pass the correct paths to SciCat
  2. New SciCat retrieval option with a field for the destination
  3. (Optional) File browser on SciCat to select the files. This would probably require a microservice running somewhere with access to all the central filesystems which would validate user permissions and return file lists.
@sbliven sbliven changed the title [WIP] Retrieve datasets to merlin Retrieve datasets to merlin Dec 1, 2022
@minottic
Copy link
Collaborator

@sbliven when would you need this to be implemented?
It will likely need a meeting with Krisz, Pedro and Michael (and us). Could you please schedule it depending on its urgency?
Thanks.

@sbliven
Copy link
Member Author

sbliven commented Mar 13, 2023

Here's an initial diagram for how the microservice I mention above might work. This "storage service" would run on the storage system and provide endpoints for the following queries:

  • Check if a filesystem is mounted centrally from this storage
  • List writable filesystem for a particular user
  • File browser/navigation (basically wraps ls and cd for central locations, taking user permissions into account)

SciCat would also need to implement an endpoint for checking what storage systems a user has access to (looking ahead to having non-PSI users in the system)

storage_service_flowchart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants