Globus integration: further improvements/refactoring #10623

landreev · 2024-06-10T21:24:24Z

A placeholder issue for keeping track of next steps in improving Globus support after the conclusion of the initial integration effort.

At the moment I am aware of a couple of areas where improvements can be made. This is all in the old code that was originally created by the Scholars Portal. While these are not bugs or something that's broken outright, there is a possibility that this refactoring can make the workflow more reliable, especially on a busy prod. server.

After the Globus service receives the confirmation of the completion of the Globus upload, in order to add the file(s) to the dataset it makes an actual /addFiles API call to itself, via curl. Once again, this is working in the current implementation and we are not aware of any real problems because of this. But it just would be much cleaner for the service to call the appropriate commands/methods directly.
This one is potentially more important, but may also take more work to implement. As implemented, the asynchronous process of waiting for the Globus upload to complete relies on a service method continuously checking on its status remotely, in a loop (with a sleep interval) for the duration of the process. If nothing else, this means that an upload in progress is not going to survive an application restart before its completion (i.e., the file will end up successfully deposited into the Globus collection, but will not be added to the dataset on the Dataverse side). We need to keep in mind that some of the real life big data uploads will be taking a very long time, days etc., to complete (which is the whole point of Globus etc. etc.). There's also some anecdotal evidence of that looping process having died on the application side while one of the OMAMA data files was being uploaded in prod., even without any restarts involved (possibly on account of something unexpected during one of the intermediate Globus API calls - ?). So, it would make sense to refactor this part as to handle it truly asynchronously. Perhaps by storing the state info about the ongoing uploads in a database table, then have a timer job check on all such known uploads every N seconds.

Mentioning this here as we have been doing a lot of testing of uploads of datasets with many files (ie: 3000+) and have run into an issue with the CURL command to upload to Dataverse via API. We tried with various subsets of the files (ie: 15, 300, 500, 700, 900) and found that the upload to Dataverse using Globus was successful up to somewhere between 400-500 files. In every instance the files were successfully uploaded to the S3 bucket via Globus (even the full 3000 file load) but for those over 500 files the call to register to Dataverse failed and the dataset lock never cleared. We had to manually delete the lock and delete all the files from the bucket.

Examination of the Globus logs showed that the Globus uploads were successful, but the call to Dataverse with that many files failed due to the CURL command call being too long for the shell environment. The error that I see in the Globus log is:

  <message>******* Unexpected Exception while executing api/datasets/:persistentId/add call </message>
  <message>java.io.IOException: Cannot run program "bash": error=7, Argument list too long</message>

Although the files go on to successfully upload to the bucket via Globus. They never actually register in Dataverse.

landreev · 2024-07-12T16:54:28Z

Thank you for the report @nightowlaz!
Yes, this is definitely one extra reason to get rid of that curl command in the implementation.
I'm going to try and get this issue prioritized asap. I already spoke about it with Stefano here.

…obus-uploaded files #10623

…taverse-accessible. #10623

… some cleanup and refinements. #10623

…tity was a BAD idea!) #10623

…Bean #10623

Resolved conflicts: src/main/java/edu/harvard/iq/dataverse/api/Datasets.java (#10623)

landreev added Feature: File Upload & Handling Feature: Globus labels Jun 10, 2024

landreev added the Size: 80 A percentage of a sprint. 56 hours. label Jul 24, 2024

cmbz added the GREI 5 Use Cases label Jul 25, 2024

This was referenced Jul 25, 2024

Project: Pilot Large Data Support Service IQSS/dataverse-pm#178

Open

Service: Harvard Dataverse Repository Large Data Services IQSS/dataverse-pm#268

Open

cmbz added the GREI Year 3 Year 3 GREI task label Jul 25, 2024

cmbz assigned landreev Jul 25, 2024

cmbz added FY25 Sprint 2 FY25 Sprint 2 issues FY25 Sprint 3 FY25 Sprint 3 labels Jul 30, 2024

landreev added a commit that referenced this issue Aug 5, 2024

a quick experimental AddReplaceFileHelper implementation of adding Gl…

65ec69f

…obus-uploaded files #10623

landreev added a commit that referenced this issue Aug 6, 2024

no need to try to calculate checksums if this globus storage isn't da…

0495160

…taverse-accessible. #10623

landreev added a commit that referenced this issue Aug 12, 2024

more globus mods (work in progress). #10623

ba66138

landreev added a commit that referenced this issue Aug 12, 2024

new class files that weren't included in the last commit #10623

dac5302

landreev added a commit that referenced this issue Aug 12, 2024

fixing some bad changes that got committed earlier #10623

4080341

cmbz added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) FY25 Sprint 4 FY25 Sprint 4 and removed Size: 80 A percentage of a sprint. 56 hours. labels Aug 14, 2024

landreev added a commit that referenced this issue Aug 17, 2024

cleanup #10623

e086a60

landreev added a commit that referenced this issue Aug 17, 2024

more testing/debugging #10623

35ce7ef

landreev added a commit that referenced this issue Aug 17, 2024

this is a working, but still work-in-progress state of things - needs…

d4b9bac

… some cleanup and refinements. #10623

landreev added a commit that referenced this issue Aug 19, 2024

refined logging #10623

9c62b81

landreev linked a pull request Aug 19, 2024 that will close this issue

Improved handling of Globus uploads #10781

Open

landreev added a commit that referenced this issue Aug 19, 2024

Added notifications for various failure cases. #10623

8cdff8d

landreev added a commit that referenced this issue Aug 20, 2024

Config guide entry. #10623

531e25c

landreev added a commit that referenced this issue Aug 20, 2024

Added a few more doc notes. #10623

007d715

landreev added a commit that referenced this issue Aug 20, 2024

typo #10623

6fcb285

landreev added a commit that referenced this issue Aug 20, 2024

cut-and-paste error #10623

f6882df

landreev added a commit that referenced this issue Aug 20, 2024

(#10623)

4ae3ee6

landreev added a commit that referenced this issue Aug 21, 2024

some minor cleanup changes #10623

9cf4e1b

landreev added a commit that referenced this issue Aug 21, 2024

cosmetic #10623

45fb938

landreev added a commit that referenced this issue Aug 21, 2024

extra L in SUCCESSFUL (#10623)

b3f79fe

landreev added a commit that referenced this issue Aug 21, 2024

better Globus service availability checks #10623

1acae68

landreev added a commit that referenced this issue Aug 21, 2024

better Globus service availability checks #10623

5ba2888

landreev added a commit that referenced this issue Aug 21, 2024

removed an unnecessary @todo (#10623)

2512eab

landreev added a commit that referenced this issue Aug 21, 2024

cosmetic #10623

6b06d94

landreev added a commit that referenced this issue Aug 23, 2024

more changes per feedback. (saving the api token in the GlobusTask en…

6d06927

…tity was a BAD idea!) #10623

landreev added a commit that referenced this issue Aug 23, 2024

changed the polling interval default in the new TaskMonitoringService…

69cfe29

…Bean #10623

landreev added a commit that referenced this issue Aug 23, 2024

more changes/refinements per review feedback (#10623)

d223a8f

landreev added a commit that referenced this issue Aug 23, 2024

added an upfront locks check to the /addGlobusFiles api #10623

0ca5e62

landreev added a commit that referenced this issue Aug 23, 2024

added an upfront locks check to the /addGlobusFiles api #10623

23d0f6c

landreev added a commit that referenced this issue Sep 10, 2024

Merge branch 'develop' into 10623-globus-improvements

5dc386f

Resolved conflicts: src/main/java/edu/harvard/iq/dataverse/api/Datasets.java (#10623)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Globus integration: further improvements/refactoring #10623

Globus integration: further improvements/refactoring #10623

landreev commented Jun 10, 2024 •

edited by cmbz

Loading

amberleahey commented Jun 13, 2024

nightowlaz commented Jul 12, 2024

landreev commented Jul 12, 2024

Globus integration: further improvements/refactoring #10623

Globus integration: further improvements/refactoring #10623

Comments

landreev commented Jun 10, 2024 • edited by cmbz Loading

Related

amberleahey commented Jun 13, 2024

nightowlaz commented Jul 12, 2024

landreev commented Jul 12, 2024

landreev commented Jun 10, 2024 •

edited by cmbz

Loading