Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globus integration: further improvements/refactoring #10623

Open
landreev opened this issue Jun 10, 2024 · 3 comments · May be fixed by #10781
Open

Globus integration: further improvements/refactoring #10623

landreev opened this issue Jun 10, 2024 · 3 comments · May be fixed by #10781
Assignees
Labels
Feature: File Upload & Handling Feature: Globus FY25 Sprint 2 FY25 Sprint 2 issues FY25 Sprint 3 FY25 Sprint 3 FY25 Sprint 4 FY25 Sprint 4 GREI Year 3 Year 3 GREI task GREI 5 Use Cases Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)

Comments

@landreev
Copy link
Contributor

landreev commented Jun 10, 2024

A placeholder issue for keeping track of next steps in improving Globus support after the conclusion of the initial integration effort.

At the moment I am aware of a couple of areas where improvements can be made. This is all in the old code that was originally created by the Scholars Portal. While these are not bugs or something that's broken outright, there is a possibility that this refactoring can make the workflow more reliable, especially on a busy prod. server.

  1. After the Globus service receives the confirmation of the completion of the Globus upload, in order to add the file(s) to the dataset it makes an actual /addFiles API call to itself, via curl. Once again, this is working in the current implementation and we are not aware of any real problems because of this. But it just would be much cleaner for the service to call the appropriate commands/methods directly.
  2. This one is potentially more important, but may also take more work to implement. As implemented, the asynchronous process of waiting for the Globus upload to complete relies on a service method continuously checking on its status remotely, in a loop (with a sleep interval) for the duration of the process. If nothing else, this means that an upload in progress is not going to survive an application restart before its completion (i.e., the file will end up successfully deposited into the Globus collection, but will not be added to the dataset on the Dataverse side). We need to keep in mind that some of the real life big data uploads will be taking a very long time, days etc., to complete (which is the whole point of Globus etc. etc.). There's also some anecdotal evidence of that looping process having died on the application side while one of the OMAMA data files was being uploaded in prod., even without any restarts involved (possibly on account of something unexpected during one of the intermediate Globus API calls - ?). So, it would make sense to refactor this part as to handle it truly asynchronously. Perhaps by storing the state info about the ongoing uploads in a database table, then have a timer job check on all such known uploads every N seconds.

Related

@amberleahey
Copy link

also, I'll just mention the Globus Dataverse App code is now GDCC's Github repository: https://github.com/gdcc/dataverse-globus and any app related issues (including ongoing internationalization efforts) will be recorded there.

@nightowlaz
Copy link

Mentioning this here as we have been doing a lot of testing of uploads of datasets with many files (ie: 3000+) and have run into an issue with the CURL command to upload to Dataverse via API. We tried with various subsets of the files (ie: 15, 300, 500, 700, 900) and found that the upload to Dataverse using Globus was successful up to somewhere between 400-500 files. In every instance the files were successfully uploaded to the S3 bucket via Globus (even the full 3000 file load) but for those over 500 files the call to register to Dataverse failed and the dataset lock never cleared. We had to manually delete the lock and delete all the files from the bucket.

Examination of the Globus logs showed that the Globus uploads were successful, but the call to Dataverse with that many files failed due to the CURL command call being too long for the shell environment. The error that I see in the Globus log is:

  <message>******* Unexpected Exception while executing api/datasets/:persistentId/add call </message>
  <message>java.io.IOException: Cannot run program "bash": error=7, Argument list too long</message>

Although the files go on to successfully upload to the bucket via Globus. They never actually register in Dataverse.

@landreev
Copy link
Contributor Author

Thank you for the report @nightowlaz!
Yes, this is definitely one extra reason to get rid of that curl command in the implementation.
I'm going to try and get this issue prioritized asap. I already spoke about it with Stefano here.

@landreev landreev added the Size: 80 A percentage of a sprint. 56 hours. label Jul 24, 2024
@cmbz cmbz added the GREI 5 Use Cases label Jul 25, 2024
@cmbz cmbz added the GREI Year 3 Year 3 GREI task label Jul 25, 2024
@cmbz cmbz added FY25 Sprint 2 FY25 Sprint 2 issues FY25 Sprint 3 FY25 Sprint 3 labels Jul 30, 2024
landreev added a commit that referenced this issue Aug 5, 2024
landreev added a commit that referenced this issue Aug 6, 2024
@cmbz cmbz added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) FY25 Sprint 4 FY25 Sprint 4 and removed Size: 80 A percentage of a sprint. 56 hours. labels Aug 14, 2024
landreev added a commit that referenced this issue Aug 17, 2024
landreev added a commit that referenced this issue Aug 17, 2024
landreev added a commit that referenced this issue Aug 17, 2024
landreev added a commit that referenced this issue Aug 19, 2024
@landreev landreev linked a pull request Aug 19, 2024 that will close this issue
landreev added a commit that referenced this issue Aug 20, 2024
landreev added a commit that referenced this issue Aug 20, 2024
landreev added a commit that referenced this issue Aug 20, 2024
landreev added a commit that referenced this issue Aug 20, 2024
landreev added a commit that referenced this issue Aug 20, 2024
landreev added a commit that referenced this issue Aug 21, 2024
landreev added a commit that referenced this issue Aug 21, 2024
landreev added a commit that referenced this issue Aug 21, 2024
landreev added a commit that referenced this issue Aug 21, 2024
landreev added a commit that referenced this issue Aug 21, 2024
landreev added a commit that referenced this issue Aug 23, 2024
landreev added a commit that referenced this issue Sep 10, 2024
Resolved conflicts:
	src/main/java/edu/harvard/iq/dataverse/api/Datasets.java
(#10623)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: File Upload & Handling Feature: Globus FY25 Sprint 2 FY25 Sprint 2 issues FY25 Sprint 3 FY25 Sprint 3 FY25 Sprint 4 FY25 Sprint 4 GREI Year 3 Year 3 GREI task GREI 5 Use Cases Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants