Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stress testing resumable transfers #83

Open
tonyhutter opened this issue Sep 29, 2020 · 4 comments
Open

Stress testing resumable transfers #83

tonyhutter opened this issue Sep 29, 2020 · 4 comments

Comments

@tonyhutter
Copy link
Collaborator

I've been heavily stress testing axl_cp resumable transfers and (as @adammoody rightly predicted) there are bugs...

Test setup
Create 100 random (0-90MB) files per CPU. For each CPU, spawn off a axl_cp -X pthreads <100 files> . killall -9 axl_cp. Resume the transfers and wait for them to finish.

Here's some of the failures I've seen:

  1. Corrupt state_file?
KVTree 1.0.0 ABORT: butte21: Failed to persist hash wrote 9121 bytes != expected 9127 @ /g/g0/hutter2/KVTree/src/kvtree.c:1132

(I've only see this error once)

  1. Can't read state_file
AXL_Create() failed (error -1)
AXL 0.3.0 ERROR: butte21: Couldn't read state file correctly @ axl_alloc_id /g/g0/hutter2/AXL/src/axl.c:89

The state_file defiantly existed since I check it with access() first. Later on I added a mutex in axl_write_state_file() thinking that would help, but still got the error. This error is semi-rare (happens roughly every 4-5 times I run my test).

  1. Files are missing

I often notice that files are missing on the destination side after the resume completes. When I look at the state_file after the killall -9 axl_cp, but before before the resume, I see that it may not have all 100 of the files it should be transferring (it may have like 60 or something). This makes me think it's getting killed in the middle of AXL_Add'ing all its files. The solution here would be to always AXL_Add() all your files before doing a resume. That way it would transfer the missing ones along with resuming the existing ones.

It seems to me that for resumes to truly work, we need either:

  • ACID compliment KVTree implementation

or

  • Don't use state_file at all for resumes. Derive the state information we need by looking at the existing ._AXL destination file. The presence of an ._AXL extension means the file didn't finish transferring or is transferring and can be resumed. And the size of the ._AXL destination file tells you how much of the file has been transferred (and thus the resume offset).
@tonyhutter
Copy link
Collaborator Author

I think not using the state_file for resumes may be the better approach. Consider that even with the state_file, you would still have to re-AXL_Add() all the old files before doing the AXL_Resume() (to protect against crashes while initially adding the files). So if we have to add the files on a resume anyway, and since we can derive all the information to resume from the partially transferred files themselves, we don't need the state_file.

@adammoody
Copy link
Contributor

I've been heavily stress testing axl_cp resumable transfers and (as @adammoody rightly predicted) there are bugs...

Test setup
Create 100 random (0-90MB) files per CPU. For each CPU, spawn off a axl_cp -X pthreads <100 files> . killall -9 axl_cp. Resume the transfers and wait for them to finish.

Here's some of the failures I've seen:

1. Corrupt state_file?
KVTree 1.0.0 ABORT: butte21: Failed to persist hash wrote 9121 bytes != expected 9127 @ /g/g0/hutter2/KVTree/src/kvtree.c:1132

(I've only see this error once)

1. Can't read state_file
AXL_Create() failed (error -1)
AXL 0.3.0 ERROR: butte21: Couldn't read state file correctly @ axl_alloc_id /g/g0/hutter2/AXL/src/axl.c:89

Nice work, @tonyhutter ! Regardless of what we decide for AXL state files, you've exposed vulnerabilities in kvtree that we should fix up. SCR uses kvtree files in lots of other places that we need to work. One thing that comes to mind is that we could modify kvtree to write to a temp file and then call rename once it is complete instead of truncating and overwriting an existing file like it does now. That should help minimize the chances of reading back corrupt kvtree files. We'll instead get old kvtree files in that case, but at least it should be valid.

Would you please also open an issue on kvtree and refer back to this?

It'd be nice to include a test like this for kvtree.

@tonyhutter
Copy link
Collaborator Author

@adammoody I've opened ECP-VeloC/KVTree#40 for the KVTree issues.

@tonyhutter
Copy link
Collaborator Author

tonyhutter commented Sep 29, 2020

I think I'm going to proceed with the "resumable transfers with no state_file" option then, since I think it will be less prone to error and easier to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants