-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stress testing resumable transfers #83
Comments
I think not using the state_file for resumes may be the better approach. Consider that even with the state_file, you would still have to re-AXL_Add() all the old files before doing the AXL_Resume() (to protect against crashes while initially adding the files). So if we have to add the files on a resume anyway, and since we can derive all the information to resume from the partially transferred files themselves, we don't need the state_file. |
Nice work, @tonyhutter ! Regardless of what we decide for AXL state files, you've exposed vulnerabilities in kvtree that we should fix up. SCR uses kvtree files in lots of other places that we need to work. One thing that comes to mind is that we could modify kvtree to write to a temp file and then call Would you please also open an issue on kvtree and refer back to this? It'd be nice to include a test like this for kvtree. |
@adammoody I've opened ECP-VeloC/KVTree#40 for the KVTree issues. |
I think I'm going to proceed with the "resumable transfers with no state_file" option then, since I think it will be less prone to error and easier to implement. |
I've been heavily stress testing
axl_cp
resumable transfers and (as @adammoody rightly predicted) there are bugs...Test setup
Create 100 random (0-90MB) files per CPU. For each CPU, spawn off a axl_cp -X pthreads <100 files> .
killall -9 axl_cp
. Resume the transfers and wait for them to finish.Here's some of the failures I've seen:
(I've only see this error once)
The state_file defiantly existed since I check it with access() first. Later on I added a mutex in axl_write_state_file() thinking that would help, but still got the error. This error is semi-rare (happens roughly every 4-5 times I run my test).
I often notice that files are missing on the destination side after the resume completes. When I look at the state_file after the
killall -9 axl_cp
, but before before the resume, I see that it may not have all 100 of the files it should be transferring (it may have like 60 or something). This makes me think it's getting killed in the middle of AXL_Add'ing all its files. The solution here would be to always AXL_Add() all your files before doing a resume. That way it would transfer the missing ones along with resuming the existing ones.It seems to me that for resumes to truly work, we need either:
or
The text was updated successfully, but these errors were encountered: