-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage] Investigate rclone mount
with VFS caching
#3353
Comments
On one GCP node:
Writing a large file to
Writing to a non-mounted path (waited for background cloud sync triggered by above to finish): for some reason slower
#!/bin/bash
# Check if a test file name was provided
if [ -z "$1" ]; then
echo "Usage: $0 <testfile>"
exit 1
fi
# Use the provided argument as the test file name
TESTFILE=$1
# Size of the test file
FILESIZE=4294967296 # 4 GB
# Block size
BLOCKSIZE=65536
# Write test
echo "Starting write test..."
dd if=/dev/urandom of=$TESTFILE bs=$BLOCKSIZE count=$(($FILESIZE/$BLOCKSIZE)) conv=fdatasync oflag=direct 2>&1 | grep -E "copied|bytes"
# Read test
echo "Starting read test..."
dd if=$TESTFILE of=/dev/null bs=$BLOCKSIZE 2>&1 | grep -E "copied|bytes" |
Some findings after implementing a cache via Rclone:
|
for reference: draft pull request w/ rclone cache implementation #3455 |
This is fantastic - thanks for sharing @shethhriday29! What EBS Looks like this is a "high latency high throughput" (FUSE) vs "low latency low throughput" (Rclone, and potentially CSYNC?) situation:
cc @landscapepainter - did you have similar observations for csync? any thoughts? |
Thank you, Romil! Cannot 100% recall for the |
Just did some of the |
@romilbhardwaj I have not particularly benchmarked the performance on read workloads, but
I'll try to update more details soon, but just wanted to give everyone a heads up. |
@romilbhardwaj @shethhriday29 @concretevitamin I'll explain the data I shared in the previous comment with more depth. To measure the time of writing checkpoint with each storage mode, I override the The time it takes for the training script to write checkpoint before continuing to train is as follows: And you can also see this visually from W&B plots: The time for the training script to complete writing checkpoint for 1. It takes quite a time for it to sync the checkpoints from local to cloud storage, and this can increase the chance of user losing the checkpoint when the spot instance get preempted. Following is the time plot of when the training script first starts to write the larger portion of checkpoints(~60GB) and when the corresponding checkpoint appears on the cloud storage:
As you can see, the time it takes to fully sync between local and cloud storage for rclone vfs takes some time( 2. Some trianing framework require all files of a single checkpoint to exist in order if it was to be used, and if not it will crash rather than failing over to another checkpoint, and But for As |
@landscapepainter @shethhriday29 In your benchmarks, what are the other VFS args (https://rclone.org/commands/rclone_mount/#vfs-file-caching) set to? Those may affect certain behavior like time taken to sync up to cloud storage. |
I was mounting via |
@concretevitamin @shethhriday29 Let me look into what other options we have to improve that performance. |
@concretevitamin @shethhriday29 @romilbhardwaj I did more research on some feasible options to improve performance of the current On top of @shethhriday29's implementation, I added the following two options:
From the following, you can see that with newly added options:
But we do need additional testing on CPU usage, how large the cache can get and it's implication, reading checkpoint when spot gets preempted(especially options related to reading), and etc. We should add additional options like |
Two additional discoveries on
|
Wow, this is awesome @landscapepainter ! Are the additional options just |
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This issue was closed because it has been stalled for 10 days with no activity. |
https://rclone.org/commands/rclone_mount/#vfs-file-caching
Goal is to (1) improve the performance of datasets reading/writing, checkpoints writing (2) support things like appends, compared to regular bucket mounting via
mode: MOUNT
.rclone mount
with VFS caching seems to be using local SSDs for reads/writes but also syncs up to the object storage bucket.This mode may mimic some CSYNC-like functionality #2336.
The text was updated successfully, but these errors were encountered: