Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken pipe with lbsvc strategy and --no-inc-recursive when re-syncing #320

Open
luis-alen opened this issue Dec 10, 2024 · 7 comments
Open

Comments

@luis-alen
Copy link

Describe the bug

I'm trying to migrate data between clusters. The source and dest PVCs use EFS storage class and contain hundreds of thousands of small files. The initial sync operation worked fine with

pv-migrate --ignore-mounted --source=storage --source-namespace=source-ns --source-context=source-ctx --dest-namespace=dest-ns --dest-context=dest-ctx --dest=storage -s lbsvc --lbsvc-timeout 600s --helm-set rsync.retryPeriodSeconds=60,rsync.maxRetries=20

However, when I try to re-sync with the same command, rsync keeps failing with broken pipe messages.

rsync: connection unexpectedly closed (48427915 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(228) [receiver=3.2.3]
rsync: [generator] write error: Broken pipe (32)
rsync error: error in socket IO (code 10) at io.c(823) [generator=3.2.3]

I believe it could be due to SSH timeouts combined with no incremental recursion. If I manually try to sync with incremental recursion, it works just fine. The problem is that apparently, --no-inc-recursive can't be dynamically disabled unless you overwrite the entire rsync command, which doesn't seem to be possible with load balancer strategy as you don't know the load balancer hostname upfront.

By the way, what's the purpose of using --no-inc-recursive by default? Afaik rsync's default is to use incremental recursion. It splits the file list into manageable chunks, reducing the memory and network pressure. It’s the default for a reason... the way I see it, unless there's a very specific need for --no-inc-recursive, it’s generally better to avoid it (especially when dealing with thousands of small files).

Maybe it makes sense to change this default behavior with an option to overwrite it?

To Reproduce
Steps to reproduce the behavior:

Using the load balancer strategy, sync a pvc containing hundreds of thousands of small files and then attempt to re-sync the changes.

Expected behavior
I expected a subsequent rsync operation to work as the 1st one did.

Console output

rsync: connection unexpectedly closed (48427915 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(228) [receiver=3.2.3]
rsync: [generator] write error: Broken pipe (32)
rsync error: error in socket IO (code 10) at io.c(823) [generator=3.2.3]

Version

  • source k8s version: v1.24.16, dest k8s version: v1.29.10-eks-7f9249a`
  • source and destination container runtimes [e.g. containerd://1.4.4-k3s2, docker://19.3.6]
  • pv-migrate version 2.0.1 (commit: dbaa3e8) (build date: 2024-05-12T00:16:44Z)
  • homebrew
  • Source and destination PVC type, size and accessModes [e.g. ReadWriteMany, 8G, kubernetes.io/gce-pd -> ReadWriteOnce, N/A, rancher.io/local-path ]
@utkuozdemir
Copy link
Owner

utkuozdemir commented Dec 10, 2024

It is hard to remember exactly, but I believe the reason I used --no-inc-recursive was to be able to get the transfer progress reliably so that the progress bar can be accurate. So I took both --info=progress2 and --no-inc-recursive from online sources on how to get rsync to display progress, e.g., https://serverfault.com/a/637634/163294

But I agree that there can be use cases where it is undesired. I don't know how the progress bar would look like without that flag. An easy solution could be to not set those two flags when --no-progress-bar is specified, but it would be an obscure solution - not discoverable.

So let me think about it - maybe exposing the whole rsync command as a raw flag and allowing it to be overridden can be a good idea, or something like that.

Suggestions are welcome.

@utkuozdemir
Copy link
Owner

utkuozdemir commented Dec 10, 2024

I think I can add a new flag for advanced users, --rsync-command-override, to allow overriding everything. If it is specified, other high-level flags like --delete etc. would be ignored.

@luis-alen
Copy link
Author

@utkuozdemir the entire rsync command can already be overwritten via helm. Can't it? The problem with this approach combined with load balancer strategy is that you can't know the load balancer hostname upfront.

An easy solution could be to not set those two flags when --no-progress-bar is specified, but it would be an obscure solution - not discoverable.

why you don't do it the other way around? add --incremental-recursion, which implicitly disables the progress bar.

By the way I tried running rsync with --no-inc-recursion combined with --timeout and ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=3, but I still get the network errors :(

@luis-alen
Copy link
Author

luis-alen commented Dec 10, 2024

Interesting update: --no-inc-recursive worked with ssh options -o ServerAliveInterval=10 -o ServerAliveCountMax=3

Is it possible to overwrite ssh options?

@utkuozdemir
Copy link
Owner

Seems not: https://github.com/utkuozdemir/pv-migrate/blob/master/rsync/cmd.go

Your best bet probably is to do your changes in the code and build your own binary for now.

@utkuozdemir
Copy link
Owner

why you don't do it the other way around? add --incremental-recursion, which implicitly disables the progress bar.

Simply because these kind of requests (specifically rsync command customization actually) come a lot, and I want to address this and any future issues in one go.

In a similar way, in early versions of pv-migrate, the manifests it installed were not helm-based. So I had to add a new toggle/flag for each and every new use case. When I migrated it to helm, it addressed all future requests in a generic way.

I aim to get the same effect for the rsync args, so it'll never be an issue again :)

@luis-alen
Copy link
Author

luis-alen commented Dec 10, 2024

@utkuozdemir the root cause is related to the default AWS load balancer idle timeout (60s) VS the time the ssh session remains idle while rsync calculates files to be transferred with --no-inc-recursive. Of course, this will only happen when you have many files (which is my case with hundreds of thousands of small files)...

pv-migrate configures the ssh server with the following:

ClientAliveInterval 300
ClientAliveCountMax 3

This is greater than the ELB idle timeout setting and doesn't help with my case.

What helps is to set ServerAliveInterval with a value lower than 60s. I confirm that setting ssh -o ServerAliveInterval=59 works, whereas with ssh -o ServerAliveInterval=60, it does not.

Another option is to increase the ELB idle timeout. I also confirmed that changing the ELB timeout to a higher value works without tweaks on the ssh client. I set it to 330 (slightly higher than ssh server ClientAliveInterval - 300).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants