broken pipe with lbsvc strategy and --no-inc-recursive when re-syncing #320

luis-alen · 2024-12-10T11:53:08Z

Describe the bug

I'm trying to migrate data between clusters. The source and dest PVCs use EFS storage class and contain hundreds of thousands of small files. The initial sync operation worked fine with

pv-migrate --ignore-mounted --source=storage --source-namespace=source-ns --source-context=source-ctx --dest-namespace=dest-ns --dest-context=dest-ctx --dest=storage -s lbsvc --lbsvc-timeout 600s --helm-set rsync.retryPeriodSeconds=60,rsync.maxRetries=20

However, when I try to re-sync with the same command, rsync keeps failing with broken pipe messages.

rsync: connection unexpectedly closed (48427915 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(228) [receiver=3.2.3]
rsync: [generator] write error: Broken pipe (32)
rsync error: error in socket IO (code 10) at io.c(823) [generator=3.2.3]

I believe it could be due to SSH timeouts combined with no incremental recursion. If I manually try to sync with incremental recursion, it works just fine. The problem is that apparently, --no-inc-recursive can't be dynamically disabled unless you overwrite the entire rsync command, which doesn't seem to be possible with load balancer strategy as you don't know the load balancer hostname upfront.

By the way, what's the purpose of using --no-inc-recursive by default? Afaik rsync's default is to use incremental recursion. It splits the file list into manageable chunks, reducing the memory and network pressure. It’s the default for a reason... the way I see it, unless there's a very specific need for --no-inc-recursive, it’s generally better to avoid it (especially when dealing with thousands of small files).

Maybe it makes sense to change this default behavior with an option to overwrite it?

To Reproduce
Steps to reproduce the behavior:

Using the load balancer strategy, sync a pvc containing hundreds of thousands of small files and then attempt to re-sync the changes.

Expected behavior
I expected a subsequent rsync operation to work as the 1st one did.

Console output

rsync: connection unexpectedly closed (48427915 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(228) [receiver=3.2.3]
rsync: [generator] write error: Broken pipe (32)
rsync error: error in socket IO (code 10) at io.c(823) [generator=3.2.3]

Version

source k8s version: v1.24.16, dest k8s version: v1.29.10-eks-7f9249a`
source and destination container runtimes [e.g. containerd://1.4.4-k3s2, docker://19.3.6]
pv-migrate version 2.0.1 (commit: dbaa3e8) (build date: 2024-05-12T00:16:44Z)
homebrew
Source and destination PVC type, size and accessModes [e.g. ReadWriteMany, 8G, kubernetes.io/gce-pd -> ReadWriteOnce, N/A, rancher.io/local-path ]

The text was updated successfully, but these errors were encountered:

utkuozdemir · 2024-12-10T12:00:57Z

It is hard to remember exactly, but I believe the reason I used --no-inc-recursive was to be able to get the transfer progress reliably so that the progress bar can be accurate. So I took both --info=progress2 and --no-inc-recursive from online sources on how to get rsync to display progress, e.g., https://serverfault.com/a/637634/163294

But I agree that there can be use cases where it is undesired. I don't know how the progress bar would look like without that flag. An easy solution could be to not set those two flags when --no-progress-bar is specified, but it would be an obscure solution - not discoverable.

So let me think about it - maybe exposing the whole rsync command as a raw flag and allowing it to be overridden can be a good idea, or something like that.

Suggestions are welcome.

utkuozdemir · 2024-12-10T12:09:18Z

I think I can add a new flag for advanced users, --rsync-command-override, to allow overriding everything. If it is specified, other high-level flags like --delete etc. would be ignored.

luis-alen · 2024-12-10T12:36:40Z

@utkuozdemir the entire rsync command can already be overwritten via helm. Can't it? The problem with this approach combined with load balancer strategy is that you can't know the load balancer hostname upfront.

An easy solution could be to not set those two flags when --no-progress-bar is specified, but it would be an obscure solution - not discoverable.

why you don't do it the other way around? add --incremental-recursion, which implicitly disables the progress bar.

By the way I tried running rsync with --no-inc-recursion combined with --timeout and ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=3, but I still get the network errors :(

luis-alen · 2024-12-10T12:52:58Z

Interesting update: --no-inc-recursive worked with ssh options -o ServerAliveInterval=10 -o ServerAliveCountMax=3

Is it possible to overwrite ssh options?

utkuozdemir · 2024-12-10T12:55:24Z

Seems not: https://github.com/utkuozdemir/pv-migrate/blob/master/rsync/cmd.go

Your best bet probably is to do your changes in the code and build your own binary for now.

utkuozdemir · 2024-12-10T14:27:57Z

why you don't do it the other way around? add --incremental-recursion, which implicitly disables the progress bar.

Simply because these kind of requests (specifically rsync command customization actually) come a lot, and I want to address this and any future issues in one go.

In a similar way, in early versions of pv-migrate, the manifests it installed were not helm-based. So I had to add a new toggle/flag for each and every new use case. When I migrated it to helm, it addressed all future requests in a generic way.

I aim to get the same effect for the rsync args, so it'll never be an issue again :)

luis-alen · 2024-12-10T14:45:53Z

@utkuozdemir the root cause is related to the default AWS load balancer idle timeout (60s) VS the time the ssh session remains idle while rsync calculates files to be transferred with --no-inc-recursive. Of course, this will only happen when you have many files (which is my case with hundreds of thousands of small files)...

pv-migrate configures the ssh server with the following:

ClientAliveInterval 300
ClientAliveCountMax 3

This is greater than the ELB idle timeout setting and doesn't help with my case.

What helps is to set ServerAliveInterval with a value lower than 60s. I confirm that setting ssh -o ServerAliveInterval=59 works, whereas with ssh -o ServerAliveInterval=60, it does not.

Another option is to increase the ELB idle timeout. I also confirmed that changing the ELB timeout to a higher value works without tweaks on the ssh client. I set it to 330 (slightly higher than ssh server ClientAliveInterval - 300).

jusstinas · 2025-01-17T14:34:46Z

Thank you @luis-alen - your suggestions helped a lot!

luis-alen mentioned this issue Dec 11, 2024

Would like to provide a best pratics for AWS PV Migration #257

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broken pipe with lbsvc strategy and --no-inc-recursive when re-syncing #320

broken pipe with lbsvc strategy and --no-inc-recursive when re-syncing #320

luis-alen commented Dec 10, 2024

utkuozdemir commented Dec 10, 2024 •

edited

Loading

utkuozdemir commented Dec 10, 2024 •

edited

Loading

luis-alen commented Dec 10, 2024

luis-alen commented Dec 10, 2024 •

edited

Loading

utkuozdemir commented Dec 10, 2024

utkuozdemir commented Dec 10, 2024

luis-alen commented Dec 10, 2024 •

edited

Loading

jusstinas commented Jan 17, 2025

broken pipe with lbsvc strategy and --no-inc-recursive when re-syncing #320

broken pipe with lbsvc strategy and --no-inc-recursive when re-syncing #320

Comments

luis-alen commented Dec 10, 2024

utkuozdemir commented Dec 10, 2024 • edited Loading

utkuozdemir commented Dec 10, 2024 • edited Loading

luis-alen commented Dec 10, 2024

luis-alen commented Dec 10, 2024 • edited Loading

utkuozdemir commented Dec 10, 2024

utkuozdemir commented Dec 10, 2024

luis-alen commented Dec 10, 2024 • edited Loading

jusstinas commented Jan 17, 2025

utkuozdemir commented Dec 10, 2024 •

edited

Loading

utkuozdemir commented Dec 10, 2024 •

edited

Loading

luis-alen commented Dec 10, 2024 •

edited

Loading

luis-alen commented Dec 10, 2024 •

edited

Loading