Errors with --workers via cron #4

klemens-u · 2018-02-02T11:20:26Z

If I run a script in crontab using blocksync with --workers it always fails:

Traceback (most recent call last):
  File "/root/bin/blocksync.py", line 415, in <module>
    sync(i, srcdev, dsthost, dstdev, options)
  File "/root/bin/blocksync.py", line 318, in sync
    p_in.write(DIFF)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/bin/blocksync.py", line 431, in <module>
    os.kill(pid, signal.SIGTERM)
OSError: [Errno 3] No such process

Environment:

Ubuntu 16.04 LTS Server
Using kpartx to create a block device from an image file
Command:
/root/bin/blocksync.py '--extraparams=-p 2222' --cipher=aes256-ctr --blocksize=4194304 --workers=2 --script=/root/bin/blocksync.py /dev/loop1 [email protected] /srv/libvirt_images/myserver.root.img

The text was updated successfully, but these errors were encountered:

ct16k · 2018-02-22T10:56:11Z

Can you do a tcpdump while the cronjob is running? And what is your network setup like?
Seems like the script believes it successfully connected to the remote system but then can't talk to it.

ct16k · 2018-03-02T11:22:20Z

Is this related to #5 and can be closed or a different issue?

klemens-u · 2018-03-02T11:23:35Z

This is a different issue. I'll try to send you the desired infos soon.

davidbartonau · 2019-05-07T13:32:28Z

I had the same issue, with the error: Error copying blocksync to the remote host! however --splay 2 fixed it. PS this is a brilliant tool @ct16k thanks for sharing!

davidbartonau · 2019-09-01T14:24:51Z

@ct16k that fix doesn't always work. I have a replication that works 100% when run by hand with 6 workers and 100% fails on the first status update (I run with --interval=60 so it takes a minute to fail)

[worker 2] Remote fadvise: None
[worker 2] Start syncing 1195 blocks...
[worker 1] Remote fadvise: None
[worker 1] Start syncing 1195 blocks...
[worker 0] Remote fadvise: None
[worker 0] Start syncing 1195 blocks...
[worker 4] Remote fadvise: None
[worker 5] Remote fadvise: None
[worker 3] Remote fadvise: None
[worker 4] Start syncing 1195 blocks...
[worker 5] Start syncing 1195 blocks...
[worker 3] Start syncing 1195 blocks...

Worker #4 exited with 256
Worker #1 exited with 15
Worker #5 exited with 15
Worker #2 exited with 15
Worker #3 exited with 15
Worker #0 exited with 15

davidbartonau · 2019-09-01T14:37:54Z

Running with multiple workers definitely helps overcome network latency, even when your disks don't increase throughput with multiple workers.

davidbartonau · 2019-09-02T09:34:50Z

I have a replication that works 100% when run by hand with 6 workers and 100% fails on the first status update (I run with --interval=60 so it takes a minute to fail)

To be clear, I mean it works when run in a terminal or even in a disconnected tmux session. When run via cron or at it fails when run with multiple workers. Switching to a single worker then works. As the network it high(ish) latency, that is a bit of a pain.

davidbartonau · 2019-09-15T08:44:27Z

@theraser I have tried adding debug logging both on the sender and the receiver.

The sender ends up failing with (lines may differ as I have added logging):

Traceback (most recent call last):
  File "blocksync.py", line 439, in <module>
    sync(i, srcdev, dsthost, dstdev, options)
  File "blocksync.py", line 334, in sync
    p_in.write(DIFF)

To enable logging on the receiver, I used a custom python: mypython.txt

The worker that fails always has much more logging to stdout (maybe coincidental?) and nothing in stderror, whereas the other workers always have a broken pipe error (presumably because the ssh is killed).

-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.eb7YkKE9zD.stderr
-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.L2OVGAD4uM.stderr
-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.YAG8eaL2mp.stderr
-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.77tm1d6hgK.stderr
-rw-r--r-- 1 root root     0 Sep 15 16:33 tmp.LN62mPrfFD.stderr
-rw-r--r-- 1 root root 53883 Sep 15 16:33 tmp.OdhwrXRMka.stdout
-rw-r--r-- 1 root root 46139 Sep 15 16:33 tmp.eb7YkKE9zD.stdout
-rw-r--r-- 1 root root 42427 Sep 15 16:33 tmp.YAG8eaL2mp.stdout
-rw-r--r-- 1 root root 51515 Sep 15 16:33 tmp.77tm1d6hgK.stdout
-rw-r--r-- 1 root root 41147 Sep 15 16:33 tmp.L2OVGAD4uM.stdout
-rw-r--r-- 1 root root 76667 Sep 15 16:33 tmp.LN62mPrfFD.stdout

davidbartonau · 2019-09-15T08:47:07Z

With some further digging, it looks like the sender process is attempting to write after the receiver thinks it is finished:

[worker 0] stderr: debug1: Sending environment.^M
[worker 0] stderr: debug1: Sending env LANG = en_AU.UTF-8^M
[worker 0] stderr: debug1: Sending command: /root/mypython /tmp/tmp.T963KHKglF tmpserver /dev/vg-decrypted-ssd/prod-REMOVED-bastion-disk -b 1048576 -d 3 -1 sha512 ^M
[worker 0] stderr: debug1: client_input_channel_req: channel 0 rtype exit-status reply 0^M
[worker 0] stderr: debug1: client_input_channel_req: channel 0 rtype [email protected] reply 0^M
[worker 0] stderr: debug1: channel 0: free: client-session, nchannels 1^M
[worker 0] stderr: debug1: fd 0 clearing O_NONBLOCK^M
[worker 0] stderr: debug1: fd 1 clearing O_NONBLOCK^M
[worker 0] stderr: debug1: fd 2 clearing O_NONBLOCK^M
[worker 0] stderr: Transferred: sent 5382364, received 107348 bytes, in 48.3 seconds^M
[worker 0] stderr: Bytes per second: sent 111324.0, received 2220.3^M
[worker 0] stderr: debug1: Exit status 0^M
[worker 0] stderr: debug1: compress outgoing: raw data 24142745, compressed 5336923, factor 0.22^M
[worker 0] stderr: debug1: compress incoming: raw data 90032, compressed 87626, factor 0.97^M

It looks like the ssh finishes normally, but the sender is continuing to send.

davidbartonau · 2019-09-15T09:02:02Z

The error seems to be on the very last block:

[worker 0] Unexpected error on block 1195

It loops over and writes every other block, and then the last block the receiver has already closed. Presumably after writing the hash.

ndusart · 2021-03-13T10:05:32Z

It also fails when sending a file to a localhost path, so it is not tied to SSH.

Once a worker has finished, all other workers fails immediately.

Here are the output I receive when using 8 workers (when issuing python /root/blocksync.py -w 8 -T "/var/lib/libvirt/images/dev.qcow2" localhost "/nas/vm/.working.dev.qcow2")

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                          !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!!         PRESS CTRL-C NOW TO EXIT         !!!
!!!                                          !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Starting worker #7 (pid: 953229)
[worker 7] Block size is 1.0 MB
[worker 7] Local fadvise: DONTNEED
[worker 7] Chunk size is 32004.9 MB, offset is 234917093376
[worker 7] Hash 1: sha512
[worker 7] Running: python2 /root/blocksync.py server /nas/vm/.working.dev.qcow2 -b 1048576 -d 3 -1 sha512
[worker 7] Remote fadvise: None
[worker 7] Start syncing 32005 blocks...
[worker 7] same: 1944, diff: 30061, 32005/32005,   6.6 MB/s
[worker 7] Completed in 1:20:18

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                          !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!!         PRESS CTRL-C NOW TO EXIT         !!!
!!!                                          !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Starting worker #0 (pid: 953206)
[worker 0] Block size is 1.0 MB
[worker 0] Local fadvise: DONTNEED
[worker 0] Chunk size is 32004.9 MB, offset is 0
[worker 0] Hash 1: sha512
[worker 0] Running: python2 /root/blocksync.py server /nas/vm/.working.dev.qcow2 -b 1048576 -d 3 -1 sha512
[worker 0] Remote fadvise: None
[worker 0] Start syncing 32005 blocks...
Traceback (most recent call last):
  File "/root/blocksync.py", line 415, in <module>
    sync(i, srcdev, dsthost, dstdev, options)
  File "/root/blocksync.py", line 318, in sync
    p_in.write(DIFF)
BrokenPipeError: [Errno 32] Broken pipe

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                          !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!!         PRESS CTRL-C NOW TO EXIT         !!!
!!!                                          !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Worker #7 exited with 0
Worker #0 exited with 256
Worker #0 caused ABORT
Terminating worker #1
Terminating worker #2
Terminating worker #3
Terminating worker #4
Terminating worker #5
Terminating worker #6
Terminating worker #7
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 431, in <module>
    os.kill(pid, signal.SIGTERM)
ProcessLookupError: [Errno 3] No such process
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe

This happen when starting the process with nohup or in a cron job. When starting directly in a shell, this completes successfuly.

ndusart · 2021-05-25T19:30:16Z

Actually, it only takes a stdout redirection to reproduce the issue.

This command line has the exact same problem:

python /root/blocksync.py -w 4 -T src localhost dest >/dev/null

Seems like the first fork that finish its blocks close its file descriptor and other forks ends up with broken pipes on their stdin/stdout.

@ct16k @theraser any clue on this ?

theraser#4 ssh ${server} "vmsync ${file_remote} ${localhost} ${file_local}";

ndusart · 2024-02-02T11:27:25Z

As this project seems unmaintained, I allow myself to promote one of my tool that I wrote to replace blocksync and that mainly solve this issue. I didn't find a way to fix this issue by patching this project unfortunately and I needed to start a new project.

So if this issue still impacts some people, you can check deltasync.

It has also some performance improvement to maximize IO bandwith and CPU usage to compute the checksum, so you should see an improvement in the effective bandwith too.

marcin-github · 2024-02-02T12:18:42Z

There is also https://github.com/nethappen/blocksync-fast

theraser added the waiting_for_feedback label Aug 29, 2019

DRRDietrich added a commit to DRRDietrich/blocksync that referenced this issue Oct 8, 2021

Workaround for a blocksync bug

d9c345d

theraser#4 ssh ${server} "vmsync ${file_remote} ${localhost} ${file_local}";

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors with --workers via cron #4

Errors with --workers via cron #4

klemens-u commented Feb 2, 2018

ct16k commented Feb 22, 2018

ct16k commented Mar 2, 2018 •

edited

Loading

klemens-u commented Mar 2, 2018

davidbartonau commented May 7, 2019

davidbartonau commented Sep 1, 2019

davidbartonau commented Sep 1, 2019

davidbartonau commented Sep 2, 2019

davidbartonau commented Sep 15, 2019

davidbartonau commented Sep 15, 2019

davidbartonau commented Sep 15, 2019 •

edited

Loading

ndusart commented Mar 13, 2021 •

edited

Loading

ndusart commented May 25, 2021

ndusart commented Feb 2, 2024

marcin-github commented Feb 2, 2024

Errors with --workers via cron #4

Errors with --workers via cron #4

Comments

klemens-u commented Feb 2, 2018

ct16k commented Feb 22, 2018

ct16k commented Mar 2, 2018 • edited Loading

klemens-u commented Mar 2, 2018

davidbartonau commented May 7, 2019

davidbartonau commented Sep 1, 2019

davidbartonau commented Sep 1, 2019

davidbartonau commented Sep 2, 2019

davidbartonau commented Sep 15, 2019

davidbartonau commented Sep 15, 2019

davidbartonau commented Sep 15, 2019 • edited Loading

ndusart commented Mar 13, 2021 • edited Loading

ndusart commented May 25, 2021

ndusart commented Feb 2, 2024

marcin-github commented Feb 2, 2024

ct16k commented Mar 2, 2018 •

edited

Loading

davidbartonau commented Sep 15, 2019 •

edited

Loading

ndusart commented Mar 13, 2021 •

edited

Loading