Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors with --workers via cron #4

Open
klemens-u opened this issue Feb 2, 2018 · 14 comments
Open

Errors with --workers via cron #4

klemens-u opened this issue Feb 2, 2018 · 14 comments

Comments

@klemens-u
Copy link
Contributor

If I run a script in crontab using blocksync with --workers it always fails:

Traceback (most recent call last):
  File "/root/bin/blocksync.py", line 415, in <module>
    sync(i, srcdev, dsthost, dstdev, options)
  File "/root/bin/blocksync.py", line 318, in sync
    p_in.write(DIFF)
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/bin/blocksync.py", line 431, in <module>
    os.kill(pid, signal.SIGTERM)
OSError: [Errno 3] No such process

Environment:

  • Ubuntu 16.04 LTS Server
  • Using kpartx to create a block device from an image file
  • Command:
    /root/bin/blocksync.py '--extraparams=-p 2222' --cipher=aes256-ctr --blocksize=4194304 --workers=2 --script=/root/bin/blocksync.py /dev/loop1 [email protected] /srv/libvirt_images/myserver.root.img
@ct16k
Copy link
Collaborator

ct16k commented Feb 22, 2018

Can you do a tcpdump while the cronjob is running? And what is your network setup like?
Seems like the script believes it successfully connected to the remote system but then can't talk to it.

@ct16k
Copy link
Collaborator

ct16k commented Mar 2, 2018

Is this related to #5 and can be closed or a different issue?

@klemens-u
Copy link
Contributor Author

This is a different issue. I'll try to send you the desired infos soon.

@davidbartonau
Copy link

I had the same issue, with the error: Error copying blocksync to the remote host! however --splay 2 fixed it. PS this is a brilliant tool @ct16k thanks for sharing!

@davidbartonau
Copy link

@ct16k that fix doesn't always work. I have a replication that works 100% when run by hand with 6 workers and 100% fails on the first status update (I run with --interval=60 so it takes a minute to fail)

[worker 2] Remote fadvise: None
[worker 2] Start syncing 1195 blocks...
[worker 1] Remote fadvise: None
[worker 1] Start syncing 1195 blocks...
[worker 0] Remote fadvise: None
[worker 0] Start syncing 1195 blocks...
[worker 4] Remote fadvise: None
[worker 5] Remote fadvise: None
[worker 3] Remote fadvise: None
[worker 4] Start syncing 1195 blocks...
[worker 5] Start syncing 1195 blocks...
[worker 3] Start syncing 1195 blocks...

Worker #4 exited with 256
Worker #1 exited with 15
Worker #5 exited with 15
Worker #2 exited with 15
Worker #3 exited with 15
Worker #0 exited with 15

@davidbartonau
Copy link

Running with multiple workers definitely helps overcome network latency, even when your disks don't increase throughput with multiple workers.

@davidbartonau
Copy link

I have a replication that works 100% when run by hand with 6 workers and 100% fails on the first status update (I run with --interval=60 so it takes a minute to fail)

To be clear, I mean it works when run in a terminal or even in a disconnected tmux session. When run via cron or at it fails when run with multiple workers. Switching to a single worker then works. As the network it high(ish) latency, that is a bit of a pain.

@davidbartonau
Copy link

@theraser I have tried adding debug logging both on the sender and the receiver.

The sender ends up failing with (lines may differ as I have added logging):

Traceback (most recent call last):
  File "blocksync.py", line 439, in <module>
    sync(i, srcdev, dsthost, dstdev, options)
  File "blocksync.py", line 334, in sync
    p_in.write(DIFF)

To enable logging on the receiver, I used a custom python: mypython.txt

The worker that fails always has much more logging to stdout (maybe coincidental?) and nothing in stderror, whereas the other workers always have a broken pipe error (presumably because the ssh is killed).

-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.eb7YkKE9zD.stderr
-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.L2OVGAD4uM.stderr
-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.YAG8eaL2mp.stderr
-rw-r--r-- 1 root root   222 Sep 15 16:33 tmp.77tm1d6hgK.stderr
-rw-r--r-- 1 root root     0 Sep 15 16:33 tmp.LN62mPrfFD.stderr
-rw-r--r-- 1 root root 53883 Sep 15 16:33 tmp.OdhwrXRMka.stdout
-rw-r--r-- 1 root root 46139 Sep 15 16:33 tmp.eb7YkKE9zD.stdout
-rw-r--r-- 1 root root 42427 Sep 15 16:33 tmp.YAG8eaL2mp.stdout
-rw-r--r-- 1 root root 51515 Sep 15 16:33 tmp.77tm1d6hgK.stdout
-rw-r--r-- 1 root root 41147 Sep 15 16:33 tmp.L2OVGAD4uM.stdout
-rw-r--r-- 1 root root 76667 Sep 15 16:33 tmp.LN62mPrfFD.stdout

@davidbartonau
Copy link

With some further digging, it looks like the sender process is attempting to write after the receiver thinks it is finished:

[worker 0] stderr: debug1: Sending environment.^M
[worker 0] stderr: debug1: Sending env LANG = en_AU.UTF-8^M
[worker 0] stderr: debug1: Sending command: /root/mypython /tmp/tmp.T963KHKglF tmpserver /dev/vg-decrypted-ssd/prod-REMOVED-bastion-disk -b 1048576 -d 3 -1 sha512 ^M
[worker 0] stderr: debug1: client_input_channel_req: channel 0 rtype exit-status reply 0^M
[worker 0] stderr: debug1: client_input_channel_req: channel 0 rtype [email protected] reply 0^M
[worker 0] stderr: debug1: channel 0: free: client-session, nchannels 1^M
[worker 0] stderr: debug1: fd 0 clearing O_NONBLOCK^M
[worker 0] stderr: debug1: fd 1 clearing O_NONBLOCK^M
[worker 0] stderr: debug1: fd 2 clearing O_NONBLOCK^M
[worker 0] stderr: Transferred: sent 5382364, received 107348 bytes, in 48.3 seconds^M
[worker 0] stderr: Bytes per second: sent 111324.0, received 2220.3^M
[worker 0] stderr: debug1: Exit status 0^M
[worker 0] stderr: debug1: compress outgoing: raw data 24142745, compressed 5336923, factor 0.22^M
[worker 0] stderr: debug1: compress incoming: raw data 90032, compressed 87626, factor 0.97^M

It looks like the ssh finishes normally, but the sender is continuing to send.

@davidbartonau
Copy link

davidbartonau commented Sep 15, 2019

The error seems to be on the very last block:

[worker 0] Unexpected error on block 1195

It loops over and writes every other block, and then the last block the receiver has already closed. Presumably after writing the hash.

@ndusart
Copy link

ndusart commented Mar 13, 2021

It also fails when sending a file to a localhost path, so it is not tied to SSH.

Once a worker has finished, all other workers fails immediately.

Here are the output I receive when using 8 workers (when issuing python /root/blocksync.py -w 8 -T "/var/lib/libvirt/images/dev.qcow2" localhost "/nas/vm/.working.dev.qcow2")

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                          !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!!         PRESS CTRL-C NOW TO EXIT         !!!
!!!                                          !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Starting worker #7 (pid: 953229)
[worker 7] Block size is 1.0 MB
[worker 7] Local fadvise: DONTNEED
[worker 7] Chunk size is 32004.9 MB, offset is 234917093376
[worker 7] Hash 1: sha512
[worker 7] Running: python2 /root/blocksync.py server /nas/vm/.working.dev.qcow2 -b 1048576 -d 3 -1 sha512
[worker 7] Remote fadvise: None
[worker 7] Start syncing 32005 blocks...
[worker 7] same: 1944, diff: 30061, 32005/32005,   6.6 MB/s
[worker 7] Completed in 1:20:18

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                          !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!!         PRESS CTRL-C NOW TO EXIT         !!!
!!!                                          !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Starting worker #0 (pid: 953206)
[worker 0] Block size is 1.0 MB
[worker 0] Local fadvise: DONTNEED
[worker 0] Chunk size is 32004.9 MB, offset is 0
[worker 0] Hash 1: sha512
[worker 0] Running: python2 /root/blocksync.py server /nas/vm/.working.dev.qcow2 -b 1048576 -d 3 -1 sha512
[worker 0] Remote fadvise: None
[worker 0] Start syncing 32005 blocks...
Traceback (most recent call last):
  File "/root/blocksync.py", line 415, in <module>
    sync(i, srcdev, dsthost, dstdev, options)
  File "/root/blocksync.py", line 318, in sync
    p_in.write(DIFF)
BrokenPipeError: [Errno 32] Broken pipe

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                                          !!!
!!! DESTINATION WILL BE PERMANENTLY CHANGED! !!!
!!!         PRESS CTRL-C NOW TO EXIT         !!!
!!!                                          !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Worker #7 exited with 0
Worker #0 exited with 256
Worker #0 caused ABORT
Terminating worker #1
Terminating worker #2
Terminating worker #3
Terminating worker #4
Terminating worker #5
Terminating worker #6
Terminating worker #7
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 431, in <module>
    os.kill(pid, signal.SIGTERM)
ProcessLookupError: [Errno 3] No such process
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/root/blocksync.py", line 384, in <module>
    server(dstdev, False, options)
  File "/root/blocksync.py", line 130, in server
    stdout.flush()
IOError: [Errno 32] Broken pipe

This happen when starting the process with nohup or in a cron job. When starting directly in a shell, this completes successfuly.

@ndusart
Copy link

ndusart commented May 25, 2021

Actually, it only takes a stdout redirection to reproduce the issue.

This command line has the exact same problem:

python /root/blocksync.py -w 4 -T src localhost dest >/dev/null

Seems like the first fork that finish its blocks close its file descriptor and other forks ends up with broken pipes on their stdin/stdout.

@ct16k @theraser any clue on this ?

DRRDietrich added a commit to DRRDietrich/blocksync that referenced this issue Oct 8, 2021
theraser#4
ssh ${server} "vmsync ${file_remote} ${localhost} ${file_local}";
@ndusart
Copy link

ndusart commented Feb 2, 2024

As this project seems unmaintained, I allow myself to promote one of my tool that I wrote to replace blocksync and that mainly solve this issue. I didn't find a way to fix this issue by patching this project unfortunately and I needed to start a new project.

So if this issue still impacts some people, you can check deltasync.

It has also some performance improvement to maximize IO bandwith and CPU usage to compute the checksum, so you should see an improvement in the effective bandwith too.

@marcin-github
Copy link

There is also https://github.com/nethappen/blocksync-fast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants