Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redundant read in patch/read_cb #4

Open
asoki opened this issue Apr 29, 2014 · 1 comment
Open

redundant read in patch/read_cb #4

asoki opened this issue Apr 29, 2014 · 1 comment

Comments

@asoki
Copy link

asoki commented Apr 29, 2014

in the function patch, the read_cb callback reads the same file position repeated.
this leads to poor performance on big files

how to repeat:
to be able to see the problem, add a line to the read_cb function
see:
https://github.com/smartfile/python-librsync/blob/master/librsync/__init__.py#L221

add the follwing line before the f.seek(pos) line
print "pos:",pos," length:",length

the code should now look:

...
def read_cb(opaque, pos, length, buff):
        print "pos:",pos," length:",length
        f.seek(pos)
        block = f.read(length)
....

now store this testcase as file and execute it:

import librsync, os, shutil
#create the datfiles only once
file_1='1mb_a.dat'
file_2='1mb_b.dat'
file_new='1mb_c.dat'

if (not os.path.exists(file_1)):
    rnd = open('/dev/random','rb')
    dst = open(file_1, 'wb')
    dst.seek(0)
    dst.write(rnd.read(1000000));
    dst.close()
    rnd.close()

if (not os.path.exists(file_2)):
    src = open(file_1,'rb')
    cnt = src.read()
    src.close()

    #make a change in the content of file_1
    cnt_a=bytearray(cnt)
    if (cnt_a[10] != 'A'):
        cnt_a[10] = 'A'
    else:
        cnt_a[10] = 'B'

    #and store the changed content as file_2
    dst = open(file_2, 'wb')
    dst.seek(0)
    dst.write(str(cnt_a))
    dst.close()

#now create the signature, delta
dst = open(file_1, 'rb')
src = open(file_2, 'rb')

synced = open(file_new, 'wb')
signature = librsync.signature(dst)
delta = librsync.delta(src, signature)

# Step 3: synchronize the files.
librsync.patch(dst, delta, synced)

the output is:
python redundant-read.py
pos: 0 length: 2048
pos: 0 length: 65536
pos: 0 length: 131072
pos: 0 length: 196608
pos: 0 length: 262144
pos: 0 length: 327680
pos: 0 length: 393216
pos: 0 length: 458752
pos: 0 length: 524288
pos: 0 length: 589824
pos: 0 length: 655360
pos: 0 length: 720896
pos: 0 length: 786432
pos: 0 length: 851968
pos: 0 length: 917504
pos: 0 length: 983040

here, the read_cb callback reads the file from position 0 again and again

@btimby
Copy link
Contributor

btimby commented Apr 30, 2014

The callback is called by librsync. Thus there is no direct control over the parameters that are provided.

You could provide a kwarg to the patch function that allows the caller to provide their own callback, rather than using the default one. You could then provide your own buffering callback function.

It is possible the OS is buffering the read data already, obviating the need for actually going to disk, but YMMV.

On the other hand perhaps some changes could happen in librsync itself that would eliminate this reading pattern (algorithm optimizations). Of course librsync is old and mature so that is likely easier said (or typed) than done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants