redundant read in patch/read_cb #4

asoki · 2014-04-29T09:43:56Z

in the function patch, the read_cb callback reads the same file position repeated.
this leads to poor performance on big files

how to repeat:
to be able to see the problem, add a line to the read_cb function
see:
https://github.com/smartfile/python-librsync/blob/master/librsync/__init__.py#L221

add the follwing line before the f.seek(pos) line
print "pos:",pos," length:",length

the code should now look:

...
def read_cb(opaque, pos, length, buff):
        print "pos:",pos," length:",length
        f.seek(pos)
        block = f.read(length)
....

now store this testcase as file and execute it:

import librsync, os, shutil
#create the datfiles only once
file_1='1mb_a.dat'
file_2='1mb_b.dat'
file_new='1mb_c.dat'

if (not os.path.exists(file_1)):
    rnd = open('/dev/random','rb')
    dst = open(file_1, 'wb')
    dst.seek(0)
    dst.write(rnd.read(1000000));
    dst.close()
    rnd.close()

if (not os.path.exists(file_2)):
    src = open(file_1,'rb')
    cnt = src.read()
    src.close()

    #make a change in the content of file_1
    cnt_a=bytearray(cnt)
    if (cnt_a[10] != 'A'):
        cnt_a[10] = 'A'
    else:
        cnt_a[10] = 'B'

    #and store the changed content as file_2
    dst = open(file_2, 'wb')
    dst.seek(0)
    dst.write(str(cnt_a))
    dst.close()

#now create the signature, delta
dst = open(file_1, 'rb')
src = open(file_2, 'rb')

synced = open(file_new, 'wb')
signature = librsync.signature(dst)
delta = librsync.delta(src, signature)

# Step 3: synchronize the files.
librsync.patch(dst, delta, synced)

the output is:
python redundant-read.py
pos: 0 length: 2048
pos: 0 length: 65536
pos: 0 length: 131072
pos: 0 length: 196608
pos: 0 length: 262144
pos: 0 length: 327680
pos: 0 length: 393216
pos: 0 length: 458752
pos: 0 length: 524288
pos: 0 length: 589824
pos: 0 length: 655360
pos: 0 length: 720896
pos: 0 length: 786432
pos: 0 length: 851968
pos: 0 length: 917504
pos: 0 length: 983040

here, the read_cb callback reads the file from position 0 again and again

btimby · 2014-04-30T00:23:30Z

The callback is called by librsync. Thus there is no direct control over the parameters that are provided.

You could provide a kwarg to the patch function that allows the caller to provide their own callback, rather than using the default one. You could then provide your own buffering callback function.

It is possible the OS is buffering the read data already, obviating the need for actually going to disk, but YMMV.

On the other hand perhaps some changes could happen in librsync itself that would eliminate this reading pattern (algorithm optimizations). Of course librsync is old and mature so that is likely easier said (or typed) than done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redundant read in patch/read_cb #4

redundant read in patch/read_cb #4

asoki commented Apr 29, 2014

btimby commented Apr 30, 2014

redundant read in patch/read_cb #4

redundant read in patch/read_cb #4

Comments

asoki commented Apr 29, 2014

btimby commented Apr 30, 2014