-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZarrWriter
speedup
#195
ZarrWriter
speedup
#195
Conversation
def threadedWriteSingleArray(self, *args, **kwargs): | ||
self.futures.append(self.executor.submit(self.writeSingleArray, *args)) | ||
fs = [] | ||
with lock_timeout(self.futures_lock): | ||
for f in self.futures: | ||
if f.done(): | ||
f.result() | ||
else: | ||
fs.append(f) | ||
self.futures = fs | ||
|
||
f = self.executor.submit(self.writeSingleArray, *args) | ||
with lock_timeout(self.futures_lock): | ||
self.futures.append(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm misinterpreting what's happening here!
But if we're iterating through futures and attempting to clear any finished ones every time a new image is submitted, wouldn't this cause a significant slowdown? (I had tried a simple version of this on the old ZarrWriter
when I thought it was uncleared futures that were hogging memory. Doing a basic list comprehension and clearing any done()
futures and I found that fps would plummet).
Having a growing list of these futures also (at least from what I've seen on dev_run
/ oracle
) doesn't seem to cause any problems, since these don't hold references to images once they've completed their task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick test, removing those lines above and running a (modified) version of the test_zarrwriter.py
script above. Modifications:
- Only ran the "new" zarrwriter
- Used a random image instead of 'falls.png'
- Set dt=1/53
- Wrote 10,000 imgs instead of 20,000 (didn't want to wait the whole duration)
Memory started at ~200MB at the beginning of the program, ended at 220MB by the time all the jobs were submitted to the executor (monitoring via htop
).
Tot_time
includes the time it took to submit all the jobs plus the additional time to finish waiting on all incomplete jobs.
This graph below confuses me - the work queue appears to climb to 2300/2400, so I'm assuming that means there are that many images that have yet to be dumped to the disk. In that case, I would expect memory usage to be at ~1.9GB (0.796 MB/img for AVT, 772*1032) - but like I mentioned above, htop
only reported memory usage hovering steadily around 200-220MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we're iterating through futures and attempting to clear any finished ones every time a new image is submitted, wouldn't this cause a significant slowdown?
Yea you're right - clearing the futures will take a bunch of time. I didn't notice a performance change on my M1, so I was hoping that it could clear futures fast enough, but perhaps not :( . It bothers me that we don't check for exceptions here. I wonder if we could multiprocess and have another core check them?
If we do nothing with the futures, we should just not collect them at all! Save memory, time, and confusion.
Thoughts?
This graph below confuses me - the work queue appears to climb to 2300/2400, so I'm assuming that means there are that many images that have yet to be dumped to the disk. In that case, I would expect memory usage to be at ~1.9GB (0.796 MB/img for AVT, 772*1032) - but like I mentioned above, htop only reported memory usage hovering steadily around 200-220MB.
Hm it confuses me too. I'll play with these configs for a bit and I'll see if I can reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth saving the futures - we can return them all at the end when we 'close' the experiment, and perhaps do something with them at that point (i.e check for exceptions) - though we won't be able to retroactively do anything, at least we'll know that something went wrong and log that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done! I'll resolve this conversation once I have inestigated this graph issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yo @i-jey I can't reproduce - just basic sanity check, did you set the dtype to uint8
? np.random.randint
's default dtype is int
which is int32
on our Pi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the test script I used! Maybe I did something silly? It'll run and output a test_zarrwriter.png
at the end.
(GIthub won't let me attach a .py
file, so pasted below instead)
#! /usr/bin/env python3
import cv2
import time
import zarr
import matplotlib.pyplot as plt
import numpy as np
import functools
from typing import Optional
from contextlib import contextmanager
from zarrwriter import ZarrWriter
ZW = ZarrWriter()
ZW.createNewFile("/media/pi/SamsungSSD/test_normal")
def measure(zw):
img = np.random.randint(0, 255, (772, 1032), dtype='uint8')
qsizes = []
ts = []
dts = []
t0 = time.perf_counter()
dt = 1 / 53
for i in range(int(1e4)):
time.sleep(dt)
T0 = time.perf_counter()
zw.threadedWriteSingleArray(img, i)
T1 = time.perf_counter()
dts.append(T1 - T0)
qsizes.append(zw.executor._work_queue.qsize())
ts.append(time.perf_counter() - t0)
t_fin_sub = time.perf_counter() - t0
print(f"Submitted 1e4 imgs in {t_fin_sub} ({1e4/t_fin_sub} imgs/s)")
while zw.executor._work_queue.qsize() > 0:
time.sleep(dt)
qsizes.append(zw.executor._work_queue.qsize())
ts.append(time.perf_counter() - t0)
t1 = time.perf_counter()
dt_tot = t1 - t0
print(f"Tot time: {dt_tot}, imgs/sec: {1e4 / dt_tot}")
zw.wait_all()
zw.closeFile()
return dt_tot, ts, qsizes, dts, t_fin_sub
try:
print("starting normal zarrwriter")
dt_tot, ts, qsizes, dts, t_fin_sub = measure(ZW)
except KeyboardInterrupt:
pass
plt.figure(figsize=(16, 12), dpi=160)
plt.title("work queue size")
plt.xlabel("time (s)")
plt.ylabel("_work_queue size")
plt.scatter(ts, qsizes, s=2, color="blue")
plt.axvline(t_fin_sub, color="blue", linewidth=2, linestyle="--")
plt.savefig("test_zarrwriter.png")
Dude this change is wild - what a massive relief and huge performance boost!! |
My perspective on this is informed by some empirical/practical experience but not a ton of theory: using one huge array sounds really efficient because it minimizes any initialization overhead. I support this effort, but have a simple request (see at the end). "Back in my day" (old croaky voice) when I was doing high speed image acquisition I was saving 500 FPS (at 2 MB / image) to a 6 disk HD RAID array while processing the images (mostly because single HDD write speed wasn't good enough). In that case, I did not have to do anything fancy at all like threaded writing queues. I simply wrote the data to a custom binary file (like literally an fwrite operation) in one gigantic array, always appending and keeping the file open. Of course, the other thing to note is that this data had a very simple pipeline - no emitting it all over the place, no extra references to it in different routines, etc. etc. This is similar to the other situation I told you about where the crop was very small but framerate much higher (up to 40,000 FPS), which used the exact same strategy. Not really a big difference. In both cases there was a single memory buffer in software that was big enough to hold 100-10,000 frames. The camera dumped the data directly into the buffer (it was a DMA frame), then I accessed it from Matlab, processed it, then wrote it to disk (blocking operation), repeat. This worked really well for all the imaging data I acquired throughout my PhD. The computer and disks were fast enough to write over 1 GB /s. In our case we are talking about 32 MB/s, and the SSD is rated up to several hundred MB/s. Granted, we are operating on limited compute hardware, and have a single GIL for everything we are doing, and...python. But if we are not hindered in some way by file opening overhead, the fundamental write speed should be over 10x faster than we currently are writing. @Axel-Jacobsen can you try a quick experiment? Can you try creating a basic script with no threading, that:
If fast, cool. If slow, we know that the zarr python bindings we are using are just slow AF. I just want to see what the limits are, cutting all the extra complications out. If it's slow, I wonder if we can create a C function to do our write operations, because python is... slow. |
Actually, you could write in chunks of (772, 1032, n), where you vary n. |
@paul-lebel Sure! results coming up shortly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for finding this fix! Added some minor comments
Interesting results so far (using this https://gist.github.com/Axel-Jacobsen/b1a5c245fc251de84e03305f2e9a5760) On the Raspberry Pi, results are confusing: On my Mac, results make more sense: Working a bit more on this still |
…-malaria-scope into zarrwriter-xtra-big
FWIW, I wrote a very simple script to test the python equivalent of binary file writing and ran it on my RPi at home, saving it to an SD card via USB port (all I had in place of an SSD):
The result:
which is about 314 FPS to a regular old SD card. On my mac (3 yr old macbook pro), writing locally:
or about 4,000 FPS. |
@i-jey chunk size of n has an actual size of (772, 1032, n) @paul-lebel we could always just drop zarr and do that? |
also I found an error in my script, standby for actual results |
Maybe, but we would need to work on a nice way of ensuring the metadata is all linked with the appropriate frame, etc. And with the raw binary file format you obviously have to get your file indexing and byte orders right so you don't scramble things up. It adds work but I'd say more straightforward than all the threaded efforts? |
That's wild @paul-lebel. I
|
Thanks for testing on Ohmu. I wonder where the discrepancy between your results and mine comes from? |
okie dokiie game plan:
when submitting the chunk, we would be submitting it by reference, so we have to either
we will have to be careful about order here too - matching the image count to it’s idx in the zarr array, but that shouldn’t be too hard |
Hey @Axel-Jacobsen , it is really interesting that you see an optimal chunk size that is so small. However, I'm very hesitant to believe (without more data running in oracle/acquisition context, SSD, etc) that this is a universal / robust phenomenon that holds water in different situations. Any insight into what might cause the reduction in speed with chunks > 32? While we do expect saturation of speed, I did not expect it to drop off like that. When I test vs. n in my script it's completely flat (always the same time per frame no matter what). Also, @i-jey incorporated a very important effect in his test script: the images are generated sequentially, each with a time lag. So in practice this might push us towards smaller chunks because we might not want to spend time waiting for more frames to come in just so we do the final write operation faster - that would be wasting a lot of time we have between image frames. I think in the end, the xtra-big zarr approach is super promising if we can achieve the fundamental write speed limit! But we will need to test on oracle with real hardware to see what value of n works best. In terms of memory management approach, I think what you're describing is the same as Eduardo's strategy he told us about from his previous work where he had many cameras all imaging at high speed. He used what he called 'ping-pong' buffers. The camera would put data into buffer A while the data storage thread wrote buffer B to disk. Then they swapped and camera filled buffer B while data storage wrote buffer A. We could definitely do this with two queues of size 'n'! |
Not sure. The precipitous drop off surprised me too. Maybe it could be a zarr thing? I tested this on the SSD this morning as well, same results
I'll try to write it so the chunk size is configurable, then do some testing! Here are the graphs, with the test script too Writing to the SSD Archive.zip |
@paul-lebel @i-jey @mwlkhoo thoughts on merging more or less what we have (I'll remove I am trying to think through the best way to write chunks. We have some special constraints:
1 isn't too hard, since we are synchronously giving
Given all this, maybe we chunk this PR into two parts? The first part would be the changes as they stand now, since they will reduce the magnitude of our memory issues. The second part would be writing Thoughts? |
I can run in and test if needed, but will need to ask Aditi to let me in (my bad for delaying my UCSF badge, and the office is closed for the holidays). I think Aditi is the best person to ask to see if blood can be put on the scope)
I agree 1 is not an issue. I also anticipate 2 will not be an issue, because any skipped images will not increment
Two parts sounds good -- this PR as is would be helpful for testing without running out of memory |
Unless I'm missing something obvious, aren't we totally in the clear with setting a chunk size of So to your point Axel - yup, I think it's a good idea to split this PR up and merge the part that we know works. But I wonder if the second part, with all the additional complexity, is necessary if we're already achieving the constant-memory + sufficient FPS requirement. |
…rwise we get a corrupt zipfile error
…es, we can't keep up at 53fps
The results from my previous comment on doing a binary write to the SD/SSD on Ohmu are incorrect (#195 (comment)). I'm guessing when I Re-running the simple write-to-binary-file-script, I get these values:
|
See https://gist.github.com/Axel-Jacobsen/147c9142559a18dd1f336a58ca7c9b88 for test scripts. I used
test_zarrwriter.py
to test different implementations.Inspiration for this comes from this issue comment. Essentially, you can create a zarr array of any arbitrary size (e.g.
(772,1032,2e9)
), and only chunks that are initialized are written to disk. An initialized chunk is a chunk that has data written to it.One thing to note: this will not completely eliminate our current issues with memory buildup, but it could be a large factor. As a recap,
max_workers
)executor._work_queue
builds up, because the thread(s) can not write to disk fast enough to keep up - there will always be some threshold FPS where this occurs. It seems to hover around 20-30 FPS in our use cases.These changes hopefully will let the
ZarrWriter
fly through the work queue very quickly, increasing the threshold FPS that is required before memory starts building up.Although the size of the zarr array on disk is only as big as the memory written, this will change how we manage data. When you open the zarr file via
AR = zarr.open('zarrfile.zip')
, it's size is (772,1032,2e9). To get the actual size of chunks written, useAR.initialized
. To get the images that you've written, iterate throughAR[...,i]
with 0 <= i <initialized
. There is probably a way to truncate uninitialized chunks, we should try do that for convenience.Below is the size of the thread pool executor's
_work_queue
for eachZarrWriter
, while submitting jobs with a delay of 0.0003 seconds on my M1. The vertical dashed line is the time at which we stopped submitting jobs.On scope, with
dt=1/33
(i.e. a delay of 1/33 seconds between image submissions):Running on blood
Master
This Branch
Optimal number of threads
By setting the maximum number of threads for the concurrent futures
ThreadPoolExecutor
to different values, we get different performance. From tests, it seems thatmax_workers=1
is the winner:On the SD:
On the SSD:
On the SSD, but just w/ 4 options and smaller
dt
between writesA
max_workers=1
is equal to or better than a greater number of max workers.