Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZarrWriter speedup #195

Merged
merged 25 commits into from
Dec 23, 2022
Merged

ZarrWriter speedup #195

merged 25 commits into from
Dec 23, 2022

Conversation

Axel-Jacobsen
Copy link
Contributor

@Axel-Jacobsen Axel-Jacobsen commented Dec 15, 2022

See https://gist.github.com/Axel-Jacobsen/147c9142559a18dd1f336a58ca7c9b88 for test scripts. I used test_zarrwriter.py to test different implementations.

Inspiration for this comes from this issue comment. Essentially, you can create a zarr array of any arbitrary size (e.g. (772,1032,2e9)), and only chunks that are initialized are written to disk. An initialized chunk is a chunk that has data written to it.

One thing to note: this will not completely eliminate our current issues with memory buildup, but it could be a large factor. As a recap,

  • we are trying to save data as fast as possible, with as little blocking as possible - that is why we use the ThreadPoolExecutor with some number of threads (capped at max_workers)
  • When we flood the thread pool w/ jobs (i.e. give images at a really really high FPS), the executor._work_queue builds up, because the thread(s) can not write to disk fast enough to keep up - there will always be some threshold FPS where this occurs. It seems to hover around 20-30 FPS in our use cases.
  • When we do extra CPU work, we will lower this threshold FPS. This is simply because there are a limited number of clock cycles with which we can do the work we want to do - write to disk, calculate flowrate, whatever

These changes hopefully will let the ZarrWriter fly through the work queue very quickly, increasing the threshold FPS that is required before memory starts building up.

Although the size of the zarr array on disk is only as big as the memory written, this will change how we manage data. When you open the zarr file via AR = zarr.open('zarrfile.zip'), it's size is (772,1032,2e9). To get the actual size of chunks written, use AR.initialized. To get the images that you've written, iterate through AR[...,i] with 0 <= i < initialized. There is probably a way to truncate uninitialized chunks, we should try do that for convenience.


Below is the size of the thread pool executor's _work_queue for each ZarrWriter, while submitting jobs with a delay of 0.0003 seconds on my M1. The vertical dashed line is the time at which we stopped submitting jobs.

result

On scope, with dt=1/33 (i.e. a delay of 1/33 seconds between image submissions):

result

Running on blood

Master

zw executor qsize_master

This Branch

zw executor qsize zarrwriter

Optimal number of threads

By setting the maximum number of threads for the concurrent futures ThreadPoolExecutor to different values, we get different performance. From tests, it seems that max_workers=1 is the winner:

On the SD:

num_workers_sd

On the SSD:

num_workers_ssd

On the SSD, but just w/ 4 options and smaller dt between writes

num_workers_variation

A max_workers=1 is equal to or better than a greater number of max workers.

@Axel-Jacobsen Axel-Jacobsen marked this pull request as draft December 15, 2022 03:13
@Axel-Jacobsen Axel-Jacobsen marked this pull request as ready for review December 15, 2022 04:26
Comment on lines 95 to 107
def threadedWriteSingleArray(self, *args, **kwargs):
self.futures.append(self.executor.submit(self.writeSingleArray, *args))
fs = []
with lock_timeout(self.futures_lock):
for f in self.futures:
if f.done():
f.result()
else:
fs.append(f)
self.futures = fs

f = self.executor.submit(self.writeSingleArray, *args)
with lock_timeout(self.futures_lock):
self.futures.append(f)
Copy link
Collaborator

@i-jey i-jey Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm misinterpreting what's happening here!

But if we're iterating through futures and attempting to clear any finished ones every time a new image is submitted, wouldn't this cause a significant slowdown? (I had tried a simple version of this on the old ZarrWriter when I thought it was uncleared futures that were hogging memory. Doing a basic list comprehension and clearing any done() futures and I found that fps would plummet).

Having a growing list of these futures also (at least from what I've seen on dev_run / oracle) doesn't seem to cause any problems, since these don't hold references to images once they've completed their task.

Copy link
Collaborator

@i-jey i-jey Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick test, removing those lines above and running a (modified) version of the test_zarrwriter.py script above. Modifications:

  • Only ran the "new" zarrwriter
  • Used a random image instead of 'falls.png'
  • Set dt=1/53
  • Wrote 10,000 imgs instead of 20,000 (didn't want to wait the whole duration)

Memory started at ~200MB at the beginning of the program, ended at 220MB by the time all the jobs were submitted to the executor (monitoring via htop).
image

Tot_time includes the time it took to submit all the jobs plus the additional time to finish waiting on all incomplete jobs.
image

This graph below confuses me - the work queue appears to climb to 2300/2400, so I'm assuming that means there are that many images that have yet to be dumped to the disk. In that case, I would expect memory usage to be at ~1.9GB (0.796 MB/img for AVT, 772*1032) - but like I mentioned above, htop only reported memory usage hovering steadily around 200-220MB.
test_zarrwriter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we're iterating through futures and attempting to clear any finished ones every time a new image is submitted, wouldn't this cause a significant slowdown?

Yea you're right - clearing the futures will take a bunch of time. I didn't notice a performance change on my M1, so I was hoping that it could clear futures fast enough, but perhaps not :( . It bothers me that we don't check for exceptions here. I wonder if we could multiprocess and have another core check them?

If we do nothing with the futures, we should just not collect them at all! Save memory, time, and confusion.

Thoughts?

This graph below confuses me - the work queue appears to climb to 2300/2400, so I'm assuming that means there are that many images that have yet to be dumped to the disk. In that case, I would expect memory usage to be at ~1.9GB (0.796 MB/img for AVT, 772*1032) - but like I mentioned above, htop only reported memory usage hovering steadily around 200-220MB.

Hm it confuses me too. I'll play with these configs for a bit and I'll see if I can reproduce.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth saving the futures - we can return them all at the end when we 'close' the experiment, and perhaps do something with them at that point (i.e check for exceptions) - though we won't be able to retroactively do anything, at least we'll know that something went wrong and log that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! I'll resolve this conversation once I have inestigated this graph issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yo @i-jey I can't reproduce - just basic sanity check, did you set the dtype to uint8? np.random.randint's default dtype is int which is int32 on our Pi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok disregard!

result

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the test script I used! Maybe I did something silly? It'll run and output a test_zarrwriter.png at the end.
(GIthub won't let me attach a .py file, so pasted below instead)

#! /usr/bin/env python3

import cv2
import time
import zarr
import matplotlib.pyplot as plt

import numpy as np

import functools

from typing import Optional
from contextlib import contextmanager

from zarrwriter import ZarrWriter


ZW = ZarrWriter()
ZW.createNewFile("/media/pi/SamsungSSD/test_normal")



def measure(zw):
    img = np.random.randint(0, 255, (772, 1032), dtype='uint8')

    qsizes = []
    ts = []
    dts = []

    t0 = time.perf_counter()
    dt = 1 / 53

    for i in range(int(1e4)):
        time.sleep(dt)
        T0 = time.perf_counter()
        zw.threadedWriteSingleArray(img, i)
        T1 = time.perf_counter()
        dts.append(T1 - T0)
        qsizes.append(zw.executor._work_queue.qsize())
        ts.append(time.perf_counter() - t0)
    t_fin_sub = time.perf_counter() - t0
    print(f"Submitted 1e4 imgs in {t_fin_sub} ({1e4/t_fin_sub} imgs/s)")

    while zw.executor._work_queue.qsize() > 0:
        time.sleep(dt)
        qsizes.append(zw.executor._work_queue.qsize())
        ts.append(time.perf_counter() - t0)

    t1 = time.perf_counter()
    dt_tot = t1 - t0
    print(f"Tot time: {dt_tot}, imgs/sec: {1e4 / dt_tot}")
    zw.wait_all()
    zw.closeFile()

    return dt_tot, ts, qsizes, dts, t_fin_sub


try:
    print("starting normal zarrwriter")
    dt_tot, ts, qsizes, dts, t_fin_sub = measure(ZW)
except KeyboardInterrupt:
    pass


plt.figure(figsize=(16, 12), dpi=160)

plt.title("work queue size")
plt.xlabel("time (s)")
plt.ylabel("_work_queue size")
plt.scatter(ts, qsizes, s=2, color="blue")
plt.axvline(t_fin_sub, color="blue", linewidth=2, linestyle="--")
plt.savefig("test_zarrwriter.png")

@i-jey
Copy link
Collaborator

i-jey commented Dec 15, 2022

Dude this change is wild - what a massive relief and huge performance boost!!

@paul-lebel
Copy link
Collaborator

paul-lebel commented Dec 15, 2022

My perspective on this is informed by some empirical/practical experience but not a ton of theory: using one huge array sounds really efficient because it minimizes any initialization overhead. I support this effort, but have a simple request (see at the end).

"Back in my day" (old croaky voice) when I was doing high speed image acquisition I was saving 500 FPS (at 2 MB / image) to a 6 disk HD RAID array while processing the images (mostly because single HDD write speed wasn't good enough). In that case, I did not have to do anything fancy at all like threaded writing queues. I simply wrote the data to a custom binary file (like literally an fwrite operation) in one gigantic array, always appending and keeping the file open. Of course, the other thing to note is that this data had a very simple pipeline - no emitting it all over the place, no extra references to it in different routines, etc. etc. This is similar to the other situation I told you about where the crop was very small but framerate much higher (up to 40,000 FPS), which used the exact same strategy. Not really a big difference.

In both cases there was a single memory buffer in software that was big enough to hold 100-10,000 frames. The camera dumped the data directly into the buffer (it was a DMA frame), then I accessed it from Matlab, processed it, then wrote it to disk (blocking operation), repeat. This worked really well for all the imaging data I acquired throughout my PhD. The computer and disks were fast enough to write over 1 GB /s.

In our case we are talking about 32 MB/s, and the SSD is rated up to several hundred MB/s. Granted, we are operating on limited compute hardware, and have a single GIL for everything we are doing, and...python. But if we are not hindered in some way by file opening overhead, the fundamental write speed should be over 10x faster than we currently are writing.

@Axel-Jacobsen can you try a quick experiment? Can you try creating a basic script with no threading, that:

  • Creates an xtra big zarr array
  • Creates a random 8 bit nd array of shape (772, 1032, int(1e3))
  • Enters a for loop, each iteration writing chunks of size (772, 1032, 100) to the zarr file
  • Time this on M1, and on RPi, using our SSD

If fast, cool. If slow, we know that the zarr python bindings we are using are just slow AF. I just want to see what the limits are, cutting all the extra complications out. If it's slow, I wonder if we can create a C function to do our write operations, because python is... slow.

@paul-lebel
Copy link
Collaborator

Actually, you could write in chunks of (772, 1032, n), where you vary n.

@Axel-Jacobsen
Copy link
Contributor Author

@paul-lebel Sure! results coming up shortly

Copy link
Contributor

@mwlkhoo mwlkhoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for finding this fix! Added some minor comments

@Axel-Jacobsen
Copy link
Contributor Author

Interesting results so far (using this https://gist.github.com/Axel-Jacobsen/b1a5c245fc251de84e03305f2e9a5760)

On the Raspberry Pi, results are confusing:

profile_result_pi

On my Mac, results make more sense:

profile_result

Working a bit more on this still

@paul-lebel
Copy link
Collaborator

My perspective on this is informed by some empirical/practical experience but not a ton of theory: using one huge array sounds really efficient because it minimizes any initialization overhead. I support this effort, but have a simple request (see at the end).

"Back in my day" (old croaky voice) when I was doing high speed image acquisition I was saving 500 FPS (at 2 MB / image) to a 6 disk HD RAID array while processing the images (mostly because single HDD write speed wasn't good enough). In that case, I did not have to do anything fancy at all like threaded writing queues. I simply wrote the data to a custom binary file (like literally an fwrite operation) in one gigantic array, always appending and keeping the file open. Of course, the other thing to note is that this data had a very simple pipeline - no emitting it all over the place, no extra references to it in different routines, etc. etc. This is similar to the other situation I told you about where the crop was very small but framerate much higher (up to 40,000 FPS), which used the exact same strategy. Not really a big difference.

In both cases there was a single memory buffer in software that was big enough to hold 100-10,000 frames. The camera dumped the data directly into the buffer (it was a DMA frame), then I accessed it from Matlab, processed it, then wrote it to disk (blocking operation), repeat. This worked really well for all the imaging data I acquired throughout my PhD. The computer and disks were fast enough to write over 1 GB /s.

In our case we are talking about 32 MB/s, and the SSD is rated up to several hundred MB/s. Granted, we are operating on limited compute hardware, and have a single GIL for everything we are doing, and...python. But if we are not hindered in some way by file opening overhead, the fundamental write speed should be over 10x faster than we currently are writing.

@Axel-Jacobsen can you try a quick experiment? Can you try creating a basic script with no threading, that:

  • Creates an xtra big zarr array
  • Creates a random 8 bit nd array of shape (772, 1032, int(1e3))
  • Enters a for loop, each iteration writing chunks of size (772, 1032, 100) to the zarr file
  • Time this on M1, and on RPi, using our SSD

If fast, cool. If slow, we know that the zarr python bindings we are using are just slow AF. I just want to see what the limits are, cutting all the extra complications out. If it's slow, I wonder if we can create a C function to do our write operations, because python is... slow.

FWIW, I wrote a very simple script to test the python equivalent of binary file writing and ran it on my RPi at home, saving it to an SD card via USB port (all I had in place of an SSD):

import numpy as np
import time

n=100
img = bytearray(np.random.randint(0, 255, (772, 1032,n), dtype='uint8'))

with open("/media/pi/rootfs/home/pi/Documents/my_file.txt", "wb") as binary_file:

    # Write bytes to file
    t0 = time.perf_counter()
    binary_file.write(img)
    t1 = time.perf_counter()
    print(f"{(t1-t0)/n} seconds per image")

The result:

pi@PaulPi4b-1:~/Documents $ python3 write_test.py 
0.003180479170000581 seconds per image

which is about 314 FPS to a regular old SD card.

On my mac (3 yr old macbook pro), writing locally:

(malariaEnv) paul.lebel@plebel-mac Documents % python3 test_write.py  
0.00024927042899999964 seconds per image

or about 4,000 FPS.
"The computers are fast but you don't know it"

@Axel-Jacobsen
Copy link
Contributor Author

@i-jey chunk size of n has an actual size of (772, 1032, n)

@paul-lebel we could always just drop zarr and do that?

@Axel-Jacobsen
Copy link
Contributor Author

also I found an error in my script, standby for actual results

@paul-lebel
Copy link
Collaborator

Maybe, but we would need to work on a nice way of ensuring the metadata is all linked with the appropriate frame, etc. And with the raw binary file format you obviously have to get your file indexing and byte orders right so you don't scramble things up. It adds work but I'd say more straightforward than all the threaded efforts?

@i-jey
Copy link
Collaborator

i-jey commented Dec 15, 2022

That's wild @paul-lebel.

I ssh'd into Ohmu and tried that same script (both on SD card and the SamsungSSD) - I get the following:

  • SD: 0.13623486866999884 seconds / img (7.3fps)
  • SSD: 0.008962084240047262 seconds / img (111.6fps)

@paul-lebel
Copy link
Collaborator

Thanks for testing on Ohmu. I wonder where the discrepancy between your results and mine comes from?

@Axel-Jacobsen
Copy link
Contributor Author

Axel-Jacobsen commented Dec 16, 2022

Here is the same, except just up to a chunk size of 32

avg_fps_32ch
Avg FPS

time_to_write_chunk_32ch
time to write chunk

avg_time_to_save_32ch
Avg time to save each image

Maybe the optimal chunk size will be different w/ the SSD plugged in?

@Axel-Jacobsen
Copy link
Contributor Author

okie dokiie game plan:

  • test the optimal chunk size on ssd (ill have to ssh onto ohmu or smth) - curious if the optimal "32" that we found there is SD specific, or pi specific!
  • rewrite to synchronously write to chunk, and once the chunk is full, submit it to the queue, and start writing to it again
  • test w/ original scripts before final verification

when submitting the chunk, we would be submitting it by reference, so we have to either

  • copy the chunk before submitting (slow)
  • something like this? have two chunks - when we submit chunk A, we start writing to B. once B is full, A will most certainly have been written, so start writing to A again, etc

we will have to be careful about order here too - matching the image count to it’s idx in the zarr array, but that shouldn’t be too hard

@paul-lebel
Copy link
Collaborator

paul-lebel commented Dec 16, 2022

okie dokiie game plan:

  • test the optimal chunk size on ssd (ill have to ssh onto ohmu or smth) - curious if the optimal "32" that we found there is SD specific, or pi specific!
  • rewrite to synchronously write to chunk, and once the chunk is full, submit it to the queue, and start writing to it again
  • test w/ original scripts before final verification

when submitting the chunk, we would be submitting it by reference, so we have to either

  • copy the chunk before submitting (slow)
  • something like this? have two chunks - when we submit chunk A, we start writing to B. once B is full, A will most certainly have been written, so start writing to A again, etc

we will have to be careful about order here too - matching the image count to it’s idx in the zarr array, but that shouldn’t be too hard

Hey @Axel-Jacobsen , it is really interesting that you see an optimal chunk size that is so small. However, I'm very hesitant to believe (without more data running in oracle/acquisition context, SSD, etc) that this is a universal / robust phenomenon that holds water in different situations. Any insight into what might cause the reduction in speed with chunks > 32? While we do expect saturation of speed, I did not expect it to drop off like that. When I test vs. n in my script it's completely flat (always the same time per frame no matter what).

Also, @i-jey incorporated a very important effect in his test script: the images are generated sequentially, each with a time lag. So in practice this might push us towards smaller chunks because we might not want to spend time waiting for more frames to come in just so we do the final write operation faster - that would be wasting a lot of time we have between image frames.

I think in the end, the xtra-big zarr approach is super promising if we can achieve the fundamental write speed limit! But we will need to test on oracle with real hardware to see what value of n works best.

In terms of memory management approach, I think what you're describing is the same as Eduardo's strategy he told us about from his previous work where he had many cameras all imaging at high speed. He used what he called 'ping-pong' buffers. The camera would put data into buffer A while the data storage thread wrote buffer B to disk. Then they swapped and camera filled buffer B while data storage wrote buffer A. We could definitely do this with two queues of size 'n'!

@paul-lebel
Copy link
Collaborator

Not sure how accurate this is, but I couldn't resist.
image

@Axel-Jacobsen
Copy link
Contributor Author

Any insight into what might cause the reduction in speed with chunks > 32?

Not sure. The precipitous drop off surprised me too. Maybe it could be a zarr thing? I tested this on the SSD this morning as well, same results

avg_time_per_img

avg_fps

time_per_chunk

Also, @i-jey incorporated a very important effect in his test script: the images are generated sequentially, each with a time lag. So in practice this might push us towards smaller chunks because we might not want to spend time waiting for more frames to come in just so we do the final write operation faster - that would be wasting a lot of time we have between image frames.

I'll try to write it so the chunk size is configurable, then do some testing!

Here are the graphs, with the test script too

Writing to the SSD Archive.zip

@Axel-Jacobsen
Copy link
Contributor Author

@paul-lebel @i-jey @mwlkhoo thoughts on merging more or less what we have (I'll remove ChunkWriter, and we will have to test on Ohmu - @i-jey or @mwlkhoo would you be able to? or able to at least put blood on the scope)?

I am trying to think through the best way to write chunks. We have some special constraints:

  1. images must be written in order
  2. we must write within chunk boundaries in the zarr array, otherwise we have to add a lock to the zarr array itself
  3. we want there to be minimal blocking

1 isn't too hard, since we are synchronously giving zarrwriter one image at a time. We just have to verify that we are getting the image count (given by ScopeOp.count) that we expect.
2 is OK, given that 1 goes well. If we miss images, and get an image that is out of order, things will get a little more complicated
3 is trickier. Here are some ideas on how to do this:

  • as images come in, write to a big np.array of shape (772,1032,num_imgs_per_chunk). When it is full, np.copy it, and submit it to the ThreadPoolExecutor to write. Maybe with a chunk size of like 4 or 8, we get like 80% of the benefit with very little extra complexity.
  • If np.copy takes too long for smaller num_imgs_per_chunk, or if we want more throughput, we can do something like A/B writing, or PingPong buffers. Looking at the performance graphs, latency can reach up to 100 ms for writing a chunk with num_imgs_per_chunk = 32, so most likely we would have to have more than two chunks that we are 'ping-ponging' inbetween. See a sketch of how we may do this here

Given all this, maybe we chunk this PR into two parts? The first part would be the changes as they stand now, since they will reduce the magnitude of our memory issues. The second part would be writing ZarrWriter to deal with chunks of images to improve throughput.

Thoughts?

@mwlkhoo
Copy link
Contributor

mwlkhoo commented Dec 16, 2022

@paul-lebel @i-jey @mwlkhoo thoughts on merging more or less what we have (I'll remove ChunkWriter, and we will have to test on Ohmu - @i-jey or @mwlkhoo would you be able to? or able to at least put blood on the scope)?

I can run in and test if needed, but will need to ask Aditi to let me in (my bad for delaying my UCSF badge, and the office is closed for the holidays). I think Aditi is the best person to ask to see if blood can be put on the scope)

I am trying to think through the best way to write chunks. We have some special constraints:

  1. images must be written in order
  2. we must write within chunk boundaries in the zarr array, otherwise we have to add a lock to the zarr array itself
  3. we want there to be minimal blocking

I agree 1 is not an issue. I also anticipate 2 will not be an issue, because any skipped images will not increment scopeop.count

Given all this, maybe we chunk this PR into two parts? The first part would be the changes as they stand now, since they will reduce the magnitude of our memory issues. The second part would be writing ZarrWriter to deal with chunks of images to improve throughput.

Two parts sounds good -- this PR as is would be helpful for testing without running out of memory

@i-jey
Copy link
Collaborator

i-jey commented Dec 16, 2022

Unless I'm missing something obvious, aren't we totally in the clear with setting a chunk size of (IMG_HEIGHT, IMG_WIDTH, 1)? In the test I ran earlier in this discussion, memory stayed constant and we were able to keep pace with the camera spitting out images at 53fps.

So to your point Axel - yup, I think it's a good idea to split this PR up and merge the part that we know works. But I wonder if the second part, with all the additional complexity, is necessary if we're already achieving the constant-memory + sufficient FPS requirement.

i-jey
i-jey previously approved these changes Dec 23, 2022
@i-jey i-jey merged commit b4f213d into master Dec 23, 2022
@i-jey
Copy link
Collaborator

i-jey commented Jan 20, 2023

The results from my previous comment on doing a binary write to the SD/SSD on Ohmu are incorrect (#195 (comment)).

I'm guessing when I ssh'd, there was some other process that was still running and was hogging up the CPU.

Re-running the simple write-to-binary-file-script, I get these values:

  • OS SD - 0.0207 to write 100 imgs (48.29 imgs/sec)
  • SSD - 0.0031s to write 100 imgs (320.54 imgs/sec)

@i-jey i-jey deleted the zarrwriter-xtra-big branch June 2, 2023 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants