-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel writing of zarr array from single process #20
Comments
First, here is the code used for testing. It iterates over two kinds of ROIs, and over three options for the chunk sizes (1, 2, and 4). import os
import shutil
import numpy as np
import dask
import dask.array as da
import time
# Global tmpfile
tmpfile = "tmp.dat"
# Mock of a task: read an array, process it per-ROI, write resulting array
def task(ROIs, N, chunksize, _case=0, do_graphs=False):
# Initialize tmpfile (to count number of process_ROI calls)
if os.path.isfile(tmpfile):
os.remove(tmpfile)
# Function to be applied to each ROI
def process_ROI(ROI):
time.sleep(1)
with open(tmpfile, "a") as f:
f.write("1\n")
return ROI + 2
delayed_process_ROI = dask.delayed(process_ROI)
# Read array
y = da.from_zarr("x.zarr")
# Define output array
z = da.empty(y.shape, chunks=y.chunks, dtype=y.dtype)
# Per-ROI processing
for indices in ROIs:
sx, ex = indices[:]
shape = (ex - sx,)
res = delayed_process_ROI(y[sx:ex])
z[sx:ex] = da.from_delayed(res, shape=shape, dtype=np.float32)
if do_graphs:
z.visualize(f"fig_graph_N{N}_c{chunksize}_case_{_case}_opt.png",
optimize_graph=True,
color="order",
cmap="autumn",
node_attr={"penwidth": "3"})
# Write array to disk (after removing temporary folders, if needed)
if os.path.isdir("/tmp/tmp.zarr"):
shutil.rmtree("/tmp/tmp.zarr")
z.to_zarr("/tmp/tmp.zarr", compute=True)
# Global parameters
N = 4
do_graphs = False
# Loop over chunk sizes
for chunksize in [1, 2, 4]:
# Create a chunked size-N array of zeros and store it to disk
if os.path.isdir("x.zarr"):
shutil.rmtree("x.zarr")
x = da.zeros(shape=N, chunks=chunksize, dtype=int)
x.to_zarr("x.zarr", compute=True)
# Define ROIs
ROIs_case1 = [(i, i + 1) for i in range(N)]
ROIs_case2 = [(i, i + 2) for i in range(N) if i + 2 <= N]
print()
print(f"N={N}, chunksize={chunksize}")
print()
# Loop over two choices of ROIs
for iROIs, ROIs in enumerate([ROIs_case1, ROIs_case2]):
# Print information about ROIs
n_ROIs = len(ROIs)
ROIs_long = [list(range(ROI[0], ROI[1])) for ROI in ROIs]
len_ROI = len(ROIs_long[0])
print(f"{n_ROIs} ROIs of {len_ROI} pixels each: {ROIs_long}")
# Execute task
t0 = time.perf_counter()
task(ROIs, N, chunksize, _case=iROIs, do_graphs=do_graphs)
t1 = time.perf_counter()
# Diagnostics on output
out = da.from_zarr("/tmp/tmp.zarr").compute()
num_calls = np.loadtxt(tmpfile, dtype=int).sum()
print(f" Task completed (elapsed: {t1-t0:.3f} s)")
print(f" Output array: {out}")
print(f" Number or ROIs: {len(ROIs)}")
print(f" Number or process_ROI calls: {num_calls}")
assert np.allclose(out, np.ones(N) * 2)
assert num_calls == len(ROIs)
print()
print("----------") |
We first look at the first kind of ROIs, namely the ones of a single-pixel each (and with no shared pixels).
and we see that the three statements (correct result, correct number of calls, correct timing) are all satisfied, including in cases where all ROIs belong to the same chunk. |
The second case is the one where ROIs have length 2 and they share some pixels.
Once again, all statements are satisfied, even when we have both "dangerous" overlaps (ROIs that share both pixels and chunks). |
There's no free lunch though, and when you write multiple ROIs to the same chunk you pay a price in the dask graph complexity.
It is clear, in the figures below, that the complexity grows when multiple parts of the chunk need to be processed (this is also very reassuring, meaning that dask will not just compute-and-write everything in parallel on the same chunk). When scaling up array sizes, we may expect overheads coming from different functions acting on the same chunk - and this is perfectly expected. The size of the overhead will need to be checked, in some way, for the relevant scale. |
That's it for now, and IMO it closes the issue of parallel writing - until further notice. Points that we are not considering as understood:
|
I'm closing this now. Feel free to re-open for additional comments. |
Wow, great test and really good documentation of the results. Thanks a lot @tcompa ! It is very impressive to see how dask handles such processing for more complex, overlapping setups! I'm particularly impressed & surprised that it handles the overlapping ROIs! You are right, there is no free lunch. Though if the only cost for this lunch is a slight overhead, but dask handles all the complexity: Well, then I'm very willing to pay the overhead for that lunch! Even if processing in arbitrary chunks were 50% slower, but it still actually works without raising the complexity of the code we need to write, that is a super good trade-off! Really interesting to eventually consider this for fancier implementation of cellpose. See also here for distributed cellpose runs that still all come together again into a single, already combined output (without later relabeling): https://github.com/MouseLand/cellpose/blob/main/cellpose/contrib/distributed_segmentation.py
This may actually become significantly easier to do if sharding gets implemented (see #54) and we have the option to play with chunk sizes without exploding file numbers.
Very good point. Means that there is extra temporary storage overhead that Fractal will need. But this overhead is passable.
So another good argument to keep tasks running on a per-well basis then. Good to be aware of this! |
A clarification: x = da.from_zarr(...)
for ROI in ROIs:
start_x, end_x = ...
out = delayed_process(x[start_x:end_x])
x[start_x:end_x] = da.from_delayed(out, ...)
x.to_zarr(..) This code creates a dask graph with additional calls to the time-consuming function, and that's the main reason for avoiding it (apart from being a bad dask practice). Concerning the other issue (overwriting a zarr file - as in #53), it would be actually very interesting to re-assess it after we moved away from |
Here is a branch from https://github.com/fractal-analytics-platform/fractal/issues/115, to specifically address parallel writing within a "simple" example (unrelated to Fractal). I'll split our notes into a few comments, to keep it readable.
Context
[0, 0, 0, 0]
and we store it to disk, with different chunk sizes (either 1, 2 or 4).[0]
. An example of a three-pixel ROI is[0, 1, 2]
. The first set of ROIs is made of four single-pixel regions:[0], [1], [2], [3]
. The second set of ROIs is made of three ROIs:[0, 1], [1, 2], [2, 3]
. Notice that in the former case there is no pixel overlapping, while in the latter ROIs share some pixels. The chunk size combined with the ROI size also determines whether each chunk hosts a single ROI or several ROIs.process_ROI()
over all ROIs of an array. The function simply increments the array values by two, and it includes atime.sleep(1)
.[2, 2, 2, 2]
;process_ROI()
match with the number of ROIs;Important caveat
Within the task, we carefully assign the results of the delayed
process_ROI
calls to an output array which is not the same as the input. First of all, this keeps us close to dask best practices, Moreover, re-assigning values to the same array would impact the three checks above:[2, 2, 2, 2]
, in cases where two ROIs share a pixel (this is OK and expected);process_ROI
may be larger than the actual number of ROIs, due to some suboptimal construction of dask graph (we observed this in some example);For this reasons, we should always assign result of delayed/indexed execution to a new array.
After this long summary/introduction, code and results are in the following comments.
The text was updated successfully, but these errors were encountered: