Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading nd arrays takes long time #32

Open
pr4deepr opened this issue Aug 4, 2023 · 21 comments
Open

reading nd arrays takes long time #32

pr4deepr opened this issue Aug 4, 2023 · 21 comments
Labels
bug Something isn't working

Comments

@pr4deepr
Copy link

pr4deepr commented Aug 4, 2023

Hi Sebastian
Thanks for making this resource available and maintaining it.
I am trying to read nd arrays from large datasets with pylibczirw_tools.read_5darray,. I set this flag output_dask=True, so it doesn't read in the entire dataset. However, it takes a very long time...

czi_img = pylibczirw_tools.read_5darray(img_path,output_order="TCZYX",
                               output_dask=True,
                               remove_adim=True
                                )

Is this the right way to do it? It takes more than 1 hour..

However, if I use the following code to initialize a dask array it takes ~5 minutes for a 255GB dataset

from dask import delayed as da_delayed
import dask.array as da 
from pylibCZIrw import czi as pyczi
from tqdm import trange



czidoc = pyczi.CziReader(img_path)
temp_frame = czidoc.read()
print("Array Shape: ",temp_frame.shape)
total_bbox = czidoc.total_bounding_box
z_slices = total_bbox['Z'][1]
nchannels = total_bbox['C'][1]
total_time = total_bbox['T'][1]

#initialise empty list to hold the dask arrays
img=[]


for time in trange(total_time):
    ch_stack=[]
    for ch in range(nchannels):
        z_stack=[]
        for z in range(z_slices):
            z_slice = da.from_delayed(
                                da_delayed(czidoc.read)(plane={'C': ch, 'T': time,'Z':z}), shape=temp_frame.shape, dtype=temp_frame.dtype
                            )
            z_slice = da.squeeze(z_slice)
            #z_slice = z_slice.swapaxes(0,1)
            z_stack.append(z_slice)
        z_stack = da.stack(z_stack,axis=0)
        #break
        ch_stack.append(z_stack)
    img.append(ch_stack)

image_stack = da.stack(img,axis=0)
@sebi06
Copy link
Owner

sebi06 commented Aug 4, 2023

Hi @pr4deepr

I also "played" around with the delayed things to improve my experimental tools. I will modify your code a bit and then create a new branch where you can review how I incorporated you nice code into czitools

@sebi06
Copy link
Owner

sebi06 commented Aug 4, 2023

Can you test your CZI using this branch: work_in_progress

@sebi06
Copy link
Owner

sebi06 commented Aug 4, 2023

It is basically this code:

def read_6darray(
    filepath: Union[str, os.PathLike[str]], output_order: str = "STCZYX", **kwargs: int
) -> Tuple[Optional[Union[np.ndarray, da.Array]], czimd.CziMetadata, str]:
    """Read a CZI image file as 6D numpy or dask array.
    Important: Currently supported are only scenes with equal size and CZIs with consistent pixel types.

    Args:
        filepath (str | Path): filepath for the CZI image
        output_order (str, optional): Order of dimensions for the output array. Defaults to "STCZYX".
        kwargs (int, optional): Allowed kwargs are S, T, Z, C and values must be >=0 (zero-based).
                                Will be used to read only a substack from the array

    Returns:
        Tuple[array6d, mdata, dim_string]: output as 6D dask array, metadata and dimstring
    """

    if isinstance(filepath, Path):
        # convert to string
        filepath = str(filepath)

    # get the complete metadata at once as one big class
    mdata = czimd.CziMetadata(filepath)

    if not mdata.consistent_pixeltypes:
        print("Detected PixelTypes ar not consistent. Cannot create array6d")
        return None, mdata, ""

    if mdata.consistent_pixeltypes:
        # use pixel type from first channel
        use_pixeltype = mdata.npdtype[0]

    valid_order = ["STCZYX", "STZCYX"]

    if output_order not in valid_order:
        print("Invalid dimension order 6D:", output_order)
        return None, mdata, ""

    if not mdata.scene_shape_is_consistent:
        print("Scenes have inconsistent shape. Cannot read 6D array")
        return None, mdata, ""

    # open the CZI document to read the
    with pyczi.open_czi(filepath) as czidoc:
        if mdata.image.SizeS is not None:
            # get size for a single scene using the 1st
            # works only if scene shape is consistent

            # use the size of the 1st scenes_bounding_rectangle
            size_x = czidoc.scenes_bounding_rectangle[0].w
            size_y = czidoc.scenes_bounding_rectangle[0].h

        if mdata.image.SizeS is None:
            # use the size of the total_bounding_rectangle
            size_x = czidoc.total_bounding_rectangle.w
            size_y = czidoc.total_bounding_rectangle.h

        # check if dimensions are None (because they do not exist for that image)
        size_c = misc.check_dimsize(mdata.image.SizeC, set2value=1)
        size_z = misc.check_dimsize(mdata.image.SizeZ, set2value=1)
        size_t = misc.check_dimsize(mdata.image.SizeT, set2value=1)
        size_s = misc.check_dimsize(mdata.image.SizeS, set2value=1)

        # check for additional **kwargs to create substacks
        if kwargs is not None and mdata.image.SizeS is not None and "S" in kwargs:
            size_s = kwargs["S"] + 1
            mdata.image.SizeS = 1

        if kwargs is not None and "T" in kwargs:
            size_t = kwargs["T"] + 1
            mdata.image.SizeT = 1

        if kwargs is not None and "Z" in kwargs:
            size_z = kwargs["Z"] + 1
            mdata.image.SizeZ = 1

        if kwargs is not None and "C" in kwargs:
            size_c = kwargs["C"] + 1
            mdata.image.SizeC = 1

        if mdata.isRGB:
            shape2d = (size_y, size_x, 3)
        if not mdata.isRGB:
            shape2d = (size_y, size_x)

        remove_adim = False if mdata.isRGB else True

        # initialise empty list to hold the dask arrays
        img = []

        for s in trange(size_s):
            time_stack = []
            for time in range(size_t):
                ch_stack = []
                for ch in trange(size_c):
                    z_stack = []
                    for z in range(size_z):
                        if mdata.image.SizeS is not None:
                            # z_slice = da.from_delayed(
                            #    da_delayed(czidoc.read)(plane={'C': ch, 'T': time, 'Z': z}, scene=s), shape=shape2d, dtype=mdata.npdtype[0]
                            # )

                            z_slice = da.from_delayed(
                                read_plane(
                                    czidoc,
                                    s=s,
                                    t=time,
                                    c=ch,
                                    z=z,
                                    has_scenes=True,
                                    remove_adim=remove_adim,
                                ),
                                shape=shape2d,
                                dtype=mdata.npdtype[0],
                            )

                        if mdata.image.SizeS is None:
                            # z_slice = da.from_delayed(
                            #    da_delayed(czidoc.read)(plane={'C': ch, 'T': time, 'Z': z}), shape=shape2d, dtype=mdata.npdtype[0]
                            # )

                            z_slice = da.from_delayed(
                                read_plane(
                                    czidoc,
                                    s=s,
                                    t=time,
                                    c=ch,
                                    z=z,
                                    has_scenes=False,
                                    remove_adim=remove_adim,
                                ),
                                shape=shape2d,
                                dtype=mdata.npdtype[0],
                            )

                        # create 2d array
                        z_slice = da.squeeze(z_slice)

                        # append to the z-stack
                        z_stack.append(z_slice)

                    # stack the array and create new dask array
                    z_stack = da.stack(z_stack, axis=0)

                    # create CZYX list of dask array
                    ch_stack.append(z_stack)

                # create TCZYX list of dask array
                time_stack.append(ch_stack)

            # create STCZYX list of dask array
            img.append(time_stack)

        # create final STCZYX dask arry
        array6d = da.stack(img, axis=0)

        # change the dimension order if needed
        if output_order == "STZCYX":
            dim_string = "STZCYXA"
            array6d = np.swapaxes(array6d, 2, 3)

        if output_order == "STCZYX":
            dim_string = "STCZYXA"

        if remove_adim:
            dim_string = dim_string.replace("A", "")

    return array6d, mdata, dim_string


@dask.delayed
def read_plane(
    czidoc: pyczi.CziReader,
    s: int = 0,
    t: int = 0,
    c: int = 0,
    z: int = 0,
    has_scenes: bool = True,
    remove_adim: bool = True,
):
    """Dask delayed function to read a 2d plane from a CZI image, which has the shape
       (Y, X, 1) or (Y, X, 3).
       If the image is a "grayscale image the last array dimension will be removed, when
       the option is ste to True.

    Args:
        czidoc (pyczi.CziReader): Czireader objects
        s (int, optional): Scene index. Defaults to 0.
        t (int, optional): Time index. Defaults to 0.
        c (int, optional): Channel index. Defaults to 0.
        z (int, optional): Z-Plane index. Defaults to 0.
        has_scenes (bool, optional): Defines if the CZI actually contains scenes. Defaults to True.
        remove_adim (bool, optional): Option to remove the last dimension of the 2D array. Defaults to True.

    Returns:
        dask.array: 6d dask.array with delayed reading for individual 2d planes
    """

    # intialize 2d array with some values
    image2d = np.zeros([10, 10], dtype=np.int16)

    if has_scenes:
        # read a 2d plane using the scene index
        image2d = czidoc.read(plane={"T": t, "C": c, "Z": z}, scene=s)
    if not has_scenes:
        # reading a 2d plane in case the CZI has no scenes
        image2d = czidoc.read(plane={"T": t, "C": c, "Z": z})

    # remove a last "A" dimension when desired
    if remove_adim:
        return image2d[..., 0]
    if not remove_adim:
        return image2d

@pr4deepr
Copy link
Author

pr4deepr commented Aug 7, 2023

For some reason the time is read as -1 and the channel as 0 for a multitimepoint and two channel czi image. This is a file split by scene after LLS imaging.

I was trying to troubleshoot by running
mdata = czimd.CziMetadata(filepath)

but, i get mdata value as 0 when testing this in a notebook.

However, the mdata value in your function above seems to work when I debug and run read_6darray function..

@sebi06
Copy link
Owner

sebi06 commented Aug 7, 2023

Hi @pr4deepr

the CZI from the LLS7 - how did you split scenes? Did you use the "Spli Scenes" already during acquisition? If yes, this might be the reason since this function creates individual CZI plus a "satellite file", where the individual CZIs have strange metadata.

I have not tested my code with those since I do not really like or recommend this function. Rather use "Spit Scenes Write Files" after acquisition is finished. This takes longer but should at least have correct metadata.

Can you somehow share this file?

And is it OK for you, when I add the code above to this package since a major part is based on your post?

@pr4deepr
Copy link
Author

pr4deepr commented Aug 8, 2023

yea, feel free to add the code in the package. An acknowledgment in the release would be enough...
I thought it would be easier to make changes and do this with a pull request..

With LLS7, I was told we use ""Split Scenes" already during acquisition option for now.. We've had issues with metadata being incorrect pertaining to the scenes.. I'll see if we can share any of these files. Do you have a Zeiss dropbox or storage link where we can drop some data?

@sebi06
Copy link
Owner

sebi06 commented Aug 9, 2023

We do not have dropbox at ZEISS. But if you share the link anyway I will find a way to download the file.

@sebi06
Copy link
Owner

sebi06 commented Aug 22, 2023

Hi @pr4deepr

I incorporated you changes and release czitools 0.4.1 with the delayed reading. I also create a 100GB test CZI stack and tried to open it in napari. But what I observed was that my RAM (128GB) was filling up over time completely until I ran into strange error messages.

What I do not understand is what is actually filling up the memory since this should be delayed reading. But Napari never even reached the state where to display any data.

I used this code to test it:

from czitools import read_tools
from czitools import napari_tools
from czitools import misc_tools
import napari
from pathlib import Path

# adapt to your needs
defaultdir = Path(Path(__file__).resolve().parents[2]) / "data"

# open s simple dialog to select a CZI file
filepath = misc_tools.openfile(directory=defaultdir,
                               title="Open CZI Image File",
                               ftypename="CZI Files",
                               extension="*.czi")

# return a array with dimension order STZCYX(A)
array6d, mdata, dim_string6d = read_tools.read_6darray(filepath,
                                                       output_order="STCZYX",
                                                       use_dask=True,
                                                       chunk_zyx=False,
                                                       # T=0,
                                                       # Z=0
                                                       )

# show array inside napari viewer
viewer = napari.Viewer()
layers = napari_tools.show(viewer, array6d, mdata,
                           dim_string=dim_string6d,
                           blending="additive",
                           contrast='from_czi',
                           gamma=0.85,
                           add_mdtable=True,
                           name_sliders=True)

napari.run()

Do you have any idea what this issue could be?

@pr4deepr
Copy link
Author

It's possible it's something to do with how napari handles dask cache. What happens if you turn it on and off or reconfigure the cache size?
https://napari.org/stable/api/napari.utils.html

@zeissmicroscopy
Copy link

I will try this. So will Napari tries to open the 100GB dask delayed stack my task manager looks like this. I did set the dask_cache = 0.7 (I ahve 128GB RAM). So far it keeps growing ...

image

@zeissmicroscopy
Copy link

5min later ...

image

@zeissmicroscopy
Copy link

And then it finally opened :-) And sliders are quite responsive as well!

image

@sebi06 sebi06 closed this as completed Oct 19, 2023
@sebi06
Copy link
Owner

sebi06 commented Oct 19, 2023

It works much faster than before after the changes. But feel free to reopen if needed.

@pr4deepr
Copy link
Author

Sorry @sebi06 , been quite busy..
Do you still have the memory issues? How big is the entire dataset?

@sebi06
Copy link
Owner

sebi06 commented Oct 27, 2023

No it seems to work now "good enough". Thx for asking. I think it was a 100GB dataset.

@pr4deepr
Copy link
Author

Your RAM usage is 100 GB above. Looks like ~ entire dataset has been loaded into memory?

@sebi06
Copy link
Owner

sebi06 commented Oct 29, 2023

@pr4deepr

I did not check yet, but do you have an idea why the code should load the entire dataset into memory?
Just reading the array (lazy) should not do this?

@sebi06 sebi06 reopened this Oct 29, 2023
@sebi06
Copy link
Owner

sebi06 commented Oct 29, 2023

Hi @pr4deepr

I checked with a smaller file and it seems that the memory "increase" occurs inside read_tools.py here:

image

So creating the "final" stack actual leads to everything being loaded into memory? Does that make sense (I am not a dask expert)?

What I do not understand is why the memory only seems to increase in line 210 but not in other lines using the same command.

@pr4deepr
Copy link
Author

pr4deepr commented Nov 1, 2023

I'll have a look into this. Need to see if specifying chunk sizes explicitly earlier may help..

BTW, how do you profile memory for each line like that?

@sebi06
Copy link
Owner

sebi06 commented Nov 3, 2023

HI @pr4deepr

First you need to do: pip install memory-profiler and then you use it like this

from memory_profiler import profile


# instantiating the decorator
@profile
# code for which memory has to
# be monitored
def read_6darray(
    filepath: Union[str, os.PathLike[str]],
    output_order: str = "STCZYX",
    use_dask: bool = False,
    chunk_zyx=False,
    **kwargs: int
) -> Tuple[Optional[Union[np.ndarray, da.Array]], czimd.CziMetadata, str]:
...

@sebi06
Copy link
Owner

sebi06 commented Nov 3, 2023

I searched on the web and found this: dask/dask#4448

but I am not sure if that something to do with what I see. And this one might also help:

delayed-best-practices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants