Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage #17

Open
fjorka opened this issue Oct 27, 2021 · 6 comments
Open

Memory usage #17

fjorka opened this issue Oct 27, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@fjorka
Copy link

fjorka commented Oct 27, 2021

  • nd2 version: 0.1.4
  • Python version: 3.7.10
  • Operating System: Windows10

Description

I try to load selected parts of nd2 files but too much memory is allocated for the objects that need to be computed. As a consequence, it fails to load objects that are bigger than ~4 times available memory.

What I Did

Test on a time lapse experiment:
nd2_ram

Test on a single time point big image:

nd2_single_frame

In the second example the memory allocation is correct when it has to compute the whole file.

It may be related to the problem of calculating object size incorrectly as shown here:

nd2_measure_size

@tlambert03 tlambert03 added the bug Something isn't working label Nov 10, 2021
@tlambert03
Copy link
Owner

Hi @fjorka

let's first explore our memory profiling options. I just created a script and ran using memory_profiler.

# script.py
import nd2
from memory_profiler import profile
import numpy as np

@profile
def main():
    f = nd2.ND2File("big.nd2")
    x = f.to_xarray()

    # instead of for loop... easier to see effect of each line in report
    a = x.isel(C=0, Z=0, T=np.arange(0, 10)).compute()
    b = x.isel(C=0, Z=0, T=np.arange(10, 20)).compute()
    c = x.isel(C=0, Z=0, T=np.arange(20, 30)).compute()
    d = x.isel(C=0, Z=0, T=np.arange(30, 40)).compute()
    e = x.isel(C=0, Z=0, T=np.arange(40, 50)).compute()
    f = x.isel(C=0, Z=0, T=np.arange(50, 60)).compute()
    g = x.isel(C=0, Z=0, T=np.arange(60, 70)).compute()
    h = x.isel(C=0, Z=0, T=np.arange(70, 80)).compute()
    i = x.isel(C=0, Z=0, T=np.arange(80, 90)).compute()
    j = x.isel(C=0, Z=0, T=np.arange(90, 100)).compute()


if __name__ == "__main__":
    main()

then run with python script.py

I get the following output

Filename: /Users/talley/Dropbox (HMS)/Python/nd2/script.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     6     51.2 MiB     51.2 MiB           1   @profile
     7                                         def main():
     8     54.0 MiB      2.8 MiB           2       f = nd2.ND2File("big.nd2")
     9
    11    116.8 MiB     62.8 MiB           1       x = f.to_xarray()
    12    150.1 MiB     33.3 MiB           1       a = x.isel(C=0, Z=0, T=np.arange(0, 10)).compute()
    13    177.6 MiB     27.5 MiB           1       b = x.isel(C=0, Z=0, T=np.arange(10, 20)).compute()
    14    204.5 MiB     26.8 MiB           1       c = x.isel(C=0, Z=0, T=np.arange(20, 30)).compute()
    15    230.5 MiB     26.1 MiB           1       d = x.isel(C=0, Z=0, T=np.arange(30, 40)).compute()
    16    257.6 MiB     27.1 MiB           1       e = x.isel(C=0, Z=0, T=np.arange(40, 50)).compute()
    17    283.7 MiB     26.1 MiB           1       f = x.isel(C=0, Z=0, T=np.arange(50, 60)).compute()
    18    309.2 MiB     25.5 MiB           1       g = x.isel(C=0, Z=0, T=np.arange(60, 70)).compute()
    19    334.3 MiB     25.1 MiB           1       h = x.isel(C=0, Z=0, T=np.arange(70, 80)).compute()
    20    359.3 MiB     25.0 MiB           1       i = x.isel(C=0, Z=0, T=np.arange(80, 90)).compute()
    21    384.9 MiB     25.6 MiB           1       j = x.isel(C=0, Z=0, T=np.arange(90, 100)).compute()

Though the file is 15GB, it looks to be allocating about what I'd expect for each chunk.
Can you try this with your file? (just want to rule out that psutil is giving something funny).

If you get something dramatically different with your file, I might want to play with it? 😬... i know it's a lot to ask, but let me know if you can share it somehow (dropbox, etc...)

@fjorka
Copy link
Author

fjorka commented Nov 15, 2021

Hi @tlambert03
Unfortunately, it seems the same when I profile with memory_profiler. For example:

nd2_file = r'DBSP12D20#1_20X.nd2'
file_path = os.path.join(nd2_dir,nd2_file)

@profile
def main():
    f = nd2.ND2File(file_path)
    x = f.to_xarray()

    a = x.isel(C=0).compute()
    print(f'object size: {a.nbytes/1e9} GB')

if __name__ == "__main__":
    main()

Gives the profile of:

object size: 0.807561216 GB
Filename: D:\BARC\nd2_memory\memory_test_slide.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    10     34.4 MiB     34.4 MiB           1   @profile
    11                                         def main():
    12     35.6 MiB      1.2 MiB           1       f = nd2.ND2File(file_path)
    13    101.9 MiB     66.3 MiB           1       x = f.to_xarray()
    14
    15                                             # instead of for loop... easier to see effect of each line in report
    16   2412.7 MiB   2310.8 MiB           1       a = x.isel(C=0).compute()
    17   2412.7 MiB      0.0 MiB           1       print(f'object size: {a.nbytes/1e9} GB')

I shared with you the file from the above example. The one from the previous example is ~0.5TB (multi position time-lapse) but I can figure out sharing it too if you would like to work with it.

@tlambert03
Copy link
Owner

thanks! I downloaded it.

You know... one thing that is probably important to mention here, which I should have thought of earlier... is that nd2 files are not (natively) chunked along the channel axis. So when you load 1 channel for a given timepoint, you load them all.

you should be able be able to save memory by only loading a Z, or T subset... but chunking in channels will require some additional functionality that isn't natively supported by the nd2 format. (still possible).

one additional observation: try leaving xarray out of the loop. Use just f.asarray() or f.to_dask(). For dask, you can then subchunk using indexing:

print(f.sizes)  # see axes in order
d = f.to_dask()
d[0, 0].compute()  # just get the first index along the first two dimensions

...and remember that if any of those dimensions are XY or C, it won't save memory (until that's added)

@fjorka
Copy link
Author

fjorka commented Nov 23, 2021

Thanks for the explanation @tlambert03!

I re-wrote the code to actually load only single time points and arrange them later.
The loop looks as follows:

C_list = [1,2,3]
P_list = [0,1,2]
T = np.arange(288)

im_nd2_reader = nd2.ND2File(file_path) # expected shape (T,P,C,X,Y) - (577, 15, 4, 2765, 2765) 
im_nd2_dask = im_nd2_reader.to_dask()

for P in P_list:
    for C in C_list:

        # create empty container
        im = np.empty(shape=[len(T),im_nd2_reader.shape[3],im_nd2_reader.shape[4]],dtype='uint16')

        for ind in T:

            frame = im_nd2_dask[ind,P,C,:,:].compute()
            im[ind,:,:] = frame
            
        # save im

A single 'im' is around 4GB but this loop takes ~18-24GB of RAM to execute (never less than 18GB after initial loading). In my mind, it should never open more than a single time point and in general require around 4GB of RAM. Do you have any insights about what I can do better here.

@elgw
Copy link

elgw commented Nov 28, 2022

you should be able be able to save memory by only loading a Z, or T subset... but chunking in channels will require some additional functionality that isn't natively supported by the nd2 format. (still possible).

Something in lines of that would be a nice addition to this library.

It is possible to read just one xy-plane at a time from the nd2 file and discard the data for the irrelevant channels, the downside is that all data has to be re-read for each channel (not sure how it works with time series). To reduce RAM usage even more the data could be streamed to disk as it is read; but that might be out of scope here.

@tlambert03
Copy link
Owner

Something in lines of that would be a nice addition to this library.

Thanks @elgw, this feature is being tracked at #85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants