Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observed memory leakage in NanoeventsFactory.from_root() #1072

Open
green-cabbage opened this issue Apr 10, 2024 · 9 comments
Open

Observed memory leakage in NanoeventsFactory.from_root() #1072

green-cabbage opened this issue Apr 10, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@green-cabbage
Copy link
Contributor

This is a quick report of an issue I have noticed with newer version of coffea (2024.3.0) with updates from interactions I had with @lgray . Reproducer of this issue is pushed onto my public repo: https://github.com/green-cabbage/dask_mem_leak_test. Please let me know if I am unclear some certain parts, as I am not experienced in making bug reports.

The basic gist is that loading root files using XRootD links and then doing simple processing leads to dask client's memory slowly increasing until memory overflow unless regularly restarted. This happens for both local cluster as well as external cluster (screen shot of residual memory after no more jobs is added below).
The leakage is observed for both NanoEventsFactory.from_root() and uproot.dask() functions. However, memory leakage is not observed when explicitly using XRootDSource handler for uproot, as in: uproot.dask(, handler=uproot.XRootDSource). However, implementing XRootDSource handler for NanoEventsFactory.from_root() (ie NanoEventsFactory.from_root( ,uproot_options={"handler" : uproot.XRootDSource})) still leads to memory overload.

Moreover, using local filenames instead of XRootD links on uproot.dask() doesn't generate the leakage. The same with NanoEventsFactory.from_root() has not been tried.

memLeak_afterInterrupt png

@green-cabbage green-cabbage added the bug Something isn't working label Apr 10, 2024
@lgray
Copy link
Collaborator

lgray commented Apr 10, 2024

Can you please try this without using a dask distributed client and in single threaded mode. You will have to watch top to understand if there is a problem.

dask.compute(stuff, scheduler="sync") and do not make a distributed.Client() in a with block or otherwise.

I assume the 1 GB memory limit above is only to make it fail sooner?

@green-cabbage
Copy link
Contributor Author

Yes, 1 GiB memory limit is only to make it fail sooner. I suppose you would like me to use memray as well. I have never used that before, so it may take a while since I have to scan the documentations. Could you also be more specific when you say "watch top to understand if there is a problem". Like, what should I be looking for at top? The increasing memory usage?

@lgray
Copy link
Collaborator

lgray commented Apr 10, 2024

Increasing RSS over time, @yimuchen may have some helpful scripts here.

@yimuchen
Copy link
Contributor

yimuchen commented Apr 10, 2024

One tool that you can use to test locally to see if this is dask issue or a coffea issue would be python's tracemalloc [1]. This can help narrow down which function is eating up memory. And example would be something like (might need a bit of tweaking as this was helping with debugging an issue with coffea==0.7):

from coffea.nanoevents import NanoEventsFactory
snapshot = tracemalloc.take_snapshot()
for filename in [f'/srv/Ntuples/{i}_RA2AnalysisTree.root' for i in range(20)]:
  step_size = 500 # Force slicing the file into multiple chunks to see if this memory leaks appear per chunk
  for start in range(0, 30000, step_size):
    events = NanoEventsFactory.from_root(filename,
                                         treepath='TreeMaker2/PreSelection',
                                         schemaclass=BaseSchema,
                                         metadata={
                                             'dataset': 'test'
                                         },
                                         entry_start=start,
                                         entry_stop=start + step_size).events()
    run_my_analysis(events)
  s2 = tracemalloc.take_snapshot()

  for s in s2.statistics('lineno', cumulative=True)[:10]:
    print(s)

What you would want to look out for is if there is a singular function that monotonously increases in each subsequent s2.statistics print-outs. If it points to some function in coffea/awkward, that might be where a real memory leaking is happening and helps narrow down the investigation range.

[1] https://docs.python.org/3/library/tracemalloc.html

@yimuchen
Copy link
Contributor

yimuchen commented Apr 10, 2024

There is also memory-profiler [1], but this only shows if there are leaks happening, not so much where the leak is appearing. But it makes plots which might be easier on the eyes then trying to guess if there are leaks based on text outputs (remember to use the --multiprocess or --include-children flag if you are running local multithread dask)

[1] https://pypi.org/project/memory-profiler/

@nsmith-
Copy link
Member

nsmith- commented Apr 12, 2024

I'm trying your demo repository (https://github.com/green-cabbage/dask_mem_leak_test) now, with

coffea verion: 2024.4.0
distributed verion: 2024.4.1
uproot verion: 5.3.2

(from coffeateam/coffea-dask-almalinux8:2024.4.0-py3.11)
and not seeing much memory growth. After the first chunk is done it seems to stay steady at around 450MB.
Screenshot 2024-04-12 at 1 06 39 PM

Though, I do see a lot of

2024-04-12 13:01:13,086 - distributed.utils_perf - WARNING - full garbage collections took 17% CPU time recently (threshold: 10%)

I tried with and without

uproot_options={"handler" : uproot.XRootDSource}

with no real difference.

After 11 files, I get a read error:

OSError: File did not open properly: [ERROR] Server responded with an error: [3000] Unable to open - cannot determine the prefix path to use for the given filesystem id /store/data/Run2018A/SingleMuon/NANOAOD/UL2018_MiniAODv2_NanoAODv9-v2/2550000/22893D09-9B7B-694B-968E-DC8566E2272B.root; invalid argument

@green-cabbage
Copy link
Contributor Author

green-cabbage commented Apr 14, 2024

Sorry for the late reply. The OSError IMO is primarily due to purdue XRootD not being the most reliable. I have updated the input_file.json and input_file_Big.json in the latest commit here, which should be more reliable as I have replaced purdue's XRootD path with fermilab's.

Moreover, observing lots of
2024-04-12 13:01:13,086 - distributed.utils_perf - WARNING - full garbage collections took 17% CPU time recently (threshold: 10%)
is expected, but if you let your code run longer (in my test, it was around processing 13th root file out of 92), then you would start to read statements along the lines of:
024-04-14 19:28:40,392 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 500.21 MiB -- Worker memory limit: 716.80 MiB

if you keep running the script, the unmanaged memory slowly creeps up until the client automatically goes through phases of pausing and unpausing until it completely freezes due to high unmanaged memory.

On a seperate note, I have ran the local client test with and without uproot_options={"handler" : uproot.XRootDSource} using local paths (NOT XRootD path) and I still observe the unmanaged memory statement. So I believe this issue is not related to XRootD.

Edit: I have made a dask report of the running through memleakage_demo.ipynb the with local root file paths using dask local client here, and you can see in the task stream tab that the gap between the "pt" workload increases dramatically due to significant compute time being dedicated to managing leaking memory from the client.

@green-cabbage
Copy link
Contributor Author

As a small update, it does seem like it is a XRootD problem. I re-ran the reproducer with local root files path, but with memory increased from 0.7 GiB to 1 GiB, and while I had lots warnings like distributed.utils_perf - WARNING - full garbage collections took 26% CPU time recently (threshold: 10%) , the worker never had to pause due to memory leakage.

On a different note, I am not so sure 26% CPU time being used for garbage collection is normal. Is this something I need to look at it separately to optimize my workflow?

@lgray
Copy link
Collaborator

lgray commented Apr 19, 2024

I think you've found instead the "steady state" memory usage.

Lots of GC time means you're making a bunch of temporaries.
But that's easier to diagnose with the memory tools you have at hand.

It might be worth it to give the latest coffea and dask-awkward/dask-histogram a try if you're not already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants