Sparse mmap()s are counted as fully allocated from the start, which can be very misleading #308

itamarst · 2022-02-06T01:04:16Z

Discussed in #297

^{Originally posted by fohria January 26, 2022}
hey! thanks for this profiler, it looks very useful, if i can figure out how to use it :)

i have a short script that, in short, generates a bunch of data, and then plots it. depending on how much i generate, memory use can be many gigabytes. so i'd like to profile it so i can find out when and where i may have some dataframes hanging around from function calls that i can delete when they're not needed anymore, like after i have dumped it to a file.

however, running it with fil, i get this:

the light pink on the left are the plotting calls, but what does it mean that it says <frozen importlib._bootstrap> all over?

tldr version of my code is:

data = generate_data(how_much)  # returns a pd.dataframe
figure = plotting_call(data)

(i've installed fil to the same conda env i use for the script, if that matters)

The text was updated successfully, but these errors were encountered:

itamarst · 2022-02-20T19:46:39Z

So far I have compiled code locally, and compared to last released version. Locally compiled Fil doesn't even show the import as using any memory at all (which... kinda makes sense, mostly it's an mmap of a file, with a few tiny allocations that should be filtered out).

itamarst · 2022-02-22T17:02:24Z

The reason for difference I was seeing, where sometimes NumPy is included and sometimes it isn't, is because of the threadpool changes. If numexpr is installed, NumPy gets imported as part of thread pool setup (via numexpr import), and so its memory isn't tracked because it's imported before tracking is started. If numexpr is not installed, NumPy is only imported at user code runtime, and therefore its memory usage is tracked.

So maybe we want to check for numexpr existence without importing it.

Regardless, however, that doesn't explain the variability in report NumPy-import memory usage, so next step is figuring out why it sometimes has huge % when it shouldn't.

itamarst · 2022-02-22T23:01:53Z

I think I figured it out:

Some BLAS implementations have a threadpool, likely tied to # of CPUs.
Each thread does a large anonymous mmap().

ADD MMAP 134217728    0: filpreload::add_allocation
   1: <unknown>
   2: alloc_mmap
   3: blas_memory_alloc
   4: blas_thread_server
   5: start_thread
   6: clone

Thus depending on detected number of CPUs and BLAS version the reported memory usage for importing numpy can vary quite a bit.

itamarst · 2022-02-23T00:46:48Z

Of course, in theory there should only be a single thread when using Fil. So something is wrong with the threadpool-controlling code too, it seems (there were three of the above tracebacks when running under Conda).

Update: threadpoolctl does not seem to reduce number of threads in NumPy, unclear why. Filed an issue: joblib/threadpoolctl#121

itamarst · 2022-02-24T17:10:05Z

It's not clear that current approach of limiting to one thread is correct (assuming it can be fixed). Zeroed out new mmap() doesn't actually use any memory, should we really be counting all of it? And if the user is using BLAS, the profiling will be ignoring potentially a large chunk of memory, especially on machines with high core count.

Alternatives:

Fixed status quo, just one thread.
Stop limiting to one thread. Just show the memory, document the caveats. Probably very confusing, especially since many people won't be using BLAS at all.
Stop limiting to one thread. Poll maps occasionally to see how much of mmap is dirty; only count that for purposes of memory usage. This makes "peak memory" harder to accurately find. It also may not work on macOS. Callstack would still refer to original source of allocation, e.g. import numpy so harder to tie it to e.g. filling in the array contents.
Use userfaultfd to track when pages become dirty. This would give accurate callstack attribution. Would only work on Linux, needs ptrace capability. Scary to implement.

itamarst · 2022-03-03T13:09:10Z

For alternative 3, checking how much of mmap is filled coudl be done whenever we check for a new peak, which should ... correctly catch peaks, I think.

itamarst · 2022-03-03T13:32:52Z

For alternative 3, looks like the info is available on macOS via the vmmap utility. https://github.com/rbspy/proc-maps wraps underlying API, although not with the info we'd need. The latter also claims it requires root and won't work with SIP, and yet I was able to do that on my macOS setup... possibly that's for arbitrary processes? Which is not a use case Fil has. So might work fine.

itamarst · 2022-03-26T16:06:21Z

As a short-term workaround until alternative 3 above is implemented, I'm going to make sure numpy is always imported before profiling starts. The memory used by numpy won't get counted, but in many ways that's not under the user's control anyway. So seems like a reasonable way to at least give consistent results (it's not like Fil guarantees it tracks everything, anyway).

itamarst · 2022-03-28T00:37:27Z

Retrieving information

On Linux, the data is in /proc/self/smaps. There is a Rust parser (in procfs) but it does a bunch of allocation and can be expected to be pretty slow. https://man7.org/linux/man-pages/man5/proc.5.html documents the format. We would need to parse just:

The address (offset is for files) from the first line for each map.
...

itamarst · 2022-05-12T13:32:41Z

The data structure representation for this is probably:

When allocating large items, remember the address in a set.
Retrieve (somehow, sometime) set of "here's much memory was not actually allocated" per memory range (i.e. mmap()).
When calculating "should I store new peak memory", the new potential peak memory bytes and per-callstack numbers can be reduced using items 1 and 2.

The problem with this is that retrieving 2 is likely to be expensive, so doing it on every free() is ... not ideal. Need to measure of course, but it's "open file, read file potentially as large as 1MB, parse it". For Sciagraph this is a little less problematic since there's sampling, but still.

One possible heuristic: only do the parse if there have been minor page faults in the interim since last check, presuming minor page faults are good indicator (need to check) and getrusage is sufficiently cheap (again, need to check). This may again be more viable with Sciagraph, depending on measurements.

itamarst · 2022-05-12T21:10:19Z

/proc/self/smaps_rollup is another alternative to getrusage.

itamarst · 2022-11-24T21:06:24Z

https://github.com/javierhonduco/bookmark reads /proc/self/pagemap

itamarst added bug Something isn't working NEXT labels Feb 6, 2022

itamarst mentioned this issue Mar 26, 2022

Workaround for NumPy import memory usage #340

Merged

itamarst removed the NEXT label Mar 27, 2022

itamarst changed the title ~~Investigate why memory usage for imports is (a) huge (b) variable in some situations~~ Sparse mmap()s are counted as fully allocated from the start, which can be very misleading May 12, 2022

itamarst mentioned this issue Jun 13, 2022

Profiling PyTorch DataLoader Method #382

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse mmap()s are counted as fully allocated from the start, which can be very misleading #308

Sparse mmap()s are counted as fully allocated from the start, which can be very misleading #308

itamarst commented Feb 6, 2022

itamarst commented Feb 20, 2022

itamarst commented Feb 22, 2022

itamarst commented Feb 22, 2022

itamarst commented Feb 23, 2022 •

edited

Loading

itamarst commented Feb 24, 2022

itamarst commented Mar 3, 2022

itamarst commented Mar 3, 2022

itamarst commented Mar 26, 2022 •

edited

Loading

itamarst commented Mar 28, 2022

itamarst commented May 12, 2022

itamarst commented May 12, 2022

itamarst commented Nov 24, 2022

Sparse mmap()s are counted as fully allocated from the start, which can be very misleading #308

Sparse mmap()s are counted as fully allocated from the start, which can be very misleading #308

Comments

itamarst commented Feb 6, 2022

Discussed in #297

itamarst commented Feb 20, 2022

itamarst commented Feb 22, 2022

itamarst commented Feb 22, 2022

itamarst commented Feb 23, 2022 • edited Loading

itamarst commented Feb 24, 2022

itamarst commented Mar 3, 2022

itamarst commented Mar 3, 2022

itamarst commented Mar 26, 2022 • edited Loading

itamarst commented Mar 28, 2022

Retrieving information

itamarst commented May 12, 2022

itamarst commented May 12, 2022

itamarst commented Nov 24, 2022

itamarst commented Feb 23, 2022 •

edited

Loading

itamarst commented Mar 26, 2022 •

edited

Loading