Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse mmap()s are counted as fully allocated from the start, which can be very misleading #308

Open
itamarst opened this issue Feb 6, 2022 Discussed in #297 · 12 comments
Labels
bug Something isn't working

Comments

@itamarst
Copy link
Collaborator

itamarst commented Feb 6, 2022

Discussed in #297

Originally posted by fohria January 26, 2022
hey! thanks for this profiler, it looks very useful, if i can figure out how to use it :)

i have a short script that, in short, generates a bunch of data, and then plots it. depending on how much i generate, memory use can be many gigabytes. so i'd like to profile it so i can find out when and where i may have some dataframes hanging around from function calls that i can delete when they're not needed anymore, like after i have dumped it to a file.

however, running it with fil, i get this:

image

the light pink on the left are the plotting calls, but what does it mean that it says <frozen importlib._bootstrap> all over?

tldr version of my code is:

data = generate_data(how_much)  # returns a pd.dataframe
figure = plotting_call(data)

(i've installed fil to the same conda env i use for the script, if that matters)

@itamarst itamarst added bug Something isn't working NEXT labels Feb 6, 2022
@itamarst
Copy link
Collaborator Author

So far I have compiled code locally, and compared to last released version. Locally compiled Fil doesn't even show the import as using any memory at all (which... kinda makes sense, mostly it's an mmap of a file, with a few tiny allocations that should be filtered out).

@itamarst
Copy link
Collaborator Author

The reason for difference I was seeing, where sometimes NumPy is included and sometimes it isn't, is because of the threadpool changes. If numexpr is installed, NumPy gets imported as part of thread pool setup (via numexpr import), and so its memory isn't tracked because it's imported before tracking is started. If numexpr is not installed, NumPy is only imported at user code runtime, and therefore its memory usage is tracked.

So maybe we want to check for numexpr existence without importing it.

Regardless, however, that doesn't explain the variability in report NumPy-import memory usage, so next step is figuring out why it sometimes has huge % when it shouldn't.

@itamarst
Copy link
Collaborator Author

I think I figured it out:

  1. Some BLAS implementations have a threadpool, likely tied to # of CPUs.
  2. Each thread does a large anonymous mmap().
ADD MMAP 134217728    0: filpreload::add_allocation
   1: <unknown>
   2: alloc_mmap
   3: blas_memory_alloc
   4: blas_thread_server
   5: start_thread
   6: clone

Thus depending on detected number of CPUs and BLAS version the reported memory usage for importing numpy can vary quite a bit.

@itamarst
Copy link
Collaborator Author

itamarst commented Feb 23, 2022

Of course, in theory there should only be a single thread when using Fil. So something is wrong with the threadpool-controlling code too, it seems (there were three of the above tracebacks when running under Conda).

Update: threadpoolctl does not seem to reduce number of threads in NumPy, unclear why. Filed an issue: joblib/threadpoolctl#121

@itamarst
Copy link
Collaborator Author

It's not clear that current approach of limiting to one thread is correct (assuming it can be fixed). Zeroed out new mmap() doesn't actually use any memory, should we really be counting all of it? And if the user is using BLAS, the profiling will be ignoring potentially a large chunk of memory, especially on machines with high core count.

Alternatives:

  • Fixed status quo, just one thread.
  • Stop limiting to one thread. Just show the memory, document the caveats. Probably very confusing, especially since many people won't be using BLAS at all.
  • Stop limiting to one thread. Poll maps occasionally to see how much of mmap is dirty; only count that for purposes of memory usage. This makes "peak memory" harder to accurately find. It also may not work on macOS. Callstack would still refer to original source of allocation, e.g. import numpy so harder to tie it to e.g. filling in the array contents.
  • Use userfaultfd to track when pages become dirty. This would give accurate callstack attribution. Would only work on Linux, needs ptrace capability. Scary to implement.

@itamarst
Copy link
Collaborator Author

itamarst commented Mar 3, 2022

For alternative 3, checking how much of mmap is filled coudl be done whenever we check for a new peak, which should ... correctly catch peaks, I think.

@itamarst
Copy link
Collaborator Author

itamarst commented Mar 3, 2022

For alternative 3, looks like the info is available on macOS via the vmmap utility. https://github.com/rbspy/proc-maps wraps underlying API, although not with the info we'd need. The latter also claims it requires root and won't work with SIP, and yet I was able to do that on my macOS setup... possibly that's for arbitrary processes? Which is not a use case Fil has. So might work fine.

@itamarst
Copy link
Collaborator Author

itamarst commented Mar 26, 2022

As a short-term workaround until alternative 3 above is implemented, I'm going to make sure numpy is always imported before profiling starts. The memory used by numpy won't get counted, but in many ways that's not under the user's control anyway. So seems like a reasonable way to at least give consistent results (it's not like Fil guarantees it tracks everything, anyway).

@itamarst
Copy link
Collaborator Author

Retrieving information

On Linux, the data is in /proc/self/smaps. There is a Rust parser (in procfs) but it does a bunch of allocation and can be expected to be pretty slow. https://man7.org/linux/man-pages/man5/proc.5.html documents the format. We would need to parse just:

  1. The address (offset is for files) from the first line for each map.
  2. ...

@itamarst itamarst changed the title Investigate why memory usage for imports is (a) huge (b) variable in some situations Sparse mmap()s are counted as fully allocated from the start, which can be very misleading May 12, 2022
@itamarst
Copy link
Collaborator Author

The data structure representation for this is probably:

  1. When allocating large items, remember the address in a set.
  2. Retrieve (somehow, sometime) set of "here's much memory was not actually allocated" per memory range (i.e. mmap()).
  3. When calculating "should I store new peak memory", the new potential peak memory bytes and per-callstack numbers can be reduced using items 1 and 2.

The problem with this is that retrieving 2 is likely to be expensive, so doing it on every free() is ... not ideal. Need to measure of course, but it's "open file, read file potentially as large as 1MB, parse it". For Sciagraph this is a little less problematic since there's sampling, but still.

One possible heuristic: only do the parse if there have been minor page faults in the interim since last check, presuming minor page faults are good indicator (need to check) and getrusage is sufficiently cheap (again, need to check). This may again be more viable with Sciagraph, depending on measurements.

@itamarst
Copy link
Collaborator Author

/proc/self/smaps_rollup is another alternative to getrusage.

@itamarst
Copy link
Collaborator Author

https://github.com/javierhonduco/bookmark reads /proc/self/pagemap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant