Error backtrace fix for frame-pointer omitted code #147

janmejay · 2017-06-19T11:14:18Z

The previous merge of this branch didn't account for frame-pointer omitted code.

This PR deals with that by reading up the mapping just before it we start profiling and not trying to dereference %rbp when we believe it is not mapped.

Of-course, this assumes a sane application that doesn't create and discard threads at drop of a hat. But of-course some applications will have such bad usage of threads(mature ones won't, though).

For bad applications like that, native-bt can be disabled using a policy flag.

For well behaved applications, if all code maintains frame-pointers, we are golden, we get all backtraces.

For well behaved application, with some frame-pointer-omitted code, we will follow backtrace as far as we think its making sense.

Eg. if suddenly we find base-pointer pointing to unmapped address or non-writable address, we assume its frame-pointer-omitted callee and stop trying to walk backtrace (such backtraces are reported snipped).

…so check if it really points to a mapped place. Also expose a profiling-policy-param that can be used to disable native-backtrace-collections for applications that play with stack-allocations too frequently for our match-with-mapping strategy to be ineffective protection.

…t was not readable, which would almost always happen due to frame-pointer-omitted code)

…ut of a symlink) + relaxed guesstimate-assert on txt-length of a few simple functions

…ccessfully when one of the parent dirs is a symlink

…t-dir

…, almost always caused by frame-pointer-omitted text) as snipped backtraces.

anvinjain · 2017-06-19T12:34:48Z

recorder/src/idl/recorder.proto

@@ -156,6 +156,7 @@ message CpuSampleWork {
  required uint32 frequency = 1;
  required uint32 max_frames = 2;
  optional uint32 serialization_flush_threshold = 3 [default = 100];
+  optional bool capture_error_bt = 4 [default = true];


shouldn't this be capture_native_bt

Actually, even though effect is a native-backtrace, from user's PoV, it is a native-backtrace captured iff we couldn't grab a java backtrace (in other words, we had an error in capturing java backtrace, so we ended up capturing the next best thing). Hence the name.

anvinjain · 2017-06-19T13:31:23Z

Discussed thread create/destroy scenarios offline. For example, popular frameworks like DW have defaults for request handling threads like min=8 and max=1024. These can lead to on-demand creation/destruction of threads and changes in mapped memory regions. Since we update memory map on start of work and do not update during work duration (which is the order of few minutes), if application has aggressive thread lifetimes, we refrain teams to have native bt enabled in policy

janmejay · 2017-06-19T14:03:14Z

About thread creation and deletion, applications would generally be tuned for peek traffic, so variability in threads doesn't make sense.

Variability in connections for instance may make sense, because we can keep only so many connections warm at a certain load, but people take that concept and apply it to threads blindly, which doesn't make any sense.

For naive applications, where they have variable thread-count, we should encourage them to benchmark and fix upon a number that gives them best performance.

If they don't have time to benchmark and fix on a number, then anyway error-backtrace feature is not useful to them, because they have much larger inefficiencies in config file, so they may as well turn the feature off.

Its possible to handle mapping changes, but it won't buy people anything. Either they benchmark and tune, or they don't. If they do, they don't need dynamism and if they don't they have very large problems outside of this, so this feature doesn't matter to them.

gauravAshok · 2017-06-19T15:18:54Z

Won't reading the mmaps in issue work create a problem. Controlling thread lifecycle can be hard if its handled in 3rd party library. A library can spawn and kill threads on every invocation when completing a task at hand can take some time.

janmejay · 2017-06-19T15:55:31Z

Some libraries will be that kind but often those won't be lib-of-choice for production. I am assuming applications that need to know CPU burn that happens outside of java code have already optimized almost everything possible within java side of the application. For such an application (mostly userpath stuff) those libraries will anyway not make sense.

New threads in production have a lot of issues. Eg. they are cache-cold, they don't have a scheduling history and will be placed arbitrarily, which means it'll take several migration and ctx-switchs for the whole eco-system to come to good rhythm again, they make working-set size unpredictable etc.

But if we really do see the need for it (as in our more performance-hungry applications re-cycle threads too much), we can go down the libunwind path (use dwarf support, dig further into x86_64 abi etc) or stay on frame-pointer approach but override sigsegv handler and deliver to JVM's handler after handling bad-stack-dereferences etc.

The main thing we lack right now is time to implement these things. If I manage to finish off-cpu soon enough, I don't mind re-looking at it actually.

But I still strongly believe that if an application is dependent on such libraries and is CPU-profiling and tuning hotspots, they will get larger efficiency gains by solving thread-management issues before trying to tune code too much, so this feature is anyway going to be of limited value with an application that is broken at that level.

janmejay · 2017-06-19T16:15:06Z

Actually, I have a working libunwind based impl (https://github.com/Flipkart/fk-prof/tree/omitfp_code) which we can try out, what do you guys think about rolling it out to our test cluster to see how it takes it?

I'll obviously have to move over some fixes from error_backtrace to that to make it ready to go.

The reason I ditched it: it was giving me much lower quality of data (it may be overhead, I haven't checked yet). Eg. insights that come out with 90% dominance with our impl, come out with ~10% presence with it.

We didn't choose it in the first place due to overhead affecting data-quality. But its worth remembering that this is error-only scenario that we are talking about here, so data-quality is not "super" important.

janmejay · 2017-06-19T16:20:22Z

Another relationship worth observing is, if in reality we get 1% of backtraces that fall-back to native-bt, its slowness (compared to our approach) will not really affect things overall that much.

But say in some workload, 20% samples fall-back to native-bt, it'll really matter.

gauravAshok · 2017-06-19T16:25:18Z

I think then its alright for now. We can come back to it later. As people start to use, we will also gain more insight.

janmejay · 2017-06-20T12:44:57Z

This PR should not be merged, because we have changed the approach in favor of libunwind (.eh_frame etc). We need to keep this open though, because when we start code-cache integration, this may become a problem and we may have to fall-back to base-pointer based backtrace-reading.

janmejay added 12 commits June 16, 2017 21:55

Integration test with disable-native-bt policy

96543e1

Exposed a flag that can be used to mark the backtrace snipped (when i…

1ff83a9

…t was not readable, which would almost always happen due to frame-pointer-omitted code)

Killed an unhelpful assert in test (it would fail when test was run o…

25426f9

…ut of a symlink) + relaxed guesstimate-assert on txt-length of a few simple functions

Symlink resolve lib-path before trying to match it

42f674f

Try "readlink -f ..." style canonicalization of path, so tests run su…

ed39400

…ccessfully when one of the parent dirs is a symlink

Minor bugfixes in test-code

7b9cde2

Some more changes to fix tests when running out of a sym-linked paren…

8d9c34b

…t-dir

Fixed a typo in test-util

32d0e62

Report artificially shortened backtraces (due to unreadable backtrace…

fbb115a

…, almost always caused by frame-pointer-omitted text) as snipped backtraces.

Profile (build-type) cxxflags now control optimization level

410fb53

Fix file-path for symlinked parent-dir scenario

5144de0

janmejay requested review from anvinjain and gauravAshok June 19, 2017 11:14

anvinjain reviewed Jun 19, 2017

View reviewed changes

janmejay mentioned this pull request Jun 20, 2017

Error-backtraces captured using libunwind (as opposed to frame-ptr based walk-down) #149

Merged

janmejay added the don't merge label Jun 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error backtrace fix for frame-pointer omitted code #147

Error backtrace fix for frame-pointer omitted code #147

janmejay commented Jun 19, 2017

anvinjain Jun 19, 2017

janmejay Jun 19, 2017

anvinjain commented Jun 19, 2017 •

edited

Loading

janmejay commented Jun 19, 2017

gauravAshok commented Jun 19, 2017

janmejay commented Jun 19, 2017

janmejay commented Jun 19, 2017

janmejay commented Jun 19, 2017

gauravAshok commented Jun 19, 2017

janmejay commented Jun 20, 2017

Error backtrace fix for frame-pointer omitted code #147

Are you sure you want to change the base?

Error backtrace fix for frame-pointer omitted code #147

Conversation

janmejay commented Jun 19, 2017

anvinjain Jun 19, 2017

Choose a reason for hiding this comment

janmejay Jun 19, 2017

Choose a reason for hiding this comment

anvinjain commented Jun 19, 2017 • edited Loading

janmejay commented Jun 19, 2017

gauravAshok commented Jun 19, 2017

janmejay commented Jun 19, 2017

janmejay commented Jun 19, 2017

janmejay commented Jun 19, 2017

gauravAshok commented Jun 19, 2017

janmejay commented Jun 20, 2017

anvinjain commented Jun 19, 2017 •

edited

Loading