You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find that linear sweep disassembly is unreliable, at best. I prefer my data to be recognized as data, even if it means I have to mark it as such manually (from "undefined" or "raw bytes" state).
Basically, in my reversing approach, I prefer not to mark code as such just because it matches some heuristic. Recursive descent is very precise and thus rather than using linear sweep analysis, I prefer looking code pointers to feed back into the RD analysis queue. Even if I have to do it manually.
Unfortunately, this approach often means including some context-specific techniques (for example, finding the ARM vector table in a firmware or finding an array of structures that contain function pointers). Such contexts can be specific to a platform, operating system, or even particular software package.
At the risk of sounding pompous, I think the aspiration should be for more code to be recognized correctly as code than just "more stuff marked as code". Having to undo incorrect analysis is more painful than having to do some yourself.
The text was updated successfully, but these errors were encountered:
We'll discuss it internally more but my guess is it's highly unlikely we change this. We optimize for the "touch the fewest knobs" out of the box. Those who prefer this off can always change the default default setting as they desire but those who need the assistance the most are less likely to even understand the nuance and benefit from having it on.
You should have seen the sheer number of issues/complaints we received before we had linear sweep for people who didn't understand why functions were sometimes missed!
Also, fwiw in our testing, our linear sweep is actually the most accurate of any of the main tools in F1 score including both false positives and false negatives. We don't simply optimize for "most functions". Our goal is accuracy. Of course, this was some time ago and we really need to re-visit the research as we've made a bunch of improvements since then so I imagine the results have only gotten better but other tools could have changed too! Here's the blog that explains our original implementation and has a lot of the data behind that claim:
That also is consistent with what we hear from the majority of customers. Of course not everyone has that experience but with different samples you'll get different results, so there will always be some individuals who have worse experiences than the mean!
I agree though it absolutely could and should be better even if we assume that we are out-performing other approaches that's no excuse not to do better.
I like the idea now that we have workflows (which didn't exist when we first implemented linear sweep) of being more selective about when and with what settings we run it. I do think that we can do better than a one sized fits all approach and language specific or platform specific settings are absolutely something we could improve in. I suspect that we're at the point where the majority of improvements to our results won't come from sweeping changes across all results but rather we begin to tune individual workflows for specific platforms and languages as long as we can do it in a way that lets users tweak those workflows and make their own modifications to them which can be tricky to get right.
I find that linear sweep disassembly is unreliable, at best. I prefer my data to be recognized as data, even if it means I have to mark it as such manually (from "undefined" or "raw bytes" state).
Basically, in my reversing approach, I prefer not to mark code as such just because it matches some heuristic. Recursive descent is very precise and thus rather than using linear sweep analysis, I prefer looking code pointers to feed back into the RD analysis queue. Even if I have to do it manually.
Unfortunately, this approach often means including some context-specific techniques (for example, finding the ARM vector table in a firmware or finding an array of structures that contain function pointers). Such contexts can be specific to a platform, operating system, or even particular software package.
At the risk of sounding pompous, I think the aspiration should be for more code to be recognized correctly as code than just "more stuff marked as code". Having to undo incorrect analysis is more painful than having to do some yourself.
The text was updated successfully, but these errors were encountered: