Revisit the default-on state of "linearSweep" #6380

jduck · 2025-02-02T05:10:05Z

I find that linear sweep disassembly is unreliable, at best. I prefer my data to be recognized as data, even if it means I have to mark it as such manually (from "undefined" or "raw bytes" state).

Basically, in my reversing approach, I prefer not to mark code as such just because it matches some heuristic. Recursive descent is very precise and thus rather than using linear sweep analysis, I prefer looking code pointers to feed back into the RD analysis queue. Even if I have to do it manually.

Unfortunately, this approach often means including some context-specific techniques (for example, finding the ARM vector table in a firmware or finding an array of structures that contain function pointers). Such contexts can be specific to a platform, operating system, or even particular software package.

At the risk of sounding pompous, I think the aspiration should be for more code to be recognized correctly as code than just "more stuff marked as code". Having to undo incorrect analysis is more painful than having to do some yourself.

psifertex · 2025-02-03T04:33:11Z

We'll discuss it internally more but my guess is it's highly unlikely we change this. We optimize for the "touch the fewest knobs" out of the box. Those who prefer this off can always change the default default setting as they desire but those who need the assistance the most are less likely to even understand the nuance and benefit from having it on.

You should have seen the sheer number of issues/complaints we received before we had linear sweep for people who didn't understand why functions were sometimes missed!

Also, fwiw in our testing, our linear sweep is actually the most accurate of any of the main tools in F1 score including both false positives and false negatives. We don't simply optimize for "most functions". Our goal is accuracy. Of course, this was some time ago and we really need to re-visit the research as we've made a bunch of improvements since then so I imagine the results have only gotten better but other tools could have changed too! Here's the blog that explains our original implementation and has a lot of the data behind that claim:

https://binary.ninja/2017/11/06/architecture-agnostic-function-detection-in-binaries.html

That also is consistent with what we hear from the majority of customers. Of course not everyone has that experience but with different samples you'll get different results, so there will always be some individuals who have worse experiences than the mean!

I agree though it absolutely could and should be better even if we assume that we are out-performing other approaches that's no excuse not to do better.

I like the idea now that we have workflows (which didn't exist when we first implemented linear sweep) of being more selective about when and with what settings we run it. I do think that we can do better than a one sized fits all approach and language specific or platform specific settings are absolutely something we could improve in. I suspect that we're at the point where the majority of improvements to our results won't come from sweeping changes across all results but rather we begin to tune individual workflows for specific platforms and languages as long as we can do it in a way that lets users tweak those workflows and make their own modifications to them which can be tricky to get right.

xusheng6 added the State: Intended Issue is actually intended behavior label Feb 4, 2025

psifertex closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit the default-on state of "linearSweep" #6380

Revisit the default-on state of "linearSweep" #6380

jduck commented Feb 2, 2025

psifertex commented Feb 3, 2025

Revisit the default-on state of "linearSweep" #6380

Revisit the default-on state of "linearSweep" #6380

Comments

jduck commented Feb 2, 2025

psifertex commented Feb 3, 2025