Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match BinExport2 instructions using patterns #2353

Merged
merged 5 commits into from
Sep 12, 2024

Conversation

williballenthin
Copy link
Collaborator

This PR against the BinExport2 PR introduces a way to match instructions and extract expression values. We use this mini-language to specify how to extract features from BinExport2 instructions, such as Number, Offset, nzxor, etc. While there's a fair amount of code (~500 lines, with tests) to implement this matching language, it is:

  1. more human readable
  2. less fragile
  3. better tested
  4. and therefore more maintainable (I think).

To understand all the changes here, I'd recommend reading:

  1. 2cff633: consolidate expression tree handling. Specifically, this introduces fixes for Ghidra BinExport2 containing lsl #0 expressions that have no effect. We prune these from the expression tree.
  2. 76eed3d: introduce instruction patterns. Start by looking at the tests (maybe bottom up) so you see what the routines do, and then read how they do it. I think the performance should be similiar to hand-written code, but I'd be curious if you can prove me wrong.

Checklist

  • No CHANGELOG update needed
  • No documentation update needed

add pruning phase to expression tree building
to remove known-bad branches. This might address
some of the data we're seeing due to:
NationalSecurityAgency/ghidra#6821

Also introduces a --instruction optional argument
to dump the details of a specific instruction.
@williballenthin williballenthin force-pushed the push-xmurxlmxyyyn branch 5 times, most recently from d15bca5 to 99767ff Compare September 11, 2024 11:36
@williballenthin williballenthin mentioned this pull request Sep 11, 2024
24 tasks
Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work and I think a big improvement over the nested ifs before - the tests give further confidence that this works as expected!

return

instruction: BinExport2.Instruction = be2.instruction[ii.instruction_index]
# guaranteed to be simple int/reg operands
# so we don't have to realize the tree/list.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, great comments here throughout!

Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @williballenthin . I've left some small comments for your review but overall this is a big improvement to the nested checks that we were using. LGTM 🚀

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all of the tests! This helps breakdown what the various functions are supposed to do.

children_tree_indexes: List[int] = expression_tree[tree_index]

if expression.type == BinExport2.Expression.OPERATOR:
if len(children_tree_indexes) == 0 and expression.symbol in ("lsl", "lsr"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle other extended register operands? see https://devblogs.microsoft.com/oldnewthing/20220727-00/?p=106907

capa/features/extractors/binexport2/helpers.py Outdated Show resolved Hide resolved
# - #int
# - #int0
# and a limited set of supported operators.
assert re.match(r"^(reg|#int)[0-9]?(\(stack\)|\(not-stack\))?$", o) or o in ("[", ",", "!", "+", "*")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concern with a performance hit from repeated regex matching here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used when parsing the patterns, which we expect to happen once at startup, O(1). I figured it wasn't worth optimizing that path (such a tiny impact) versus line count/context here. Would have to move the regex declaration out of the function, which makes it harder to follow as a human.

Comment on lines +81 to +88
"""
ldr|ldrb|ldrh|ldrsb|ldrsh|ldrex|ldrd|str|strb|strh|strex|strd reg, [reg(not-stack), #int] ; capture #int
ldr|ldrb|ldrh|ldrsb|ldrsh|ldrex|ldrd|str|strb|strh|strex|strd reg, [reg(not-stack), #int]! ; capture #int
ldr|ldrb|ldrh|ldrsb|ldrsh|ldrex|ldrd|str|strb|strh|strex|strd reg, [reg(not-stack)], #int ; capture #int
ldp|ldpd|stp|stpd reg, reg, [reg(not-stack), #int] ; capture #int
ldp|ldpd|stp|stpd reg, reg, [reg(not-stack), #int]! ; capture #int
ldp|ldpd|stp|stpd reg, reg, [reg(not-stack)], #int ; capture #int
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat 🚀

capa/features/extractors/binexport2/arch/arm/insn.py Outdated Show resolved Hide resolved
Introduce intruction pattern matching to declaratively
describe the instructions and operands that we want to
extract. While there's a bit more code, its much more
thoroughly tested, and is less brittle than the prior
if/else/if/else/if/else implementation.
@williballenthin williballenthin merged commit 38dab5c into mandiant:feat/1755 Sep 12, 2024
2 checks passed
@williballenthin williballenthin deleted the push-xmurxlmxyyyn branch September 12, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants