-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for analysis of source code/scripted languages #1080
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased)
section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start towards adding scripting language support to capa! thanks @adamstorek!
The code is already quite good and I don't anticipate any major issues to getting it merged; however, I have added a number of comments on regions I think should be tweaked.
One thing you should know about me is that I prefer to over-communicate review feedback with the understanding that everything is up for discussion. So, if anything feels weird or wrong, don't hesitate to ask for more details or deeper discussion.
General points:
- I really like the file range address type, I think that will work well.
- I think we can simplify the code a bit by merging the "Script" feature into "Format". thoughts?
- Some of the embedded data and configuration can be restructured into python globals.
- I'd like to hear a bit about what it takes to embed/depend on the TS languages so we can ensure its easy for people to download/use.
- Please add tests showing the features extracted on various example files
class ScriptLanguage(Feature): | ||
def __init__(self, value: str, description=None): | ||
super().__init__(value, description=description) | ||
self.name = "script language" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use format
for this? e.g. format: C#
.
pro:
- fewer features to memorize
- less duplication
- less code
con:
- maybe slightly less precise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with overloading the file format feature is that file to language is a one-to-many mapping, e.g. there can be embedded templates that contain multiple different script languages such as C# for server-side scripts and JavaScript for client-side.
capa/features/extractors/script.py
Outdated
def get_language_from_ext(path: str): | ||
_, ext = os.path.splitext(path) | ||
if ext == ".cs": | ||
return LANG_CS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also think about maybe trying to guess the language based on the file contents if there's no extension.
also supporting things like .cs_
which some may use to prevent the file from accidentally getting executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto-identifying the script language is definitely a worthwhile feature which I have deprioritized for the minimal implementation and instead manually incorporated the extensions (now including the defanged extensions as suggested above). This is also because it might not always be very straightforward to do so (e.g. one file might include multiple scripts; sometimes context might be necessary).
2fdcac2
to
0a61d86
Compare
i think it would be worthwhile to get the tests running (and passing) in CI. this means:
|
Just submitted the pull request pull request.
On it. |
…such as byte-range address.
…rves as an interface to the language-specific tree-sitter queries.
…ng language-independent extractors).
…SitterEngine class.
…and function definition parsing for a pure C# sample.
…d fixed bugs found in the process.
…and not introduce unspecified rule-exceptions.
…default to the base extractor level).
if self.is_aspx_import_directive(node): | ||
namespace = self.get_aspx_namespace(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document why we are only/specially handling ASPX
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the issue discussed here: #1080 (comment).
…refactored language toolkit code; added extraction of global constants.
…ddressed most of the GH pull request comments/suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice updates! I've left a few comments, questions, and suggestions for your review 🚀
tree = _parse(ts_language, buf) | ||
except ValueError: | ||
continue | ||
if not _contains_errors(ts_language, tree.root_node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment on what are assumptions are here? I'm not overly familiar with tree-sitter but it appears that we assume it will only throw errors when encountering a language mismatch e.g. we attempt to parse Python using tree-sitter C# tooling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be more readable, what do you think?
def _parse(ts_language: Language, buf: bytes) -> Optional[Tree]:
try:
parser = Parser()
parser.set_language(ts_language)
return parser.parse(buf)
except ValueError:
return None
def _contains_errors(ts_language, node: Node) -> bool:
return ts_language.query("(ERROR) @error").captures(node)
def get_language_ts(buf: bytes) -> str:
for language, ts_language in TS_LANGUAGES.items():
tree = _parse(ts_language, buf)
if tree and not _contains_errors(ts_language, tree.root_node):
return language
raise ValueError("failed to parse the language")
@@ -0,0 +1,15 @@ | |||
from tree_sitter import Language | |||
|
|||
build_dir = "build/my-languages.so" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean we only support Linux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tree-sitter needs to compile its (C) language bindings. Although I have a limited knowledge of package management, I've suggested to Moritz that we should precompile and package the supported tree-sitter bindings for each platform we support. The current state is a temporary measure.
def parse(self) -> Tree: | ||
parser = Parser() | ||
parser.set_language(self.query.language) | ||
return parser.parse(self.buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this call generate any exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can throw type errors (which I believe we prevent with mypy) and a value error if parsing completely fails. Then the parse method will throw a ValueError, so the engine will throw a ValueError etc.: I can handle it in the following way at the Extractor level:
try:
self.language = capa.features.extractors.ts.autodetect.get_language(path)
self.template_engine = self.get_template_engine(buf)
self.engines = self.get_engines(buf)
except ValueError as e:
raise UnsupportedFormatError(e)
def get_range(self, node: Node) -> str: | ||
return self.get_byte_range(node).decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intended use of this function appears to be decoding a string found in a specific byte range? If so, consider changing the name to something more descriptive like get_str_from_range
. Also, do we expect encoding exceptions to be thrown by the decode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, but I doubt tree-sitter would be able to parse something that we can't decode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also changed get_range
to get_str
.
def _extract_imported_constants(fn_node: Node, engine: TreeSitterExtractorEngine) -> Iterator[Tuple[Feature, Address]]: | ||
for ic_node, ic_name in engine.get_processed_imported_constants(fn_node): | ||
for name in get_imports(ic_name, engine.namespaces, engine): | ||
yield API(engine.language_toolkit.format_imported_constant(name)), engine.get_address(ic_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need more discussion on #1125
capa/features/extractors/ts/tools.py
Outdated
signatures = json.loads(importlib.resources.read_text(capa.features.extractors.ts.signatures, signature_file)) | ||
return {category: set(namespaces) for category, namespaces in signatures.items()} | ||
|
||
def _is_import(self, name: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo w/ _
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is merely a private method to handle import table lookups that the public is_import
method uses.
return int(integer, base) | ||
return int(integer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect exceptions to occur here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can raise ValueError (and does when TS labels something as an int which is not an int), but the ValueError is handled by the caller (see extract_integers).
…es in order to make rules clearer; refactored the codebase to address the latest PR comments/suggestions.
…ace, and modified tests.
…(passes all test cases but by no means perfect); further clean up, especially of the signatures; synced with new Python test cases.
This enhancement extends capa's functionality to the analysis of potentially malicious scripts and source code. A tree-sitter backend was added to parse the source files into a lightweight AST. Features akin to the PE-Vivisect capa are then extracted:
File-level:
Function-level:
To install Tree-sitter:
pip3 install tree-sitter
mkdir vendor build
cd vendor
git clone [email protected]:tree-sitter/tree-sitter-c-sharp.git
git clone [email protected]:tree-sitter/tree-sitter-embedded-template.git
git clone [email protected]:tree-sitter/tree-sitter-html.git
git clone [email protected]:tree-sitter/tree-sitter-javascript.git
Checklist