Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for analysis of source code/scripted languages #1080

Draft
wants to merge 51 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
bbd3f70
Added initial capa control flow for scripts in C#.
adamstorek Jun 27, 2022
8173397
Implemented some further basic TreeSitter Extractor-related concepts …
adamstorek Jun 27, 2022
428f6bc
Modified mypy config file to ignore tree-sitter's missing exports.
adamstorek Jun 28, 2022
a6d7ba2
Implemented core tree sitter engine component with C# queries that se…
adamstorek Jun 28, 2022
80bf78b
Implemented script global extraction handlers (mostly wrapping existi…
adamstorek Jun 28, 2022
cf3dc7e
Reworked format parsing to align better with the rest of capa logic.
adamstorek Jun 28, 2022
9d7f575
Implemented a large part of the C# functionality; refactored the Tree…
adamstorek Jun 29, 2022
3d4b4ec
Added function-level feature extraction.
adamstorek Jun 30, 2022
eca7ead
Bug fixes and code refactoring of the Tree Sitter extractor.
adamstorek Jun 30, 2022
5fd953f
Added tree_sitter to requirements in setup.py.
adamstorek Jun 30, 2022
1f79db9
Added tests for TreeSitterExtractorEngine initialization, new object …
adamstorek Jul 1, 2022
a58bc0b
Added more TreeSitterExtractorEngine tests for pure C#.
adamstorek Jul 1, 2022
5ddb8ba
Added last remaining tests for the TreeSitterExtractorEngine class an…
adamstorek Jul 1, 2022
31e2fb9
Reverted yielding only non-empty strings in order to stay consistent …
adamstorek Jul 5, 2022
5bf3f18
Removing functions that should not be used in tree-sitter extractor (…
adamstorek Jul 5, 2022
a4529fc
Modifying extraction of global statements to omit local function decl…
adamstorek Jul 5, 2022
d5de9a1
Added script language feature to freeze.
adamstorek Jul 5, 2022
6c10458
Added test cases for TS Extractor.
adamstorek Jul 5, 2022
9bd9824
Refactored query bindings.
adamstorek Jul 6, 2022
2594849
Added support for template parsing.
adamstorek Jul 6, 2022
619ed94
Added support for HTML parsing.
adamstorek Jul 6, 2022
5e23802
Implemented the necessary modifications to support embedded templates…
adamstorek Jul 7, 2022
5d83e8d
Added more buildings to build; minor style improvement.
adamstorek Jul 7, 2022
9570523
Further refactored the Tree-sitter queries and fixed minor template e…
adamstorek Jul 7, 2022
7c5e6e3
Refactored extractor engine tests and began adding new template tests.
adamstorek Jul 7, 2022
1e0326a
Added new tests for embedded template testing and refactored a few al…
adamstorek Jul 8, 2022
ca1939f
Bug fixes in extractor and HTML Tree-sitter engine.
adamstorek Jul 8, 2022
d7ab2db
Fixed important namespace-parsing bugs.
adamstorek Jul 11, 2022
5cfbecc
Further improvement to namespace parsing, including default namespace…
adamstorek Jul 11, 2022
26cc1bc
Added more tests and a few minor bug fixes.
adamstorek Jul 11, 2022
2a9e76f
Added language-specific integer parsing.
adamstorek Jul 12, 2022
672ca71
Fixed an important bug in FileOffsetRangeAddress comparison method.
adamstorek Jul 12, 2022
ca426ca
Added more ASPX tests.
adamstorek Jul 12, 2022
fd80277
Fixed the capa control flow to fully support capa scripts.
adamstorek Jul 12, 2022
d0c4acb
Major changes: switching imports and function names to properties, st…
adamstorek Jul 18, 2022
ad31d83
Fixed property-extraction bugs.
adamstorek Jul 19, 2022
e52a9b3
Added few more test cases.
adamstorek Jul 19, 2022
b27713b
Minor style improvements.
adamstorek Jul 19, 2022
b2df2b0
Removed deprecated parse_integer.
adamstorek Jul 19, 2022
a0379a6
Added more tests; fixed integer parsing related bugs.
adamstorek Jul 19, 2022
eeecb63
Fixing address range bug; refactoring and cleanup.
adamstorek Jul 20, 2022
cebc5e1
Incorporated more tests.
adamstorek Jul 20, 2022
d7dcc94
Added support for Python.
adamstorek Jul 26, 2022
32dc5ff
Added more python test cases; fixed a number of python bugs; further …
adamstorek Jul 29, 2022
5e85a6e
Implemented namespace aliasing; further refactored the codebase.
adamstorek Aug 2, 2022
614900f
Refactored/simplified parts of the codebase to improve readability; a…
adamstorek Aug 3, 2022
bb08181
Implemented script language auto-detection.
adamstorek Aug 3, 2022
1fd9d4a
Removed a spurious import.
adamstorek Aug 3, 2022
7ba978f
Added more test cases; moved script language feature to global featur…
adamstorek Aug 5, 2022
25cf09b
Introduced auto-detection to template-script parsing, builtins namesp…
adamstorek Aug 10, 2022
e693573
Attempted to implement the class extraction as specified last Friday …
adamstorek Aug 12, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/mypy/mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -76,4 +76,7 @@ ignore_missing_imports = True
ignore_missing_imports = True

[mypy-dncil.*]
ignore_missing_imports = True

[mypy-tree_sitter.*]
ignore_missing_imports = True
20 changes: 20 additions & 0 deletions capa/features/address.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,26 @@ def __repr__(self):
return f"file(0x{self:x})"


class FileOffsetRangeAddress(Address):
adamstorek marked this conversation as resolved.
Show resolved Hide resolved
"""an address range relative to the start of a file"""

def __init__(self, start_byte, end_byte):
self.start_byte = start_byte
self.end_byte = end_byte

def __eq__(self, other):
return (self.start_byte, self.end_byte) == (other.start_byte, other.end_byte)

def __lt__(self, other):
return (self.start_byte, self.end_byte) < (other.start_byte, other.end_byte)

def __hash__(self):
return hash((self.start_byte, self.end_byte))

def __repr__(self):
return f"file(0x{self.start_byte:x}, 0x{self.end_byte:x})"


class DNTokenAddress(Address):
"""a .NET token"""

Expand Down
9 changes: 8 additions & 1 deletion capa/features/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -405,10 +405,17 @@ def __init__(self, value: str, description=None):
self.name = "os"


class ScriptLanguage(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
self.name = "script language"
Comment on lines +408 to +411
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use format for this? e.g. format: C#.

pro:

  • fewer features to memorize
  • less duplication
  • less code

con:

  • maybe slightly less precise

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with overloading the file format feature is that file to language is a one-to-many mapping, e.g. there can be embedded templates that contain multiple different script languages such as C# for server-side scripts and JavaScript for client-side.



FORMAT_PE = "pe"
FORMAT_ELF = "elf"
FORMAT_DOTNET = "dotnet"
VALID_FORMAT = (FORMAT_PE, FORMAT_ELF, FORMAT_DOTNET)
FORMAT_SCRIPT = "script"
VALID_FORMAT = (FORMAT_PE, FORMAT_ELF, FORMAT_DOTNET, FORMAT_SCRIPT)
# internal only, not to be used in rules
FORMAT_AUTO = "auto"
FORMAT_SC32 = "sc32"
Expand Down
16 changes: 15 additions & 1 deletion capa/features/extractors/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,21 @@
import capa.features
import capa.features.extractors.elf
import capa.features.extractors.pefile
from capa.features.common import OS, FORMAT_PE, FORMAT_ELF, OS_WINDOWS, FORMAT_FREEZE, Arch, Format, String, Feature
from capa.features.common import (
OS,
FORMAT_PE,
FORMAT_ELF,
OS_WINDOWS,
FORMAT_FREEZE,
FORMAT_SCRIPT,
Arch,
Format,
String,
Feature,
)
from capa.features.freeze import is_freeze
from capa.features.address import NO_ADDRESS, Address, FileOffsetAddress
from capa.features.extractors.ts.autodetect import is_script

logger = logging.getLogger(__name__)

Expand All @@ -34,6 +46,8 @@ def extract_format(buf) -> Iterator[Tuple[Feature, Address]]:
yield Format(FORMAT_ELF), NO_ADDRESS
elif is_freeze(buf):
yield Format(FORMAT_FREEZE), NO_ADDRESS
elif is_script(buf):
yield Format(FORMAT_SCRIPT), NO_ADDRESS
else:
# we likely end up here:
# 1. handling a file format (e.g. macho)
Expand Down
41 changes: 41 additions & 0 deletions capa/features/extractors/script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from typing import Tuple, Iterator

from capa.features.common import OS, OS_ANY, ARCH_ANY, FORMAT_SCRIPT, Arch, Format, Feature, ScriptLanguage
from capa.features.address import NO_ADDRESS, Address, FileOffsetRangeAddress

# Can be used to instantiate tree_sitter Language objects (see ts/query.py)
LANG_CS = "c_sharp"
adamstorek marked this conversation as resolved.
Show resolved Hide resolved
LANG_HTML = "html"
LANG_JS = "javascript"
LANG_PY = "python"
LANG_TEM = "embedded_template"

EXT_ASPX = ("aspx", "aspx_")
EXT_CS = ("cs", "cs_")
EXT_HTML = ("html", "html_")
EXT_PY = ("py", "py_")


LANGUAGE_FEATURE_FORMAT = {
LANG_CS: "C#",
LANG_HTML: "HTML",
LANG_JS: "JavaScript",
LANG_PY: "Python",
LANG_TEM: "Embedded Template",
}


def extract_arch() -> Iterator[Tuple[Feature, Address]]:
yield Arch(ARCH_ANY), NO_ADDRESS


def extract_language(language: str, addr: FileOffsetRangeAddress) -> Iterator[Tuple[Feature, Address]]:
yield ScriptLanguage(LANGUAGE_FEATURE_FORMAT[language]), addr


def extract_os() -> Iterator[Tuple[Feature, Address]]:
yield OS(OS_ANY), NO_ADDRESS


def extract_format() -> Iterator[Tuple[Feature, Address]]:
yield Format(FORMAT_SCRIPT), NO_ADDRESS
Empty file.
65 changes: 65 additions & 0 deletions capa/features/extractors/ts/autodetect.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
from typing import Optional

from tree_sitter import Node, Tree, Parser, Language

from capa.features.extractors.script import EXT_CS, EXT_PY, LANG_CS, LANG_PY, EXT_ASPX, EXT_HTML, LANG_TEM, LANG_HTML
from capa.features.extractors.ts.query import TS_LANGUAGES


def is_script(buf: bytes) -> bool:
try:
return bool(get_language_ts(buf))
except ValueError:
return False


def _parse(ts_language: Language, buf: bytes) -> Optional[Tree]:
try:
parser = Parser()
parser.set_language(ts_language)
return parser.parse(buf)
except ValueError:
return None


def _contains_errors(ts_language, node: Node) -> bool:
return ts_language.query("(ERROR) @error").captures(node)


def get_language_ts(buf: bytes) -> str:
for language, ts_language in TS_LANGUAGES.items():
tree = _parse(ts_language, buf)
if tree and not _contains_errors(ts_language, tree.root_node):
return language
raise ValueError("failed to parse the language")


def get_template_language_ts(buf: bytes) -> str:
for language, ts_language in TS_LANGUAGES.items():
if language in [LANG_TEM, LANG_HTML]:
continue
tree = _parse(ts_language, buf)
if tree and not _contains_errors(ts_language, tree.root_node):
return language
raise ValueError("failed to parse the language")


def get_language_from_ext(path: str) -> str:
if path.endswith(EXT_ASPX):
return LANG_TEM
if path.endswith(EXT_CS):
return LANG_CS
if path.endswith(EXT_HTML):
return LANG_HTML
if path.endswith(EXT_PY):
return LANG_PY
raise ValueError(f"{path} has an unrecognized or an unsupported extension.")


def get_language(path: str) -> str:
try:
with open(path, "rb") as f:
buf = f.read()
return get_language_ts(buf)
except ValueError:
return get_language_from_ext(path)
15 changes: 15 additions & 0 deletions capa/features/extractors/ts/build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from tree_sitter import Language

build_dir = "build/my-languages.so"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we only support Linux?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tree-sitter needs to compile its (C) language bindings. Although I have a limited knowledge of package management, I've suggested to Moritz that we should precompile and package the supported tree-sitter bindings for each platform we support. The current state is a temporary measure.

languages = [
"vendor/tree-sitter-c-sharp",
"vendor/tree-sitter-embedded-template",
"vendor/tree-sitter-html",
"vendor/tree-sitter-javascript",
"vendor/tree-sitter-python",
]


class TSBuilder:
def __init__(self):
Language.build_library(build_dir, languages)
Loading