Skip to content

Commit

Permalink
Revamp magic identification for significant speed improvements (#492)
Browse files Browse the repository at this point in the history
Revamp magic identification for significant speed improvements

1. `File` now inherits from `GenericBinary` to avoid duplicative Identifier runs
2. Update auto-run component logic to run all Analyzers, not just the most specific ones
3. Refactor Magic identification
4. Update registered identifiers to make use of new `MagicIdentifier`
  • Loading branch information
whyitfor authored Jan 8, 2025
1 parent ebc8606 commit 8e15fc5
Show file tree
Hide file tree
Showing 41 changed files with 372 additions and 256 deletions.
45 changes: 39 additions & 6 deletions docs/contributor-guide/component/identifier.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,50 @@
# Registering Identifier Patterns
When [writing unpackers](./unpacker.md), OFRAK Contributors can leverage the `MagicMimeIdentifier` and `MagicDescriptionIdentifier` by registering mappings between resource tags and mime or description patterns. Doing so will ensure that `Resource.unpack` automatically calls their custom unpacker.
# Adding Identifiers
OFRAK Contributors and Users can extend the tool's identification capability in one of two ways:

For example, consider the following magic description identification registration in the file containing a `UImageUnpacker`:
1. Extend the [MagicIdentifier][ofrak.core.magic.MagicIdentifier] by registering a new magic pattern match
2. Implement a new [Identifier][ofrak.component.identifier.Identifier]

## Extend the MagicIdentifier
First, consider extending the magic identifier by registering a new magic pattern match.
The [MagicIdentifier][ofrak.core.magic.MagicIdentifier] uses three pattern matchers:

- [MagicMimePattern][ofrak.core.magic.MagicMimePattern] allows users to register matches to magic's mime output
- [MagicDescriptionPattern][ofrak.core.magic.MagicDescriptionPattern] allows users to create matching functions that run on the magic description output
- [RawMagicPattern][ofrak.core.magic.RawMagicPattern] allows users to create custom raw byte matching patterns against a resource's binary data

Combining these pattern matching strategies can provide expanded identification coverage, particularly when libmagic's output contains false negatives.
For example, all three patterns are used to identify `DeviceTreeBlob`

```python
MagicDescriptionIdentifier.register(UImage, lambda s: s.startswith("u-boot legacy uImage"))
MagicMimePattern.register(DeviceTreeBlob, "Device Tree Blob")
MagicDescriptionPattern.register(DeviceTreeBlob, lambda s: "device tree blob" in s.lower())


def match_dtb_magic(data: bytes):
if len(data) < 4:
return False
return data[:4] == DTB_MAGIC_BYTES


RawMagicPattern.register(DeviceTreeBlob, match_dtb_magic)
```

This line ensures that the `MagicDescriptionIdentifier` adds a `UImage` tag to resources matching that description pattern. As a result, any unpackers targeting a `UImage` will automatically run when `Resource.unpack` is run.
These patterns (along with all other identifier patterns) will get run when the [MagicIdentifier][ofrak.core.magic.MagicIdentifier] runs, adding a `DeviceTreeBlob` tag to matching resources.
See the docstrings for each pattern for implementation details.
Generally speaking, it makes sense to start with a magic mime or magic description pattern, implementing a raw magic pattern only when necessary.

## Implement a New Identifier
Additionally, it is possible to implement a new [Identifier][ofrak.component.identifier.Identifier].
Doing so should be reserved for situations where extending the [MagicIdentifier][ofrak.core.magic.MagicIdentifier] is impractical.
The [ApkIdentifier][ofrak.core.apk.ApkIdentifier] is an example of a custom identifier implementation.

### Handling External Dependencies
!!! warning
Adding new identifiers should be done with care to minimize overall performance impact to OFRAK workflows.
Try to carefully select the resource tags the identifier targets to minimize the frequency with which
it is run: generally speaking, targeting `GenericBinary` will result in this identifier getting run on the largest
number of possible resources. `ApkIdentifier` targets `JavaArchive` and `ZipArchive` only for this reason.

### Handling External Dependencies
If the Identifier makes use of tools that are not packaged in modules installable via `pip` from
PyPI (commonly command-line tools), these dependencies must be explicitly declared as part of the
identifier's class declaration. See the [Components Using External Tools](./external_tools.md) doc
Expand Down
31 changes: 1 addition & 30 deletions docs/user-guide/key-concepts/component/identifier.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,37 +2,8 @@
## Overview
Identifiers are components that tag resources with specific resource tags.

The following is an example of the `MagicMimeIdentifier`, which uses libmagic file type identification to tag resources:
```python

class MagicMimeIdentifier(Identifier[None]):
id = b"MagicMimeIdentifier"
targets = (File,)
_tags_by_mime: Dict[str, ResourceTag] = dict()

async def identify(self, resource: Resource, config=None):
_magic = await resource.analyze(Magic)
magic_mime = _magic.mime
tag = MagicMimeIdentifier._tags_by_mime.get(magic_mime)
if tag is not None:
resource.add_tag(tag)
@classmethod
def register(cls, resource: ResourceTag, mime_types: Union[Iterable[str], str]):
if isinstance(mime_types, str):
mime_types = [mime_types]
for mime_type in mime_types:
if mime_type in cls._tags_by_mime:
raise AlreadyExistError(f"Registering already-registered mime type: {mime_type}")
cls._tags_by_mime[mime_type] = resource


...

MagicMimeIdentifier.register(GenericText, "text/plain")

```
The most ubiquitous identifier is the [MagicIdentifier][ofrak.core.magic.MagicIdentifier].

The last line of the example, `MagicMimeIdentifier.register(GenericText, "text/plain")`, registers the "text/plain" pattern as one that maps to the `GenericText` resource tag.

## Usage
Identifiers can be explicitly run using the `Resource.identify` method:
Expand Down
12 changes: 12 additions & 0 deletions ofrak_core/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Add generic DecompilationAnalysis classes. ([#453](https://github.com/redballoonsecurity/ofrak/pull/453))
- `PatchFromSourceModifier` bundles src and header files into same temporary directory with BOM and FEM ([#517](https://github.com/redballoonsecurity/ofrak/pull/517))
- Add support for running on Windows to the `Filesystem` component. ([#521](https://github.com/redballoonsecurity/ofrak/pull/521))
- Add `JavaArchive` resource tag ([#492](https://github.com/redballoonsecurity/ofrak/pull/492))
- Add new method for allocating `.bss` sections using free space ranges that aren't mapped to data ranges. ([#505](https://github.com/redballoonsecurity/ofrak/pull/505))
- Add `JavaArchive` resource tag ([#492](https://github.com/redballoonsecurity/ofrak/pull/492))

### Fixed
- Improved flushing of filesystem entries (including symbolic links and other types) to disk. ([#373](https://github.com/redballoonsecurity/ofrak/pull/373))
Expand All @@ -37,8 +39,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Fix bugs on Windows arising from using `os.path` methods when only forward-slashes are acceptable ([#521](https://github.com/redballoonsecurity/ofrak/pull/521))
- Made some changes to OFRAK test suite to improve test coverage on Windows ([#487](https://github.com/redballoonsecurity/ofrak/pull/487))
- Fix usage of `NamedTemporaryFile` with external tools on Windows ([#486](https://github.com/redballoonsecurity/ofrak/pull/486))
- Fixed endianness issue in DTB raw byte identifier ([#492](https://github.com/redballoonsecurity/ofrak/pull/492))
- Fix unintentional ignoring of cpio errors introduced in [#486](https://github.com/redballoonsecurity/ofrak/pull/486) ([#555](https://github.com/redballoonsecurity/ofrak/pull/555]))
- `Data` resource attribute always corresponds to value of `Resource.get_data_range_within_root` ([#559](https://github.com/redballoonsecurity/ofrak/pull/559))
- Fixed endianness issue in DTB raw byte identifier ([#492](https://github.com/redballoonsecurity/ofrak/pull/492))

### Changed
- By default, the ofrak log is now `ofrak-YYYYMMDDhhmmss.log` rather than just `ofrak.log` and the name can be specified on the command line ([#480](https://github.com/redballoonsecurity/ofrak/pull/480))
Expand All @@ -50,6 +54,14 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Minor update to OFRAK Community License, add OFRAK Pro License ([#478](https://github.com/redballoonsecurity/ofrak/pull/478))
- Update python to 3.9 as main version used and tested (including in default docker image build) ([#502](https://github.com/redballoonsecurity/ofrak/pull/502))
- Update OpenJDK to version 17, remove unused qemu package ([#502](https://github.com/redballoonsecurity/ofrak/pull/502))
- Update resource tag File to inherit from GenericBinary ([#492](https://github.com/redballoonsecurity/ofrak/pull/492))
- Update auto-run component logic to run all Analyzers, not just the most specific ([#492](https://github.com/redballoonsecurity/ofrak/pull/492))
- Revamp magic identification for significant speed improvements ([#492](https://github.com/redballoonsecurity/ofrak/pull/492))
- Refactor magic identification to use one identifier, named `MagicIdentifier`
- Rename `MagicMimeIdentifier` to `MagicMimePattern`, as it is run by `MagicIdentifier`
- Rename `MagicDescriptionIdentifier` to `MagicDescriptionPattern`, as it is run by `MagicIdentifier`
- Add `RawMagicPattern` to efficiently run custom magic byte search logic within `MagicIdenfifier`
- Update registered identifiers to make use of new `MagicIdentifier` for following resource tags: `Apk`, `Bzip2Data`, `CpioFilesystem`, `DeviceTreeBlob`, `Elf`, `Ext2Filesystem`, `Ext3Filesystem`, `Ext4Filesystem`, `GzipData`, `ISO9660Image`, `Jffs2Filesystem`, `LzmaData`, `XzData`, `LzoData`, `OpenWrtTrx`, `Pe`, `RarArchive`, `SevenZFilesystem`, `SquashfsFilesystem`, `TarArchive`, `Ubi`, `Ubifs`, `Uf2File`, `UImage`, `ZipArchive`, `ZlibData`, `ZstdData`

### Security
- Update aiohttp to 3.10.11 ([#522](https://github.com/redballoonsecurity/ofrak/pull/522))
Expand Down
3 changes: 3 additions & 0 deletions ofrak_core/ofrak/core/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@
from ofrak.core.injector import *
from ofrak.core.instruction import *
from ofrak.core.iso9660 import *

# Why JavaArchive only? See https://github.com/redballoonsecurity/ofrak/pull/492/files#r1905582276
from ofrak.core.java import JavaArchive
from ofrak.core.label import *
from ofrak.core.lzma import *
from ofrak.core.lzo import *
Expand Down
67 changes: 34 additions & 33 deletions ofrak_core/ofrak/core/apk.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,17 @@
from subprocess import CalledProcessError
from dataclasses import dataclass

from ofrak.core.filesystem import File, Folder

from ofrak.component.identifier import Identifier
from ofrak.component.packer import Packer

from ofrak.resource import Resource

from ofrak.component.unpacker import Unpacker
from ofrak.component.identifier import Identifier

from ofrak.model.component_model import ComponentConfig, ComponentExternalTool
from ofrak.core.filesystem import File, Folder
from ofrak.core.java import JavaArchive
from ofrak.core.magic import MagicMimePattern
from ofrak.core.zip import ZipArchive, UNZIP_TOOL
from ofrak.core.binary import GenericBinary
from ofrak.core.magic import Magic, MagicMimeIdentifier
from ofrak.model.component_model import ComponentConfig, ComponentExternalTool
from ofrak.resource import Resource
from ofrak_type.range import Range


APKTOOL = ComponentExternalTool("apktool", "https://ibotpeaches.github.io/Apktool/", "-version")
JAVA = ComponentExternalTool(
"java",
Expand Down Expand Up @@ -206,30 +201,36 @@ async def pack(
resource.queue_patch(Range(0, await resource.get_data_length()), new_data)


MagicMimePattern.register(Apk, "application/vnd.android.package-archive")


class ApkIdentifier(Identifier):
targets = (File, GenericBinary)
"""
Identifier for ApkArchive.
Some Apks are recognized by the MagicMimePattern; others are tagged as JavaArchive or
ZipArchive. This identifier inspects those files, and tags any with an androidmanifest.xml
as an ApkArchive.
"""

targets = (JavaArchive, ZipArchive)
external_dependencies = (UNZIP_TOOL,)

async def identify(self, resource: Resource, config=None) -> None:
await resource.run(MagicMimeIdentifier)
magic = resource.get_attributes(Magic)
if magic.mime == "application/vnd.android.package-archive":
resource.add_tag(Apk)
elif magic is not None and magic.mime in ["application/java-archive", "application/zip"]:
async with resource.temp_to_disk(suffix=".zip") as temp_path:
unzip_cmd = [
"unzip",
"-l",
temp_path,
]
unzip_proc = await asyncio.create_subprocess_exec(
*unzip_cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await unzip_proc.communicate()
if unzip_proc.returncode:
raise CalledProcessError(returncode=unzip_proc.returncode, cmd=unzip_cmd)
async with resource.temp_to_disk(suffix=".zip") as temp_path:
unzip_cmd = [
"unzip",
"-l",
temp_path,
]
unzip_proc = await asyncio.create_subprocess_exec(
*unzip_cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await unzip_proc.communicate()
if unzip_proc.returncode:
raise CalledProcessError(returncode=unzip_proc.returncode, cmd=unzip_cmd)

if b"androidmanifest.xml" in stdout.lower():
resource.add_tag(Apk)
if b"androidmanifest.xml" in stdout.lower():
resource.add_tag(Apk)
3 changes: 1 addition & 2 deletions ofrak_core/ofrak/core/binwalk.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
BINWALK_INSTALLED = False

from ofrak.core.binary import GenericBinary
from ofrak.core.filesystem import File
from ofrak.model.component_model import ComponentExternalTool
from ofrak.service.data_service_i import DataServiceInterface
from ofrak.service.resource_service_i import ResourceServiceInterface
Expand All @@ -45,7 +44,7 @@ class BinwalkAttributes(ResourceAttributes):


class BinwalkAnalyzer(Analyzer[None, BinwalkAttributes]):
targets = (GenericBinary, File)
targets = (GenericBinary,)
outputs = (BinwalkAttributes,)
external_dependencies = (BINWALK_TOOL,)

Expand Down
6 changes: 3 additions & 3 deletions ofrak_core/ofrak/core/bzip2.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from ofrak.component.unpacker import Unpacker
from ofrak.resource import Resource
from ofrak.core.binary import GenericBinary
from ofrak.core.magic import MagicDescriptionIdentifier, MagicMimeIdentifier
from ofrak.core.magic import MagicDescriptionPattern, MagicMimePattern
from ofrak_type.range import Range

LOGGER = logging.getLogger(__name__)
Expand Down Expand Up @@ -64,5 +64,5 @@ async def pack(self, resource: Resource, config=None):
resource.queue_patch(Range(0, original_size), bzip2_compressed)


MagicMimeIdentifier.register(Bzip2Data, "application/x-bzip2")
MagicDescriptionIdentifier.register(Bzip2Data, lambda s: s.startswith("BZip2 archive"))
MagicMimePattern.register(Bzip2Data, "application/x-bzip2")
MagicDescriptionPattern.register(Bzip2Data, lambda s: s.startswith("BZip2 archive"))
5 changes: 2 additions & 3 deletions ofrak_core/ofrak/core/checksum.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
from dataclasses import dataclass

from ofrak.core.binary import GenericBinary
from ofrak.core.filesystem import File

from ofrak.component.analyzer import Analyzer
from ofrak.model.resource_model import ResourceAttributes
Expand All @@ -19,7 +18,7 @@ class Sha256Analyzer(Analyzer[None, Sha256Attributes]):
Analyze binary data and add attributes with the SHA256 checksum of the data.
"""

targets = (File, GenericBinary)
targets = (GenericBinary,)
outputs = (Sha256Attributes,)

async def analyze(self, resource: Resource, config=None) -> Sha256Attributes:
Expand All @@ -39,7 +38,7 @@ class Md5Analyzer(Analyzer[None, Md5Attributes]):
Analyze binary data and add attributes with the MD5 checksum of the data.
"""

targets = (File, GenericBinary)
targets = (GenericBinary,)
outputs = (Md5Attributes,)

async def analyze(self, resource: Resource, config=None) -> Md5Attributes:
Expand Down
6 changes: 3 additions & 3 deletions ofrak_core/ofrak/core/cpio.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from ofrak.component.unpacker import Unpacker
from ofrak.core.binary import GenericBinary
from ofrak.core.filesystem import File, Folder, FilesystemRoot, SpecialFileType
from ofrak.core.magic import MagicMimeIdentifier, MagicDescriptionIdentifier, Magic
from ofrak.core.magic import MagicMimePattern, MagicDescriptionPattern, Magic
from ofrak.model.component_model import ComponentExternalTool
from ofrak.resource import Resource
from ofrak_type.range import Range
Expand Down Expand Up @@ -150,5 +150,5 @@ async def pack(self, resource: Resource, config=None):
resource.queue_patch(Range(0, await resource.get_data_length()), cpio_pack_output)


MagicMimeIdentifier.register(CpioFilesystem, "application/x-cpio")
MagicDescriptionIdentifier.register(CpioFilesystem, lambda s: "cpio archive" in s)
MagicMimePattern.register(CpioFilesystem, "application/x-cpio")
MagicDescriptionPattern.register(CpioFilesystem, lambda s: "cpio archive" in s)
38 changes: 18 additions & 20 deletions ofrak_core/ofrak/core/dtb.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,23 @@
import fdt

from ofrak.component.analyzer import Analyzer
from ofrak.component.identifier import Identifier
from ofrak.component.packer import Packer
from ofrak.component.unpacker import Unpacker
from ofrak.model.viewable_tag_model import AttributesType
from ofrak.resource import Resource
from ofrak.service.resource_service_i import ResourceFilter, ResourceSort
from ofrak.core import GenericBinary, MagicMimeIdentifier, MagicDescriptionIdentifier
from ofrak.core import GenericBinary
from ofrak.core.magic import (
MagicMimePattern,
MagicDescriptionPattern,
RawMagicPattern,
)
from ofrak.model.component_model import ComponentConfig
from ofrak.model.resource_model import index
from ofrak_type.range import Range

DTB_MAGIC_SIGNATURE: int = 0xD00DFEED
DTB_MAGIC_BYTES = struct.pack(">I", DTB_MAGIC_SIGNATURE)


@dataclass
Expand Down Expand Up @@ -332,22 +337,6 @@ async def pack(self, resource: Resource, config: ComponentConfig = None):
resource.queue_patch(Range(0, original_size), dtb.to_dtb())


class DeviceTreeBlobIdentifier(Identifier[None]):
"""
Identify Device Tree Blob files.
"""

targets = (GenericBinary,)

async def identify(self, resource: Resource, config: ComponentConfig = None) -> None:
"""
Identify DTB files based on the first four bytes being "d00dfeed".
"""
data = await resource.get_data(Range(0, 4))
if data == struct.pack("<I", DTB_MAGIC_SIGNATURE):
resource.add_tag(DeviceTreeBlob)


async def _prop_to_fdt(p: DtbProperty) -> fdt.items.Property:
"""
Generates an fdt.items.property corresponding to a DtbProperty.
Expand Down Expand Up @@ -402,5 +391,14 @@ def _prop_from_fdt(p: fdt.items.Property) -> Tuple[DtbPropertyType, bytes]:
return _p_type, _p_data


MagicMimeIdentifier.register(DeviceTreeBlob, "Device Tree Blob")
MagicDescriptionIdentifier.register(DeviceTreeBlob, lambda s: "device tree blob" in s.lower())
MagicMimePattern.register(DeviceTreeBlob, "Device Tree Blob")
MagicDescriptionPattern.register(DeviceTreeBlob, lambda s: "device tree blob" in s.lower())


def match_dtb_magic(data: bytes):
if len(data) < 4:
return False
return data[:4] == DTB_MAGIC_BYTES


RawMagicPattern.register(DeviceTreeBlob, match_dtb_magic)
4 changes: 2 additions & 2 deletions ofrak_core/ofrak/core/elf/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
ResourceSortDirection,
ResourceSort,
)
from ofrak.core.magic import MagicDescriptionIdentifier
from ofrak.core.magic import MagicDescriptionPattern
from ofrak_type.bit_width import BitWidth
from ofrak_type.endianness import Endianness
from ofrak_type.memory_permissions import MemoryPermissions
Expand Down Expand Up @@ -869,4 +869,4 @@ async def get_program_header_by_index(self, index: int) -> ElfProgramHeader:
)


MagicDescriptionIdentifier.register(Elf, lambda s: s.startswith("ELF "))
MagicDescriptionPattern.register(Elf, lambda s: s.startswith("ELF "))
Loading

0 comments on commit 8e15fc5

Please sign in to comment.