Skip to content
rossj-en edited this page Aug 24, 2023 · 13 revisions

Introduction

Threat Report ATT&CK Mapper (TRAM) is an open-source platform designed to advance research into automating the mapping of cyber threat intelligence reports to MITRE ATT&CK®.

TRAM enables researchers to test and refine Machine Learning (ML) models for identifying ATT&CK techniques in prose-based cyber threat intelligence (CTI) reports and allows CTI analysts to train ML models and validate the results.

Through research into automating the mapping of cyber threat intel reports to ATT&CK, TRAM aims to reduce cost and increase the effectiveness of integrating ATT&CK across the CTI community. Threat intel providers, threat intel platforms, and analysts can use TRAM to integrate ATT&CK more easily and consistently into their products.

The purpose of this Wiki is to describe the Center for Threat Informed Defense research, provide reference information on data annotation process and ML fine-tuning activities, and to enable users to recreate the experiments and further the research.

Background

Mapping TTPs found in CTI reports to MITRE ATT&CK is difficult, error prone, and time-consuming. Our goals to improve the models that are used in TRAM to perform text classification can be divided into three areas: streamline the generation of customized training sets through data annotation, providing high quality training data, and incorporating the best-performing Large Language Model (LLM) into TRAM.

  • Data annotation: Recommended annotation tool features and best practices guide
  • High-quality model training data: Annotated 150 reports containing 4,070 technique-labeled sentences out of 19,011 total samples
  • TRAM tool updates: integrating a new prediction model based on SciBERT

Getting started

See the README on center-for-threat-informed-defense/TRAM to pull the docker images, offline installation instructions, and developer build process.

The center-for-threat-informed-defense/TRAM user_notebooks section has Jupyter notebooks for the SciBERT-based single-label model and multi-label model. There are supplemental notebooks tailored to further fine-tune each model with additional data.

Research

Previous versions of TRAM did not consider LLMs and solely relied on supervised learning classification methods such as logistic regression for prediction. We wanted to explore the advantages of incorporating LLMs to see how they could improve automated identification of adversary TTPs in threat intelligence reports.

The logistic regression model was trained against all existing ATT&CK techniques in the dataset and relied on 1) having a procedure example of each ATT&CK technique, and 2) having enough procedure examples of each to differentiate them. By narrowing the scope of ATT&CK IDs to identify, we could achieve better results in predicting adversary TTPs, which led us to select 50 ATT&CK techniques to train models. The selection of 50 techniques came from:

  • The most commonly found techniques in the embedded dataset
  • The most commonly discovered techniques from the Sightings Report
  • The most common techniques as defined by Actionability, Choke Point, and Prevalence from Top ATT&CK Techniques
    Table of 50 ATT&CK Techniques    
T1548.002 Abuse Elevation Control Mechanism: Bypass User Account Control T1484.001 Domain Policy Modification: Group Policy Modification T1070.004 Indicator Removal: File Deletion T1566.001 Phishing: Spearphishing Attachment T1518.001 Software Discovery: Security Software Discovery
T1557.001 Adversary-in-the-Middle: LLMNR/NBT-NS Poisoning and SMB Relay T1573.001 Encrypted Channel: Symmetric Cryptography T1105 Ingress Tool Transfer T1057 Process Discovery T1218.011 System Binary Proxy Execution: Rundll32
T1071.001 Application Layer Protocol: Web Protocols T1041 Exfiltration Over C2 Channel T1056.001 Input Capture: Keylogging T1055 Process Injection T1082 System Information Discovery
T1547.001 Boot or Logon Autostart Execution: Registry Run Keys / Startup Folder T1190 Exploit Public-Facing Application T1570 Lateral Tool Transfer T1090 Proxy T1016 System Network Configuration Discovery
T1110 Brute Force T1068 Exploitation for Privilege Escalation T1036.005 Masquerading: Match Legitimate Name or Location T1012 Query Registry T1033 System Owner/User Discovery
T1059.003 Command and Scripting Interpreter: Windows Command Shell T1210 Exploitation of Remote Services T1112 Modify Registry T1219 Remote Access Software T1569.002 System Services: Service Execution
T1543.003 Create or Modify System Process: Windows Service T1083 File and Directory Discovery T1106 Native API T1021.001 Remote Services: Remote Desktop Protocol T1552.001 Unsecured Credentials: Credentials In Files
T1074.001 Data Staged: Local Data Staging T1564.001 Hide Artifacts: Hidden Files and Directories T1095 Non-Application Layer Protocol T1053.005 Scheduled Task/Job: Scheduled Task T1204.002 User Execution: Malicious File
T1005 Data from Local System T1574.002 Hijack Execution Flow: DLL Side-Loading T1003.001 OS Credential Dumping: LSASS Memory T1113 Screen Capture T1078 Valid Accounts
T1140 Deobfuscate/Decode Files or Information T1562.001 Impair Defenses: Disable or Modify Tools T1027 Obfuscated Files or Information T1072 Software Deployment Tools T1047 Windows Management Instrumentation

Large Language Model Architectures

Large language models benefit from being pre-trained on vast amounts of data. Because these models arrive pre-built they require a much smaller amount of domain-specific training data to perform other tasks. A main benefit of using LLMs is the ability to predict text not included in the training data. This means that LLM-based models are more robust to unseen words and are capable of perceiving subtle relationships between words that are indicative of an ATT&CK technique.

We considered three different LLMs between two architectures. While other models and LLM architectures exist, these three are open access, have appropriate licenses for our use case, and are associated with reputable labs. The two architectures considered were BERT and GPT-2. In both cases, the LLMs are intended for different use cases than text classification but can be adapted during fine-tuning. We considered two BERT models, namely the original BERT model that was produced along with the paper that first described BERT, as well as SciBERT, which is a variation trained on scientific literature. BERT is designed to predict hidden words in text, while GPT-2 is designed for generating text, and produces sequences of words by considering what next word would make sense given words it has already produced.

To confirm our prediction that LLMs could have better performance we needed a way to analyze and compare results. Precision, Recall, and F1 score are common metrics we can use to compare the performance of models. Precision is the metric that penalizes false positives (a score of 1 indicates no false positives), and recall is the metric that penalizes false negatives (a score of 1 indicates no false negatives). F1 is the harmonic mean of precision and recall, which means that instead of being half way between precision and recall (as you would get from summing the two and dividing by two), the F1 score is skewed towards the lower of the two scores.

Each of these three metrics are calculated for each individual ATT&CK technique. When talking about the aggregate performance of the whole model, we can take the micro or macro average. The micro average is where we treat each instance the same, and calculate precision, recall, and F1 based on true positives, false positives, and false negatives across every technique. The macro average is where we treat each technique the same (even if it appears more or less often than other techniques) and take the precision, recall, and F1 scores that are already calculated, and take the average of each.

Typical metrics for Machine Learning performance
Typical metrics for Machine Learning performance - source “The Role of Machine Learning in Cybersecurity“ https://doi.org/10.1145/3545574

Results

To compare the performance of each model, all three (SciBERT, BERT, GPT-2) were trained to perform single-label classification on ten epochs of a dataset that combined the TRAM tool’s embedded training data with the annotated CTI reports in this effort.

Precision, Recall, and F1-score comparison between SciBERT and Logistic Regression models

The results show SciBERT performs best during the first epoch and reaches peak performance more quickly than the other two. This is likely because its vocabulary is more aligned with the vocabulary of our data, and by extension, the kinds of documents on which the final model will be applied. As a result, we selected SciBERT as the best performing LLM architecture to integrate into TRAM.

Precision, Recall, and F1-score comparison between SciBERT and Logistic Regression models

The fine-tuned SciBERT model shows improvement over the logistic regression model in all but one area where we measured precision, recall, and F1-score. For TRAM users this means our new LLM correctly identified the correct ATT&CK technique 88 of 100 times; and missed finding 12 techniques out of 100 samples. F1 score indicates a balance between precision and recall scores.

TRAM Tool

Follow the instructions in the README to pull the container images.

TRAM Jupyter Notebook

The LLM functionality has been built into a Jupyter notebook – able to be run locally or hosted online through Google Colab. With Colab, you can import your own data, using Google’s GPU-enabled system to provide access to our LLM training code. This alternative approach offers advanced users a step-by-step process to executing the code behind the text classifier. To use the notebook, follow the comment sections in each of the cells to download the model, setup the analysis parameters, then upload a report. Machine learning engineers can customize the configuration to further refine the results.

The TRAM notebook divides uploaded reports into partially overlapping ngrams. An ngram is a sequence of n-number of adjacent words. By extracting ngrams from each document, we can produce segments that might be more similar in construction to the segments that the model was trained on than are complete sentences. The notebooks will allow you to specify the value of n, as the model was trained on segments of varying length, and adjusting the number may allow the model to make predictions that it wouldn’t make on larger or shorter segments.

Model retraining and fine-tuning

One of the most important features is that the models can be extended to ingest a larger, more comprehensive dataset, or even use customized classes to identify organization-specific indicators of interest. Two additional Jupyter notebooks are available in the center-for-threat-informed-defense/TRAM user_notebooks section

Clone this wiki locally