Home

Introduction

Threat Report ATT&CK Mapper (TRAM) is an open-source platform designed to advance research into automating the mapping of cyber threat intelligence reports to MITRE ATT&CK®.

TRAM enables researchers to test and refine Machine Learning (ML) models for identifying ATT&CK techniques in prose-based cyber threat intelligence (CTI) reports and allows CTI analysts to train ML models and validate the results.

Through research into automating the mapping of cyber threat intel reports to ATT&CK, TRAM aims to reduce cost and increase the effectiveness of integrating ATT&CK across the CTI community. Threat intel providers, threat intel platforms, and analysts can use TRAM to integrate ATT&CK more easily and consistently into their products.

The purpose of this wiki is to describe the Center for Threat Informed Defense research, provide reference information on data annotation process and ML fine-tuning activities, and to enable users to recreate the experiments and further the research.

Background

Mapping TTPs found in CTI reports to MITRE ATT&CK is difficult, error prone, and time-consuming. This release is focused on improving the models that are used in TRAM to perform text classification. The goals of our project can be divided into three areas: having a streamlined approach to generating customized training data through expert annotation, providing high quality training data, and incorporating the best-performing LLM into TRAM.

Data annotation: Recommended annotation tool features and best practices guide
High-quality model training data: Annotated 150 reports containing 4,070 technique-labeled sentences out of 19,011 total samples
TRAM tool updates: integrating a new prediction model based on SciBERT

Getting started

See the README on center-for-threat-informed-defense/TRAM to pull the docker images, offline installation instructions, and developer build process.

The center-for-threat-informed-defense/TRAM user_notebooks section has Jupyter notebooks for the SciBERT-based single-label model and multi-label model. There are supplemental notebooks tailored to further fine-tune each model with additional data.

Research

Previous versions of TRAM did not consider LLMs and solely relied on supervised learning classification methods such as logistic regression for prediction. We wanted to explore the advantages of incorporating LLMs to see how they could improve automated identification of adversary TTPs in threat intelligence reports.

The logistic regression model was trained against all existing ATT&CK techniques in the dataset and relied on 1) having a procedure example of each ATT&CK technique, and 2) having enough procedure examples of each to differentiate them. By narrowing the scope of ATT&CK IDs to identify, we could achieve better results in predicting adversary TTPs, which led us to select 50 ATT&CK techniques to train models. The selection of 50 techniques came from:

The most commonly found techniques in the embedded dataset
The most commonly discovered techniques from the Sightings Report
The most common techniques as defined by Actionability, Choke Point, and Prevalence from Top ATT&CK Techniques

insert table of techniques

Large Language Model Architectures

Large language models benefit from being pre-trained on vast amounts of data. Because these models arrive pre-built they require a much smaller amount of domain-specific training data to perform other tasks. A main benefit of using LLMs is the ability to predict text not included in the training data. This means that LLM-based models are more robust to unseen words and are capable of perceiving subtle relationships between words that are indicative of an ATT&CK technique.

We considered three different LLMs between two architectures. While other models and LLM architectures exist, these three are open access, have appropriate licenses for our use case, and are associated with reputable labs. The two architectures considered were BERT and GPT-2. In both cases, the LLMs are intended for different use cases than text classification but can be adapted during fine-tuning. We considered two BERT models, namely the original BERT model that was produced along with the paper that first described BERT, as well as SciBERT, which is a variation trained on scientific literature. BERT is designed to predict hidden words in text, while GPT-2 is designed for generating text, and produces sequences of words by considering what next word would make sense given words it has already produced.

To confirm our prediction that LLMs could have better performance we needed a way to analyze and compare results. Precision, Recall, and F1 score are common metrics we can use to compare the performance of models. Precision is the metric that penalizes false positives (a score of 1 indicates no false positives), and recall is the metric that penalizes false negatives (a score of 1 indicates no false negatives). F1 is the harmonic mean of precision and recall, which means that instead of being half way between precision and recall (as you would get from summing the two and dividing by two), the F1 score is skewed towards the lower of the two scores.

Each of these three metrics are calculated for each individual ATT&CK technique. When talking about the aggregate performance of the whole model, we can take the micro or macro average. The micro average is where we treat each instance the same, and calculate precision, recall, and F1 based on true positives, false positives, and false negatives across every technique. The macro average is where we treat each technique the same (even if it appears more or less often than other techniques) and take the precision, recall, and F1 scores that are already calculated, and take the average of each.

Typical metrics for Machine Learning performance

Results

To compare the performance of each model, all three (SciBERT, BERT, GPT-2) were trained to perform single-label classification on ten epochs of a dataset that combined the TRAM tool’s embedded training data with the annotated CTI reports in this effort.

Precision, Recall, and F1-score comparison between SciBERT and Logistic Regression models

The results show SciBERT performs best during the first epoch and reaches peak performance more quickly than the other two. This is likely because its vocabulary is more aligned with the vocabulary of our data, and by extension, the kinds of documents on which the final model will be applied. As a result, we selected SciBERT as the best performing LLM architecture to integrate into TRAM.

The fine-tuned SciBERT model shows improvement over the logistic regression model in all but one area where we measured precision, recall, and F1-score. For TRAM users this means our new LLM correctly identified the correct ATT&CK technique 88 of 100 times; and missed finding 12 techniques out of 100 samples. F1 score indicates a balance between precision and recall scores.

TRAM Tool

Follow the instructions in the README to pull the container images.

TRAM Jupyter Notebook

The LLM functionality has been built into a Jupyter notebook – able to be run locally or hosted online through Google Colab. With Colab, you can import your own data, using Google’s GPU-enabled system to provide access to our LLM training code. This alternative approach offers advanced users a step-by-step process to executing the code behind the text classifier. To use the notebook, follow the comment sections in each of the cells to download the model, setup the analysis parameters, then upload a report. Machine learning engineers can customize the configuration to further refine the results.

The TRAM notebook divides uploaded reports into partially overlapping ngrams. An ngram is a sequence of n-number of adjacent words. By extracting ngrams from each document, we can produce segments that might be more similar in construction to the segments that the model was trained on than are complete sentences. The notebooks will allow you to specify the value of n, as the model was trained on segments of varying length, and adjusting the number may allow the model to make predictions that it wouldn’t make on larger or shorter segments.

Model retraining and fine-tuning

One of the most important features is that the models can be extended to ingest a larger, more comprehensive dataset, or even use customized classes to identify organization-specific indicators of interest. Two additional Jupyter notebooks are available in the center-for-threat-informed-defense/TRAM user_notebooks section

Provide feedback

Saved searches

Use saved searches to filter your results more quickly