-
Notifications
You must be signed in to change notification settings - Fork 868
Twitter Anomaly Detector
AnomalyDetection is an R package developed by Twitter that detects anomalies in time-series data. The package implements their Seasonal Hybrid ESD algorithm, which extends the generalized ESD algorithm to allow for seasonality in the data, i.e. different periods of patterns in the data that represent macro-level changes rather than micro-level anomalies. Details of the algorithm and the open-sourced R code can be found here.
Since AnomalyDetection is written in R and NAB is written in Python, we had three options: port the R code into Python, use an interface from R to Python like rpy2, or use the R code for anomaly detection and the Python code for evaluating its results. We decided to go with the third option, following the guidelines of "Path 3" in this NAB figure. Thus the task reduced to converting the NAB datasets into structures recognized by AnomalyDetection, and then converting the output of AnomalyDetection into the results format required by NAB.
As specified in the NAB whitepaper, datasets in NAB are CSV files with a "timestamp" column and a "value" column. The values are floats or integers, and the timestamps are strings of the form YYYY-mm-dd HH:MM:SS.s
(in Python notation). In R notation, the timestamps are of the form %Y-%m-%d %H:%M:%OS
. R provides a read.csv
function that will load NAB data into a dataframe that AnomalyDetection can use. Converting the timestamps in the CSV file to the appropriate datatype in R requires a bit of subtlety. With the name of the CSV file stored in dataFile
,
setClass('nabDate')
setAs("character","nabDate", function(from) strftime(from, format="%Y-%m-%d %H:%M:%OS"))
nab_data <- read.csv(dataFile, colClasses=c("nabDate","numeric"))
Now nab_data
can be passed in to the AnomalyDetection functions:
res <- AnomalyDetectionTs(nab_data, max_anoms=0.0008, direction='both', plot=FALSE)
or
res <- AnomalyDetectionVec(nab_data[,2], period=900, max_anoms=0.0008, direction='both', plot=FALSE)
For some files*, the functions throw an error "Anom detection needs at least 2 periods worth of data". One of the sources of this error is too large of a granularity in the timestamps of the data. In this case, we treat the results as equivalent to finding no anomalies, which ultimately contributes 0 to the normalized final score.
*This step was necessary for realKnownCause/ambient_temperature_system_failure.csv
and realAWSCloudwatch/iio_us-east-1_i-a2eb1cd9_NetworkIn.csv
.
To prepare NAB for analyzing results from a new detector, we ran the following script:
python scripts/create_new_detector.py --detector twitter
This script generates the necessary directories and creates an entry in the thresholds JSON.
NAB requires a CSV file with timestamp, value, anomaly_score, and label columns. Since AnomalyDetection identifies anomalies, rather than reporting a probability of their being anomalies or a raw score, we used a binary anomaly_score according to whether the points were flagged by AnomalyDetection as anomalous. The label column is also typically binary, indicating whether the points are true anomalies. The true anomalies and their durations are record in a JSON file. We used the jsonlite R package for handling the JSON. Labels were extracted and added to the dataframe as follows:
addLabels <- function(anomalyDataFrame, labelJSON, dataFile) {
anomalyDataFrame$label = 0
labels <- fromJSON(labelJSON)
anomalyBounds <- labels[[dataFile]]
if (length(anomalyBounds) != 0) {
for (i in 1:nrow(anomalyBounds)) {
lower <- anomalyBounds[i, 1]
upper <- anomalyBounds[i, 2]
idx <- anomalyDataFrame$timestamp>=lower & anomalyDataFrame$timestamp<=upper
anomalyDataFrame[idx,]$label = 1
}
}
return(anomalyDataFrame)
}
With all columns added to the dataframe, write.csv
lets us write the results to to a CSV file that can be passed into NAB. Note: Each CSV file must have the name of the detector followed by an underscore at the beginning of the filename, e.g. twitter_cpu_utilization_asg_misconfiguration.csv
.
The results CSV file(s) were placed in NAB/results/twitter in categorical subdirectories. For the scoring to proceed, we first need baseline results to use for normalization. Thus running the entire NAB pipeline with the baseline
detector is necessary. After, in the top level of NAB, we can run:
python run.py --optimize --score --normalize -d baseline,twitter
The final scores will be printed to the screen, and results for individual files will be written to CSV files in the twitter directory. We obtained the following output for AnomalyDetectionTs with max_anoms=0.0008
:
Final score for 'twitter_reward low FN rate' = 36.37
Final score for 'twitter_reward low FP rate' = 27.59
Final score for 'twitter_standard' = 32.97
and the following output for AnomalyDetectionVec with period=900
:
Final score for 'twitter_reward_low_FP_rate_scores' = 36.18
Final score for 'twitter_reward_low_FN_rate_scores' = 43.46
Final score for 'twitter_standard_scores' = 40.18
The one parameter of significant consequence to the results of AnomalyDetectionTs is max_anoms
, which captures the maximum percent of data points that will be labelled as anomalous by the algorithm. Akin to the NAB optimization step, we tuned this parameter manually in search of the best final scores. We found a value of 0.0008 for max_anoms
maximizes scores for all three NAB scoring profiles (standard, reward low FP, reward low FN). We then tuned the parameter period
for AnomalyDetectionVec, which captures the length of the window within which trends should hold constant. We found a value of 900 for period
maximizes all three final scores.