Skip to content

abhishekakumar/GatorSquad

Repository files navigation

Disk Drive Failure Prediction Using RNN/LSTM

Disk drives are the back bone of large scale systems employing durable storage mechanisms for storage purpose. Disk drives provide a cost effective and reliable mechanism for data storage and are currently the primary storage devices. But even with the reliability of disk drives, disk failures do happen. Failure of disk drives is a rare event with an annual failure rate of 0.3 - 3% per year. Failure of these drive although rare are costly. It would cause large amounts of data loss. Data loss can disrupt businesses, prohibiting them from providing a product or service to their customers and damage reputations. Loss of financial data can lead to financial losses. Such a loss is further disastrous as the cost is greater than the cost of hardware required to replace the drive entirely. Disk failure prediction is a binary classification problem with labeling disks with one of two classes of failure or healthy with one class being very rare. In recent times deep learning methods based on Recurrent Neural Networks (RNN) have shown great classification results for complex problems. A class of RNN – Long Short Term Memory (LSTM) is ideally suited for time based classification and prediction.

Dataset Description

The dataset for our study is a labeled dataset obtained from BackBlaze, an online backup service provider for individuals and businesses. BackBlaze deploys over 41000 hard drives ranging from 1TB to 6TB. It collects the data by capturing the snapshot of basic drive information along with SMART parameters. The csv file has columns that provides information about Hard Disk like Serial Number, Model, Hard Disk capacity, Failure (0=Drive 'OK' and 1= Drive failure) and different SMART parameters. 2013 and 2014 files have 80 columns of SMART parameters. 2015 file has 90 columns of SMART parameters. Each SMART attribute has a six-byte raw value and a one byte normalized value. The raw value (RAW_VALUE) is converted into a normalized value (VALUE) ranging from 1 to 253 [14]. The dataset consists of 70 different models from 6 different manufacturers.

Architecture

This project implements RNN for analysis for better success rate and large data. RNN is an artificial neural network where output from one layer is fed back as input to the previous layer. In other words, the output vector’s content depends not only on the input but on the entire set of input fed to the system in the past. LSTM is a RNN which was developed by Sepp Hochreiter and Juergen Schmidhuber as a solution to the vanishing gradient problem. LSTM can preserve the error through backpropagation in time and between layers. The learning ability of LSTM is faster and better because of the constant error rate maintained over many time steps.

Feature Selection

Previous researches have shown that certain SMART attributes have higher correlation to disk failure. Pinheiro et al. concluded that SMART 5, SMART 196, SMART 197 and SMART 198 have high correlation with failure. We have used 10 SMART attributes as features for this prediction. Inconsistent data was removed from the dataset. Null values were replaced with 0.

Technology Overview

We are using the Sequential implementation of Keras library, which contains linear stack of layers that encapsulates RNN model for deep learning. The method fit in Keras provides parameters like epoch, validation split, and callbacks which help to modify the LSTM system. Spark is used as the distributed system for this project, even though Hadoop has been the solution for large data processing over distributed systems for the past 10 years. Hadoop MapReduce works efficiently for one-pass computation but it is not really efficient in the processing of multi-pass computation and complex algorithm. LSTM model on spark is implemented using the Elephas library. Elephas uses the machine learning library of SPARK, mlib for various functionality.

LSTM Model

The model has 3 layers – one input, one output and one hidden layer. The number of cells in hidden layer is 200. The RAW_VALUE of the selected SMART parameters is given as input to the LSTM Model. The output of this model is either 0 (Drive = Ok) or 1(Drive = Failure)

Training and Learning

The 3D data is divided into testing and training set using scikit learn cross validation function. The data is divided in ratio of 70:30 as the number of failures should be reasonable in training and testing set. Drive failure data is an example of highly imbalanced classification. Of around 1 million data, there are only 1000 failures which is equal to 0.001 %. Different strategies for handling imbalanced data like over (up) sampling, under (down) sampling, sample weights have been suggested by various research. We tried all 3 of them and found that down sampling is suitable for our problem. Sample weights did not improve the prediction results. Up sampling, with the huge amount of data we have lead to too many training sample and an increased training time. Under sampling has been shown by research as well to perform better than over sampling.

The model is trained using adam optimizer with a batch size of 200, linear activation of dense layer and binary cross entropy (log loss) as objective function. Adam is an efficient optimizer for stochastic optimization. It computes learning rates for different parameters from estimates of first and second moments of gradient. The parameters are learning rate = 0.001, beta_1=0.9, beta_2=0.999 and, epsilon=1e-8.

Releases

No releases published

Packages

No packages published

Languages