This project aims to address the critical issue of credit card fraud detection by leveraging the power of R, a versatile programming language widely used for data analysis and machine learning. By employing advanced analytical techniques and machine learning algorithms, I seek to build a reliable and efficient fraud detection system that can identify and prevent fraudulent credit card transactions in real-time.
The primary objective of this project is to create a predictive model that can accurately classify transactions as either fraudulent or legitimate. To achieve this, I used historical credit card transaction data, which includes a mixture of both fraudulent and non-fraudulent instances. By extracting meaningful patterns and features from the data, I trained a machine learning model capable of distinguishing between genuine transactions and fraudulent activities.
For our R project we utilized historical credit card transaction data(creditcard.csv), which includes a mixture of both fraudulent and non-fraudulent instances. This dataset is taken from Kaggle. The dataset here contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
ranger
caret
rpart
caTools
pROC
neuralnet
gbm
Firstly I imported the datasets that contain transactions made by credit cards. To perform analysis, reading of data set is done using command read.csv
.Then I explored the data that is contained in the creditcard_data dataframe. After displaying the creditcard_data using the head()
function as well as the tail()
function, we proceeded to explore the other components of this dataframe.
In this section of the project, I scaled the data using the scale()
function. I applied this to the amount component of our creditcard_data amount. With the help of scaling, the data is structured according to a specified range. Therefore, there are no extreme values in the dataset that might interfere with the functioning of the model.
After standardizing the entire dataset, I split the dataset into training set as well as test set with a split ratio of 0.80. This means that 80%
of the data will be attributed to the train_data
whereas 20%
will be attributed to the test_data
. I then found the dimensions using the dim()
function.
In this section of the project, I fit the first model. I began with logistic regression. I used it for modeling the outcome probability of fraud/not fraud. I proceeded to implement this model on the test data. Once I summarised the model, I visualized it through plots. In order to assess the performance of the model, I portrayed the Receiver Optimistic Characteristics or ROC curve. For this, I first imported the ROC package and then plotted the ROC curve to analyze its performance.
Next, I implemented a decision tree algorithm to plot the outcomes of a decision through which I could conclude as to what class the object belongs to. I then implemented the decision tree model and plotted it using the rpart.plot()
function. I specifically used the recursive parting to plot the decision tree.
Artificial Neural Networks are a type of machine learning algorithm that are modeled after the human nervous system. The ANN models are able to learn the patterns using the historical data and are able to perform classification on the input data. I imported the neuralnet package that allowed me to implement the ANNs. Then I proceeded to plot it using the plot()
function. Now, in the case of Artificial Neural Networks, there is a range of values that is between 1 and 0. I set a threshold of 0.5, that is, values above 0.5 will correspond to 1 and the rest will be 0.
Gradient Boosting is a popular machine learning algorithm that is used to perform classification and regression tasks. This model comprises of several underlying ensemble models like weak decision trees. These decision trees combine together to form a strong model of gradient boosting. I implemented gradient descent algorithm in the model.
In the last section of the project, I calculated and plotted an ROC curve measuring the sensitivity and specificity of the model. The print command plots the curve and calculates the area under the curve. The area of a ROC curve can be a test of the sensivity and accuracy of a model.
Concluding our R Data Science project, I learnt how to develop a credit card fraud detection model using machine learning. I used a variety of ML algorithms to implement this model and also plotted the respective performance curves for the models. I also learnt how data can be analyzed and visualized to discern fraudulent transactions from other types of data.