Skip to content

Machine Learning models to prove the importance and the effectiveness of this technology in the industry, and how it should be implemented across the board.

Notifications You must be signed in to change notification settings

chirathlv/Fraud-Detection-with-ML

Repository files navigation

Fraud Detection

AI in Finance

Table of Contents

  1. Introduction - Machine Learning and Risk mitigation in Finance
  2. Data, Technology and Coding Standards
    1. Data Sources
    2. Technologies
    3. Coding Standards
  3. Basic Statistical Analysis
  4. Machine Learning Analysis and procedures
    1. Machine Learning Pipeline
    2. Data Preparation
      1. Data Ingestion
      2. Data Wrangling
      3. Feature Engineering
    3. Model Training
    4. Model Deployment
  5. References

Introduction - Machine Learning and Risk mitigation in Finance

The financial sector in which operates throughout the world as we know, is arguably the most predominant and most influential industry throughout.

As in any industry and any society, no matter how advanced or behind, there is always ways and opportunities to cheat the system, and gain and leverage mis-information or poor security & operations, in order for financial gain.

As the world evolves and technology advances, we see more and more cases of fraud and data theft within the financial sector each year, costing businesses, individuals, and the economy billions of dollars.

The Purpose of this project is to use and develope machine learning models to prove the importance and the effectiveness of this technology in the industry, and how it should be implemented across the board.

Data, Technology and Coding Standards

Data Sources

  1. Credit Card Data For ML model (Kaggle)

    The Data Set used for the machine learning model is a Credit Card transactions data set. This Data set has no obfuscation as it has more then 24 million transactions generated from a virtual world simulation.The Data covers 2000 synthetic consumer residents in the US, who travel the world. For this analysis, only 2019 data has been extracted due to missing fradulant data in some years.

  2. Fraud Statistics

    1. The Ascent
    2. Federal Trade Commision

Technology stack

  • Google Colab
  • python 3.8.3
  • pandas 1.0.5
  • numpy 1.18.5
  • requests 2.24.0
  • json 2.0.9
  • panel 0.9.7
  • plotly 5.3.1
  • hvplot 0.7.3
  • seaborn 0.10.1
  • matplotlib 3.2.2
  • scikit-learn 1.0.0 (NOTE scikit-learn 1.0.1 will not work!)
  • imblearn 0.8.1
  • xgboost 1.5.1
  • jupyter lab 2.1.5

Coding Standards

Following rules have been applied during code development and testing:

  1. All variables must reflect their purpose. Underscore to be used as and when required.
  2. Each step of the code must contain comments to explain the purpose of the code.
  3. A git hub repository called project2 must be set up with branches for each developer.
  4. Each developer must use their own git hub branch to code and unit test developed code.
  5. Lead developer must review code prior to merge.
  6. Lead developer is responsible for merging all code.
  7. Each developer must download the most recent code from main branch before commencing code changes.
  8. Each release must provide a brief message on changes made prior to committing the code.

Basic Statistical Analysis

The largest amount of fradulent transaction is 1244 and the largest refund fradulent transaction is -475, this suggests that the fraudsters are able to issue refund request and also make fradulent transactions. This however requires further investigation to pin point. Overall, according to the box plot, fraudulent transactions exhibit higher average amount (79.42 vs 42.21) per transaction with greater degree of deviation (143 vs 80)

SA

Machine Learning Model Procedure / Analysis

Machine Learning Pipeline

ML Pipeline

Data Preparation

Data Ingestion

Extracted data is from Kaggle platform (Refer to the reference for more details) as a CSV files which was about 24 Million data samples that include Fradulant and non-fradulant transaction details. However, due to limited capacity of processing large amount of data, decision being made to extract subsect of the original dataset. To avoid the selection bias, yearly based data (Year 2019 Data) extracted without loosing any information. Excel and python used as tools to manipulate data which made it ready for the analysis.

Data Wrangling

Following Data cleansing techniques used to process the data before going further.

1. Drop Duplicates
2. Handling Missing Values
3. Correct the data types
4. Drop unwanted columns

Feature Engineering

Next, following Feature Engineering techniques applied to get a better sense of the data and to choose most relavent features from the raw data for the Machine Learning model.

1. Feature Transformation
2. Feature Splitting
3. Feature Encoding
4. Feature Scaling

Model Training

Before trainig the data, split the data into 60% for Training and 40% for Testing. Then, feed the training data into the Machine Learning Algorithms. Next, validate the predctions against metrics and imporve further by tuning hyper-parameters. This is an iterative process which continues until model train well enough reducing the cost while increasing the accuracy. Since it is a Fraud detection use cases in financial domain, further forcus on reducing False Negatives as opposed to purely rely on accuracy.

1. Logistic Regression

Confusion Matrix

Classification Report

2. Easy Ensemble Classifier

Confusion Matrix

Classification Report

3. XGBoost Classifier

Confusion Matrix

Classification Report

4. Random Forest Classifier

Confusion Matrix

Classification Report

Model Deployment

Google colab has been chosen as the desired cloud based platform for model deployment due to following reasons

1. Cost effectiveness
2. Reliability
3. No infrastructure overhead
4. Usability and accessibility

Hosted files can be found in below links (Open with Google Colaboratory)

https://drive.google.com/file/d/1GRwbiNPk_BRxBJpy5GpH7GHHIuS-t5-5/view?usp=sharing

https://drive.google.com/file/d/1CRn9pSCsjJ5W0YZcEpq6Sslk-JdbGVig/view?usp=sharing

https://drive.google.com/file/d/1X_-51IZRfOtzNyUspLVn-6dY8ggUzCFh/view?usp=sharing

References

https://www.kaggle.com/ealtman2019/credit-card-transactions

https://www.ftc.gov/

https://www.fool.com/the-ascent/research/identity-theft-credit-card-fraud-statistics/

https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/

https://www.projectpro.io/article/8-feature-engineering-techniques-for-machine-learning/423

About

Machine Learning models to prove the importance and the effectiveness of this technology in the industry, and how it should be implemented across the board.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published