Skip to content

Markov Chain based fraud detection system in Spark.

Notifications You must be signed in to change notification settings

namebrandon/Sparkov

Repository files navigation

Sparkov - A Markov-Chain based fraud detection system based in Spark.

This code utilizes data generated by our Data Generation Tool in order to detect fraud in sliding windows of credit card transactions. The Spark-based portion of the code loads .csv files of credit card transactions and generates state-transition matricies for every user (and user segment/profile) based on their transaction history. The generation of the matricies, as well as aggregation of various user and segment-level statistics are all performed via Spark and are distributed in nature. The results (matricies and aggregates) are stored in a Redis instance after being calculated.

Kafka is utilized to listen for incoming streaming transactions (transaction_listener_AWS.py) and the listener utilizes the pre-calculated aggregates / matricies stored in Redis to evaluated incoming transactions for probabilities of fraud. If the probability of fraud is over a given threshold, the transaction information is sent (via Kafka) to another listener (fraud_listener_AWS.py) which records the data to a local .csv file, as well as creates an updated map of the United States with location of fraud via Folium / Leaflet.js.

Streaming data is usually simualted by generating a test dataset via the data generation process, that is independent of the data used in the state-transition / aggregation process (though both sets must share the same customer file).

This implementation is designed to be run on Amazon Web Services Elastic MapReduce (EMR), and utilizes AWS ElasticCache for a Redis instance. Map visualization is served on the Hadoop namenode via the apache2 server (/var/www/html). This process will not run without opening a number of ports to your Namenode instance, so plan to review your EMR security policy.

General Usage and Notes

  • Requires Python 2.7 or higher (EMR AMI's require Python update for namenodes as well as all data nodes)
  • Requires several Python package installations (please see import statements)
  • Location of source data is defined in Sparkov_AWS.py, and is by default pointing to the HDFS store.
  • Sparkov_AWS requires updating a number of variables, including IP addresses. Search for #AWS in the code.
  • The transaction_listener code requires a number of modifications, including Kafka topic as well as Kafka and Redis node IP addresses.
  • fraud_listener requires modifications for Kafka IP addresses as well as to be executed with an account that has write permissions to /var/www/html.

About

Markov Chain based fraud detection system in Spark.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published