This code utilizes data generated by our Data Generation Tool in order to detect fraud in sliding windows of credit card transactions. The Spark-based portion of the code loads .csv files of credit card transactions and generates state-transition matricies for every user (and user segment/profile) based on their transaction history. The generation of the matricies, as well as aggregation of various user and segment-level statistics are all performed via Spark and are distributed in nature. The results (matricies and aggregates) are stored in a Redis instance after being calculated.
Kafka is utilized to listen for incoming streaming transactions (transaction_listener_AWS.py) and the listener utilizes the pre-calculated aggregates / matricies stored in Redis to evaluated incoming transactions for probabilities of fraud. If the probability of fraud is over a given threshold, the transaction information is sent (via Kafka) to another listener (fraud_listener_AWS.py) which records the data to a local .csv file, as well as creates an updated map of the United States with location of fraud via Folium / Leaflet.js.
Streaming data is usually simualted by generating a test dataset via the data generation process, that is independent of the data used in the state-transition / aggregation process (though both sets must share the same customer file).
This implementation is designed to be run on Amazon Web Services Elastic MapReduce (EMR), and utilizes AWS ElasticCache for a Redis instance. Map visualization is served on the Hadoop namenode via the apache2 server (/var/www/html). This process will not run without opening a number of ports to your Namenode instance, so plan to review your EMR security policy.
- Requires Python 2.7 or higher (EMR AMI's require Python update for namenodes as well as all data nodes)
- Requires several Python package installations (please see import statements)
- Location of source data is defined in Sparkov_AWS.py, and is by default pointing to the HDFS store.
- Sparkov_AWS requires updating a number of variables, including IP addresses. Search for #AWS in the code.
- The transaction_listener code requires a number of modifications, including Kafka topic as well as Kafka and Redis node IP addresses.
- fraud_listener requires modifications for Kafka IP addresses as well as to be executed with an account that has write permissions to /var/www/html.