This project aims to detect fraudulent transactions in real-time and batch processing scenarios using big data tools and techniques. The architecture combines streaming and batch pipelines to process transactions, update lookup tables, and determine fraudulence based on predefined rules.
- Amazon S3:
- Stores raw data files (CSV format).
- Holds card member details, scores, and transactions.
- Databricks:
- Processes data using Spark.
- Implements Delta Live Tables (DLTs) for data transformations.
- Apache Kafka:
- Streams new transaction data in real-time.
- Delta Lake:
- Manages transaction and lookup tables with ACID properties.
- SQL:
- Performs data cleaning, transformations, and aggregations.
- Raw Data Ingestion:
- Raw data is stored in S3 buckets as CSV files.
- Data includes card transactions, member details, and member scores.
- Batch Processing:
- Batch loads raw data into Databricks tables using Delta Lake.
- A lookup table is created to store aggregated statistics.
- Streaming Processing:
- Kafka streams new transactions to Databricks.
- Transactions are processed and appended to existing Delta tables.
- Fraud Detection:
- Rules are applied to determine if a transaction is fraudulent.
- Outputs are stored in a separate fraud detection table.
-
Remove Duplicates:
CREATE OR REPLACE TABLE credit_card_db.silver_transactions AS SELECT DISTINCT * FROM credit_card_db.raw_transactions;
-
Add Primary Key: A combination of
card_id
andtransaction_dt
is created to ensure uniqueness.ALTER TABLE credit_card_db.silver_transactions ADD COLUMN primary_key STRING; UPDATE credit_card_db.silver_transactions SET primary_key = CONCAT(card_id, '_', transaction_dt);
- Aggregate transaction statistics per
card_id
:WITH cte_rownum AS ( SELECT card_id, amount, member_id, transaction_dt, post_code, ROW_NUMBER() OVER (PARTITION BY card_id ORDER BY transaction_dt DESC) AS rownum FROM credit_card_db.silver_transactions ), processed_data AS ( SELECT card_id, amount, c.member_id, m.score, c.transaction_dt, c.post_code, STDDEV(amount) OVER (PARTITION BY card_id ORDER BY transaction_dt DESC) AS std FROM cte_rownum c INNER JOIN credit_card_db.member_score m ON c.member_id = m.member_id WHERE rownum <= 10 ) SELECT card_id, member_id, ROUND(AVG(amount) + 3 * MAX(std), 0) AS UCL, MAX(score) AS score, MAX(transaction_dt) AS last_txn_time, MAX(post_code) AS last_txn_zip FROM processed_data GROUP BY card_id, member_id;
- Stream data from Kafka into Databricks Delta tables.
- Use Autoloader for efficient streaming.
- Flag transactions exceeding
UCL
values from the lookup table. - Cross-check member scores and transaction patterns.
- Databricks: For data engineering and ML pipelines.
- Delta Lake: To ensure ACID compliance and versioning.
- Apache Kafka: For real-time streaming.
- Amazon S3: As the raw data storage layer.
- PySpark/SQL: For data transformations and aggregations.
- Databricks account.
- S3 bucket with raw data files.
- Kafka cluster for streaming.
- Clone this repository:
git clone <repository-url>
- Configure S3 bucket permissions for Databricks.
- Set up your Databricks workspace with:
- Unity Catalog (if applicable).
- Delta Live Tables.
- Create required databases:
CREATE DATABASE IF NOT EXISTS credit_card_db;
- Deploy and test the pipeline notebooks.
- Upload raw data files to S3 buckets.
- Run the batch pipelines for initial data processing.
- Enable the Kafka streaming pipeline for real-time transaction ingestion.
- Query the fraud detection table for results.
- Implement ML models for fraud detection.
- Use AWS RDS or DynamoDB for faster lookup table updates.
- Introduce a dashboard for monitoring fraud metrics.
This project is licensed under the MIT License. See LICENSE
for details.
Contributions are welcome! Please fork the repository and submit a pull request.
Feel free to reach out for any questions or suggestions!