To get started with the de_challenge_2
project, follow these steps:
-
Open a terminal and navigate to the project directory:
/home/ubuntu/de_challenge_2/
-
Activate the project environment. Depending on your operating system, you can use either
venv
orconda
:- For
venv
, run:python -m venv .venv
- For
conda
, run:conda create -n de_challenge_2
- For
-
Activate the project environment. Depending on your operating system, you can use either
venv
orconda
:- For
venv
, run:source .venv/bin/activate
- For
conda
, run:conda activate de_challenge_2
- For
-
Install the project dependencies by running the following command:
pip install -r requirements.txt
-
Rename the
.env.example
file to.env
and update the environment variables as needed.PROCESS_ALL
: Set to1
to process all data or0
to process only new data.LOGS_LEVEL
: Set the logging level toDEBUG
if you want to see detailed logs. Otherwise, set it toINFO
.DATA_PATH
: The path to thedata
directory.
-
Unzip the
deposit.zip
file located on thedata/input
directory. -
Run the following command to execute the project:
python app/main.py
-
Open a terminal and navigate to the project directory:
/home/ubuntu/de_challenge_2/
-
Build the Docker image by running the following command:
docker build -t de_challenge_2 .
-
Run the Docker container by executing the following command:
docker run --env-file /home/ubuntu/de_challenge_2/.env -v /home/ubuntu/de_challenge_2/data:/home/ubuntu/de_challenge_2/data de_challenge_2
Replace
/home/ubuntu/de_challenge_2/data
with your local path to thedata
directory. This has to be alligned with the path in the.env
file.
The Entity Relationship Diagram (ERD) for the de_challenge_2
project is as follows:
The lineage for the output tables is as follows:
-
transaction
table is derived from thedeposit
andwithdrawal
tables. -
user_login
table is derived from theevent
table.
-
ETL Process:
- Extract: Reading data from
deposit.csv
,withdrawl.csv
, andevent.csv
into DuckDB tables. - Transform: Processing and filtering data to create consolidated tables (
transaction
anduser_login
). - Load: Inserting processed data into new tables for analysis.
- Extract: Reading data from
-
Data Integration: Combining multiple sources into unified tables to provide a comprehensive view of user activities for further analysis.
-
Incremental Loading: Efficiently processing only new data since the last run to optimize performance.
-
Dedupe: Removing duplicate records to ensure data integrity and consistency when running queries.
-
Indexing and Partitioning: Enhancing query performance by creating indexes on frequently queried columns and partitioning tables based on specific criteria.
Indexes:
transaction_type
user_id
currency
transaction_id
event_timestamp
Partitioning:
- If we're using BigQuery we can partition by
event_timestamp
to optimize query performance.
Indexes:
user_id
login_id
- You can find queries to answer the questions in the challenge in the queries directory.