Xero-data-engineer-take-home-interview

Take home problem for Xero Interview; includes a simple ETL through Airflow and a dashboard through Dash.

Usage

First, make sure you have docker installed

Then run the following commands in the root of this project:

docker-compose build
docker-compose up -d
bash init_etl.sh

Once the init_etl script is completed, check the airflow ETL to see the log data be processed via a batch ETL.

Once the airflow ETL is completed, check the results of the analysis on this plotly dashboard. NOTE this dashboard will not run until the ETL has completed running once. You can check the progress using the above airflow ETL link.

Project Structure

The project is managed via docker-compose which create 3 services:

A PostgreSQL Database to act as a local data warehouse
An Airflow scheduler to act as the ETL process for transforming the data. The UI for this can be seen at localhost:8080. Each task logs the SQL transformations for debugging and documentation.
A Dashboard (via dash) that will show the result of the analysis. Normally, in a production setup I'd recommend not rolling out a custom dashboarding solution and instead using an existing BI tool (Tableau, Periscope, Looker)

The codebase is divided into 4 main folders:

etl: This is the Airflow ETL component. It manages the tasks that transform and load the log data into Postgres.
xero_dash: This is the visualization component. It holds the business logic and graphing code for the analysis. It is written as it's own python module.
exploration: This is the first pass exploration of the dataset. It serves as a log of the workflow/thought process of solving the problem. This folder exists mostly to help the efforts of reproducible research.
- This can be run from the root folder via: venv/bin/python exploration/explore_data.py
libraries: These are reusable code snippets between the different services. Currently it holds database querying code.

Development Workflow

For most dev work, it should be done within the docker-compose setup so that it best matches what would be deployed to production. This process has been setup so that every file changes will refresh xero_airflow and xero_dashboard so it will allow for a fast feedback loop.

For dev work that cannot be done within the docker containers, you need to have the virtualenv setup:

python -m virtualenv venv --python python3
venv/bin/pip install pip-tools
venv/bin/python -m piptools compile
venv/bin/pip install -r requirements.txt

Packages Management

We use (pip-tools)[https://github.com/jazzband/pip-tools] to manage packaged. This means that to add a new package it needs to be added to requirements.in and then we need to compile a new requirements.txt:

venv/bin/python -m piptools compile

then make sure you redo the pip install:

venv/bin/pip install -r requirements.txt

Next Steps

Testing

Setting this up with some mock data end-to-end test cases would be a smart next step. I avoided it here because of the time commitment to setting it up.

Setup Prod Infrastructure

Currently this entire project runs on local using a local Postgres. Moving forward it might make sense to persist the database on a cloud service provider (AWS RDS for example).

Clean up temp files

Running airflow locally dumps all the files into the projects root directory. It would be better to have this use a docker volume so that it doesn't fill the repo with noise

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
etl		etl
exploration		exploration
libraries		libraries
xero_dash		xero_dash
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
Xero_data_engineer_take_home.pdf		Xero_data_engineer_take_home.pdf
docker-compose.yml		docker-compose.yml
init_etl.sh		init_etl.sh
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Xero-data-engineer-take-home-interview

Usage

Project Structure

Development Workflow

Packages Management

Next Steps

Testing

Setup Prod Infrastructure

Clean up temp files

About

Releases

Packages

Languages

ericmjalbert/xero-data-engineer-interview

Folders and files

Latest commit

History

Repository files navigation

Xero-data-engineer-take-home-interview

Usage

Project Structure

Development Workflow

Packages Management

Next Steps

Testing

Setup Prod Infrastructure

Clean up temp files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages