README

To get started with the de_challenge_1 project, follow these steps:

Run Locally

Open a terminal and navigate to the project directory: /home/ubuntu/de_challenge_1/
Activate the project environment. Depending on your operating system, you can use either venv or conda:
- For venv, run:
```
python -m venv .venv
```
- For conda, run:
```
conda create -n de_challenge_1
```
Activate the project environment. Depending on your operating system, you can use either venv or conda:
- For venv, run:
```
source .venv/bin/activate
```
- For conda, run:
```
conda activate de_challenge_1
```
Install the project dependencies by running the following command:
```
pip install -r requirements.txt
```
Rename the .env.example file to .env and update the environment variables as needed.
- BITSO_API_URL: The URL of the Bitso API.
- LOGS_LEVEL: Set the logging level to DEBUG if you want to see detailed logs. Otherwise, set it to INFO.
Run the following command to execute the project:
```
python app/main.py
```

Run on Docker

Open a terminal and navigate to the project directory: /home/ubuntu/de_challenge_1/
Build the Docker image by running the following command:
```
docker build -t de_challenge_1 .
```
Run the Docker container by executing the following command:
```
 docker run --env-file /home/ubuntu/de_challenge_1/.env -v /home/ubuntu/de_challenge_1/data:/home/ubuntu/de_challenge_1/data de_challenge_1
```
Replace /home/ubuntu/de_challenge_1/data with your local path to the data directory. This has to be alligned with the path in the .env file.

Partitioning of the data

For partitioning the data in the de_challenge_1 we use the following directory structure:

/book=book_name/year=year/month=month/day=day/hour=hour/minute=minute/order_book.csv

This partitioning scheme is hierarchical and based on the timestamp of the data. This approach is well-suited for easy analysis and querying of the data using tools like Amazon Athena. The data is partitioned by the following columns:

book: The name of the book.
year: The year of the timestamp.
month: The month of the timestamp.
day: The day of the timestamp.
hour: The hour of the timestamp.
minute: The minute of the timestamp.

For example, the following directory structure shows how the data is partitioned:

/book=mxn_btc/year=2024/month=07/day=15/hour=00/minute=00/order_book.csv
/book=mxn_btc/year=2024/month=07/day=15/hour=00/minute=10/order_book.csv
/book=mxn_btc/year=2024/month=07/day=15/hour=00/minute=20/order_book.csv
...

Advantages of this partitioning scheme:

Query performance: The hierarchical partitioning aligns well with typical query patterns in Athena, allowing for efficient time-based filtering and analysis.
Scalability: Keeps partitions small and manageable, which is crucial for maintaining query performance in Athena.
Organization: The deep directory structure creates a clear hierarchy, making it easier to navigate and understand the organization of your data.

Assumptions

This process relies on the following assumptions:

Latency is not a concern. If it were, we would need to consider a different approach to api calls.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
app		app
data		data
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Run Locally

Run on Docker

Partitioning of the data

Advantages of this partitioning scheme:

Assumptions

About

Releases

Packages

Languages

leandroacostag/de_challenge_1

Folders and files

Latest commit

History

Repository files navigation

README

Run Locally

Run on Docker

Partitioning of the data

Advantages of this partitioning scheme:

Assumptions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages