Data Engineering Assignment (PySpark)

Before starting to work on the assignment - please make sure you fork this repository.

Requirements:

This is an assignment for a Data Engineer role. You are requested to:

Read and understand the requirements. You may contact the interviewer for further clarification
Write code that answers the objectives
Deploy the code to the provided AWS account

You are a Data Engineer in a financial institute. Your task is to calculate and answer the business questions (objectives) provided by the analysts team. You’re provided here with a small dataset, but in a real-world scenario you'll have a huge dataset, so the code needs to be deployed and run on a cloud environenment.

Coding instructions:

The file stocks_data.csv contains daily closing price of a few stocks on the NYSE/NASDAQ
Load the file as a DataFrame, Dataset, or RDD and complete the assignment objectives
The result of each question should be saved as a separate file in an S3 Bucket

Assumptions:

Use only the closing price to determine returns
If a price is missing on a given date, you can compute returns from the closest available date
Return can be trivially computed as the % difference of two prices

Objectives:

Compute the average daily return of all stocks for every date

date average_return

yyyy-MM-dd return of all stocks on that date
Which stock was traded with the highest worth - as measured by closing price * volume - on average?

ticker value
Which stock was the most volatile as measured by the annualized standard deviation of daily returns?

ticker standard_deviation
What were the top three 30-day return dates as measured by % increase in closing price compared to the closing price 30 days prior? present the top three ticker and date combinations.

ticker date

AWS Deployment

At Vi, we manage and provision cloud infrastructure through definition files (IaC). Please use IaC (such as CloudFormation) to deploy the code you created to perform the below tasks.
Infrastructure tasks:
- Create a glue job
- Create a Glue Catalog Database
- Create a Glue Catalog Table for each result file
- Create crawler/s
Expected result: Questions (objectives) results should be queryable from Athena
If you encounter issues with reading/writing from/to S3 buckets, it is recommended to add your name as a prefix to the bucket’s name. For example name the bucket: “data-engineer-assignment-my-name”
Use the AWS credentials provided in the email to deploy the code. DO NOT COMMIT THEM IN THE CODE.
Deploy your resources in the Europe (Frankfurt) eu-central-1
Create a local .env file with the following environment variables
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- STACK_NAME
Use the create-update-stack.sh in the repo to to deploy your stack file. A demo stack file is provided in the repo

Submission

Please share your Github repo by replying to the email. Write us any assumptions you made or additional information you think is relevant.

Evaluation

Objectives completion
Code quality & efficiency
AWS deployment

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
create-update-stack.sh		create-update-stack.sh
stack.yml		stack.yml
stocks_data.csv		stocks_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Assignment (PySpark)

Requirements:

Coding instructions:

Assumptions:

Objectives:

AWS Deployment

Submission

Evaluation

About

Releases

Packages

Languages

vi-technologies/data-engineering-home-assignment

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Assignment (PySpark)

Requirements:

Coding instructions:

Assumptions:

Objectives:

AWS Deployment

Submission

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages