An immersive data engineering journey awaits you in this comprehensive starter kit, featuring a curated list of resources, tools, and best practices to help you get started with data engineering. This starter kit is designed to help you learn the basics of data engineering and get you up and running with your first data engineering project.
├── Makefile | -> Allows you to run commands for setup, test, lint, etc
├── README.md | -> Documentation for the project setup and usage
│
├── automation | -> Contains scripts to automate deployment and testing
│ └── deploy_glue_job.sh | -> Script to deploy or update glue job
│
├── examples | -> Contains example scripts to demonstrate pyspark features
│ ├── 01_pyspark_dataframe | -> Create a DataFrame by reading data from a source (CSV, Parquet Database, etc)
│ │ ├── README.md | -> Contains instructions to run the example
│ │ └── main.py | -> Example script to read csv file and write to parquet
│ ├── 02_applying_filters | -> Apply filters on a dataframe
│ │ ├── README.md | -> Contains instructions to run the example
│ │ └── main.py | -> Example script to apply filters on dataframe
│ ├── 03_transform_columns | -> Transform columns & manipulate data in a dataframe
│ │ ├── README.md | -> Contains instructions to run the example
│ │ └── main.py | -> Example script to transform columns
│ ├── 04_remap_columns | -> Normalise columns in a dataframe
│ │ ├── README.md | -> Contains instructions to run the example
│ │ └── main.py | -> Example script to normalise columns in a dataframe
│ ├── 05_complex_transformations | -> Perform complex transformations on a dataframe
│ │ ├── README.md | -> Contains instructions to run the example
│ │ └── main.py | -> Example script to perform some complex transformations
│ ├── 06_write_dataframe | -> Write a dataframe to a target
│ │ ├── README.md | -> Contains instructions to run the example
│ │ └── main.py | -> Example script to write dataframe to parquet or RDBMS Database
│ ├── 07_pyspark_in_glue_jobs | -> Examples of using PySpark in AWS Glue Jobs
│ │ ├── README.md | -> Contains instructions to run the example
│ │ └── main.py | -> Example script to run pyspark script in glue job
│ ├── 08_glue_dynamic_frame | -> Create a DynamicFrame by reading data from a data catalog
│ │ ├── README.md | -> Contains instructions to run the example
│ │ └── main.py | -> Example script to create a dynamic frame from a data catalog
│ ├── 09_apply_mappings | -> Apply mappings on a dynamic frame (change column names, data types, etc)
│ │ ├── README.md | -> Contains instructions to run the example
│ │ └── main.py | -> Example script to apply mappings on dynamic frame
│ └── 10_write_to_target | -> Write a dynamic frame to a target (CSV, Parquet, Database, etc)
│ ├── README.md | -> Contains instructions to run the example
│ └── main.py | -> Example script to write dynamic frame to parquet and store in S3
│
└── src | -> Contains all the source code for the onboarding exercise
├── data | -> Contains data files for the onboarding exercise
│ ├── customers.csv | -> Customer Dataset CSV file
│ ├── survey_results_public.csv | -> Stackoverflow Survey CSV file
│ └── survey_results_public.parquet | -> Stackoverflow Survey Parquet file
│
└── scripts | -> Contains all the glue scripts exercise
├── a_stackoverflow_survey | -> A sample glue script to read, apply mappings, transform data
│ └── main.py
├── b_fix_this_script | -> A broken glue script for you to fix
│ ├── README.md
│ └── main.py
└── c_top_spotify_tracks | -> A task for you to complete. Best of luck!
└── README.md
Step 1: Clone this repository and install required packages
$ make install
Step 2: Clone AWS Glue Python Lib
AWS Glue libraries are not available on via PIP. Hence, we need to install it manually.
# Clone the master branch for Glue 4.0
$ git clone https://github.com/awslabs/aws-glue-libs.git
$ export AWS_GLUE_HOME=$(pwd)/aws-glue-libs
Step 3: Install Apache Maven
$ curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz -o apache-maven-3.6.0-bin.tar.gz
$ tar -xvf apache-maven-3.6.0-bin.tar.gz
$ ln -s apache-maven-3.6.0-bin maven
$ export MAVEN_HOME=$(pwd)/maven
Step 3: Install Apache Spark
$ curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-4.0/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz -o spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz
$ tar -xvf spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz
$ ln -s spark-3.3.0-amzn-1-bin-3.3.3-amzn-0 spark
$ export SPARK_HOME=$(pwd)/spark
Step 4: Export Paths
$ export PATH=$PATH:$SPARK_HOME/bin:$MAVEN_HOME/bin:$AWS_GLUE_HOME/bin
verify installation by running
mvn --version
pyspark --version
Step 5: Download Glue ETL .jar files
$ cd $AWS_GLUE_HOME
$ mvn install dependency:copy-dependencies
$ cp $AWS_GLUE_HOME/jarsv1/AWSGlue*.jar $SPARK_HOME/jars/
$ cp $AWS_GLUE_HOME/jarsv1/aws*.jar $SPARK_HOME/jars/
After this step you should be able to execute
gluepyspark
, gluepytest
, gluesparksubmit
from shell
tools.jar
error
solution: YouTube
$ gluesparksubmit src/scripts/main.py
To run all test suites run:
$ make test
To geneate html coverage report run:
$ python3 -m coverage html