PS: A SAM version is available on sam
branch
This repo aim to demonstrate how to develop AWS Glue Job efficiently:
- Be able to develop locally
- Get a fast feedback loop
- Be able to commit with no manual copy paste between tools
In addition this repo shows how to deploy this AWS Glue Job through a proper CI/CD pipeline leveraging Infrastructure as code.
Two options are proposed here: "Use this repo" or "Do it your self"
- Clone this repo
git clone https://github.com/flochaz/aws-glue-job-e2e-dev-life-cycle.git cd aws-glue-job-e2e-dev-life-cycle
- setup virtual env
python3 -m venv .venv source .venv/bin/activate
- Install CDK
npm install -g aws-cdk
In order to run glue job locally we will need some specific elements such as
- an iam role to assume while running local notebook
- a glue database to store the data
- a glue crawler to extract the schema and data from raw source csv files
- Trigger the crawler ...
This CDK app will deploy all those for you to be ready to work on the glue job itself
- Install deps
pip install -r requirements.txt
- Bootstrap account
cdk bootstrap
- Deploy Glue role, crawler etc.
cdk deploy infrastructure
AWS Glue service offer a way to run your job remotely while developping locally through the Interactive Sessions feature.
- Set up interactive session:
pip install -r requirements-dev.txt
SITE_PACKAGES=$(pip show aws-glue-sessions | grep Location | awk '{print $2}')
jupyter kernelspec install $SITE_PACKAGES/aws_glue_interactive_sessions_kernel/glue_pyspark # Add "--user" if getting "[Errno 13] Permission denied: '/usr/local/share/jupyter'"
jupyter kernelspec install $SITE_PACKAGES/aws_glue_interactive_sessions_kernel/glue_spark # Add "--user" if getting "[Errno 13] Permission denied: '/usr/local/share/jupyter'"
- Setup glue role by copying the output called
awsConfigUPDATE
of the previouscdk deploy
command into~/.aws/config
under[default]
cat ~/.aws/config [default] glue_role_arn=xxxxxx
- Launch notebook
jupyter notebook # add "--ip 0.0.0.0" if running in a remote IDE such as cloud9 (PS: you will need to open your security group for TCP connection on 8888 port as well !)
- Play with
glue_job_source/data_cleaning_and_lambda.ipynb
- Commit your changes to git
- Optionally deploy your changes to dev env
cdk deploy infrastructure
If deploying to same account / region, first you will need to destroy your dev stack to avoid resource collision (especially glue role, crawler, database etc.)
cdk destroy infrastructure
- Create a repo by deploying the pipeline stack
cdk deploy GlueJobPipelineStack
- Push code to repo
# Remove github origin git remote remove origin # Add code commit repo as origin git remote add origin <YOUR CODE COMMIT REPO URL (THE COMMAND SHOULD BE FOUND IN THE PREVIOUS "cdk deploy GlueJobPipelineStack" output)> git push -u master
- Observe the deployment through code pipeline
- Get into your aws account
- Setup your online IDE: Cloud 9
- Add your glue job (you can take this one for instance https://github.com/aws-samples/aws-glue-samples/blob/master/examples/data_cleaning_and_lambda.py)
- Add interactive sessions + notebook CI/CD (optional)
- https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
- Quick hack
vim ~/.aws/config
glue_role_arnvim ~/.aws/credentials
jupyter notebook —ip 0.0.0.0
jupyter nbconvert --to script ./data_cleaning_and_lambda.ipynb
- Create your first CDK app
- Add glue infrastructure: https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_glue_alpha/README.html
- Glue database
- Glue Role
- Glue Crawler
- Glue Job
- Add CI/CD using the official doc or workshop
- Inject config (such as output_bucket, stage, database name etc ...)
- Add dev life cycle diagram and screenshots
- Add example for external file inclusion in notebook with aws s3Sync and %extra_py_files etc.
- Add integration tests to pipeline
- Describe how to add stage with manual approval
- Fix CDK unit tests
Feel free to contribute !!!