Skip to content

Commit

Permalink
Init code
Browse files Browse the repository at this point in the history
  • Loading branch information
sarit-si committed Apr 21, 2021
1 parent e4c06a2 commit ca991d2
Show file tree
Hide file tree
Showing 24 changed files with 3,728 additions and 1 deletion.
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.vscode
setup-airflow/logs
setup-airflow/plugins
setup-pentaho/logs
__pycache__
.meta
.env
# jdbc.properties
100 changes: 99 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,99 @@
docker-airflow-pdi-01
# Description

Step by step approach to easily dockerize Airflow and Pentaho Data Integration **IN SEPARATE CONTAINERS**.
Below is the high level architecture of the setup:
- Airflow:
- Orchestrator container
- Sends transformation/job metadata as task to Pentaho container

- Pentaho:
- Container receives transformation/job details as task to be done
- Performs (runs) the actual task (transformation/job)


# Pre-requisites
- [Docker Engine](https://docs.docker.com/engine/install/)
- [Docker Compose](https://docs.docker.com/compose/install/)

# Versions
- Airflow 2.0
- PDI 9.1

# Setup
Change directory to the project folder before performing below steps.

### Environment variables, files & folders for containers
- Create a .env file and add the user and group Ids for the respective containers.
This is required for the containers to have same access privileges as that of the host user during docker compose.

echo -e "PENTAHO_UID=$(id -u)\nPENTAHO_GID=0\nAIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env

- If needed, append the below optional variables to the above .env file.

echo -e "<variable name>=<value>" >> .env
- HOST_ENV --> run containers as localhost/dev/qa/prod. This will copy corresponding kettle.properties into the PDI container. Also enables PDI transformations to pick environment specific DB JNDI connections during execution. Can be used by Airflow to connect to corresponding resources.
- CARTE_USER --> Default: cluster
- CARTE_PASSWORD --> Default: cluster
- AIRFLOW_ADMIN_USER --> Create Web UI user. Default: airflow
- AIRFLOW_ADMIN_PASSWORD --> Default: airflow
- AIRFLOW_ADMIN_EMAIL --> Required if new user to be created
- PENTAHO_DI_JAVA_OPTIONS --> Allocate JVM memory to PDI container, based on host machine RAM. Increase if container crashes due to GC Out of memory. Ex: for Min. 1G and Max 4G, set this to "-Xms1g -Xmx4g"
- CARTE_HOST_PORT --> Default: 8181
- AIRFLOW_HOST_PORT --> Default: 8080

- Create below folders for the container volumes to bind

mkdir ./setup-airflow/logs ./setup-airflow/plugins ./setup-pentaho/logs


- Source Code
Since the DAGs/PDI source code files might undergo frequent updates, they are not copied into the container during image build, instead mounted via docker compose. Any update to these source code files on host will automatically get visible inside the container.

- Airflow:
- Default folder for DAGs on host is ./source-code/dags
- Replace the above default folder in the docker compose file, with the desired folder location on host.
- Place all the DAG files in the above host dags folder.

- Pentaho:
- Default folder for ktr/kjb files on host is ./source-code/ktrs
- Replace the above default folder in the docker compose file, with the desired folder location on host.
- Place all the PDI files in the above host ktrs folder.
- Update repositories.xml file accordingly, to make them visible to Carte.

### Build & Deploy
Below command will build (if first time) and start all the services.

docker-compose up
To run as daemon, add -d option.

# Web UI
- If not localhost, replace with server endpoint Url
- If not below default ports, replace with the ones used during CARTE_HOST_PORT & AIRFLOW_HOST_PORT setup.

Airflow Webserver

localhost:8080/home

Carte Webserver

localhost:8181/kettle/status

# Best practices
- ```jdbc.properties``` file, which contains database access credentials, has been included in this repo for reference purpose only. In actual development, this should be avoided and needs to be added to gitignore instead. After first code pull to a server, update it with all JNDI details before docker compose.

- ```.env``` file also may contain sensitive information, like environment dependent access keys. This also should be added to .gitignore file. Instead create this file with necessary parameters during image build.

- ```HOST_ENV``` setting this parameter gives us a flexibility to choose appropriate ```kettle.properties``` file. For example, QA and PROD mailing server SMTP details may differ. This can be included in separate kettle properties file, to be selected dynamically based on the host environment. Not only this, if one uses the ```jdbc.properties``` file, we can enable PDI container dynamically select the correct JNDI from ```jdbc.properties``` file. For ex: if one needs to test a transformation in QA environemnt using Postgres JNDI connection encoded as ```db-${HOST_ENV}```, running PDI service with ```HOST_ENV=qa```, will render ```db-qa``` database JNDI, thus using QA data for testing.

- ```PENTAHO_DI_JAVA_OPTIONS``` Having this option lets the user tweak the amount of memory PDI gets inside the container, to run a task. Depending on the host machine memory and average task complexity, this can be modified to avoid PDI container crash due to "GC Out of Memory" errors. If host machine has ample RAM and PDI container is crashing due to the default memory limits, we can increase it by setting ```PENTAHO_DI_JAVA_OPTIONS=-Xms1g -Xmx4g``` 1GB and 4GB being the lower and upper limits respectively.

# References & Credits
- [What is Carte Server ?](https://wiki.pentaho.com/display/EAI/Carte+User+Documentation)

- [Configure Carte Server](https://help.pentaho.com/Documentation/8.0/Products/Data_Integration/Carte_Clusters/060)

- [Set Repository on the Carte Server](https://help.pentaho.com/Documentation/9.1/Products/Use_Carte_Clusters)

- [Carte APIs to trigger kettle transformation/jobs](https://help.pentaho.com/Documentation/9.1/Developer_center/REST_API_Reference/Carte)

- [Scheduling a PDI job using Dockerized Airflow](https://diethardsteiner.github.io/pdi/2020/04/01/Scheduling-a-PDI-Job-on-Apache-Airflow.html)
151 changes: 151 additions & 0 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
version: '3'
x-pdi-common:
&pdi-common
build:
context: ./setup-pentaho
dockerfile: Dockerfile
args:
PENTAHO_UID: ${PENTAHO_UID}
PENTAHO_GID: ${PENTAHO_GID}
image: pdi
environment:
&pdi-common-env
PENTAHO_DI_JAVA_OPTIONS: ${PENTAHO_DI_JAVA_OPTIONS}
CARTE_USER: ${CARTE_USER}
CARTE_PASSWORD: ${CARTE_PASSWORD}
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./source-code/ktrs:/home/pentaho/repositories
- ./setup-pentaho/logs:/opt/data-integration/logs
- ./setup-pentaho/repositories.xml:/opt/data-integration/.kettle/repositories.xml
- ./setup-pentaho/kettle-properties/${HOST_ENV:-localhost}-kettle.properties:/opt/data-integration/.kettle/kettle.properties
- ./setup-pentaho/simple-jndi:/opt/data-integration/simple-jndi
deploy:
restart_policy:
condition: on-failure
max_attempts: 3

x-airflow-common:
&airflow-common
build: ./setup-airflow
image: airflow
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@airflow-database/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@airflow-database/airflow
AIRFLOW__CELERY__BROKER_URL: redis://:@airflow-broker:6379/0
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
PDI_CONN_STR: http://${CARTE_USER:-cluster}:${CARTE_PASSWORD:-cluster}@pdi-master:${CARTE_HOST_PORT:-8181}
volumes:
- ./source-code/dags:/opt/airflow/dags
- ./setup-airflow/plugins:/opt/airflow/plugins
- ./setup-airflow/logs:/opt/airflow/logs
- ./setup-airflow/execute-carte.sh:/opt/airflow/execute-carte.sh
- ./setup-airflow/airflow.cfg:/opt/airflow/airflow.cfg
user: "${AIRFLOW_UID}:${AIRFLOW_GID}"
depends_on:
airflow-broker:
condition: service_healthy
airflow-database:
condition: service_healthy


services:
# Airflow-DB
airflow-database:
image: postgres:13
container_name: airflow-database
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
restart: always

# Airflow-messenger
airflow-broker:
image: redis:latest
container_name: airflow-broker
ports:
- 6379:6379
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 30s
retries: 50
restart: always

# Airflow-webserver
airflow-webserver:
<<: *airflow-common
container_name: airflow-webserver
command: webserver
ports:
- ${AIRFLOW_HOST_PORT:-8080}:8080
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:${AIRFLOW_HOST_PORT:-8080}/health"]
interval: 10s
timeout: 10s
retries: 5
restart: always

# Airflow-scheduler
airflow-scheduler:
<<: *airflow-common
container_name: airflow-scheduler
command: scheduler
restart: always

# Airflow-worker
airflow-worker:
<<: *airflow-common
command: celery worker
restart: always

# Airflow-DB-initialize
airflow-init:
<<: *airflow-common
container_name: airflow-init
command: version
environment:
<<: *airflow-common-env
_AIRFLOW_DB_UPGRADE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${AIRFLOW_ADMIN_USER:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${AIRFLOW_ADMIN_PASSWORD:-airflow}
_AIRFLOW_WWW_USER_EMAIL: ${AIRFLOW_ADMIN_EMAIL:[email protected]}

# Pentaho
pdi-master:
<< : *pdi-common
container_name: pdi-master
environment:
<<: *pdi-common-env
ports:
- ${CARTE_HOST_PORT:-8181}:8181

# pdi-child:
# << : *pdi-common
# container_name: pdi-child
# ports:
# - 8182
# depends_on:
# - pdi-master
# environment:
# <<: *pdi-common-env
# CARTE_PORT: 8182
# CARTE_IS_MASTER: 'N'
# CARTE_INCLUDE_MASTERS: 'Y'
# CARTE_MASTER_HOSTNAME: 'pdi-master'
# CARTE_MASTER_PORT: ${CARTE_HOST_PORT:-8181}

volumes:
postgres-db-volume:
14 changes: 14 additions & 0 deletions setup-airflow/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM apache/airflow:2.0.1

USER root

# Install environment dependencies
RUN apt-get update \
# xmlstarlet package is required by Airflow to read XML log generated by Carte server running in separate container
&& apt-get install xmlstarlet -y \
# Upgrade PIP
&& pip install --upgrade pip \
# Install project specific packages
&& pip install 'apache-airflow[postgres]'

USER airflow
Loading

0 comments on commit ca991d2

Please sign in to comment.