-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
24 changed files
with
3,728 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
.vscode | ||
setup-airflow/logs | ||
setup-airflow/plugins | ||
setup-pentaho/logs | ||
__pycache__ | ||
.meta | ||
.env | ||
# jdbc.properties |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,99 @@ | ||
docker-airflow-pdi-01 | ||
# Description | ||
|
||
Step by step approach to easily dockerize Airflow and Pentaho Data Integration **IN SEPARATE CONTAINERS**. | ||
Below is the high level architecture of the setup: | ||
- Airflow: | ||
- Orchestrator container | ||
- Sends transformation/job metadata as task to Pentaho container | ||
|
||
- Pentaho: | ||
- Container receives transformation/job details as task to be done | ||
- Performs (runs) the actual task (transformation/job) | ||
|
||
|
||
# Pre-requisites | ||
- [Docker Engine](https://docs.docker.com/engine/install/) | ||
- [Docker Compose](https://docs.docker.com/compose/install/) | ||
|
||
# Versions | ||
- Airflow 2.0 | ||
- PDI 9.1 | ||
|
||
# Setup | ||
Change directory to the project folder before performing below steps. | ||
|
||
### Environment variables, files & folders for containers | ||
- Create a .env file and add the user and group Ids for the respective containers. | ||
This is required for the containers to have same access privileges as that of the host user during docker compose. | ||
|
||
echo -e "PENTAHO_UID=$(id -u)\nPENTAHO_GID=0\nAIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env | ||
|
||
- If needed, append the below optional variables to the above .env file. | ||
|
||
echo -e "<variable name>=<value>" >> .env | ||
- HOST_ENV --> run containers as localhost/dev/qa/prod. This will copy corresponding kettle.properties into the PDI container. Also enables PDI transformations to pick environment specific DB JNDI connections during execution. Can be used by Airflow to connect to corresponding resources. | ||
- CARTE_USER --> Default: cluster | ||
- CARTE_PASSWORD --> Default: cluster | ||
- AIRFLOW_ADMIN_USER --> Create Web UI user. Default: airflow | ||
- AIRFLOW_ADMIN_PASSWORD --> Default: airflow | ||
- AIRFLOW_ADMIN_EMAIL --> Required if new user to be created | ||
- PENTAHO_DI_JAVA_OPTIONS --> Allocate JVM memory to PDI container, based on host machine RAM. Increase if container crashes due to GC Out of memory. Ex: for Min. 1G and Max 4G, set this to "-Xms1g -Xmx4g" | ||
- CARTE_HOST_PORT --> Default: 8181 | ||
- AIRFLOW_HOST_PORT --> Default: 8080 | ||
|
||
- Create below folders for the container volumes to bind | ||
|
||
mkdir ./setup-airflow/logs ./setup-airflow/plugins ./setup-pentaho/logs | ||
|
||
|
||
- Source Code | ||
Since the DAGs/PDI source code files might undergo frequent updates, they are not copied into the container during image build, instead mounted via docker compose. Any update to these source code files on host will automatically get visible inside the container. | ||
|
||
- Airflow: | ||
- Default folder for DAGs on host is ./source-code/dags | ||
- Replace the above default folder in the docker compose file, with the desired folder location on host. | ||
- Place all the DAG files in the above host dags folder. | ||
|
||
- Pentaho: | ||
- Default folder for ktr/kjb files on host is ./source-code/ktrs | ||
- Replace the above default folder in the docker compose file, with the desired folder location on host. | ||
- Place all the PDI files in the above host ktrs folder. | ||
- Update repositories.xml file accordingly, to make them visible to Carte. | ||
|
||
### Build & Deploy | ||
Below command will build (if first time) and start all the services. | ||
|
||
docker-compose up | ||
To run as daemon, add -d option. | ||
|
||
# Web UI | ||
- If not localhost, replace with server endpoint Url | ||
- If not below default ports, replace with the ones used during CARTE_HOST_PORT & AIRFLOW_HOST_PORT setup. | ||
|
||
Airflow Webserver | ||
|
||
localhost:8080/home | ||
|
||
Carte Webserver | ||
|
||
localhost:8181/kettle/status | ||
|
||
# Best practices | ||
- ```jdbc.properties``` file, which contains database access credentials, has been included in this repo for reference purpose only. In actual development, this should be avoided and needs to be added to gitignore instead. After first code pull to a server, update it with all JNDI details before docker compose. | ||
|
||
- ```.env``` file also may contain sensitive information, like environment dependent access keys. This also should be added to .gitignore file. Instead create this file with necessary parameters during image build. | ||
|
||
- ```HOST_ENV``` setting this parameter gives us a flexibility to choose appropriate ```kettle.properties``` file. For example, QA and PROD mailing server SMTP details may differ. This can be included in separate kettle properties file, to be selected dynamically based on the host environment. Not only this, if one uses the ```jdbc.properties``` file, we can enable PDI container dynamically select the correct JNDI from ```jdbc.properties``` file. For ex: if one needs to test a transformation in QA environemnt using Postgres JNDI connection encoded as ```db-${HOST_ENV}```, running PDI service with ```HOST_ENV=qa```, will render ```db-qa``` database JNDI, thus using QA data for testing. | ||
|
||
- ```PENTAHO_DI_JAVA_OPTIONS``` Having this option lets the user tweak the amount of memory PDI gets inside the container, to run a task. Depending on the host machine memory and average task complexity, this can be modified to avoid PDI container crash due to "GC Out of Memory" errors. If host machine has ample RAM and PDI container is crashing due to the default memory limits, we can increase it by setting ```PENTAHO_DI_JAVA_OPTIONS=-Xms1g -Xmx4g``` 1GB and 4GB being the lower and upper limits respectively. | ||
|
||
# References & Credits | ||
- [What is Carte Server ?](https://wiki.pentaho.com/display/EAI/Carte+User+Documentation) | ||
|
||
- [Configure Carte Server](https://help.pentaho.com/Documentation/8.0/Products/Data_Integration/Carte_Clusters/060) | ||
|
||
- [Set Repository on the Carte Server](https://help.pentaho.com/Documentation/9.1/Products/Use_Carte_Clusters) | ||
|
||
- [Carte APIs to trigger kettle transformation/jobs](https://help.pentaho.com/Documentation/9.1/Developer_center/REST_API_Reference/Carte) | ||
|
||
- [Scheduling a PDI job using Dockerized Airflow](https://diethardsteiner.github.io/pdi/2020/04/01/Scheduling-a-PDI-Job-on-Apache-Airflow.html) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
version: '3' | ||
x-pdi-common: | ||
&pdi-common | ||
build: | ||
context: ./setup-pentaho | ||
dockerfile: Dockerfile | ||
args: | ||
PENTAHO_UID: ${PENTAHO_UID} | ||
PENTAHO_GID: ${PENTAHO_GID} | ||
image: pdi | ||
environment: | ||
&pdi-common-env | ||
PENTAHO_DI_JAVA_OPTIONS: ${PENTAHO_DI_JAVA_OPTIONS} | ||
CARTE_USER: ${CARTE_USER} | ||
CARTE_PASSWORD: ${CARTE_PASSWORD} | ||
volumes: | ||
- /var/run/docker.sock:/var/run/docker.sock | ||
- ./source-code/ktrs:/home/pentaho/repositories | ||
- ./setup-pentaho/logs:/opt/data-integration/logs | ||
- ./setup-pentaho/repositories.xml:/opt/data-integration/.kettle/repositories.xml | ||
- ./setup-pentaho/kettle-properties/${HOST_ENV:-localhost}-kettle.properties:/opt/data-integration/.kettle/kettle.properties | ||
- ./setup-pentaho/simple-jndi:/opt/data-integration/simple-jndi | ||
deploy: | ||
restart_policy: | ||
condition: on-failure | ||
max_attempts: 3 | ||
|
||
x-airflow-common: | ||
&airflow-common | ||
build: ./setup-airflow | ||
image: airflow | ||
environment: | ||
&airflow-common-env | ||
AIRFLOW__CORE__EXECUTOR: CeleryExecutor | ||
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@airflow-database/airflow | ||
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@airflow-database/airflow | ||
AIRFLOW__CELERY__BROKER_URL: redis://:@airflow-broker:6379/0 | ||
AIRFLOW__CORE__FERNET_KEY: '' | ||
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true' | ||
AIRFLOW__CORE__LOAD_EXAMPLES: 'false' | ||
PDI_CONN_STR: http://${CARTE_USER:-cluster}:${CARTE_PASSWORD:-cluster}@pdi-master:${CARTE_HOST_PORT:-8181} | ||
volumes: | ||
- ./source-code/dags:/opt/airflow/dags | ||
- ./setup-airflow/plugins:/opt/airflow/plugins | ||
- ./setup-airflow/logs:/opt/airflow/logs | ||
- ./setup-airflow/execute-carte.sh:/opt/airflow/execute-carte.sh | ||
- ./setup-airflow/airflow.cfg:/opt/airflow/airflow.cfg | ||
user: "${AIRFLOW_UID}:${AIRFLOW_GID}" | ||
depends_on: | ||
airflow-broker: | ||
condition: service_healthy | ||
airflow-database: | ||
condition: service_healthy | ||
|
||
|
||
services: | ||
# Airflow-DB | ||
airflow-database: | ||
image: postgres:13 | ||
container_name: airflow-database | ||
environment: | ||
POSTGRES_USER: airflow | ||
POSTGRES_PASSWORD: airflow | ||
POSTGRES_DB: airflow | ||
volumes: | ||
- postgres-db-volume:/var/lib/postgresql/data | ||
healthcheck: | ||
test: ["CMD", "pg_isready", "-U", "airflow"] | ||
interval: 5s | ||
retries: 5 | ||
restart: always | ||
|
||
# Airflow-messenger | ||
airflow-broker: | ||
image: redis:latest | ||
container_name: airflow-broker | ||
ports: | ||
- 6379:6379 | ||
healthcheck: | ||
test: ["CMD", "redis-cli", "ping"] | ||
interval: 5s | ||
timeout: 30s | ||
retries: 50 | ||
restart: always | ||
|
||
# Airflow-webserver | ||
airflow-webserver: | ||
<<: *airflow-common | ||
container_name: airflow-webserver | ||
command: webserver | ||
ports: | ||
- ${AIRFLOW_HOST_PORT:-8080}:8080 | ||
healthcheck: | ||
test: ["CMD", "curl", "--fail", "http://localhost:${AIRFLOW_HOST_PORT:-8080}/health"] | ||
interval: 10s | ||
timeout: 10s | ||
retries: 5 | ||
restart: always | ||
|
||
# Airflow-scheduler | ||
airflow-scheduler: | ||
<<: *airflow-common | ||
container_name: airflow-scheduler | ||
command: scheduler | ||
restart: always | ||
|
||
# Airflow-worker | ||
airflow-worker: | ||
<<: *airflow-common | ||
command: celery worker | ||
restart: always | ||
|
||
# Airflow-DB-initialize | ||
airflow-init: | ||
<<: *airflow-common | ||
container_name: airflow-init | ||
command: version | ||
environment: | ||
<<: *airflow-common-env | ||
_AIRFLOW_DB_UPGRADE: 'true' | ||
_AIRFLOW_WWW_USER_CREATE: 'true' | ||
_AIRFLOW_WWW_USER_USERNAME: ${AIRFLOW_ADMIN_USER:-airflow} | ||
_AIRFLOW_WWW_USER_PASSWORD: ${AIRFLOW_ADMIN_PASSWORD:-airflow} | ||
_AIRFLOW_WWW_USER_EMAIL: ${AIRFLOW_ADMIN_EMAIL:[email protected]} | ||
|
||
# Pentaho | ||
pdi-master: | ||
<< : *pdi-common | ||
container_name: pdi-master | ||
environment: | ||
<<: *pdi-common-env | ||
ports: | ||
- ${CARTE_HOST_PORT:-8181}:8181 | ||
|
||
# pdi-child: | ||
# << : *pdi-common | ||
# container_name: pdi-child | ||
# ports: | ||
# - 8182 | ||
# depends_on: | ||
# - pdi-master | ||
# environment: | ||
# <<: *pdi-common-env | ||
# CARTE_PORT: 8182 | ||
# CARTE_IS_MASTER: 'N' | ||
# CARTE_INCLUDE_MASTERS: 'Y' | ||
# CARTE_MASTER_HOSTNAME: 'pdi-master' | ||
# CARTE_MASTER_PORT: ${CARTE_HOST_PORT:-8181} | ||
|
||
volumes: | ||
postgres-db-volume: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
FROM apache/airflow:2.0.1 | ||
|
||
USER root | ||
|
||
# Install environment dependencies | ||
RUN apt-get update \ | ||
# xmlstarlet package is required by Airflow to read XML log generated by Carte server running in separate container | ||
&& apt-get install xmlstarlet -y \ | ||
# Upgrade PIP | ||
&& pip install --upgrade pip \ | ||
# Install project specific packages | ||
&& pip install 'apache-airflow[postgres]' | ||
|
||
USER airflow |
Oops, something went wrong.