kube-airflow

kube-airflow provides a set of tools to run Airflow in a Kubernetes cluster This is useful when you'd want:

Easy high availability of the Airflow scheduler
- Running multiple schedulers for high availability isn't safe so it isn't the way to go in the first place. Someone in the internet tried to implement a wrapper to implement leader election on top of the scheduler so that only one scheduler executes the tasks at a time. It is possbile but can't we just utilize a kind of cluster manager here? This is where Kubernetes comes into play.
Easy parallelism of task executions
- The common way to scale out workers in Airflow is to utilize Celery. However, managing a H/A backend database and Celery workers just for parallelising task executions sounds like a hassle. This is where Kubernetes comes into play, again. If you already had a K8S cluster, just let K8S manage them for you.
- If you have ever considered to avoid Celery for task parallelism, yes, K8S can still help you for a while. Just keep using LocalExecutor instead of CeleryExecutor and delegate actual tasks to Kubernetes by calling e.g. kubectl run --restart=Never ... from your tasks. It will work until the concurrent kubectl run executions(up to the concurrency implied by scheduler's max_threads and LocalExecutor's parallelism. See this SO question for gotchas) consumes all the resources a single airflow-scheduler pod provides, which will be after the pretty long time.

This repository contains:

Dockerfile(.template) of airflow for Docker images published to the public Docker Hub Registry.
airflow.all.yaml for creating Kubernetes services and deployments to run Airflow on Kubernetes

Informations

Highly inspired by the great work puckel/docker-airflow
Based on Debian Jessie official Image debian:jessie and uses the official Postgres as backend and RabbitMQ as queue
Following the Airflow release from Python Package Index

Create cluster

gcloud container clusters create airflow-cluster2 --enable-autorepair --machine-type=n1-standard-2 --num-nodes=1

create google iam service account with the following roles

bigquery data owner
bigquery job user
storage object admin

Build

git clone this repository and then just run:

    export PROJECT_ID=xxxxx
    # Set the version of airflow dags to replace KUBE_AIRFLOW_VERSION in Makefile
    cd ../ && export VERSION="$(TZ=Asia/Tokyo date +%Y%m%dt%H%M%S)-$(git rev-parse --short HEAD)" && echo $VERSION && cd -
    GCP_JSON_PATH=~/Documents/Archive/GCPProjectID-c06dcd1e8e67-airflow.json make apply
    
    or 
    GCP_JSON_PATH=~/Documents/Archive/GCPProjectID-c06dcd1e8e67-airflow.json make publish
    make rolling-update

apply task depends on publish task which depend on build task

Publish to GCP

Create all the deployments and services for Airflow:

    make publish

Usage

Create all the deployments and services to run Airflow on Kubernetse: vim airflow.all.yaml make create # first deployment make deploy #update

   make list-pods
   make list-services
   pod_name="web-2874099158-lxgm2" make pod-login

It will create deployments for:

postgres
rabbitmq
airflow-webserver
airflow-scheduler
airflow-flower
airflow-worker

and services for:

postgres
rabbitmq
airflow-webserver
airflow-flower

Login into pod

pod_name="scheduler-1413753147-fd3q7" make login-pod

You can browse the Airflow dashboard via running:

make browse-web

the Flower dashboard via running:

make browse-flower

If you want to use Ad hoc query, make sure you've configured connections: Go to Admin -> Connections and Edit "mysql_default" set this values (equivalent to values in config/airflow.cfg) :

Host : mysql
Schema : airflow
Login : airflow
Password : airflow

Check Airflow Documentation

Run the test "tutorial"

    kubectl exec web-<id> --namespace airflow-dev airflow backfill tutorial -s 2015-05-01 -e 2015-06-01

Scale the number of workers

For now, update the value for the replicas field of the deployment you want to scale and then:

    make apply

connect to the cluster

gcloud container clusters get-credentials airflow-cluster --zone us-central1-a --project spinappadjust
kubectl proxy

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
config		config
instance		instance
script		script
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile.template		Dockerfile.template
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
airflow.all.yaml		airflow.all.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kube-airflow

Informations

Create cluster

create google iam service account with the following roles

Build

Publish to GCP

Usage

Run the test "tutorial"

Scale the number of workers

connect to the cluster

About

Releases

Packages

Languages

License

opt-tech/kube-airflow

Folders and files

Latest commit

History

Repository files navigation

kube-airflow

Informations

Create cluster

create google iam service account with the following roles

Build

Publish to GCP

Usage

Run the test "tutorial"

Scale the number of workers

connect to the cluster

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages