galaxy with k8s

Running Galaxy with Kubernetes

Galaxy is a workflow environment tool developed by a large Bioinformatics community, mostly by people working on the context of Next Generation Sequencing (NGS) tools. The main site for Galaxy can be found here and its main development point at github can be accessed here.

Galaxy, as a workflow environment, allows researcher with no programming ability to concatenate common bioinformatics tools to create pipelines or workflows. Galaxy of course uses the original code and binaries of those bioinformatics tools developed elsewhere, and provides tool wrappers for them so that the Galaxy's UI and API can interact with those tools. In a classical installation, normally most tools would be executed serially on the same machine where Galaxy is running. This doesn't scale of course for the purposes of PhenoMeNal. Towards this scalability, and given our model of using tools as microservices, is that the PhenoMeNal project has contributed to the community with a way of connecting galaxy with a container orchestrator.

A container orchestrator is something like "a cluster for containers". Kubernetes, an open source container orchestrator backed up by Google, is today one of the most actively developed projects of this kind. The GitHub project for Kubernetes can be found here.

The Kubernetes runner for Galaxy contributed by PhenoMeNal was added to main Galaxy Project GitHub repo through a number of pull requests (2314, 2559 among others).

Currently, there are two ways of running:

Galaxy running within the Kubernetes cluster: easier setup, and now also adequate for development of galaxy tools. Currently the preferred way.
Galaxy running outside of the Kubernetes cluster: might be easier for tool wrapper development purposes, but harder to setup. This is not the preferred way for PhenoMeNal purposes, but will be left as reference for external users.

Common Requirements

An installation of Kubernetes. See "Creating a cluster" here for alternatives.
Currently this is necessary as the PhenoMeNal MANTL immutable images do not have yet support for Kubernetes. This is expected to change soon.
For a single machine development environment on Linux or Mac OS X, currently the official Kubernetes recommendation is to use Mini-Kube. This is the current preferred way.
There is now experimental support on Windows for running mini-kube. Please see these temporary directions.
For a single machine development environment on a Mac OS X machine, we have also succesfully worked with Kube-Cluster.
Credentials to access and use that Kubernetes installation.
You don't need to worry about this if using mini-kube for development/tool testing.
Normally in the form of a ~/.kube/config file, generated during installation. If you can issue kubectl cluster-info and you don't get an error, then you already have this in place.
Shared filesystem accessible from Kubernetes.

Common setup: shared filesystem

The Kubernetes Galaxy runner makes use of Persistent Volume Claim, which is a way in Kubernetes for abstracting access to filesystem (which can be provided on the other side in many ways, such as GlusterFS, Ceph, NFS, local FS, iSCSI, etc.). For this we need to create first on the Kubernetes a Persistent Volume (PV), which can vary depending on the local Kubernetes cluster that you use.

PV for minikube on Linux/Mac

On minikube the only accepted persistent volume supported is currently hostPath. As such, we can create a PV using the following yaml content:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv0001
  labels:
    type: local
spec:
  capacity:
    storage: 25Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/opt/mk-galaxy-data"

copy-paste this content on a pv_minikube_internal.yaml and then execute kubectl create -f pv_minikube_internal.yaml. Notice that the /opt/mk-galaxy-data is internal to the minikube VM and is not shared with your host machine. This means that galaxy will create all its files in internal storage of the minikube VM and that this won't be exposed to your host. The reason for this is that Galaxy makes use of hard links for handling data uploads, and hard links are not supported by VirtualBox on shared folders. We have asked to get support on minikube for NFS mounts instead of shared folders on the boot2docker minikube VM.

PV for Kube-cluster on OS X

You don't need this if you used the mini-kube section for PV just before this one, please skip to the next part if that is the case.

For a Mac OS X development environment using Kube-Cluster, this is through the NFS share of the /Users/<you> that Kube-Cluster exposes to the running Kubernetes. The yaml definition for this looks like this:

kind: PersistentVolume
apiVersion: v1
metadata:
  name: pv-galaxy-jdoe-nfs
  labels:
    type: nfs
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retained
  nfs:
    path: /Users/jdoe/galaxy_data
    server: 192.168.64.1

This needs to be modified accordingly so that the spec:nfs:path is inside the NFS share already put in place by the server at the defined IP on spec:nfs:server (and so does the IP need to be amended accordingly). This content needs to be placed inside a pv-for-galaxy.yaml and then the PV created inside Kubernetes by issuing kubectl create -f pv-for-galaxy.yaml. Please note that since we will be using Galaxy from outside Kubernetes initially, we will setup galaxy to leave or its temporary and working files on the same path in this PV (/Users/jdoe/galaxy_data).

Create PVC

Once the PV is in place in Kubernetes, we can setup the Persistent Volume Claim that Galaxy and the tools will ultimately use.

To do this (applicable to any installation), the yaml file looks like this:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: galaxy-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 20Gi

To create this PVC, the content needs to be left in a file pvc-galaxy.yaml and then execute kubectl create -f pvc-galaxy.yaml. To check that the PV and PVC have been created, you can execute kubectl get pv and kubectl get pvc, and you should see the names and status of the created units. Take note of the PVC metadata:name: as this needs to be used in the Kubernetes runner setup in Galaxy. Please note that the storage requirement made by the PVC (spec:resources:requests:storage) should be smaller than the storage provided by the PV (spec:capacity:storage), otherwise it won't be allocated.

Setup

[Galaxy inside Kubernetes setup](Galaxy inside Kubernetes)
[Galaxy outside Kubernetes setup](Galaxy outside Kubernetes)

	Funded by the EC Horizon 2020 programme, grant agreement number 654241

PhenoMeNal Logo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly