-
Notifications
You must be signed in to change notification settings - Fork 6
galaxy with k8s
Galaxy is a workflow environment tool developed by a large Bioinformatics community, mostly by people working on the context of Next Generation Sequencing (NGS) tools. The main site for Galaxy can be found here and its main development point at github can be accessed here.
Galaxy, as a workflow environment, allows researcher with no programming ability to concatenate common bioinformatics tools to create pipelines or workflows. Galaxy of course uses the original code and binaries of those bioinformatics tools developed elsewhere, and provides tool wrappers for them so that the Galaxy's UI and API can interact with those tools. In a classical installation, normally most tools would be executed serially on the same machine where Galaxy is running. This doesn't scale of course for the purposes of PhenoMeNal. Towards this scalability, and given our model of using tools as microservices, is that the PhenoMeNal project has contributed to the community with a way of connecting galaxy with a container orchestrator.
A container orchestrator is something like "a cluster for containers". Kubernetes, an open source container orchestrator backed up by Google, is today one of the most actively developed projects of this kind. The GitHub project for Kubernetes can be found here.
The Kubernetes runner for Galaxy contributed by PhenoMeNal was added to main Galaxy Project GitHub repo through a number of pull requests (2314, 2559 among others).
Currently, there are two ways of running:
- Galaxy running within the Kubernetes cluster: easier setup, and now also adequate for development of galaxy tools. Currently the preferred way.
- Galaxy running outside of the Kubernetes cluster: might be easier for tool wrapper development purposes, but harder to setup. This is not the preferred way for PhenoMeNal purposes, but will be left as reference for external users.
- An installation of Kubernetes. See "Creating a cluster" here for alternatives.
- Currently this is necessary as the PhenoMeNal MANTL immutable images do not have yet support for Kubernetes. This is expected to change soon.
- For a single machine development environment on Linux or Mac OS X, currently the official Kubernetes recommendation is to use Mini-Kube. This is the current preferred way.
- There is now experimental support on Windows for running mini-kube. Please see these temporary directions.
- For a single machine development environment on a Mac OS X machine, we have also succesfully worked with Kube-Cluster.
- Credentials to access and use that Kubernetes installation.
- You don't need to worry about this if using mini-kube for development/tool testing.
- Normally in the form of a
~/.kube/config
file, generated during installation. If you can issuekubectl cluster-info
and you don't get an error, then you already have this in place. - Shared filesystem accessible from Kubernetes.
The Kubernetes Galaxy runner makes use of Persistent Volume Claim, which is a way in Kubernetes for abstracting access to filesystem (which can be provided on the other side in many ways, such as GlusterFS, Ceph, NFS, local FS, iSCSI, etc.). For this we need to create first on the Kubernetes a Persistent Volume (PV), which can vary depending on the local Kubernetes cluster that you use.
On minikube
the only accepted persistent volume supported is currently hostPath
. As such, we can create a PV using the following yaml
content:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv0001
labels:
type: local
spec:
capacity:
storage: 25Gi
accessModes:
- ReadWriteMany
hostPath:
path: "/opt/mk-galaxy-data"
copy-paste this content on a pv_minikube_internal.yaml
and then execute kubectl create -f pv_minikube_internal.yaml
. Notice that the /opt/mk-galaxy-data
is internal to the minikube VM and is not shared with your host machine. This means that galaxy will create all its files in internal storage of the minikube VM and that this won't be exposed to your host. The reason for this is that Galaxy makes use of hard links for handling data uploads, and hard links are not supported by VirtualBox on shared folders. We have asked to get support on minikube for NFS mounts instead of shared folders on the boot2docker minikube VM.
You don't need this if you used the mini-kube section for PV just before this one, please skip to the next part if that is the case.
For a Mac OS X development environment using Kube-Cluster, this is through the NFS share of the /Users/<you>
that Kube-Cluster exposes to the running Kubernetes. The yaml
definition for this looks like this:
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-galaxy-jdoe-nfs
labels:
type: nfs
spec:
capacity:
storage: 30Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retained
nfs:
path: /Users/jdoe/galaxy_data
server: 192.168.64.1
This needs to be modified accordingly so that the spec:nfs:path
is inside the NFS share already put in place by the server at the defined IP on spec:nfs:server
(and so does the IP need to be amended accordingly). This content needs to be placed inside a pv-for-galaxy.yaml
and then the PV created inside Kubernetes by issuing kubectl create -f pv-for-galaxy.yaml
. Please note that since we will be using Galaxy from outside Kubernetes initially, we will setup galaxy to leave or its temporary and working files on the same path in this PV (/Users/jdoe/galaxy_data
).
Once the PV is in place in Kubernetes, we can setup the Persistent Volume Claim that Galaxy and the tools will ultimately use.
To do this (applicable to any installation), the yaml
file looks like this:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: galaxy-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
To create this PVC, the content needs to be left in a file pvc-galaxy.yaml
and then execute kubectl create -f pvc-galaxy.yaml
. To check that the PV and PVC have been created, you can execute kubectl get pv
and kubectl get pvc
, and you should see the names and status of the created units. Take note of the PVC metadata:name:
as this needs to be used in the Kubernetes runner setup in Galaxy. Please note that the storage requirement made by the PVC (spec:resources:requests:storage
) should be smaller than the storage provided by the PV (spec:capacity:storage
), otherwise it won't be allocated.
- [Galaxy inside Kubernetes setup](Galaxy inside Kubernetes)
- [Galaxy outside Kubernetes setup](Galaxy outside Kubernetes)
Funded by the EC Horizon 2020 programme, grant agreement number 654241 |
---|