-
Notifications
You must be signed in to change notification settings - Fork 6
galaxy with k8s
If you don't need to understand the details of what is going on, and you aim for the default recommended case of deploying Galaxy inside Kubernetes using mini-kube for local testing, you can go directly to the quick start installation. PhenoMeNal developers are advised to read the complete tutorial at least once to understand what is going on, although the actual operation for day to day testing should be done using the steps in the above link. So you shouldn't need to deal directly with YAML files as below, following parts of this page are just for illustration purposes, do not actually follow them.
Galaxy is a workflow environment tool developed by a large Bioinformatics community, mostly by people working on the context of Next Generation Sequencing (NGS) tools. The main site for Galaxy can be found here and its main development point at github can be accessed here.
Galaxy, as a workflow environment, allows researcher with no programming ability to concatenate common bioinformatics tools to create pipelines or workflows. Galaxy of course uses the original code and binaries of those bioinformatics tools developed elsewhere, and provides tool wrappers for them so that the Galaxy's UI and API can interact with those tools. In a classical installation, normally most tools would be executed serially on the same machine where Galaxy is running. This doesn't scale of course for the purposes of PhenoMeNal. Towards this scalability, and given our model of using tools as microservices, is what the PhenoMeNal project has contributed to the community with a way of connecting galaxy with a container orchestrator.
A container orchestrator is something like "a cluster for containers". Kubernetes, an open source container orchestrator backed up by Google, is today one of the most actively developed projects of this kind. The GitHub project for Kubernetes can be found here.
The Kubernetes runner for Galaxy contributed by PhenoMeNal was added to main Galaxy Project GitHub repo through a number of pull requests (2314, 2559 among others).
Currently, there are two ways of running:
- Galaxy running within the Kubernetes cluster: easier setup, and now also adequate for development of galaxy tools. Currently the preferred way.
- Galaxy running outside of the Kubernetes cluster: might be easier for tool wrapper development purposes, but harder to setup. This is not the preferred way for PhenoMeNal purposes, but will be left as reference for external users.
- An installation of Kubernetes. See "Creating a cluster" here for alternatives.
- For a single machine development environment on Linux or Mac OS X, currently the official Kubernetes recommendation is to use Mini-Kube. This is the current preferred way.
- There is now experimental support on Windows for running mini-kube. Please see these temporary directions.
- For a single machine development environment on a Mac OS X machine, we have also succesfully worked with Kube-Cluster.
- Credentials to access and use that Kubernetes installation.
- You don't need to worry about this if using mini-kube for development/tool testing.
- Normally in the form of a
~/.kube/config
file, generated during installation. If you can issuekubectl cluster-info
and you don't get an error, then you already have this in place. - Shared filesystem accessible from Kubernetes.
The Kubernetes Galaxy runner makes use of Persistent Volume Claim, which is a way in Kubernetes for abstracting access to filesystem (which can be provided on the other side in many ways, such as GlusterFS, Ceph, NFS, local FS, iSCSI, etc.). For this we need to create first on the Kubernetes a Persistent Volume (PV), which can vary depending on the local Kubernetes cluster that you use.
On minikube
the only accepted persistent volume supported is currently hostPath
, which is nothing but a directory in the hosting machine being shared to the Pods running. The directory that is shared is specified as part of the PV. We can create a PV for use with deployment using the following steps:
- On a directory of choice, create a file named
pv_minikube_internal.yaml
. This file should be created in the machine where you installed kubectl. Add the following yaml content to that new file:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv0001
labels:
type: local
spec:
capacity:
storage: 25Gi
accessModes:
- ReadWriteMany
hostPath:
path: "/opt/mk-galaxy-data"
- Standing on the same directory where the previous file was created, execute the following command on a shell:
$ kubectl create -f pv_minikube_internal.yaml
- Notice that the
/opt/mk-galaxy-data
is internal to the minikube VM and is not shared with your host machine. This means that galaxy will create all its files in internal storage of the minikube VM and that this won't be exposed to your host. The reason for this is that Galaxy makes use of hard links for handling data uploads, and hard links are not supported by VirtualBox on shared folders. We have asked to get support on minikube for NFS mounts instead of shared folders on the boot2docker minikube VM.
You can now skip to the "Create PVC" section below.
You don't need this if you used the mini-kube section for PV just before this one, please skip to the next part if that is the case.
For a Mac OS X development environment using Kube-Cluster, this is through the NFS share of the /Users/<you>
that Kube-Cluster exposes to the running Kubernetes. The yaml
definition for this looks like this:
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-galaxy-jdoe-nfs
labels:
type: nfs
spec:
capacity:
storage: 30Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retained
nfs:
path: /Users/jdoe/galaxy_data
server: 192.168.64.1
This needs to be modified accordingly so that the spec:nfs:path
is inside the NFS share already put in place by the server at the defined IP on spec:nfs:server
(and so does the IP need to be amended accordingly). This content needs to be placed inside a pv-for-galaxy.yaml
and then the PV created inside Kubernetes by issuing kubectl create -f pv-for-galaxy.yaml
. Please note that since we will be using Galaxy from outside Kubernetes initially, we will setup galaxy to leave or its temporary and working files on the same path in this PV (/Users/jdoe/galaxy_data
).
Once the PV is in place in Kubernetes, we can setup the Persistent Volume Claim that Galaxy and the tools will ultimately use.
To do this (applicable to any installation), go through the following steps:
- Create a file named
pvc-galaxy.yaml
and paste the following yaml content on it:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: galaxy-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
- Execute on the shell the following command, standing on the same directory where the file was created:
$ kubectl create -f pvc-galaxy.yaml
To check that the PV and PVC have been created, you can execute on the same shell kubectl get pv
and kubectl get pvc
, and you should see the names and status of the created units. Please note that the storage requirement made by the PVC (spec:resources:requests:storage
) should be smaller than the storage provided by the PV (spec:capacity:storage
), otherwise it won't be allocated.
- Galaxy inside Kubernetes: Both Galaxy and the tools run inside Kubernetes. This is the preferred route for both users and developers, and the path requiring less setup. This is a legacy explanations of what would be needed to be done manually if not using Helm. Yaml files might be outdated.
-
Galaxy outside Kubernetes setup: This is a legacy of initial versions, where Galaxy was submitting jobs to Kubernetes but was not running inside Kubernetes as the tools deployed. Requires more setup and might be outdated. If you are following this setup, take note of the PVC
metadata:name:
(in the example, it has the valuegalaxy-pvc
), as this needs to be used in the Kubernetes runner setup in Galaxy.
The first step to a custom Galaxy is to create a new branch in our (container-galaxy-k8s-runtime)[https://github.com/phnmnl/container-galaxy-k8s-runtime] repo.
After that has been done, you can do some modifications. When you are finished, you will have to build the docker container image and push it to a docker registry, which can be reached by the PhenoMeNal infrastructure. In our example, we use dockerhub:
docker build -t example/galaxy-k8s-runtime:latest .
docker push example/galaxy-k8s-runtime:latest
The next step is to deploy a Galaxy instance in the infrastructure. After that has been done, please note the IP address of the Galaxy node:
export KUBE_NODE="1.1.1.1"
Now, log in to the remote node and perform a kubernetes rolling update:
ssh ubuntu@${KUBE_NODE}.nip.io
kubectl rolling-update --image docker.io/example/galaxy-k8s-runtime:latest --image-pull-policy=Always galaxy-k8s
The custom Galaxy is now running under:
http://galaxy.${KUBE_NODE}.nip.io
You can save the server in the environment with:
export GALAXY_SERVER="http://galaxy.${KUBE_NODE}.nip.io"
The Galaxy will automatically have an admin user and will automatically generate an api key. Please log into the custom Galaxy and go to the Admin section. There should be an API key because that is also used to deploy the shared workflows.
Note the api key:
export GALAXY_API_KEY="33ce2cac8b7508fc6478df9991c44f0d"
Now, to add custom users to that Galaxy do the following:
git clone https://github.com/phnmnl/tutorial_utils
pip install bioblend
pip install future
./generate_users_list.py -n 9 --username-prefix=testuser --email-domain phenomenal-h2020.eu > _users.csv
./galaxy_users_manager.py register --server "${GALAXY_SERVER}" --api-key "${GALAXY_API_KEY}" --users-file _users.csv --output-file _galaxy-users.csv
After that has been done, you will have added 9 custom test users (without admin access).
Funded by the EC Horizon 2020 programme, grant agreement number 654241 |
---|