See publication behind this experiment here: https://academic.oup.com/gigascience/article/7/4/giy036/4965114
All components required to reproduce the Field of Genes experiment are provided here.
As all running software components are published as public Docker images, it is not essential to compile the source code in order to run the experiments. For developers who wish to build the containers from source and use these containers for the experiment, there is a separate README under the source
directory. Note that if building and using your own docker images, you will need to update the kubernetes yaml files (described below) to substitute the blawlor
image names with your own.
The rest of this README assumes that these steps are complete: that the images referenced by the kubernetes deployment files are built and published on Docker Hub.
These instructions assume the following about your environment.
- Linux: You are running a bash shell in a Linux environment.
- gcloud: You have installed the gcloud sdk - a command line utility to manage Google Cloud resources.
- kubectl You have installed kubectl - a command line utility to manage a Kubernetes cluster instance.
- docker-machine: You have installed docker-machine. This is used to create the benchmark node.
The objective is to test the hypothesis that Kafka is a scalable data repository for bioinformatic data (in this case, the RefSeq genomic database). We compare Kafka's scalability characteristics with the flat BLAST-format files that are downloadable from NCBI. We use a GC Content calculation as a placeholder algorithm to provide the comparison, but any per-sequence processing algorithm could be substituted. When we talk about 'processing' in this text, we are referring to GC Content.
All code is run on the cloud. The benchmark is run using simple Docker containers. The experimental code is deployed using Kubernetes(k8s)-orchestrated containers. We have used Google Kubernetes Engine (GKE) as our platform and this requires a Google Cloud account to run this experiment. Note that this is a paid platform. You will be charged by Google for resources used, so make sure to destroy your k8s cluser and any extra SSDs that are created using your account, when you are finished an experimental run. For cluster sizes of 8 and larger, Google may ask you to increase your quotas.
Although the provided instructions for creating the k8s cluster are specific to GKE (i.e they use gcloud, most of the k8s deployment instructions will work on k8s clusters hosted elsewhere (e.g. Azure, or on a private k8s cluster if you have access to one). The only known exceptions are the storage configuration yml files mentioned in the instructions below, would would need to be substituted with platform-specific alternatives.
In order for the gcloud scripts (described in more detail below) to work, you will need to have created a Google cloud project and authenticated yourself on it. Create a file under the 'deployment/k8s' directory called set-gooogle-project.sh which sets an environment variable with the name of your Google project. Then use gcloud to authenticate using your Google account credentials:
gcloud auth application-default login
Alternatively, contact the author to arrange the temporary use of his account if you are reviewing this experiment for a publication.
The experiment is broken down into 2 sections: Benchmark and Experiment. The Benchmark measures the speed at which we can process increasing amounts of RefSeq fasta-format data, on a single node (flat files residing on a linux file system are, by their nature, not distributed). The Experiment measures the processing of these same sequences from a Kafka topic, on Kafka cluster sizes of 4, 8 and 12 nodes, using Akka actors. The Akka actors are simply vehicles for invoking the same GC Content code as the Benchmark, but in the parallel, streamed and distributed manner that the Kafka topic facilites. We chose Akka because this is a technology we are familiar with from other research.
Benchmark and Experiment are both broken into two phases: Loader and GCContent.
- The loader phase prepares the data. In the case of Benchmark, this involves FTPing tarred and zipped files from NCBI, and then untarring, unzipping and running the
blastdbcmd
utility to extract fasta files from the BLAST format. In the case of the Experiment, the loader must do the same as the Benchmark and then publish the sequences into a Kafka topic (calledrefseq
). - The gccontent phase is where the GC Content algorithm (our own Java-based implementation) is run on every refseq sequence.
The experimental runs are parameterized along two dimensions:
- Number of files (~ number of sequences). This is effectively the 'size' of the experiment. Each new file downloaded from NCBI increases the number of sequences to be processed.
- Parallelization Factor. In the case of the Benchmark code, this is the number of independent threads we create to perform the loading and gc content processing. In the case of the Experiment, this corresponds to the number of Akka actors created to do the loading/gc content processing. We also use this factor to decide how many partitions to create for each topic.
The loader and gccontent benchmark are run together by launching the gccontent-benchmark Docker image as a kubernetes Pod. The Docker image, when run, simply invokes the gccontent executable jar which first downloads and expands the stipulated number of files from NCBI, and then measures the time taken to run the gccontent algorithm over those files, using the stipulated number of threads. Use docker-machine
to create the node and point at it, and then docker cli
to run the container. The results will be tracked and displayed on the bash shell. For example, to run the benchmark using 4 files and 4 threads:
./create-benchmark.sh
eval $(docker-machine env benchmark)
docker run -it blawlor/gccontent-benchmark 4 4
This creates a single vm with Docker installed, directs the command line at that VM and then runs the gccontent-benchmark image as a container, instructing it to download 4 files using 4 threads. We typically run this with value 4/4, 8/8, 12/12 and 16/16.
When the run is complete, and the times for downloading and processing have been recorded, don't forget to destroy the VM:
docker-machine rm benchmark
(It's generall a good idea to use the cloud provider's console to make sure the delete has worked.)
Running the experiment is more complex and can be viewed as two phases:
We must create a multi-node kubernetes cluster (4,8 or 12 nodes) and then bring up a multi-node Kafka cluster including its accompanying Zookeeper instances. To do this, we have leaned heavily on the work done by yolean.
- Run the
create-kafka-cluster.sh
script, passing in the cluster size and your username. E.g.
./create-kafka-cluster.sh 4 [email protected]
Wait until this is complete. Check completion by occasionally running
kubectl get all
and verify that all the kafka instances (4 in this case, named kafka-0 to kafka-3) are created and have a status of Running. Once the Kafka cluster is ready, move on to the experiment.
For each configuration of node size and number of files/sequences/parallelization factor, we must create the specified topics with the correct number of partitions, launch loader agents that can populate the Kafka refseq
topic and then launch gc-content agents capable of running the gc-content algorithm in a parallel and coordinated way on the Kafka refseq topic, putting the results into the refseq-gccontent
topic.
In order to measure the run times of the experiment, we need a mechanism for triggering them at the same time, and detecting when processing is complete. In a distributed system like this, it is not trivial. What follows is an overview of how we achieve this:
- Launch the stipulated number of loader agents. They do nothing until they read an instruction from the
loader
Kafka topic (i.e. we use Kafka itself not only as a source of the data, but also as a message channel for managing the experiment). Wait until all agents are up and running on Kubernetes before proceeding to next step. - Launch the loader experiment. This sends the instructions to the
loader
topic and then enters a timed loop, listening to therefseq
topic until it stops increasing (i.e. until the loading is complete). - Launch the stipulated number of gccontent-agents. The do nothing until the read an instruction from the
gccontent
topic. Wait until all agents are up and running before proceding to the next step. - Launch the gccontent experiment. This sends the instructions to the
gccontent
topic and then enters a timed loop, listening to therefseq-gccontent
topic until it stops increasing (i.e. until the processing is complete).
- Create the topics, specifying the parallelization factor (i.e. the number of partitions). E.g for a parallelization factor of 4:
cd topics
./create-topics.sh 4
cd ..
- Launch the required number of loader agents. E.g. to launch 4:
./deploy-loader-agents.sh 4
- Run the loader experiment, specifying the parallelization factor and monitor the logs. E.g for parallelization factor 4:
cd loader-experiment
./loader-experiment.sh 4
cd ..
- Launch the required number of gccontent agents: E.g. to launch 4:
./deploy-gccontent-agents.sh 4
- Run the gccontent experiment, specifying the parallelization factor and monitor the logs. E.g for parallelization factor 4:
cd gccontent-experiment
./gccontent-experiment.sh 4
cd ..
NOTE Remember to destroy the cluster afterwards. Use
./delete-k8s-cluster.sh
and verify using GCE's console.