Skip to content

abhitechdojo/maven-scala-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Integration Testing of Spark applications with Docker, Maven and Jenkins

Introduction

One of the problems with integration testing is that the integration tests are usually run against a shared environment. This makes it hard for the tests to ensure that the environment is suitable for the tests to be run. An environment where developers are working is constantly changing and therefore the consistency of such an enviornment cannot be guaranteed.

Therefore it is desirable that we have an environment which is created just for running integration tests. This environment can closely memic the production/qa systems and can still be in a prestine state.

Here we attempt to create such an integration testing environment by using FailSafe plugin and Docker

Fail Safe Plugin

This plugin gives us 3 new phases in the maven build life cycle

  1. pre-integration test
  2. integration-test
  3. post-integration-test

We can execute various goals in these phases. the task of the goals would be to create a docker container with spark in the pre-integration test and execute the spark job.

The integration test can look at the HDFS file sytem (using Web HDFS) and run assertions on the data produced.

The post-integration test can be used to delete the docker containers.

Maven Docker Plugin

In this project we are using the maven docker plugin plublished by

Maven Docker Plugin by wouterd

This plugin is used to execute a docker file to build the image and then copy input files into the image

The plugin is also used to launch a container during the testing phase.

workflow

There are couple of moving parts in the project. First the maven project build kicks off. this build invokes the docker file using the maven docker plugin.

The docker file launches a "shell script". this shell script is resonsible to copy all the files from the container into HDFS.

The shell script then has an infinite loop to keep the container alive so that the tests can be run against HDFS.

One tricky aspect is how does the integration test know which IP address and ports to use to query Web HDFS

this is simplified by

<systemPropertyVariables>      <webhdfsurl>http://${docker.containers.myapp.ports.50070/tcp.host}:${docker.containers.myapp.ports.50070/tcp.port}</webhdfsurl>      <datanodeurl>http://${docker.containers.myapp.ports.50075/tcp.host}:${docker.containers.myapp.ports.50075/tcp.port}</datanodeurl> </systemPropertyVariables>

This way we can tell the plugin to dynamically determine which is the port which has been exposed for the actual 50070, 50075 ports and then use them at runtime.

Once the integration test receives the IP address and the port of the WebHDFS REST Service, querying the web hdfs and running validations against the output produced by the spark job is easy.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published