This example shows how to compile a Java Spark program and programmatically submit it to a BigInsights cluster and wait for the Spark job response. An example use cases for programmatically running Spark jobs on the BigInsights cluster is a client application that needs to process data in BigInsights and send the results of processing to a third party application. Here the client application could be an ETL tool job, a Microservice, or some other custom application.
This example uses Apache Oozie to run and monitor the job. Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
BigInsights for Apache Hadoop clusters are secured using Apache Knox. Apache Knox is a REST API Gateway for interacting with Apache Hadoop clusters. The Apache Knox project provides a Java client library and this can be used for interacting with the Oozie REST API. Using a library is usally more productive because the library provides a higher level of abstraction than when working directly with the REST API.
See also:
Developers will gain the most from these examples if they are:
- Comfortable using Windows, OS X or *nix command prompts
- Familiar with compiling and running java applications
- Able to read code written in a high level language such as Groovy
- Familiar with the Gradle build tool
- Familiar with Spark concepts
- Familiar with Oozie concepts
- Familiar with the WebHdfsGroovy Example
- You meet the pre-requisites in the top level README
- You have performed the setup instructions in the top level README
This example consists of three parts:
- WordCount.java - Spark Java code
- Example.groovy - Groovy script to submit the Map/Reduce job to Oozie
- build.gradle - Gradle script to compile and package the Map/Reduce code and execute the Example.groovy script
To run the example, open a command prompt window:
- change into the directory containing this example and run gradle to execute the example
./gradlew Example
(OS X / *nix)gradlew.bat Example
(Windows)
- some output from running the command on my machine is shown below
BigInsights-on-Apache-Hadoop $ cd examples/OozieWorkflowMapSparkGroovy
BigInsights-on-Apache-Hadoop/examples/OozieWorkflowSparkGroovy $ ./gradlew Example
:compileJava UP-TO-DATE
:compileGroovy
warning: [options] bootstrap class path not set in conjunction with -source 1.7
1 warning
:processResources UP-TO-DATE
:classes
:jar
:Example
[Example.groovy] Delete /user/snowch/test: 200
[Example.groovy] Mkdir /user/snowch/test: 200
[Example.groovy] Uploading jar file may take some time...
[Example.groovy] Put /user/snowch/test/workflow.xml: 201
[Example.groovy] Put /user/snowch/test/input/FILE: 201
[Example.groovy] Put /user/snowch/test/lib/OozieWorkflowSparkGroovy.jar 201
[Example.groovy] Downloading spark assembly jar file, may take some time...
[Example.groovy] Uploading spark assembly jar file to job lib dir, may take some time...
[Example.groovy] Submitted job: 0000006-160804212652051-oozie-oozi-W
[Example.groovy] Polling up to 300s for job completion...
...
(compliance,1)
(comment,1)
(subsequently,1)
(replaced,1)
(mechanical,1)
(Contributor,8)
The output above shows:
- the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
- the steps performed by the Example.groovy script (output is prefixed by '[Example.groovy]')
- the Spark output which is the count of the number of times each word is seen in an Apache 2.0 License file
The examples uses a gradle build file build.gradle when you run ./gradlew
or gradle.bat
. The build.gradle does the following:
- compile the Java Spark code. The
apply plugin: 'java'
statement controls this more info. - create a Jar file with the compiled Map/Reduce code. The
jar { ... }
statement controls this more info. - compile and execute the Example.groovy script. The
task('OozieMapReduce' ...)
statement controls this more info
The Example.groovy script performs the following:
- Create an Oozie workflow XML document
- Create an Oozie configuration XML document
- Create a HTTP session on the BigInsights Knox REST API
- Upload the workflow XML file over WebHDFS
- Upload an Apache 2.0 LICENSE file over WebHDFS
- Upload the jar file (containing the Java Spark code) over WebHDFS
- Download the spark-assembly jar file over WebHdfs
- Upload the spark-assembly jar file over WebHDFS to the job lib directory
- Submit the Oozie workflow job using Knox REST API for Oozie
- Every second, check the status of the Oozie workflow job using Knox REST API for Oozie
- When the Oozie workflow job successfully finishes, download the wordcount output from the Java Spark job
All code is well commented and it is suggested that you browse the build.gradle and *.groovy scripts to understand in more detail how they work.