Apache Spark with MinIO Server

Apache Spark is a fast and general engine for large-scale data processing. In this recipe we'll see how to launch jobs on Apache Spark-Shell that reads/writes data to a MinIO server.

1. Prerequisites

Install MinIO Server from here.
Download Apache Spark version spark-2.3.0-bin-without-hadoop from here.
Download Apache Hadoop version hadoop-2.8.2 from here.
Download other dependencies

2. Installation

Extract the spark-2.3.0-bin-without-hadoop tar ball in the directory where you're planning to install Spark.
Extract the hadoop-2.8.2 tar ball in a separate directory. Copy the path to this directory.
Create a directory called bin inside the directory where spark-2.3.0-bin-without-hadoop was unzipped. Then move all the dependency jar files (downloaded in previous step) in this directory.

3. Start Spark-Shell

Navigate to the directory where you extracted spark-2.3.0-bin-without-hadoop, and set the following environment variables:

export SPARK_HOME=/path/to/spark-2.3.0-bin-without-hadoop
export PATH=$PATH:$SPARK_HOME/bin
export HADOOP_HOME=/path/to/hadoop-2.8.2
export PATH=$PATH:$HADOOP_HOME/bin
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Then, open the file $HADOOP_HOME/etc/hadoop/core-site.xml for editing. In this example MinIO server is running at http://127.0.0.1:9000 with access key minio and secret key minio123. Make sure to update relevant sections with valid MinIO server endpoint and credentials.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.s3a.endpoint</name>
    <description>AWS S3 endpoint to connect to. An up-to-date list is
      provided in the AWS Documentation: regions and endpoints. Without this
      property, the standard region (s3.amazonaws.com) is assumed.
    </description>
    <value>http://127.0.0.1:9000</value>
  </property>

  <property>
    <name>fs.s3a.access.key</name>
    <description>AWS access key ID.</description>
    <value>minio</value>
  </property>

  <property>
    <name>fs.s3a.secret.key</name>
    <description>AWS secret key.</description>
    <value>minio123</value>
  </property>

  <property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
    <description>Enable S3 path style access ie disabling the default virtual hosting behaviour.
      Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.
    </description>
  </property>

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    <description>The implementation class of the S3A Filesystem</description>
  </property>
</configuration>

Then start Spark-Shell by

./bin/spark-shell --master local[4] --jars "../bin/hadoop-aws-2.8.2.jar,../bin/httpclient-4.5.3.jar,../bin/aws-java-sdk-core-1.11.524.jar,../bin/aws-java-sdk-kms-1.11.524.jar,../bin/aws-java-sdk-1.11.524.jar,../bin/aws-java-sdk-s3-1.11.524.jar,../bin/joda-time-2.9.9.jar"

You should see the prompt

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

4. (Optional) Start Spark-History server

Spark History Server provides web UI for completed and running Spark applications. Once Spark jobs are configured to log events, the history server displays both completed and incomplete Spark jobs. If an application makes multiple attempts after failures, the failed attempts will be displayed, as well as any ongoing incomplete attempt or the final successful attempt.

MinIO can be used as the storage back-end for Spark history back-end using the s3a file system. As we already have $HADOOP_HOME/etc/hadoop/core-site.xml file configured with s3a file system details. We need to now set up the conf/spark-defaults.conf file so history server uses s3a to store the files.

By default the conf directory has a spark-defaults.conf.template file, make a copy of the template file and rename it to spark-defaults.conf. Then add the below content to the file

spark.jars.packages                 net.java.dev.jets3t:jets3t:0.9.4,com.google.guava:guava:14.0.1,com.amazonaws:aws-java-sdk:1.11.234,org.apache.hadoop:hadoop-aws:2.8.2
spark.eventLog.enabled              true
spark.eventLog.dir                  s3a://spark/
spark.history.fs.logDirectory       s3a://spark/
spark.hadoop.fs.s3a.impl            org.apache.hadoop.fs.s3a.S3AFileSystem

Next step is to add jar files specified under the spark.jars.packages section to the jars directory. Once the files are added, create a new bucket called spark in the MinIO instance specified in the $HADOOP_HOME/etc/hadoop/core-site.xml file. This is because we specified the log directory as s3a://spark/.

Finally start Spark history server using

./sbin/start-history-server.sh

If everything works fine you should be able to see the console on http://localhost:18080/.

5. Test if Spark-Shell can communicate with MinIO server

Read

For this recipe, MinIO server is running on http://127.0.0.1:9000. To test if read works on Spark-Shell, create a bucket called spark-test on your MinIO server and upload a test file. Here is how to do this with mc

mc config host add myminio http://127.0.0.1:9000 minio minio123
mc mb myminio/spark-test
mc cp test.json myminio/spark-test/test.json

Now, switch to the Spark-Shell terminal and type

val b1 = sc.textFile("s3a://spark-test/test.json")
b1.collect().foreach(println)

You should be able to see the text file you uploaded to MinIO server.

If you configured Spark history server as described in step 4, you can see the event logs on the console here http://localhost:18080/.

Write

To test if Spark-Shell can write back to MinIO server, switch to Spark-Shell terminal and run

import spark.implicits._
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.saveAsTextFile("s3a://spark-test/test-write")

You should see a prefix called test-write created inside the bucket spark-test. The data is written under the test-write prefix.

MinIO server configured with HTTPS

In case you are planning to use self-signed certificates for testing purposes, you'll need to add these certificates to local JRE cacerts. You can do this using

keytool -import -trustcacerts -keystore $JAVA_HOME/jre/lib/security/cacerts -storepass changeit -noprompt -alias mycert -file /home/username/.minio/certs/public.crt

Also, change the scheme to https in fs.s3a.endpoint in file $HADOOP_HOME/etc/hadoop/core-site.xml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!