Skip to content

Dockerized_Demo

Harish Butani edited this page Jan 13, 2022 · 4 revisions

The following procedure describes how to bring up a Spark Oracle demo environment in a docker container. The prerequisite is that you have docker set up on your host machine. This has to be tested only on Macintosh computers. (It will probably work on a Linux host.)

This is for demostration purposes only. We are not supporting the docker environment

Dockerbuilder tool

The dockerbuilder sub-project a utility to construct a Dockerfile for spark-on-oracle. In an empty folder unzip the docker builder artifact. You should also copy the spark-oracle-0.1.0-SNAPSHOT.zip to this folder. Then run: ./sparkOraDockerBuilder-0.1.0-SNAPSHOT without any options to see what is required. (You may need to run the command outside of vpn for the spark and zeppelin download url checks to work.)

Usage: sparkOraDockerBuilder [options]

  -m, --spark_mem <value>  Memory in Mb/Gb for spark; for example 512m or 2g.
                           When running the tpcds demo set it to 4g
  -c, --spark_cores <value>
                           num_cores for Spark.
                           When running the tpcds demo, set it to at least 4.
  -j, --oracle_instance_jdbc_url <value>
                           JDBC connection information for the Oracle instance.
                           For example: "jdbc:oracle:thin:@10.89.206.230:1531/cdb1_pdb7.regress.rdbms.dev.us.oracle.com

                           Specify the ip-addr of host; otherwise you may need
                           to edit the /etc/resolv.conf of the docker container.
  -u, --oracle_instance_username <value>
                           Oracle username to connect the Oracle instance.
  -p, --oracle_instance_password <value>
                           Oracle password for the Oracle user.
                           Either provide the password or location of a wallet.
  -w, --oracle_instance_wallet_loc <value>
                           Oracle password for the Oracle user.
                           Either provide the password or location of a wallet.
  -s, --spark_download_url <value>
                           url to download Apache Spark. Spark version must be 3.1.0 or above.
                           For example: https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
  -z, --zeppelin_download_url <value>
                           URL to download Apache Zeppelin. Spark version must be 0.9.0 or above.
                           For example: https://downloads.apache.org/zeppelin/zeppelin-0.9.0/zeppelin-0.9.0-bin-netinst.tgz
  -o, --spark_ora_zip <value>
                           Location of spark-oracle package.
                           For example: ~/Downloads/spark-oracle-0.1.0-SNAPSHOT.zip

Provide the specified options:

  • For example, if you want to use our development environment run as:
./sparkOraDockerBuilder-0.1.0-SNAPSHOT -c 4 -m 4g \
  -j jdbc:oracle:thin:@10.89.206.230:1531/cdb1_pdb7.regress.rdbms.dev.us.oracle.com \
  -u tpcds -p tpcds \
  -s https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz \
  -z https://mirrors.ocf.berkeley.edu/apache/zeppelin/zeppelin-0.9.0/zeppelin-0.9.0-bin-netinst.tgz \
  -o spark-oracle-0.1.0-SNAPSHOT.zip

This will create a Dockerfile and associated files for the specified options.

Notes on the Dockerfile and docker container:

  • For the Oracle instance jdbc url, specify an ip-addr instead of a hostname. If you specify a hostname you will have to edit the /etc/resolv.conf in the docker container. For example:
-- you may need to add to /etc/reolv.conf

search us.oracle.com
nameserver 2606:B400:300:D:FEED::1
nameserver 2606:B400:300:D:FEED::2
nameserver 206.223.27.1
nameserver 206.223.27.2

The docker container

  • Is setup with Apache Spark, Spark-Oracle extension and Apache Zeppelin.
  • Currently we have not enabled Apache Zeppelin. We are working on notebooks for the demo.
  • The default command for the container is spark-shell. You can follow the steps in the Demo
  • To build the docker image issue something like: docker image build -t spark_ora_demo:latest .
  • Then to run the container issue something like:
docker run -it -p 8080:8080 -p 4040:4040 --rm spark_ora_demo:latest
  • give the port options -p 8080:8080 -p 4040:4040 so you can see the Spark UI and when available the Zeppelin notebooks from a host browser.
  • The Dockerfile is setup with a CMD to start the spark-shell So you will be in the spark-shell when your terminal enters the container.
  • Once there you can follow the steps in the demo. Start by issuing sql("use oracle") and then follow the steps in the demo.
  • The container starts with an empty metadata-cache so you will notice that the first time you execute a query (even in pushdown=true mode) it will take several seconds more than usual. This is because Oracle table metadata (including partition information) is pulled into the metadata_cache on demand, the first time a query is issued against a table.