Skip to content

Dockerized_Demo

Harish Butani edited this page Mar 6, 2021 · 4 revisions

The following procedure describes how to bring up a Spark Oracle demo environment in a docker container. The prerequisite is that you have docker setup on your host machine. This has only be tested on Macs(it will probably work on a linux host).

This is purely for demo purposes. We are not supporting the docker env

Dockerbuilder tool

As part of a release we provide a sparkOraDockerBuilder-0.1.0-SNAPSHOT utility. Download this to an empty folder. You should also download spark-oracle-0.1.0-SNAPSHOT.zip to this folder. Then run: ./sparkOraDockerBuilder-0.1.0-SNAPSHOT without any options to see what is required. (you may need to run the command outside of vpn for the spark and zeppelin download url checks to work.)

Usage: sparkOraDockerBuilder [options]

  -m, --spark_mem <value>  memory in Mb/Gb for spark; for example 512m or 2g.
                           when running the tpcds demo set it to 4g
  -c, --spark_cores <value>
                           num_cores for spark.
                           when running the tpcds demo set it to at-least 4
  -j, --oracle_instance_jdbc_url <value>
                           jdbc connection information for the oracle instance.
                           for example: "jdbc:oracle:thin:@10.89.206.230:1531/cdb1_pdb7.regress.rdbms.dev.us.oracle.com

                           specify the ip-addr of host; otherwise you may need
                           to edit the /etc/resolv.conf of the docker container."
  -u, --oracle_instance_username <value>
                           Oracle username to connect the oracle instance
  -p, --oracle_instance_password <value>
                           Oracle password for the oracle user.
                           Either provide the password or location of a wallet.
  -w, --oracle_instance_wallet_loc <value>
                           Oracle password for the oracle user.
                           Either provide the password or location of a wallet.
  -s, --spark_download_url <value>
                           url to download apache spark. Spark version must be 3.1.0 or above.
                           for example: https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
  -z, --zeppelin_download_url <value>
                           url to download apache zeppelin. Spark version must be 0.9.0 or above.
                           for example: https://downloads.apache.org/zeppelin/zeppelin-0.9.0/zeppelin-0.9.0-bin-netinst.tgz
  -o, --spark_ora_zip <value>
                           location of spark-oracle package.
                           for example: ~/Downloads/spark-oracle-0.1.0-SNAPSHOT.zip

Provide the specified options:

  • for example, if you want to use our dev. env run as:
./sparkOraDockerBuilder-0.1.0-SNAPSHOT -c 4 -m 4g \
  -j jdbc:oracle:thin:@10.89.206.230:1531/cdb1_pdb7.regress.rdbms.dev.us.oracle.com \ 
  -u tpcds -p tpcds \
  -s https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz \
  -z https://mirrors.ocf.berkeley.edu/apache/zeppelin/zeppelin-0.9.0/zeppelin-0.9.0-bin-netinst.tgz \
  -o spark-oracle-0.1.0-SNAPSHOT.zip

This will create a Dockerfile and associated files for the specified options.

Notes on the Dockerfile and docker container:

  • for the oracle instance jdbc url specify an ip-addr instead of a hostname. If you specify a hostname you will have to edit the /etc/resolv.conf in the docker container. For example:
-- you may need to add to /etc/reolv.conf

search us.oracle.com
nameserver 2606:B400:300:D:FEED::1
nameserver 2606:B400:300:D:FEED::2
nameserver 206.223.27.1
nameserver 206.223.27.2

The docker container

  • is setup with Apache Spark, Spark-Oracle extension and Apache Zeppelin.
  • Currently we have not enabled Apache Zeppelin. We are working on notebooks for the Demo.
  • The default command for the container is spark-shell. You can follow the steps in the Demo
  • To build the docker image issue something like: docker image build -t spark_ora_demo:latest .
  • Then to run the container issue something like:
docker run -it -p 8080:8080 -p 4040:4040 --rm spark_ora_demo:latest
  • give the port options -p 8080:8080 -p 4040:4040 so you can see the Spark UI and when available the Zeppelin notebooks from a host browser.
  • The Dockerfile is setup with a CMD to start the spark-shell So you will be in the spark-shell when your terminal enters the container.
  • Once there you can follow the steps in the demo. Start by issuing sql("use oracle") and then follow the steps in the demo.
  • The container starts with an empty metadata-cache so you will notice that the first time you execute a query(even in pushdown=true mode) takes several seconds more than usual. This is because oracle table metadata(including partition information) is pulled into the metadata_cache on demand, the first time a query is issued against a table.