-
Notifications
You must be signed in to change notification settings - Fork 10
Dockerized_Demo
The following procedure describes how to bring up a Spark Oracle demo environment in a docker container. The prerequisite is that you have docker setup on your host machine. This has only be tested on Macs(it will probably work on a linux host).
This is purely for demo purposes. We are not supporting the docker env
The dockerbuilder
sub-project a utility to construct a Dockerfile
for spark-on-oracle.
In an empty folder unzip the dokcer builder artifact. You should also copy the spark-oracle-0.1.0-SNAPSHOT.zip
to this folder.
Then run: ./sparkOraDockerBuilder-0.1.0-SNAPSHOT
without any options to see what is required.
(you may need to run the command outside of vpn for the spark and zeppelin download url checks to work.)
Usage: sparkOraDockerBuilder [options]
-m, --spark_mem <value> memory in Mb/Gb for spark; for example 512m or 2g.
when running the tpcds demo set it to 4g
-c, --spark_cores <value>
num_cores for spark.
when running the tpcds demo set it to at-least 4
-j, --oracle_instance_jdbc_url <value>
jdbc connection information for the oracle instance.
for example: "jdbc:oracle:thin:@10.89.206.230:1531/cdb1_pdb7.regress.rdbms.dev.us.oracle.com
specify the ip-addr of host; otherwise you may need
to edit the /etc/resolv.conf of the docker container."
-u, --oracle_instance_username <value>
Oracle username to connect the oracle instance
-p, --oracle_instance_password <value>
Oracle password for the oracle user.
Either provide the password or location of a wallet.
-w, --oracle_instance_wallet_loc <value>
Oracle password for the oracle user.
Either provide the password or location of a wallet.
-s, --spark_download_url <value>
url to download apache spark. Spark version must be 3.1.0 or above.
for example: https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
-z, --zeppelin_download_url <value>
url to download apache zeppelin. Spark version must be 0.9.0 or above.
for example: https://downloads.apache.org/zeppelin/zeppelin-0.9.0/zeppelin-0.9.0-bin-netinst.tgz
-o, --spark_ora_zip <value>
location of spark-oracle package.
for example: ~/Downloads/spark-oracle-0.1.0-SNAPSHOT.zip
Provide the specified options:
- for example, if you want to use our dev. env run as:
./sparkOraDockerBuilder-0.1.0-SNAPSHOT -c 4 -m 4g \
-j jdbc:oracle:thin:@10.89.206.230:1531/cdb1_pdb7.regress.rdbms.dev.us.oracle.com \
-u tpcds -p tpcds \
-s https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz \
-z https://mirrors.ocf.berkeley.edu/apache/zeppelin/zeppelin-0.9.0/zeppelin-0.9.0-bin-netinst.tgz \
-o spark-oracle-0.1.0-SNAPSHOT.zip
This will create a Dockerfile
and associated files for the specified options.
Notes on the Dockerfile and docker container:
- for the oracle instance jdbc url specify an ip-addr instead of a hostname.
If you specify a hostname you will have to edit the
/etc/resolv.conf
in the docker container. For example:
-- you may need to add to /etc/reolv.conf
search us.oracle.com
nameserver 2606:B400:300:D:FEED::1
nameserver 2606:B400:300:D:FEED::2
nameserver 206.223.27.1
nameserver 206.223.27.2
- is setup with Apache Spark, Spark-Oracle extension and Apache Zeppelin.
- Currently we have not enabled Apache Zeppelin. We are working on notebooks for the Demo.
- The default command for the container is
spark-shell
. You can follow the steps in the Demo -
To build the docker image issue something like:
docker image build -t spark_ora_demo:latest .
- Then to run the container issue something like:
docker run -it -p 8080:8080 -p 4040:4040 --rm spark_ora_demo:latest
- give the port options
-p 8080:8080 -p 4040:4040
so you can see the Spark UI and when available the Zeppelin notebooks from a host browser. - The Dockerfile is setup with a
CMD
to start thespark-shell
So you will be in the spark-shell when your terminal enters the container. - Once there you can follow the steps in the demo. Start by issuing
sql("use oracle")
and then follow the steps in the demo. - The container starts with an empty
metadata-cache
so you will notice that the first time you execute a query(even inpushdown=true
mode) takes several seconds more than usual. This is because oracle table metadata(including partition information) is pulled into the metadata_cache on demand, the first time a query is issued against a table.
- Quick Start
- Latest Demo
- Configuration
- Catalog
- Translation
- Query Splitting details
- DML Operations
- Language Integration
- Dockerized Demo env.
- Sharded Database
- Developer Notes