Skip to content

Latest commit

 

History

History
406 lines (348 loc) · 12.6 KB

spark-connector-oceanbase.md

File metadata and controls

406 lines (348 loc) · 12.6 KB

Spark Connector OceanBase

English | 简体中文

Spark OceanBase Connector can support reading data stored in OceanBase through Spark, and also supports writing data to OceanBase through Spark.

Read Write
DataFrame JDBC JDBC、Direct Load
SQL JDBC JDBC、Direct Load

Version compatibility

Connector Spark OceanBase Java Scala
1.0 2.4, 3.1 ~ 3.4
  • JDBC: 3.x, 4.x
  • Direct Load: 4.2.x or later versions
8 2.12
  • Note: If you need a package built based on other Scala versions, you can get the package by building it from source code.

Get the package

You can get the release packages at Releases Page or Maven Central.

<dependency>
    <groupId>com.oceanbase</groupId>
    <artifactId>spark-connector-oceanbase-3.4_2.12</artifactId>
    <version>${project.version}</version>
</dependency>

If you'd rather use the latest snapshots of the upcoming major version, use our Maven snapshot repository and declare the appropriate dependency version.

<dependency>
    <groupId>com.oceanbase</groupId>
    <artifactId>spark-connector-oceanbase-3.4_2.12</artifactId>
    <version>${project.version}</version>
</dependency>

<repositories>
    <repository>
        <id>sonatype-snapshots</id>
        <name>Sonatype Snapshot Repository</name>
        <url>https://s01.oss.sonatype.org/content/repositories/snapshots/</url>
        <snapshots>
            <enabled>true</enabled>
        </snapshots>
    </repository>
</repositories>

Of course, you can also get the package by building from source code.

  • By default, it is built with scala version 2.12
  • After successful compilation, the target jar package will be generated in the target directory under the module corresponding to each version, such as: spark-connector-oceanbase-3.4_2.12-1.0-SNAPSHOT.jar. Copy this file to Spark's ClassPath to use spark-connector-oceanbase.
git clone https://github.com/oceanbase/spark-connector-oceanbase.git
cd spark-connector-oceanbase
mvn clean package -DskipTests
  • If you need a package built based on other Scala versions, refer to the command below to build based on Scala 2.11.
git clone https://github.com/oceanbase/spark-connector-oceanbase.git
cd spark-connector-oceanbase
mvn clean package -Dscala.version=2.11.12 -Dscala.binary.version=2.11 -DskipTests

Usage Examples

Read

Use Spark-SQL.

CREATE TEMPORARY VIEW spark_oceanbase
USING oceanbase
OPTIONS(
  "url"= "jdbc:mysql://localhost:2881/test?useUnicode=true&characterEncoding=UTF-8&useSSL=false",
  "schema-name"="test",
  "table-name"="test",
  "username"="root",
  "password"="123456"
);

SELECT * FROM spark_oceanbase;

Use DataFrame

val oceanBaseSparkDF = spark.read.format("OceanBase")
  .option("url", "jdbc:mysql://localhost:2881/test?useUnicode=true&characterEncoding=UTF-8&useSSL=false")
  .option("username", "root")
  .option("password", "123456")
  .option("table-name", "test")
  .option("schema-name", "test")
  .load()

oceanBaseSparkDF.show(5)

Write

Take synchronizing data from Hive to OceanBase as an example.

Preparation

Create corresponding Hive tables and OceanBase tables to prepare for data synchronization

  • Start spark-sql by running ${SPARK_HOME}/bin/spark-sql
CREATE TABLE test.orders (
  order_id     INT,
  order_date   TIMESTAMP,
  customer_name string,
  price        double,
  product_id   INT,
  order_status BOOLEAN
) using parquet;

insert into orders values
(1, now(), 'zs', 12.2, 12, true),
(2, now(), 'ls', 121.2, 12, true),
(3, now(), 'xx', 123.2, 12, true),
(4, now(), 'jac', 124.2, 12, false),
(5, now(), 'dot', 111.25, 12, true);
  • Connect to OceanBase
CREATE TABLE test.orders (
  order_id     INT PRIMARY KEY,
  order_date   TIMESTAMP,
  customer_name VARCHAR(225),
  price        double,
  product_id   INT,
  order_status BOOLEAN
);

Based on JDBC

Spark-SQL
CREATE TEMPORARY VIEW test_jdbc
USING oceanbase
OPTIONS(
  "url"= "jdbc:mysql://localhost:2881/test?useUnicode=true&characterEncoding=UTF-8&useSSL=false",
  "schema-name"="test",
  "table-name"="orders",
  "username"="root@test",
  "password"=""
);

insert into table test_jdbc
select * from test.orders;

insert overwrite table test_jdbc
select * from test.orders;
DataFrame
val df = spark.sql("select * from test.orders")

import org.apache.spark.sql.SaveMode
df.write
  .format("oceanbase")
  .mode(saveMode = SaveMode.Append)
  .option("url", "jdbc:mysql://localhost:2881/test?useUnicode=true&characterEncoding=UTF-8&useSSL=false")
  .option("username", "root")
  .option("password", "123456")
  .option("table-name", "orders")
  .option("schema-name", "test")
  .save()

Based on direct-load

Spark-SQL
CREATE TEMPORARY VIEW test_direct
USING oceanbase
OPTIONS(
  "url"="jdbc:mysql://localhost:2881/test?useUnicode=true&characterEncoding=UTF-8&useSSL=false",
  "schema-name"="test",
  "table-name"="orders",
  "username"="root@test",
  "password"="123456",
  "direct-load.enabled" = true,
  "direct-load.host" = "localhost",
  "direct-load.rpc-port" = "2882"
);

insert into table test_direct
select * from test.orders;

insert overwrite table test_direct
select * from test.orders;
DataFrame
val df = spark.sql("select * from test.orders")

import org.apache.spark.sql.SaveMode
df.write
  .format("oceanbase")
  .mode(saveMode = SaveMode.Append)
  .option("url", "jdbc:mysql://localhost:2881/test?useUnicode=true&characterEncoding=UTF-8&useSSL=false")
  .option("username", "root")
  .option("password", "123456")
  .option("table-name", "orders")
  .option("schema-name", "test")
  .option("direct-load.enabled", "true")
  .option("direct-load.host", "localhost")
  .option("direct-load.rpc-port", "2882")
  .save()

Configuration

General configuration

Option Default Type Description
url String The connection URL.
username String The connection username like 'root@sys'.
tenant-name String The tenant name.
password String The password.
schema-name String The schema name or database name.
table-name String The table name.

Direct load configuration

Option Default Type Description
direct-load.enabled false Boolean Enable direct-load writing.
direct-load.host String Hostname used in direct-load.
direct-load.rpc-port 2882 Integer Rpc port number used in direct-load.
direct-load.parallel 8 Integer The parallel of the direct-load server. This parameter determines how much CPU resources the server uses to process this import task.
direct-load.batch-size 10240 Integer The size of the batch that is written to the OceanBase at one time.
direct-load.max-error-rows 0 Long Maximum tolerable number of error rows.
direct-load.dup-action REPLACE String Action when there is duplicated record of direct-load task. Can be STOP_ON_DUP, REPLACE or IGNORE.
direct-load.timeout 7d Duration The timeout for direct-load task.
direct-load.heartbeat-timeout 60s Duration Client heartbeat timeout in direct-load task.
direct-load.heartbeat-interval 10s Duration Client heartbeat interval in direct-load task.
direct-load.load-method full String The direct-load load mode: full, inc, inc_replace.
  • full: full direct-load, default value.
  • inc: normal incremental direct-load, primary key conflict check will be performed, observer-4.3.2 and above support, direct-load.dup-action REPLACE is not supported for the time being.
  • inc_replace: special replace mode incremental direct-load, no primary key conflict check will be performed, directly overwrite the old data (equivalent to the effect of replace), direct-load.dup-action parameter will be ignored, observer-4.3.2 and above support.

JDBC configuration

  • This Connector is implemented based on JDBC To Other Databases.
  • Support OceanBase MySQL and Oracle modes:
    • For MySQL mode, you need to add the MySQL Connector/J driver to Spark's CLASSPATH
    • For Oracle mode, you need to add the OceanBase Connector/J driver to Spark's CLASSPATH