Skip to content

External Tables

Harish Butani edited this page Jan 13, 2022 · 2 revisions

We will support creation of Spark external tables that are defined in the spark grammar as:

    | createTableHeader ('(' colTypeList ')')? tableProvider
        createTableClauses
        (AS? query)?                                                   #createTable
  • We will focus on parquet provider. We may certify for orc, avro and csv.
  • These kinds of Spark DDLs will be translated to setup Oracle External Tables.

Creation of Hive external tables in Spark DDL is routed to the Session Catalog; so we will not support the translation of these. In Spark grammar these are of the form:

    | createTableHeader ('(' columns=colTypeList ')')?
        (commentSpec |
        (PARTITIONED BY '(' partitionColumns=colTypeList ')' |
        PARTITIONED BY partitionColumnNames=identifierList) |
        bucketSpec |
        skewSpec |
        rowFormat |
        createFileFormat |
        locationSpec |
        (TBLPROPERTIES tableProps=tablePropertyList))*
        (AS? query)?                                                   #createHiveTable

Creation of non-partitioned tables

Something like this:

create table t2(id long, p string)
using parquet
location "https://objectstorage.us-ashburn-1.oraclecloud.com/n/idlxex3qf8sf/b/SparkTest/o/t1.parquet"

We will translate the Spark DDL specification into an equivalent DBMS_CLOUD.CREATE_EXTERNAL_TABLE invocation:

  • The CatalogPlugin must be configured with the credential_name to use with object_store
    • TBDL parameter_name and example.
  • The location parameter must be provided; it will be interpreted as an 'Oracle Cloud Infrastructure Object Storage Native URI Format'.
  • The file_uri_list will be populated based on the contents of the folder.
    • Contents are inferred using:
      • dbms_cloud functions list_object or list_files
      • Alternatively if we make Catalog in Spark aware of object store credentials, this can be done using oci-hdfs functionality. This is probably preferred TBD
  • The column_list parameter populated from schema specified in Spark DDL.
  • The format parameter populated from the provided set in Spark DDL.

Creation of Partitioned tables

Something like this:

create table t2(id long, p string)
using parquet
partitioned by (p)
location "https://objectstorage.us-ashburn-1.oraclecloud.com/n/idlxex3qf8sf/b/SparkTest/o/t2/"

We will translate the Spark DDL specification into an equivalent DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE invocation:

  • partitioning_clause constructed by introspecting object store location
    • Using dbms_cloud functions list_object or list_files or oci-hdfs
    • oci-hdfs will probably be preferred TBD

Scanning External Table

Once set up in Oracle dictionary, no special action is needed in normal cases. Spark Queries involving only external tables can be set up with physical plans that directly read data using oci-hdfs. TBD when and if we do this.

Writing External Tables

  • Doesn't seem to be a way to write parquet/orc/avro from Oracle.
  • So write part of the plan would have to be done in Spark
    • Move input data to Spark: initially to Spark on Oracle; later on Spark in Oracle.
    • Have Spark tasks write oci files. This implies for write support Catalog must be configured with access to object store.