Expression Translation

Datatypes
Literals
Casts
Arithmetic
Named
Conditional
Nulls
Predicates
Subquery
Identifiers

Datatypes

Currently only Oracle Catalog tables with the following Oracle datatypes are supported.

the Catalyst DataType column lists the Catalyst datatype they are mapped to.

Oracle Type	Catalyst DataType
CHAR(n)	StringType
VARCHAR(n)	StringType
NCHAR(n)	StringType
NVARCHAR(n)	StringType
NUMBER(p), `p<=2`	ByteType
NUMBER(p), `p<=4`	ShortType
NUMBER(p), `p<=9`	IntegerType
NUMBER(p), `p<=18`	LongType
NUMBER(p), `p>18`	DecimalType(p, 0)
NUMBER(p, s), `s>p`	Not supported
NUMBER(p, s), `s<0`	Not supported
NUMBER(p, s), `p<38`	DecimalType(p, s)
NUMBER(p, s), `p>=38`	DecimalType(38, 18)
FLOAT	broken, TODO
LONG	StringType
BINARY_FLOAT	FloatType
BINARY_DOUBLE	DoubleType
DATE	DateType
DATE	DateType
TIMESTAMP(s)	TimestampType
TIMESTAMP WITH TZ	TODO
TIMESTAMP WITh LOCAL TZ	TODO

Literals

Spark Literal values are converted to oracle literals using the following rules. Based on oracle sql guide

String are wrapped in quotes(') and single "'" are escaped with "''"
Date and Timestamp values are converted to strings using the format yyyy-MM-dd HH:mm:ss.SSSSSS and set up as DATE/TIMESTAMP literals. For examle TIMESTAMP '2000-03-11 08:00:00.000000' and TRUNC(TIMESTAMP '2000-03-11 08:00:00.000000')
Other datatypes are converted to their string representation.

Casts

the following are currently translated: Cast, PromotePrecision, CheckOverflow.

for PromotePrecision just return child translation.
for CheckOverflow
- if nullOnOverflow is true add a case check
- if nullOnOverflow is false: do nothing? translated oExpr will throw

The following helper SQL snippet generator functions are referenced in the following:

  val epochTS = {
    // osql"to_timestamp_tz('1970-01-01 00:00:00 00:00', 'YYYY-MM-DD HH24:MI:SS TZH:TZM')"
    osql"from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD'), 'UTC')"
  }
  val epochDt = osql"date '1970-01-01'"

  val epochTSAtSessionTZ = osql"to_timestamp('1970-01-01', 'YYYY-MM-DD')"

  val true_bool_TS =
    osql"from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD'), 'UTC') + interval '0.001' second(0,3)"

dtToTimestamp

if there is no catalystExpr#zoneId then generate cast(${oraE} as timestamp)
else generate cast(${oraE} as timestamp) at time zone ${zoneOE}

timestampToDt

if there is no catalystExpr#zoneId then
- cast input to a timestamp and set its timezone; then cast the result to a date
- Translation expression: cast(from_tz(cast({oraE} as timestamp), {zoneOE}) as date)
Otherwise translation expression is cast({oraE} as date)

epochToTimestamp

Translation logic is:

millisToInterval = numtodsinterval({oraE}/1000, 'SECOND')
millisToIntervalWithTZOffset = {millisToInterval} + {epochTS} - {epochTSAtSessionTZ}
result = {epochTSAtSessionTZ} + ({millisToIntervalWitTZOffset})

For example for oraE = 1603425037802, sql is:

  to_timestamp('1970-01-01', 'YYYY-MM-DD') +
       (numtodsinterval(1603425037802/1000, 'SECOND') +
        from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC') -
        to_timestamp('1970-01-01', 'YYYY-MM-DD')
       )

epochToDate

Translation logic is:

millisToInterval = numtodsinterval({oraE}/1000, 'SECOND')
millisToIntervalWithTZOffset = {millisToInterval} + {epochTS} - {epochTSAtSessionTZ}
epoch_ts = {epochTSAtSessionTZ} + {millisToIntervalWitTZOffset}
result = trunc({epoch_ts}, 'DD')

For example for oraE = 1603425037802, sql is:

  trunc(
    to_timestamp('1970-01-01', 'YYYY-MM-DD') +
       (numtodsinterval(1603425037802/1000, 'SECOND') +
        from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC') -
        to_timestamp('1970-01-01', 'YYYY-MM-DD')
       ),
     'DD'
    )

timestampToEpoch

Translation logic is:

  // using ora date arithmetic: ora_ts - ora_ts -> ora_interval
  days = extract(day from ({oraE} - {epochTS}))246060
  hours = extract(hour from ({oraE} - {epochTS}))6060
  mins = extract(minute from ({oraE} - {epochTS}))6060
  secs = extract(second from ({oraE} - {epochTS}))6060
  result = ({days} + {hours} + {mins} + {secs})1000

For example for oraE = systimestamp, sql is:

  extract(day from (systimestamp - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))246060 +
 extract(hour from (systimestamp - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))6060 +
 extract(minute from (systimestamp - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))60 +
 extract(second from (systimestamp - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))
)1000

dateToEpoch

Translation logic is:

  trunc_to_days = trunc(sysdate, 'DD')
  // using ora date arithmetic: ora_date - ora_ts -> ora_interval
  interval_from_epoch = trunc_to_days - epoch_ts
  num_hours = extract(day from interval_from_epoch)24 +
              extract(hour from interval_from_epoch)
  result = num_hours60601000

For example, for sysdate:

  (extract(day from(trunc(sysdate, 'DD') - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))24 +
   extract(hour from(trunc(sysdate, 'DD') - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))
  )60601000

NumericCasting

for widening Cast: do nothing, return 'childOE'

for narrowing Cast: use the following sql expression template:

case when {childOE} > {toDT.MinValue} and {childOE} < {toDT.MaxValue}
         then cast({childOE} as {toDT})
     else null
end

StringCasting

from string:

to numeric:
- apply TO_NUMBER oracle function
- so sql template is to_number({childOE})
to date:
- Spark uses DateTimeUtils.stringToDate; this tries a number of date formats
- When translating we will use the default date format of the Oracle connection. From Oracle To_Date Oracle function: The default date format is determined implicitly by the NLS_TERRITORY initialization parameter or can be set explicitly by the NLS_DATE_FORMAT parameter.
- so translation is to_date({childOE})
to timestamp:
- Spark uses org.joda.time.DateTimeUtils.stringToTimestamp; this tries a number of date formats.
- When translating we will use the default timestamp format of the Oracle connection. From Oracle To_Timestamp Oracle function The default format of the TIMESTAMP data type, which is determined by the NLS_TIMESTAMP_FORMAT initialization parameter.
- so translation is to_timestamp({childOE})

to boolean:

Spark uses
- StringUtils.isTrueString to translate to true
- StringUtils.isFalseString to translate to false
- else null
sql template:

  (case when ${childOE} in ('t', 'true', 'y', 'yes', '1') then 1
       when ${childOE} in ('f', 'false', 'n', 'no', '0') then 0
       else null
   end) = 1

So the below example shows Boolean translations:

  -- This returns 1 row
  select 1
  from dual
  where (case when't' in ('t', 'true', 'y', 'yes', '1') then 1
              when 't' in ('f', 'false', 'n', 'no', '0') then 0
              else null
         end) = 1;

 -- These return 0 rows:
 select 1
 from dual
 where (case when'f' in ('t', 'true', 'y', 'yes', '1') then 1
             when 'f' in ('f', 'false', 'n', 'no', '0') then 0
             else null
        end) = 1;

 select 1
 from dual
 where (case when'u' in ('t', 'true', 'y', 'yes', '1') then 1
             when 'u' in ('f', 'false', 'n', 'no', '0') then 0
             else null
        end) = 1;

to string:

from numeric:
- Spark applies UTF8String.fromString({childExpr.val}.toString)
- translate using TO_CHAR(number) Oracle function
- So the translation template is to_char({childOE})
- For example to_char(12345678912345.345678900000) returns 12345678912345.345678900000
from date:
- Spark uses DateFormatter for the timeZoneId of the castExpr
  - The date pattern used is defaultPattern: String = "yyyy-MM-dd"
- Translate to SQL template using TO_CHAR(date) Oracle function.
  - The template is to_char({childOE}).
  - This uses the default date format of the Oracle connection. This can be changed in Oracle by setting the 'NLS_TERRITORY' initialization parameter or can be set explicitly by the 'NLS_DATE_FORMAT' parameter.
from timestamp:
- Spark uses FractionTimestampFormatter for the timeZoneId of the castExpr
  - timestamp pattern used is formatter = DateTimeFormatterHelper.fractionFormatter Which parses/formats timestamps according to the pattern yyyy-MM-dd HH:mm:ss.[..fff..]
- translate to sql template using TO_CHAR(date) oracle function.
  - The template is to_char({childOE}).
  - This uses the default date format of the Oracle connection. Which can be changed in Oracle by setting the 'NLS_TERRITORY' initialization parameter or can be set explicitly by the 'NLS_TIMESTAMP_FORMAT' parameter.
  - For example: to_char('22-OCT-20 01.16.32.740812 PM')
from boolean:
- use template: case when {childOE} then 'true' else 'false' end

BooleanCasting

from boolean:

to numeric: {orE} != 0
to string: Same as Boolean -> String in StringCasting
to date: Same as Boolean -> Date in DateCasting
to timestamp: Same as Boolean -> Timestamp in TimestampCasting

to boolean:

from numeric: {oraE}
from string: Same as String -> Boolean in StringCasting
from date: Same as Date -> Boolean in DateCasting
from timestamp: Same as Timestamp -> Boolean in TimestampCasting

DateCasting

from date:

to numeric:
- In Spark: num_of_days since epoch.
- translate to: {oraE} - {epochDt}. Based on Oracle's date arithmetic (oraDt - oraDt -> number), this represents the number of days since start of epoch.
to string:
- Same as Date -> String in StringCasting
to timestamp:
- In Spark: DatetimeUtils.daysToMicros(d, zoneId)
  - Converts days since 1970-01-01 at the given zone ID to microseconds since 1970-01-01 00:00:00Z.
- Translate to: cast({oraE} as timestamp) with additional at time zone {castE.timeZoneId}. See dtToTimestamp() method.
to boolean:
- In Spark: null
- translate to: null

to date:

from numeric:
- In Spark it is undefined
- translate to: {epochDt} + {oraE} Based on oracle's date arithmetic this represents the date that is {oraE} days from epoch.
from string:
- Same as String -> Date in StringCasting
from timestamp:
- In Spark: convert timestamp at given time zone to date
- Translate to: cast({oraE} as date); if {castE.timeZoneId} is specified first convert to timestamp in timeZone. See timestampToDt()
from boolean:
- In Spark it is undefined
- We throw during translation.

TimestampCasting

from timestamp:

to numeric:
- In Spark: convert to millis_since_epoch
- Translate to: see timestampToEpoch()
to string: Sames as Timestamp -> String in StringCasting
to date: Same as Timestamp -> Date in DateCasting
to boolean:
- In Spark: millis_since_epoch != 0 Translate to: timestampToEpoch({oraE}) != 0. See timestampToEpoch()

to timestamp:

from numeric:
- In Spark it is undefined
- Translate to: See epochToTimestamp()
from string: same as String -> Date in StringCasting
from date:
- In Spark: convert timestamp at given tz to date
- Translate to: See timestampToDt()
from boolean:
- In Spark: true is interpreted as 1L millis_since_epoch, and false is 0L millis_since_epoch.
- translate to: case when {oraE} then ${true_bool_TS} else ${epochTS} end

Arithmetic

the following are currently translated: UnaryMinus, UnaryPositive, Abs, +, -, *, /, Remainder, Pmod, Least and Greatest

Named

AttributeReference is translated to a column reference.
Alias is translated to an alias expression.

Conditional

the following are currently translated: If, and CaseWhen

Nulls

IsNull and IsNotNull are translated into is null and is not null oracle predicates
IsNull SubqueryExpression is translated into not exists (subquery sql)
IsNotNull SubqueryExpression is translated into exists (subquery sql)

Predicates

the following are currently translated: Not, And, Or, In, InSet, EqualNullSafe, BinaryComparison

Subquery

the following are currently translated: ScalarSubquery, ListQuery

Identifiers

If spark.sql.caseSensitive=false (the default setting)
- Table, Schema and Column names are resolved in a case-insensitive manner.
If spark.sql.caseSensitive=true
- Table and Schema names are resolved in a case-sensitive manner; if resolution fails, a case-insensitive resolution is attempted.
- Column names are resolved in a case-sensitive manner.

Quick Start
Latest Demo
Configuration
Catalog
Translation
Query Splitting details
DML Operations
Language Integration
- Spark SQL Macros
- Spark SQL Macro examples
Dockerized Demo env.
Sharded Database
- TPCH queries
Developer Notes

Expression Translation

dtToTimestamp

timestampToDt

epochToTimestamp

epochToDate

timestampToEpoch

dateToEpoch

NumericCasting

StringCasting

BooleanCasting

DateCasting

TimestampCasting

Identifiers

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally