Skip to content

Expression Translation

Harish Butani edited this page Jan 13, 2022 · 7 revisions

Currently only Oracle Catalog tables with the following Oracle datatypes are supported.

  • the Catalyst DataType column lists the Catalyst datatype they are mapped to.
Oracle Type Catalyst DataType
CHAR(n) StringType
VARCHAR(n) StringType
NCHAR(n) StringType
NVARCHAR(n) StringType
NUMBER(p), p<=2 ByteType
NUMBER(p), p<=4 ShortType
NUMBER(p), p<=9 IntegerType
NUMBER(p), p<=18 LongType
NUMBER(p), p>18 DecimalType(p, 0)
NUMBER(p, s), s>p Not supported
NUMBER(p, s), s<0 Not supported
NUMBER(p, s), p<38 DecimalType(p, s)
NUMBER(p, s), p>=38 DecimalType(38, 18)
FLOAT broken, TODO
LONG StringType
BINARY_FLOAT FloatType
BINARY_DOUBLE DoubleType
DATE DateType
DATE DateType
TIMESTAMP(s) TimestampType
TIMESTAMP WITH TZ TODO
TIMESTAMP WITh LOCAL TZ TODO

Spark Literal values are converted to oracle literals using the following rules. Based on oracle sql guide

  • String are wrapped in quotes(') and single "'" are escaped with "''"
  • Date and Timestamp values are converted to strings using the format yyyy-MM-dd HH:mm:ss.SSSSSS and set up as DATE/TIMESTAMP literals. For examle TIMESTAMP '2000-03-11 08:00:00.000000' and TRUNC(TIMESTAMP '2000-03-11 08:00:00.000000')
  • Other datatypes are converted to their string representation.

the following are currently translated: Cast, PromotePrecision, CheckOverflow.

  • for PromotePrecision just return child translation.
  • for CheckOverflow
    • if nullOnOverflow is true add a case check
    • if nullOnOverflow is false: do nothing? translated oExpr will throw

The following helper SQL snippet generator functions are referenced in the following:

  val epochTS = {
    // osql"to_timestamp_tz('1970-01-01 00:00:00 00:00', 'YYYY-MM-DD HH24:MI:SS TZH:TZM')"
    osql"from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD'), 'UTC')"
  }
  val epochDt = osql"date '1970-01-01'"

  val epochTSAtSessionTZ = osql"to_timestamp('1970-01-01', 'YYYY-MM-DD')"

  val true_bool_TS =
    osql"from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD'), 'UTC') + interval '0.001' second(0,3)"

dtToTimestamp

  • if there is no catalystExpr#zoneId then generate cast(${oraE} as timestamp)
  • else generate cast(${oraE} as timestamp) at time zone ${zoneOE}

timestampToDt

  • if there is no catalystExpr#zoneId then
    • cast input to a timestamp and set its timezone; then cast the result to a date
    • Translation expression: cast(from_tz(cast({oraE} as timestamp), {zoneOE}) as date)
  • Otherwise translation expression is cast({oraE} as date)

epochToTimestamp

Translation logic is:

millisToInterval = numtodsinterval({oraE}/1000, 'SECOND')
millisToIntervalWithTZOffset = {millisToInterval} + {epochTS} - {epochTSAtSessionTZ}
result = {epochTSAtSessionTZ} + ({millisToIntervalWitTZOffset})

For example for oraE = 1603425037802, sql is:

  to_timestamp('1970-01-01', 'YYYY-MM-DD') +
       (numtodsinterval(1603425037802/1000, 'SECOND') +
        from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC') -
        to_timestamp('1970-01-01', 'YYYY-MM-DD')
       )

epochToDate

Translation logic is:

millisToInterval = numtodsinterval({oraE}/1000, 'SECOND')
millisToIntervalWithTZOffset = {millisToInterval} + {epochTS} - {epochTSAtSessionTZ}
epoch_ts = {epochTSAtSessionTZ} + {millisToIntervalWitTZOffset}
result = trunc({epoch_ts}, 'DD')

For example for oraE = 1603425037802, sql is:

  trunc(
    to_timestamp('1970-01-01', 'YYYY-MM-DD') +
       (numtodsinterval(1603425037802/1000, 'SECOND') +
        from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC') -
        to_timestamp('1970-01-01', 'YYYY-MM-DD')
       ),
     'DD'
    )

timestampToEpoch

Translation logic is:

  // using ora date arithmetic: ora_ts - ora_ts -> ora_interval
  days = extract(day from ({oraE} - {epochTS}))246060
  hours = extract(hour from ({oraE} - {epochTS}))6060
  mins = extract(minute from ({oraE} - {epochTS}))6060
  secs = extract(second from ({oraE} - {epochTS}))6060
  result = ({days} + {hours} + {mins} + {secs})1000

For example for oraE = systimestamp, sql is:

  extract(day from (systimestamp - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))246060 +
 extract(hour from (systimestamp - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))6060 +
 extract(minute from (systimestamp - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))60 +
 extract(second from (systimestamp - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))
)1000

dateToEpoch

Translation logic is:

  trunc_to_days = trunc(sysdate, 'DD')
  // using ora date arithmetic: ora_date - ora_ts -> ora_interval
  interval_from_epoch = trunc_to_days - epoch_ts
  num_hours = extract(day from interval_from_epoch)24 +
              extract(hour from interval_from_epoch)
  result = num_hours60601000

For example, for sysdate:

  (extract(day from(trunc(sysdate, 'DD') - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))24 +
   extract(hour from(trunc(sysdate, 'DD') - from_tz(to_timestamp('1970-01-01', 'YYYY-MM-DD')), 'UTC')))
  )60601000

NumericCasting

for widening Cast: do nothing, return 'childOE'

for narrowing Cast: use the following sql expression template:

case when {childOE} > {toDT.MinValue} and {childOE} < {toDT.MaxValue}
         then cast({childOE} as {toDT})
     else null
end

StringCasting

from string:

  • to numeric:
    • apply TO_NUMBER oracle function
    • so sql template is to_number({childOE})
  • to date:
    • Spark uses DateTimeUtils.stringToDate; this tries a number of date formats
    • When translating we will use the default date format of the Oracle connection. From Oracle To_Date Oracle function: The default date format is determined implicitly by the NLS_TERRITORY initialization parameter or can be set explicitly by the NLS_DATE_FORMAT parameter.
    • so translation is to_date({childOE})
  • to timestamp:
    • Spark uses org.joda.time.DateTimeUtils.stringToTimestamp; this tries a number of date formats.
    • When translating we will use the default timestamp format of the Oracle connection. From Oracle To_Timestamp Oracle function The default format of the TIMESTAMP data type, which is determined by the NLS_TIMESTAMP_FORMAT initialization parameter.
    • so translation is to_timestamp({childOE})
  • to boolean:
    • Spark uses
      • StringUtils.isTrueString to translate to true
      • StringUtils.isFalseString to translate to false
      • else null
    • sql template:
      (case when ${childOE} in ('t', 'true', 'y', 'yes', '1') then 1
           when ${childOE} in ('f', 'false', 'n', 'no', '0') then 0
           else null
       end) = 1
    • So the below example shows Boolean translations:
      -- This returns 1 row
      select 1
      from dual
      where (case when't' in ('t', 'true', 'y', 'yes', '1') then 1
                  when 't' in ('f', 'false', 'n', 'no', '0') then 0
                  else null
             end) = 1;
    
     -- These return 0 rows:
     select 1
     from dual
     where (case when'f' in ('t', 'true', 'y', 'yes', '1') then 1
                 when 'f' in ('f', 'false', 'n', 'no', '0') then 0
                 else null
            end) = 1;
    
     select 1
     from dual
     where (case when'u' in ('t', 'true', 'y', 'yes', '1') then 1
                 when 'u' in ('f', 'false', 'n', 'no', '0') then 0
                 else null
            end) = 1;

to string:

  • from numeric:
    • Spark applies UTF8String.fromString({childExpr.val}.toString)
    • translate using TO_CHAR(number) Oracle function
    • So the translation template is to_char({childOE})
    • For example to_char(12345678912345.345678900000) returns 12345678912345.345678900000
  • from date:
    • Spark uses DateFormatter for the timeZoneId of the castExpr
      • The date pattern used is defaultPattern: String = "yyyy-MM-dd"
    • Translate to SQL template using TO_CHAR(date) Oracle function.
      • The template is to_char({childOE}).
      • This uses the default date format of the Oracle connection. This can be changed in Oracle by setting the 'NLS_TERRITORY' initialization parameter or can be set explicitly by the 'NLS_DATE_FORMAT' parameter.
  • from timestamp:
    • Spark uses FractionTimestampFormatter for the timeZoneId of the castExpr
      • timestamp pattern used is formatter = DateTimeFormatterHelper.fractionFormatter Which parses/formats timestamps according to the pattern yyyy-MM-dd HH:mm:ss.[..fff..]
    • translate to sql template using TO_CHAR(date) oracle function.
      • The template is to_char({childOE}).
      • This uses the default date format of the Oracle connection. Which can be changed in Oracle by setting the 'NLS_TERRITORY' initialization parameter or can be set explicitly by the 'NLS_TIMESTAMP_FORMAT' parameter.
      • For example: to_char('22-OCT-20 01.16.32.740812 PM')
  • from boolean:
    • use template: case when {childOE} then 'true' else 'false' end

BooleanCasting

from boolean:

to boolean:

DateCasting

from date:

  • to numeric:
    • In Spark: num_of_days since epoch.
    • translate to: {oraE} - {epochDt}. Based on Oracle's date arithmetic (oraDt - oraDt -> number), this represents the number of days since start of epoch.
  • to string:
  • to timestamp:
    • In Spark: DatetimeUtils.daysToMicros(d, zoneId)
      • Converts days since 1970-01-01 at the given zone ID to microseconds since 1970-01-01 00:00:00Z.
    • Translate to: cast({oraE} as timestamp) with additional at time zone {castE.timeZoneId}. See dtToTimestamp() method.
  • to boolean:
    • In Spark: null
    • translate to: null

to date:

  • from numeric:
    • In Spark it is undefined
    • translate to: {epochDt} + {oraE} Based on oracle's date arithmetic this represents the date that is {oraE} days from epoch.
  • from string:
  • from timestamp:
    • In Spark: convert timestamp at given time zone to date
    • Translate to: cast({oraE} as date); if {castE.timeZoneId} is specified first convert to timestamp in timeZone. See timestampToDt()
  • from boolean:
    • In Spark it is undefined
    • We throw during translation.

TimestampCasting

from timestamp:

  • to numeric:
  • to string: Sames as Timestamp -> String in StringCasting
  • to date: Same as Timestamp -> Date in DateCasting
  • to boolean:
    • In Spark: millis_since_epoch != 0 Translate to: timestampToEpoch({oraE}) != 0. See timestampToEpoch()

to timestamp:

  • from numeric:
  • from string: same as String -> Date in StringCasting
  • from date:
    • In Spark: convert timestamp at given tz to date
    • Translate to: See timestampToDt()
  • from boolean:
    • In Spark: true is interpreted as 1L millis_since_epoch, and false is 0L millis_since_epoch.
    • translate to: case when {oraE} then ${true_bool_TS} else ${epochTS} end

the following are currently translated: UnaryMinus, UnaryPositive, Abs, +, -, *, /, Remainder, Pmod, Least and Greatest

  • AttributeReference is translated to a column reference.
  • Alias is translated to an alias expression.

the following are currently translated: If, and CaseWhen

  • IsNull and IsNotNull are translated into is null and is not null oracle predicates
  • IsNull SubqueryExpression is translated into not exists (subquery sql)
  • IsNotNull SubqueryExpression is translated into exists (subquery sql)

the following are currently translated: Not, And, Or, In, InSet, EqualNullSafe, BinaryComparison

the following are currently translated: ScalarSubquery, ListQuery

Identifiers

  • If spark.sql.caseSensitive=false (the default setting)
    • Table, Schema and Column names are resolved in a case-insensitive manner.
  • If spark.sql.caseSensitive=true
    • Table and Schema names are resolved in a case-sensitive manner; if resolution fails, a case-insensitive resolution is attempted.
    • Column names are resolved in a case-sensitive manner.
Clone this wiki locally