Skip to content
Angus Miller edited this page Jun 12, 2020 · 6 revisions

A common requirement during ETL processing is to search a table column for a given (unique) value, find the corresponding row, then get the value of a second column in that same row.

This can be very valuable to:

  • Lookup dimension keys using names e.g. find the department named Sales Department and get the primary key value e.g 101
  • Replace a given business key by the corresponding surrogate key (see wikipedia page on surrogate keys) in e.g. a datawarehouse.
  • Help handle relationships conversion (foreign keys) between tables during migrations between databases (e.g. one could keep a legacy_id in a new product table, then an invoice line item referring to the legacy_id can easily replace that id with the new target id)

Kiba Pro SQLBulkLookup provides an easy and efficient way to replace those relations. It handles a large group of rows (batch) at once to avoid N queries, but rather 1 per rows batch.

Currently tested against: PostgreSQL 9.5+, MySQL 5.5+, MRI Ruby 2.4-2.7.

Requirements: make sure to add those to your Gemfile:

  • sequel gem
  • pg gem (if you connect to Postgres)
  • mysql2 gem (if you connect to MySQL)

Typical use

Given the following products table:

id legacy_product_id
1 100
2 107
3 126

One can setup a transform like this:

require 'kiba-common/sources/enumerable' # for demo
require 'kiba-pro/transforms/sql_bulk_lookup'

job = Kiba.parse do
  # fake a source which would refer to some db specific ids, lacking the primary key we need
  source Kiba::Common::Sources::Enumerable, -> [{external_id: 100}, {external_id: 107}, {external_id: 126}]

  transform Kiba::Pro::Transforms::SQLBulkLookup,
    # NOTE: a live Sequel connection must be passed here
    database: db,
    table: :products,
    # number of rows to batch in a single SQL lookup query
    buffer_size: 2_500,
    row_input: :external_id, # name of the Hash key above
    sql_input: :legacy_product_id, # name of the SQL column we'll be searching for
    sql_output: :id, # name of the SQL column we want to retrieve
    row_output: :product_id # name of the Hash key to add in the row with the retrieved value
  
  # SNIP
end

Such a transform will give us rows with the following content:

{external_id: 100, product_id: 1}
{external_id: 107, product_id: 2}
{external_id: 126, product_id: 3}

Behaviour on lookups

By default, to be safe, the transform will raise an error and interrupt the processing if:

  • No matching record is found for the lookup criteria.
  • More than one matching record is found for the lookup criteria.
  • A matching record is found, but the column referred to by sql_output: has a NULL value

This fail-fast behaviour is here by design to ensure that by default, you will not unknowingly create orphan records, which can be harmful in your reports or datawarehouses.

If your scenario can safely accept missing ids, you can turn on the following option:

transform Kiba::Pro::Transforms::SQLBulkLookup,
  # SNIP
  raise_if_no_match: false

There is no option at the moment to support cases where you would have more than one match. A future "joiner" component may handle that. Please get in touch if you are interested!