Skip to content
Thibaut Barrère edited this page May 13, 2020 · 6 revisions

⚠️ This document is a draft

A common requirement during ETL work is to search a SQL column for a given (unique) value, find the corresponding row, then get the value of a second column in that same row. This can be very valuable to:

  • Replace a given business key by the corresponding surrogate key (see wikipedia page on surrogate keys) in e.g. a data warehouse.
  • Help handle relationships conversion (foreign keys) between tables during migrations between databases (e.g. one could keep a legacy_id in a new product table, then an invoice line item referring to the legacy_id can easily replace that id with the new target id)

Kiba Pro SQLBulkLookup provides an easy and efficient way to replace those relations. It handles a large group of rows (batch) at once to avoid N queries, but rather 1 per rows batch.

Currently tested against: PostgreSQL 9.5+, MySQL 5.5+, MRI Ruby 2.4-2.7.

Requirements: make sure to add those to your Gemfile:

  • sequel gem
  • pg gem (if you connect to Postgres)
  • mysql2 gem (if you connect to MySQL)

Typical use

Given the following products table:

id legacy_product_id
1 100
2 107
3 126

One can setup a transform like this:

require 'kiba-common/sources/enumerable' # for demo
require 'kiba-pro/transforms/sql_bulk_lookup'

job = Kiba.parse do
  # fake a source which would refer to some db specific ids, lacking the primary key we need
  source Kiba::Common::Sources::Enumerable, -> [{external_id: 100}, {external_id: 107}, {external_id: 126}]

  transform Kiba::Pro::Transforms::SQLBulkLookup,
    # NOTE: a live Sequel connection must be passed here
    database: db,
    table: :products,
    # number of rows to batch in a single SQL lookup query
    buffer_size: 2_500,
    row_input: :external_id, # name of the Hash key above
    sql_input: :legacy_product_id, # name of the SQL column we'll be searching for
    sql_output: :id, # name of the SQL column we want to retrieve
    row_output: :product_id # name of the Hash key to add in the row with the retrieved value
  
  # SNIP
end

Such a transform will give us rows with the following content:

{external_id: 100, product_id: 1}
{external_id: 107, product_id: 2}
{external_id: 126, product_id: 3}