The objective of this assessment is to showcase our ability to:
- Design and build a basic ELT pipeline, moving and processing raw data into aggregated, analytics-ready datasets.
- Showcase our ability to use modern tooling to efficiently generate the desired results
- Have fun!
We are presented with an Excel spreadsheet storing sightings of Carmen Sandiego and are asked to answer a few questions with respect to this dataset.
In this version of the assignment, the solution leverages a dbt project linked to a local instance of PostgreSQL database. It organizes the tables and views following medallion architecture.
An alternative, spark-based approach leveraing Databricks DLT and Laktory was also provided here.
The problem to be solved is an excellent candidate for a medallion architecture, consisting of 3 layers:
- Bronze tables: raw and unaltered data
- Silver tables: filtered, cleaned and modeled data
- Gold tables: aggregated data, ready for analytics consumption and specific use cases
The bronze tables are an exact representation of the data found in the Excel
spreadsheets. There is one table per region (or per sheet) using the following
naming convention: brz_carmen_sigthings_{region}
. As an example, here is the
bronze table for asia:
The column names have not been altered in any way and no transformation have been applied. The tables are created from csv using dbt seed function.
The csv files were generated from a python script using pandas reader and writer:
import pandas as pd
REGIONS = ["africa", "america", "asia", "atlantic", "australia", "europe",
"indian", "pacific"]
for region in REGIONS:
# Read
df = pd.read_excel("../data/carmen_sightings_20220629061307.xlsx", sheet_name=region.upper())
df["region"] = region
# Write
df.to_csv(f"../seeds/brz_carmen_sightings_{region}.csv", index=False)
The next step is to define the silver as a transformation of the bronze tables. The transformation in this case consisted of :
- Standardize column names
- Selecting column of interest
- Creating normalization ids
A preliminary analysis of the raw data has shown that no duplicates were found in any of the regions. Here is what it looks like for the Asia region.
We used a dbt macro to apply the same transformation to all regions.
{% macro create_silver(source_table, renames) %}
select
{{ renames.date_witness }}::date as date_witness,
{{ renames.witness }}::varchar as witness,
{{ renames.agent }}::varchar as agent,
{{ renames.date_agent }}::date as date_agent,
{{ renames.city_agent }}::varchar as city_agent,
{{ renames.country }}::varchar as country,
{{ renames.city }}::varchar as city,
{{ renames.latitude }}::float as latitude,
{{ renames.longitude }}::float as longitude,
{{ renames.has_weapon }}::boolean as has_weapon,
{{ renames.has_hat }}::boolean as has_hat,
{{ renames.has_jacket }}::boolean as has_jacket,
{{ renames.behavior }}::varchar as behavior,
region::varchar as region,
date_trunc('month', {{ renames.date_witness }}::date) as month_witness,
{{ dbt_utils.generate_surrogate_key([renames.city]) }} as location_id,
{{ dbt_utils.generate_surrogate_key([renames.agent, renames.date_agent, renames.city_agent]) }} as report_id
from
{{ source_table }}
{% endmacro %}
The renames
dictionary is region-specific and maps the region-specific column
names to the standard column names. Columns location_id
and report_id
will
be used for normalization of the data model.
Once we have all the region-specific silver tables, we create
slv_carmen_sightings
which is a union of all other silver tables.
WITH
africa AS (SELECT * FROM {{ ref('slv_carmen_sightings_africa') }}),
america AS (SELECT * FROM {{ ref('slv_carmen_sightings_america') }}),
asia AS (SELECT * FROM {{ ref('slv_carmen_sightings_asia') }}),
atlantic AS (SELECT * FROM {{ ref('slv_carmen_sightings_atlantic') }}),
australia AS (SELECT * FROM {{ ref('slv_carmen_sightings_australia') }}),
europe AS (SELECT * FROM {{ ref('slv_carmen_sightings_europe') }}),
indian AS (SELECT * FROM {{ ref('slv_carmen_sightings_indian') }}),
pacific AS (SELECT * FROM {{ ref('slv_carmen_sightings_pacific') }})
SELECT
{{ dbt_utils.generate_surrogate_key(['date_witness', 'location_id', 'report_id']) }} as id,
*
FROM
(
SELECT * FROM africa
UNION
SELECT * FROM america
UNION
SELECT * FROM asia
UNION
SELECT * FROM atlantic
UNION
SELECT * FROM australia
UNION
SELECT * FROM europe
UNION
SELECT * FROM indian
UNION
SELECT * FROM pacific
)
To reduce redundancy and better organize the data for analytics, the wide table is transformed in 3 tables:
- 1 fact table
slv_fact_sightings
listing all sightings - 1 fact table
slv_fact_reports
listing all reports (agent, date city) - 1 dimension
slv_dim_locations
table providing the details of a location (city, country, region, latitude and longitude)
The relationships are summarized in this entity relationships diagram:
Finally, 3 gold tables (and a view) were defined to the analytics questions.
The first gold table gld_sightings_by_behavior
counts the number of occurrence
of each behavior. It is defined as
select
behavior,
count(id) as count
from
{{ref("slv_carmen_sightings")}}
group by
behavior
order by
count desc
The second table gld_carmen_sightings_by_month
is the monthly aggregation defined as:
SELECT
month,
count(id) as count,
sum(has_target_behavior::int) as target_behavior_count,
sum(has_target_outfit::int) as target_outfit_count,
sum(has_target_behavior::int)::float / count(id) as target_behavior_probability,
sum(has_target_outfit::int)::float / count(id) as target_outfit_probability
from (
select
*,
behavior IN ('out-of-control', 'complaining', 'happy') as has_target_behavior,
has_weapon AND has_jacket AND NOT has_hat AS has_target_outfit
from
{{ref("slv_fact_sightings")}}
)
group by
month
order by
month desc
This one is a bit more involving as it:
- Create a new column
has_target_behavior
that is1
when the behavior is in the top 3 identified previously - Create a new column
has_target_outfit
that is1
when the outfit (armed, hat, jacket) matches the description provided in the assessment - Group all rows of a month and
- Count the total number of rows for each month
- Count the match for target behavior
- Count the match for target outfit
- Calculate the probability of seeing target behavior
target_behavior_probability
- Calculate the probability of seeing target outfit
target_outfit_probability
To identify the region most frequently visited by Carmen each month, we proceed in two steps:
- Create a view
gld_sightings_by_month_region
that counts the number of sightings per region per month - Create a view
gld_most_visited_regions
on top of that view that keeps only the row associated with the region with most occurrence for each month
They are defined as:
select
month,
region,
count(id)
from (
select
*
from
{{ref("slv_fact_sightings")}} as f
join
{{ref("slv_dim_locations")}} as d
on f.location_id = d.location_id
)
group by
month, region
order by
month desc, count desc
and
select
distinct on (month) month, region, count
from
{{ref("gld_sightings_by_month_region")}}
order by
month desc, count desc
As an added bonus, dbt builds a data lineage graphs that helps understand the relationship between the different tables.
The answer is provided in column region
of table gld_most_visited_regions
.
Here are the 6 most recent results:
month | region |
---|---|
2022-06 | america |
2022-05 | asia |
2022-04 | america |
2022-03 | europe |
2022-02 | america |
2022-01 | america |
The answer is provided in column target_outfit_probability
of table gld_sightings_by_month
.
Here are the 6 most recent results:
month | probability |
---|---|
2022-06 | 3.448% |
2022-05 | 3.226% |
2022-04 | 3.333% |
2022-03 | 6.452% |
2022-02 | 3.571% |
2022-01 | 0.000% |
Overall, the probability of Carmen to be armed, wearing a jacket, but no hat are very low with 3.91% on average. The only general observation that can be drawn from this is that this specific combination is very rare. Further investigation would be required to identify if:
- One of the contributing factor is low in itself (she could be armed on very rare occasions for example)
- She tends to wear a hat when she is armed
- We have a situation of survivorship bias where wearing a jacket means being able to better conceal the weapon and hiding it from the agent.
The answer is provided in column count
of table gld_carmen_sightings_by_behavior
.
Here are the 3 most frequent results:
- out-of-control
- complaining
- happy
The answer is provided in column target_behavior_probability
of table gld_carmen_sightings_by_month
.
Here are the 6 most recent results:
month | probability |
---|---|
2022-06 | 24.14% |
2022-05 | 22.58% |
2022-04 | 20.00% |
2022-03 | 16.13% |
2022-02 | 10.71% |
2022-01 | 12.90% |