Challenge 01 - Ingest Data From Blob Storage

Pre-requisites

Your own Azure subscription with Owner access
An already in-place Dedicated SQL pool. Configure it using SLO = DW500c
Your choice of database management tool:
- SQL Server Management Studio (SSMS) (Windows)
- Azure Data Studio (Windows, Mac OS, and Linux)

Introduction

In this challenge you will import several parquet files from Blob Storage into your data warehouse and then you will move data from staging area to the production one.

You can find all the files for this challnge into the Zip file provided by your coach at this path: ./Resources/Challenge-01/

Description

Learning objectives:

Get understanding about tables architecture in Dedicated Sql Pool
Distributed
Round Robin
Replicated
How to identify Data Skew and how to fix it

Ingest Data from Azure Data Lake Gen2 – Blob container

Define staging tables architecture, then import all parquet files using the link your coach will provide you. Data should land in a staging area (schema “Staging”) optimized to ingest data at maximum speed using the “COPY INTO” T-SQL command.

Open C1_1_Import_into_staging_tables.sql
- Complete the provided T-SQL code to create the staging tables by choosing the proper structure and storage.
- Import data from Parquet files into Staging tables using the suggested COPY INTO T-SQL command

Move data from staging area to production tables

Optimize each table structure considering whether it is a “dimension” or a “fact” table. Production tables should belong to “Sales” schema. Check the Database diagram (Download available here) to identify relationships between tables and decide which is the proper distribution method. Consider also tables will be queried by filtering using the CustomerKey, ProductionKey, DataKey columns. Choose the proper one to guarantee an even distribution of data across all distributions. Use the suggested “CREATE TABLE AS” T-SQL command.

Open C1_2_Create_Actual_Tables.sql and complete T-SQL code to move data from staging to production tables, distributing data using the most efficient method considering tables relations as described in the database diagram (Download available here).
- Are Dimension tables (DimAccount, DimCustomer etc...) good candidates to be replicated ?
- Most of your queries will join tables using CustomerKey, ProductKey, CurrencyKey and FinanceKey fields. Choose the most selective column
- Example: Check FactInternetSales table: Is it better to distribute it using CustomerKey or ProductKey column ?

Investigate slowness due to data skew

Users are complaining that a query is taking too much to complete and need your assistance to fix the issue. Investigate why the query is so slow and make it faster.

Open folder C1_3_Check_and_Fix_Table_Skew.sql and investigate the issue using the suggested set of T-SQL commands, then fix the issue.
- Check for table skew
- Is the distribution column for the FactSales table good enough ?

Success Criteria

Identify the best design for Staging tables to improve performance during load
Verify all the methods you can use to load data from Blob storage
Choose the best design (distribution method) for production tables
Check Data Skew and fix it

Learning Resources

Distributed tables design guidance
Design guidance for replicated tables
COPY INTO (Transact-SQL) - (Azure Synapse Analytics)
Authentication mechanisms with the COPY statement
Quickstart: Bulk load data using a single T-SQL statement
Design a PolyBase data loading strategy for dedicated SQL pool
CREATE TABLE AS SELECT (CTAS)
Distributed tables design guidance
Design guidance for replicated tables
Distributed tables design guidance - Azure Synapse Analytics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenge-01.md

Challenge-01.md

Challenge 01 - Ingest Data From Blob Storage

Pre-requisites

Introduction

Description

Ingest Data from Azure Data Lake Gen2 – Blob container

Move data from staging area to production tables

Investigate slowness due to data skew

Success Criteria

Learning Resources

Files

Challenge-01.md

Latest commit

History

Challenge-01.md

File metadata and controls

Challenge 01 - Ingest Data From Blob Storage

Pre-requisites

Introduction

Description

Ingest Data from Azure Data Lake Gen2 – Blob container

Move data from staging area to production tables

Investigate slowness due to data skew

Success Criteria

Learning Resources