< Previous Challenge Next Challenge>
WW Importers keep missing the SLAs for their nightly data load process. The loads take six hours to complete and start each evening at 1:00AM. They must complete by 8:00AM but frequenly these jobs are taking longer than planned. In addition a few critical stakeholders are asking to provide the data even more frequently. Since these business units are the key stakeholders, they have the funding to help replatform the data pipelines. WW Importers realizes they need to leverage their new Data Lake to scale and load this into their Data Warehouse for Stage 3. These data pipelines must be ELT (Extract, Load & Transform) so they can quickly write the data to the cloud and scale out the compute to transform the data.
The objective of this lab is to modernize the ETL pipeline that was originally built in SSIS. A detailed diagram of the current workflow is included below. We need to rebuild this pipeline in Azure leveraging scale-out architecture to transform this data. The data flow will include steps to extract the data from the OLTP platform, store it in the Azure Data Lake and bulk ingest it into Azure Synapase Analytics. This will be run on a nightly basis, and will need to leverage Azure Data Factory or Synapse Pipelines as a job orchestration and scheduling tool.
Below is a summary of each of the tasks in the existing SSIS package. Note that we will be able to re-use the existing scripts for all of these tasks except for step 6.
- The first step of the pipeline is to retrieve the “ETL Cutoff Date”. This date can be found in the [Integration].[Load_Control] in Azure Synapse SQL Pool that should have been created as part of challenge 1 (This step should have already been recreated in Challenge 2)
- The next step ensures that the [Dimension].[Date] table is current by executing the [Integration].[PopulateDateDimensionForYear] in Synapse SQL Pool
- Next the [Integration].[GetLineageKey] procedure is executed to create a record for each activity in the [Integration].[Lineage Key] table
- This step Truncates the [Integration].[[Table]_Staging] tables to prep them for new data
- This step retrieves the cutoff date for the last successful load of each table from the [Integration].[ETL Cutoffs] Table
- New data is now read from the OLTP source (using [Integration].[Get[Table]Updates] procedures) and copied into the [Integration].[[Table]_Staging] tables in the target DW
- Finally the staged data is merged into the [Dimension] and [Fact] tables in the target DW by executing the [Integration].[MigrateStaged[Table]] stored procedures in the target DW
- NOTE: As part of this step, surrogate keys are generated for new attributes in Dimension tables (tables in the [Dimension] schema), so Dimenion tables must be loaded before FACT tables to maintain data integrity
Note: This challenge is intended to build upon the previous 2 challenges, and you should try to reuse content wherever possible
Create a data pipeline for the [Dimension].[City] table considering logic above. Follow these steps for this pipeline.
- Add a new activity to your Synapse Pipeline to load data from the new Azure Data Lake into the [Integration].[City_Staging] in the Data Warehouse in Azure Synapse via Polybase (this will correlate to Step 6 in existing package described above). Note: Be sure that table exists and is empty prior to loading
- Add an activity to execute the Get Lineage Key stored procedure so that the process can be logged (this will correlate to Step 3 in existing SSIS package described above)
- Create another activity to merge the new data into the target table ([Dimension].[City]) from your staging table [Integration].[City_Staging] via existing stored procedure (this correlates to Step 7 in existing SSIS package described above)
- Add another new activity to move the files to the .\RAW\WWIDB[TABLE]{YY}{MM}{DD}\ directory in your data lake once they have been loaded into your DW table (this is a new task that will allow your data to be persisted in your new data lake for further exploration and integration into downstream systems)
- Test your new Synapse Pipeline. You can execute your pipeline by clicking on "Debug" or adding a Trigger in the design UI. If done correctly, the 11 new records loaded into the City data should have been loaded into the [Dimension].[City] table in the Azure Synapse Analytics DW.
- Load data into DW via Polybase
- Incrementally load multiple tables in Azure Data Factory
- Azure Data Factory Copy Activity
- There are multiple ways to load data via Polybase. You could potentially use:
- "CTAS" with "External Tables"
- "Copy Command" in Azure Synapse Analytics or
- Copy Data task in Azure Data Factory
- For MERGE process, you should reuse or replicate logic found in the Integration.MigratedStaged____Data stored procedures
- Optimize where possible by using dynamic code, and executing tasks in parallel.
- Additional information on using Lookup Tasks and expressions in Azure Data Factory can be found here
- Enhance the pipeline so that it can be used to load all tables. Steps required for this would include:
- Update Azure Data Factory to use expressions and parameters wherever possible
- Add a ForEach Loop to iterate through all tables and execute your pipeline for each one (note: Dimensions need to be loaded prior to Facts)
- Leverage partition switching for tables with large-scale modifications (UPDATES)
- Refactor the T-SQL code in Polybase to leverage Python or Scala
- Build out these data pipelines using Azure Mapping Data Flows