diff --git a/source/documentation/tools/create-a-derived-table/data-modelling-concepts-probate-case-study.md b/source/documentation/tools/create-a-derived-table/data-modelling-concepts-probate-case-study.md new file mode 100644 index 00000000..81484559 --- /dev/null +++ b/source/documentation/tools/create-a-derived-table/data-modelling-concepts-probate-case-study.md @@ -0,0 +1,76 @@ +# Data Modelling Probate Case Study + +## Overview + +A “probate caveat” is a legal notice to the court that a person has or may have an interest in a decedent’s estate, and they want to be notified before the court takes certain actions, like granting probate to a will. The primary purpose of filing a caveat is to prevent the probate of a will, at least temporarily, so that concerns or objections about the validity of the will or the appointment of a particular personal representative can be addressed. + +Here’s a simplified breakdown: + + 1. Notification: A person (often a potential heir or beneficiary) files the probate caveat to alert the court of their interest in the estate or concerns about the will. + 2. Probate Hold: Once a caveat is filed, the court typically won’t grant probate to the will without first addressing the concerns raised by the caveator (the person who filed the caveat). + 3. Resolution: The issues brought up by the caveat might be resolved through legal hearings or other proceedings, where evidence is presented and the validity of the will or the suitability of the personal representative is determined. + +## Understanding The Data + + We have performed [summary statistics](https://alpha-mojap-ccd.s3.eu-west-1.amazonaws.com/ccd-analysis/results/caveat/ccd_probate_analysis.html) of the probate data that is available on Athena to gain an overall understanding of the data, identify any empty variables, or those that contain only blocks of text that are not very useful for analysis. The data shows us that all the variables at event level have a "ce" prefix (i.e. ce_id, ce_event_id) whereas those relevant at case level have a "cd" prefix (i.e. cd_reference, cd_latest_state). So the first question we should be asking is whether we want our dimensional model to be at case or event level. There are around 300 duplicate rows that need to be removed. We can also see that there are 87 unique events within 17 possible states. Here is a flowchart of the overall caveat process. + + Probate Caveat Flowchart + +## Identifying Dimensions and Measures + +Next, we needed to identify the key dimensions and measures within our data. Dimensions represent different ways to categorize our data, while measures are the numerical values we want to aggregate or analyze. We began by examining the JSON [code](https://github.com/moj-analytical-services/airflow-hmcts-sdp-load/blob/main/services/ccd/metadata/raw/probate_caveat.json) provided in the HMCTS probate repo. It contains columns and their corresponding attributes, providing valuable insights into our data structure. We will leverage this information to create our dimensional model. +Some questions to ask when deciding: + +1. Is the column a measure of business performance? If yes, it’s likely a fact. For example, “Total Sales”. +2. Is the column descriptive or categorical in nature? If yes, it’s likely a dimension attribute. For example, “Product Color”. +3. Is the column an identifier or transaction number without other related attributes? It might be a degenerate dimension in the fact table. +4. Does the column help to slice and dice the measures? If yes, it’s likely a dimension. For example, the “Country” dimension helps to analyze “Sales” by country. + +However, always keep in mind the specific business requirements and context. Sometimes, design choices are influenced by factors like query performance, ease of use for end users, and the nature of the reporting. + +## Designing Dimension Tables + +Based on the identified dimensions, we will create our tables and these will capture the unique values of each dimension and any additional attributes. For example, we will have dimension tables for applicantOrganisationPolicy, bulkPrintId, caseMatches, scannedDocuments, bulkScanEnvelopes, documentsGenerated, documentsReceived, documentsScanned, caseLegacyId, events, and payments. In the json file we can also see several vaiables about the deceased or caveator which we can separate to create separate dimensional tables. + + +## Designing Fact Tables + +Moving on to designing the fact tables, which is the crucial part of this process we have agreed to create two tables: one at ce level and the other at cd level. + +The ce_fact table represents the fact data related to the case event information. It captures the primary key, ce_id, which uniquely identifies each ce event. This fact table will include foreign keys referencing the dimension tables associated with ce data. It allows us to analyze and measure various aspects of ce events, such as state and event details, along with their corresponding attributes. + +The cd_fact table represents the fact data related to the cd (case data) information over time. It captures the primary key, cd_reference, which uniquely identifies each cd data entry. This fact table will include foreign keys referencing the dimension tables associated with cd data. It allows us to analyze and measure various aspects of cd data, such as creation dates, jurisdiction, references, and security classifications. + +The main reason for having separate fact tables is to maintain the granularity and integrity of the data. In some scenarios, ce events and cd data may have different attributes, timeframes, or levels of detail. By separating them into distinct fact tables, we can analyze and report on them independently and accurately. This approach provides flexibility and avoids data redundancy or confusion when querying or aggregating the data. Having two fact tables also allows us to capture and analyze different aspects of the overall process. For example, the ce_fact table can focus on event-related metrics, such as event types, user information, and state transitions, while the cd_fact table can focus on case-specific details. So Separating the facts into two tables enables a clearer understanding and analysis of the distinct dimensions associated with each aspect of the data. + +## Populating Dimension Tables + +Once the dimension tables have been designed, we have populated them with the relevant data from the JSON file. This step ensures that we have complete and accurate dimension tables to reference in our dimensional model. + +## Establishing Relationships + +The next stage is to establish relationships with the use of foreign keys. These keys will connect the fact tables with the appropriate dimension tables, enabling us to combine and analyze data from different dimensions seamlessly. An example here is the user dimension table which is linked to the ce fact table by user_id where ce_id is the primary key and user_id is the foreign key. + +For cases where there is no foreign key to link a dimensional table with the fact table, we have a couple of options to consider. We can build a bridge table, which acts as a mediator between the fact and dimensional tables, or introduce a surrogate key in the dimensional table. A surrogate key is a system-generated unique identifier that serves as a substitute for a missing or inadequate foreign key. By assigning a surrogate key to the dimensional table, we can establish a relationship with the fact table. + +In our example, we have a caveator dimensional table but no caveator_id in the overall probate data which means that we need to create a surrogate key to link this table with the relevant fact table. DBT has an in-built [macro](https://github.com/dbt-labs/dbt-utils#generate_surrogate_key-source) where you can specify which columns to use to create the key. + +The two fact tables are linked via cd_id which is a surrogate key generated using the cd_reference and cd_last_modified. + +It’s important to note that the absence of a foreign key can impact the integrity and completeness of the dimensional model. It’s recommended to thoroughly assess the implications and consult with subject matter experts and stakeholders to determine the best approach for handling such scenarios. + +## Validating and Refining the Model + +It's important to note that this whole process is iterative. Things will keep changing and before finalizing the dimensional model, we need to validate and refine it. We will review the model and make any necessary adjustments based on your feedback and specific business requirements but it is also important to acknowledge the limitation of our own data. + +## Creating the Star Schema + +What you are seeing is a possible star schema. In the star schema, the fact tables (ce_fact and cd_fact) serve as the central points, connected to various dimension tables. Each dimension table captures specific attributes related to the ce and cd data, enabling efficient analysis and reporting. +After various discussion We have omitted the cavatorAddress and deceasedAddress dimension tables as we agreed that usually only the postcode is required for any geospatial or statistical analysis and we kept them in caveator and deceased person dimension tables. + +By looking at cd fact table you are probably questioning why there are some ce variables there and the reason is that these variables are consistent throughout the case and they are not affected by a change in event and as our goal is to keep tables as simple and consistent as possible we decided to move them under the cd fact table. You may also decide that some of these variables are not even needed for the analysis so we can omit them completely. + + Probate Caveat Star Schema + +## Closure +That concludes our explanation of the end-to-end process for building the probate dimensional model based on the provided JSON code. By following this process and leveraging the star schema, we can easily analyze and gain valuable insights from our data. diff --git a/source/documentation/tools/create-a-derived-table/data-modelling-concepts.md b/source/documentation/tools/create-a-derived-table/data-modelling-concepts.md index 96b4682a..1ffbb62a 100644 --- a/source/documentation/tools/create-a-derived-table/data-modelling-concepts.md +++ b/source/documentation/tools/create-a-derived-table/data-modelling-concepts.md @@ -1,5 +1,98 @@ -# Data Modelling Concepts - placeholder +# Data Modelling Overview ⚠️ This service is in beta ⚠️ -This page is intended to give users a brief introduction to Data Modelling concepts and why we are using `dbt` as the backend for `create-a-derived-table`. Please post suggestions to improve this document in our slack channel [#ask-data-modelling](https://asdslack.slack.com/archives/C03J21VFHQ9), or edit and raise a PR. \ No newline at end of file +This page is intended to give users a brief introduction to Dimensional Modelling concepts, the process the Data Modelling team take toc reate Dimensional Models, and why we are using `dbt` as the backend for `create-a-derived-table`. Please post suggestions to improve this document in our slack channel [#ask-data-modelling](https://asdslack.slack.com/archives/C03J21VFHQ9), or edit and raise a PR. + +## Dimensional Modelling: Key Conepts + +Data modelling is the process of creating a structured representation of data. There are several approaches to creating a data model but the data modelling team tends to use and endorse the dimensional modelling approach introduced by Ralph Kimball. If you have hear the term data modelling used by members of the team then it is likely they are referring to dimensional modelling. + +This section contains several important concepts related to dimensional modelling and the explanations are heavily influenced by explanations given in the following book 'The Data Warehouse Toolkit, 3rd Edition' by Ralph Kimball and Margy Ross. This book as well as ['Kimball Dimensional Modeling Techniques'](http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf) are great places to start if you want to dive deeper into dimensional modelling. + +### What is Dimensional Modelling + +Dimensional Modelling involves designing the data structure in a way that optimizes querying and analysis. This is done through organising data into easily understandable "dimensions" (descriptive categories, such as time, geography, or product) and "facts" (measurable metrics, such as sales or revenue). The core guiding principle behind dimensional modelling is simplicity. Simplicity is critical because it ensures that users can easily understand the data, as well as allows software to navigate and deliver results quickly and efficiently. + +### What are Fact Tables +The fact table in a dimensional model stores the measurements or outcomes which results from an event. You should strive to store the low-level measurement data resulting from a business process in a single dimensional model. + +Each row in a fact table corresponds to a measurement event. The data on each row is at a specific level of detail, referred to as the grain, such as one row per disposal. One of the core tenets of dimensional modelling is that all the measurement rows in a fact table must be at the same grain. Having the discipline to create fact tables with a single level of detail ensures that measurements aren’t inappropriately double-counted. + +### What are Dimension Tables +Dimension tables are integral companions to a fact table. The dimension tables contain the descriptive context associated with a fact table event. They describe the “who, what, where, when, how, and why” associated with the event. + +### What is a Star Schema +A star schema is a dimensional model which contains a sigle fact table containg the measurements/outcomes of a business process which is connected to several dimension tables that provide the descriptive context for that measurement/outcome. For example for a disposal dimensional model we would have a single fact table containing the disposals that occured and then several dimension tables such as defendant, offence, court. A visual example can be seen here [LINK to Khristiania's diagram] + +### What is meant by the "Grain" of the data. +The grain is the level of detail a row of data in a fact table is. Often work from the principle that it is best to use the atomic grain which is the lowest level grain as it is easier to aggregate up compared to disaggregating. A fact table must have a consistent grain though different fact tables can have different grains e.g. we can have a disposal fact table and a cases fact table. + +## Process of creating a Dimensional Model. + +The following is the process that the Data Modelling team takes when creating dimensional models and derived tables for different source systems. The majority of these steps will however still be useful for the creation of derived tables and dimensional models in general. + +Dimensional Model Process + + +1. In conjunction with the analytical user community (AUC), we gather analytical requirements through the use of interviews / workshops. We document our findings from these including lists of extracts, reports and fields; data definitions; metric / KPI definitions; and existing data transformations being applied to raw / source data, as well as quality expectations. We attempt to capture how users and analysts describe the entities, business processes, dimensions, facts and granularity associated with the system(s) in question. + +1. At the same time as the step above we develop an understanding of who the system owners are of the source database we will be using and devlop relationships with them + +1. We work with these data suppliers / operational users to understand the system structure, operational usage and business processes. We also attempt to see if they have any useful materials such as training guides. We make sure to document these and any limitations or caveats, of both used and unused fields, which we do via the data discovery tool and/or DBT docs. To understand the structure of the database may require the use of data modelling software and/or creation of DDL scripts to reverse-engineer / represent relationships between operational tables. Then, in conjunction with capturing how system users and managers describe the business processes and entities (in terms of dimensions, facts and granularity) that the system(s) underpin, and the knowledge and requirements gathered, we identify relevant core concepts as the basis of a simple, effective dimensional data structure that is recognisable and intuitive for stakeholders. + +1. We then develop a conceptual star schema based view of the domain which de-normalises the operational database / source data. This view is based around business processes, facts (additive; semi-additive; non-additive), and hierarchical dimensions at the atomic (lowest) level of granularity. We then document and present this content to all stakeholders via the bus matrix and basic diagrams. As well as de-normalising operational database content into facts and dimensions, transformation / designs may require a) extraction of structured data from document-oriented storage such as JSON or XLM; or b) modelling / extraction / separation of entities from flat files into a more normalised dimensional structure. + +1. We then test and agree the conceptual design with all relevant stakeholders + +1. A big part of what the Data Modelling team does is to ensure uniformity of data across the MoJ where posssible. As such we cross reference our new design with broader content via the bus matrix and identify additional steps needed to conform content across domains. This enterprise-level perspective facilitates integration / joins between concepts from disparate data sources to underpin exploratory analysis and insight discovery. + +1. To develop the model we then break down the conceptual design into high-level tasks for Jira / augment existing dimension development tasks where necessary. We determine data source(s), structure and slowly changing dimension approach for each dimension. + +1. For each atomic / lowest-level concept, develop logical content / metadata and translate this into physical content using DBT code. Content tis developed through a series of staging, intermediate, fact and dimension tables, building sequentially upon earlier models using DBT functionality and materialised in DEV Athena databases. + +1. Provide regular demonstrations to the user and governance communities to explore issues / conflicts and to refine content. Ideally this would be done through a BI tool which would require it's design, development, deployment and maintenance + +1. Throughout development: seek peer review of SQL / DBT code to ensure appropriate and meaningful level of decomposition + +1. We also add to the metadata repo that is used to document fields and variables and which has the aim of being tool agnostic so that it can be adapted for any documentation tool that we may use in the future. + +1. As the models become more mature we develop tests that are applied to each transformation / load. In the future we have the desire to have data quality dashboards that would include for example record counts across key dimensions such as time, geography and relevant categories to identify levels of missingness and value distributions. We also have a desire to adopt machine learning techniques to explore and identify areas of potential interest in the data across relevant variable interactions; flag areas of interest to users / owners as part of data transformation and load activities, and to support data and service quality improvement efforts. + +1. Following on from these Data Quality checks we will push specific data quality issues to, system owners / data stewards to address in source systems; also collate findings to identify regular issues where design changes to underlying systems may be required to prevent issues from occurring. + +1. Once we have the models approved by the AUC we develop and implement through dbt: update frequency; snapshot requirements; and security / role-based access design as part of preparing to deploy content to PROD. + +1. We then deploy atomic / lowest level concept content to PROD with agreed update frequency; snapshot creation schedule and data quality content / fix processes. + +1. We will work with the relevant AUC to migrate their outputs and processes to use (and further validate) the newly created content. Depending on skills / experience / available resources, this step may include the development of extracts / data marts which combine atomic concepts to replicate existing content. This requires exploration and understanding of existing processes, including potentially R, Python, SAS, Airflow, Excel, etc. skills to enable new content to be spliced neatly into production / dissemination pipelines. In any event, describe and support for key users: +a. core development processes and how users can augment base content (with governed review / approval from data modelling team); and +b. how users can develop extracts / data marts / OLAP cubes themselves, that bring together core concepts / create aggregated metrics and key performance indicators, to address specific analytical use cases (including documentation to support re-use) + +1. A key part of our role is to present / champion core and integrated content regularly at forums across Data & Analysis to encourage standardisation in data source and metric usage. We also want to support the user community to connect with / make best use of content and provide drop-in sessions for creators to connect users with content and/or to provide advice and support on new content creation, where required. + +1. Finally we use these forums, and other regular communicates routes, to capture new requirements for further prioritisation and development starting the cycle over. + + + +## The benefits of dbt. + +There are several key benefits of using dbt for creating data models. + +* **Modular Data Transformations**: dbt enables users to define data transformations as modular and reusable SQL-based models. This is key as it promotes a more organised and maintainable approach to creating models, making it easier to manage complex transformation pipelines. + +* **Code Reusability**: With dbt, users can create custom macros and packages, allowing them to reuse common logic, calculations, and transformation patterns across different projects. This is helpful as it leads to consistent and efficient data transformation practices. + +* **Version Control**: dbt integrates well with Git, allowing users to track changes to their data transformation code over time. This promotes collaboration, documentation, and the ability to revert to previous versions if needed. + +* **Testing and Validation**: A key part of data modelling is ensuring high data quality and dbt aides with this due to it's strong testing functionality which enables users to write tests for their data transformations. These tests ensure data quality, accuracy, and consistency, helping to identify issues early in the transformation process. + +* **Automated Documentation**: dbt automatically generates documentation for data models, transformations, and datasets. This documentation is easily accessible and helps teams understand the purpose, logic, and lineage of different components in the data pipelines. + +* **Dependency Management**: dbt allows users to define dependencies between different data models, ensuring that transformations are executed in the correct order. This helps manage complex data workflows and avoids issues related to dependencies. + +* **Data Lineage**: dbt provides a data lineage graph showing how different datasets and models are connected. This is valuable for understanding the impact of changes, identifying potential bottlenecks, and troubleshooting issues. + +* **Community and Ecosystem**: dbt has a strong user communityallowing us to benefit from shared knowledge, best practices, and resources. The ecosystem includes plugins, integrations, and extensions that can enhance our data transformation capabilities. + + diff --git a/source/documentation/tools/create-a-derived-table/index.md b/source/documentation/tools/create-a-derived-table/index.md index e0afe716..43c45b7f 100644 --- a/source/documentation/tools/create-a-derived-table/index.md +++ b/source/documentation/tools/create-a-derived-table/index.md @@ -18,6 +18,7 @@ Create a Derived Table is a tool for creating persistent derived tables in Athen - [Data Modelling Concepts](/tools/create-a-derived-table/data-modelling-concepts) - [Project Structure](/tools/create-a-derived-table/project-structure) +- [Probate Case Study](/tools/create-a-derived-table/data-modelling-concepts-probate-case-study) ## Creating Models diff --git a/source/images/create-a-derived-table/caveat_table.png b/source/images/create-a-derived-table/caveat_table.png new file mode 100644 index 00000000..ec71abe8 Binary files /dev/null and b/source/images/create-a-derived-table/caveat_table.png differ diff --git a/source/images/create-a-derived-table/caveate_state.png b/source/images/create-a-derived-table/caveate_state.png new file mode 100644 index 00000000..6c37c2e4 Binary files /dev/null and b/source/images/create-a-derived-table/caveate_state.png differ diff --git a/source/images/create-a-derived-table/data_modelling_process.png b/source/images/create-a-derived-table/data_modelling_process.png new file mode 100644 index 00000000..cacd4fb4 Binary files /dev/null and b/source/images/create-a-derived-table/data_modelling_process.png differ diff --git a/source/tools/create-a-derived-table/data-modelling-concepts-probate-case-study/index.html.md.erb b/source/tools/create-a-derived-table/data-modelling-concepts-probate-case-study/index.html.md.erb new file mode 100644 index 00000000..1428b2e2 --- /dev/null +++ b/source/tools/create-a-derived-table/data-modelling-concepts-probate-case-study/index.html.md.erb @@ -0,0 +1,11 @@ +--- +title: Data Modelling Concepts Probate Caveat Case Study +weight: 37 +last_reviewed_on: 2022-09-15 +review_in: 1 year +show_expiry: true +owner_slack: "#ask-data-modelling" +owner_slack_workspace: "mojdt" +--- + +<%= partial 'documentation/tools/create-a-derived-table/data-modelling-concepts-probate-case-study' %>