diff --git a/docs/website/blog/2023-10-03-data-product-docs.md b/docs/website/blog/2023-10-03-data-product-docs.md index 9c6fa41c57..229ddee5d6 100644 --- a/docs/website/blog/2023-10-03-data-product-docs.md +++ b/docs/website/blog/2023-10-03-data-product-docs.md @@ -13,7 +13,7 @@ tags: [data product, data as a product, data documentation, data product documen # Lip service to docs -We often see people talk about data products or data as a product and they usually tackle the topics of: +We often see people talk about data products or data as a product, and they usually tackle the topics of: - Concepts and how to think about data products - Users and producers: Roles, responsibilities and blame around data products @@ -21,38 +21,45 @@ We often see people talk about data products or data as a product and they usual - Code: The code or technology powering the pipelines - Infra: the infrastructure data products are run on -What we do not see is any practical advices or examples of how to implement these products. While the concepts often cover definition of data products as something with a use case, they fail to discuss the importance a user manual, or documentation. +What we do **not** see is any practical advices or examples of how to implement these products. +While the concepts often define data products as something with a use case, +they fail to discuss the importance a user manual, or documentation. # The role of the user manual ### So what is a data product? -A data product is a self-contained piece of data-powered software that serves a single use case. For example, it could be a pipeline that loads Salesforce data to Snowflake, or it could be a ML model hosted behind an api. +A data product is a self-contained piece of data-powered software that serves a single use case. For example, it could be a pipeline that loads Salesforce data to Snowflake, or it could be an ML model hosted behind an api. ### What makes a data pipeline a data product? -The term product assumes more than just some code. A product is something that you can pick up and use and is thus different from someone’s python spaghetti. +The term product assumes more than just some code. +A "quick and dirty" pipeline is what you would call a "proof of concept" in the product world and far from a product. -For example, a product is: +![Who the duck wrote this garbage??? Ah nvm… it was me…](/img/parrot-baby.gif) +> Who the duck wrote this trash??? Ahhhhh it was me :( ... -- Reusable: The first thing needed here is a **solid documentation** that will enable other users to understand how to use the product -- Robust: Nothing kills the trust in data faster than bad numbers. To be maintainable, code must be simple, explicit, tested ****and **documented** :) +To create a product, you need to consider how it will be used, by whom, and enable that usage by others. -![Who the duck wrote this garbage??? Ah nvm… it was me…](/img/parrot-baby.gif) +A product is something that you can pick up and use and is thus different from someone’s python spaghetti. +For example, a product is: + +- Reusable: The first thing needed here is a **solid documentation** that will enable other users to understand how to use the product +- Robust: Nothing kills the trust in data faster than bad numbers. To be maintainable, code must be simple, explicit, tested, and **documented** :) - Secure: Everything from credentials to data should be secure. Depending on their requirements, that could mean keeping data on your side (no 3rd party tools), controlling data access, using SOC2 compliant credential stores, etc. -- Observable: Is it working? how do you know? you can automate a large part of this question by monitoring volume of data and schema changes, or whatever other important run parameters or changes you might have. +- Observable: Is it working? how do you know? you can automate a large part of this question by monitoring the volume of data and schema changes, or whatever other important run parameters or changes you might have. - Operationizable: Can we use it? do we need a rocket scientist, or can [little Bobby Tables](https://xkcd.com/327/) use it? That will largely depend on docs and the product itself ### So what is a data product made of? Let’s look at the high level components -1. Structured data: A data product needs data. The code and data are tightly connected - a ML model or data pipeline cannot be trained or operate without data. Why structured? because our code will expect a structured input, so the data is going to be either explicitly structured upfront (”schema on write”), or structured implicitly on read (”schema on read”). +1. Structured data: A data product needs data. The code and data are tightly connected - an ML model or data pipeline cannot be trained or operate without data. Why structured? because our code will expect a structured input, so the data is going to be either explicitly structured upfront (”schema on write”), or structured implicitly on read (”schema on read”). 2. Code 3. Docs for usage - Without a user manual, a complex piece of code is next to unusable. -### And where are the docs needed? +### And which docs are needed? We will need top level docs, plus some for each of the parts described above. @@ -60,21 +67,22 @@ We will need top level docs, plus some for each of the parts described above. 2. Structured data: 1. A **data dictionary** enables users to gain literacy on the dataset. 2. **Maintenance info:** information about the source, schema, tests, responsible person, how to monitor, etc. -3. Code & Usage manual: This one is harder. You need to convey a lot of information in an effective manner, and depending on who your user is, you need to convey that information in a different format. According to the **[leading brains on the topic of docs](https://www.writethedocs.org/videos/eu/2017/the-four-kinds-of-documentation-and-why-you-need-to-understand-what-they-are-daniele-procida/)**, these are the **4 relevant formats you should consider.** They will enable you to write high quality, comprehensive and understandable docs that cover the user’s intention. +3. Code & Usage manual: This one is harder. You need to convey a lot of information effectively, and depending on who your user is, you need to convey that information in a different format. According to the **[leading brains on the topic of docs](https://www.writethedocs.org/videos/eu/2017/the-four-kinds-of-documentation-and-why-you-need-to-understand-what-they-are-daniele-procida/)**, these are the **4 relevant formats you should consider.** They will enable you to write high quality, comprehensive and understandable docs that cover the user’s intention. - learning-oriented tutorials - goal-oriented how-to guides - understanding-oriented discussions - information-oriented reference material -# dlt pipeline as an example +# Some examples from dlt + +Dlt is a library that enables us to build data products. By building with dlt, you benefit from simple declarative code and accessible docs for anyone maintaining later. -Dlt is a library that enables us to build data products. By building with dlt, you benefit from simple code and accessible docs. Here’s how a dlt pipeline documentation could be structured: -- Top level: [intro doc](https://dlthub.com/docs/intro) +- Top level: Here is our attempt for dlt itself - the [intro doc](https://dlthub.com/docs/intro). You could describe the problem or use case that the pipeline solves. - Data dictionary: Schema info belongs to each pipeline and can be found [here](https://dlthub.com/docs/blog/dlt-lineage-support). To get sample values, you could write a query. We plan to enable its generation in the future via a “describe” command. -- Maintenance info: See here [how to set up schema evolution alerts](https://dlthub.com/docs/blog/schema-evolution#whether-you-are-aware-or-not-you-are-always-getting-structured-data-for-usage). You can also capture load info such as row counts to monitor loaded volume for abnormalities, as described in the post under data dictionary. -- Code and usage: We are structuring all our [docs](https://dlthub.com/docs/intro) to follow the best practices around the 4 types of docs, generating a comprehensive, recognisable documentation. We also have GPT assistant on docs and we answer questions in slack for conversational help. +- Maintenance info: See [how to set up schema evolution alerts](https://dlthub.com/docs/blog/schema-evolution#whether-you-are-aware-or-not-you-are-always-getting-structured-data-for-usage). You can also capture load info such as row counts to monitor loaded volume for abnormalities. +- Code and usage: We are structuring all our [docs](https://dlthub.com/docs/intro) to follow the best practices around the 4 types of docs, generating a comprehensive, recognisable documentation. We also have a GPT assistant on docs, and we answer questions in Slack for conversational help. # In conclusion