From 36821a0ae43a7f00acd39c8b2d43cdd702e21e2b Mon Sep 17 00:00:00 2001 From: Marcin Rudolf Date: Sun, 17 Sep 2023 16:11:51 +0200 Subject: [PATCH] post merge fixes --- .../website/docs/user-guides/data-beginner.md | 130 --------------- .../docs/user-guides/data-scientist.md | 129 --------------- .../docs/user-guides/engineering-manager.md | 155 ------------------ 3 files changed, 414 deletions(-) delete mode 100644 docs/website/docs/user-guides/data-beginner.md delete mode 100644 docs/website/docs/user-guides/data-scientist.md delete mode 100644 docs/website/docs/user-guides/engineering-manager.md diff --git a/docs/website/docs/user-guides/data-beginner.md b/docs/website/docs/user-guides/data-beginner.md deleted file mode 100644 index e6dd8b8d22..0000000000 --- a/docs/website/docs/user-guides/data-beginner.md +++ /dev/null @@ -1,130 +0,0 @@ ---- -title: Data Beginner -description: A guide to using dlt for aspiring data professionals -keywords: [beginner, analytics, machine learning] ---- - -# Data Beginner - -If you are an aspiring data professional, here are some ways you can showcase your understanding and -value to data teams with the help of `dlt`. - -## Analytics: Empowering decision-makers - -Operational users at a company need general business analytics capabilities to make decisions, e.g. -dashboards, data warehouse, self-service, etc. - -### Show you can deliver results, not numbers - -The goal of such a project is to get you into the top 5% of candidates, so you get invited to an -interview and understand pragmatically what is expected of you. - -Depending on whether you want to be more in engineering or analytics, you can focus on different -parts of this project. If you showcase that you are able to deliver end to end, there remains little -reason for a potential employer to not hire you. - -Someone hiring folks on this business analytics path will be looking for the following skills: - -- Can you load data to a db? - - Can you do incremental loading? - - Are your pipelines maintainable? - - Are your pipelines reusable? Do they take meaningful arguments? -- Can you transform the data to a standard architecture? - - Do you know dimensional modelling architecture? - - Does your model make the data accessible via a user facing tool to a business user? - - Can you translate a business requirement into a technical requirement? -- Can you identify a use case and prepare reporting? - - Are you displaying a sensible use case? - - Are you taking a pragmatic approach as to what should be displayed and why? - - Did you hard code charts in a notebook that the end user cannot use or did you use a user-facing - dashboard tool? - - Is the user able to answer follow-up questions by changing the dimensions in a tool or did you - hard code queries? - -Project idea: - -1. Choose an API that produces data. If this data is somehow business relevant, that’s better. Many - business apps offer free developer accounts that allow you to develop business apps with them. -1. Choose a use case for this data. Make sure this use case makes some business sense and is not - completely theoretical. Business understanding and pragmatism are key for such roles, so do not - waste your chance to show it. Keep the use case simple-otherwise it will not be pragmatic right - off the bat, handicapping yourself from a good outcome. A few examples are ranking leads in a - sales CRM, clustering users, and something around customer lifetime value predictions. -1. Build a dlt pipeline that loads data from the API for your use case. Keep the case simple and - your code clean. Use explicit variable and method names. Tell a story with your code. For loading - mode, use incremental loading and don’t hardcode parameters that are subject to change. -1. Build a [dbt package](../dlt-ecosystem/transformations/dbt.md) for this pipeline. -1. Build a visualization. Focus on usability more than code. Remember, your goal is to empower a - business user to self-serve, so hard coded dashboards are usually seen as liabilities that need - to be maintained. On the other hand, dashboard tools can be adjusted by business users too. For - example, the free “Looker studio” fro Google is relatable to business users, while notebooks - might make them feel insecure. Your evaluator will likely not take time to set up and run your - things, so make sure your outcomes are well documented with images. Make sure they are self - readable, explain how you intend the business user to use this visualization to fulfil the use - case. -1. Make it presentable somewhere public, such as GitHub, and add docs. Show it to someone for - feedback. You will find likeminded people in - [our Slack](https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) - that will happily give their opinion. - -## Machine Learning: Automating decisions - -Solving specific business problems with data products that generate further insights and sometimes -automate decisions. - -### Show you can solve business problems - -Here the challenges might seem different from the business analytics path, but they are often quite -similar. Many courses focus on statistics and data science but very few focus on pragmatic -approaches to solving business problems in organizations. Most of the time, the largest obstacles to -solving a problem with ML are not purely algorithmic but rather about the semantics of the business, -data, and people who need to use the data products. - -Employers look for a project that showcases both technical ability and business pragmatism in a use -case. In reality, data does not typically come in files but via APIs with fresh data, where you -usually will have to grab it and move it somewhere to use, so show your ability to deliver end to -end. - -Project idea: - -1. Choose an API that produces data. If this data is somehow business relevant, that’s better. Many - business apps offer free developer accounts that allow you to develop business apps with them. -1. Choose a use case for this data. Make sure this use case makes some business sense and is not - completely theoretical. Business understanding and pragmatism are key for such roles, so do not - waste your chance to show it. Keep the use case simple-otherwise it will not be pragmatic right - off the bat, handicapping yourself from a good outcome. A few examples are ranking leads in a - sales CRM, clustering users, and something around customer lifetime value predictions. -1. Build a dlt pipeline that loads data from the API for your use case. Keep the case simple and - your code clean. Use explicit variable and method names. Tell a story with your code. For loading - mode, use incremental loading and don’t hardcode parameters that are subject to change. -1. Build a data model with SQL. If you are ambitious you could try running the SQL with a - [dbt package](../dlt-ecosystem/transformations). -1. Showcase your chosen use case that uses ML or statistics to achieve your goal. Don’t forget to - mention how you plan to do this “in production”. Choose a case that is simple so you don’t end up - overcomplicating your solution. Focus on outcomes and next steps. Describe what the company needs - to do to use your results, demonstrating that you understand the costs of your propositions. -1. Make it presentable somewhere public, such as GitHub, and add docs. Show it to someone for - feedback. You will find likeminded people in - [our Slack](https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) - that will happily give their opinion. - -## Further reading - -Good docs pages to check out: - -- [Getting started.](../getting-started) -- [Create a pipeline.](../walkthroughs/create-a-pipeline) -- [Run a pipeline.](../walkthroughs/run-a-pipeline) -- [Deploy a pipeline with GitHub Actions.](../walkthroughs/deploy-a-pipeline/deploy-with-github-actions) -- [Understand the loaded data.](../general-usage/destination-tables.md) -- [Explore the loaded data in Streamlit.](../dlt-ecosystem/visualizations/exploring-the-data.md) -- [Transform the data with SQL or python.](../dlt-ecosystem/transformations) -- [Contribute a pipeline.](https://github.com/dlt-hub/verified-sources/blob/master/CONTRIBUTING.md) - -Here are some example projects: - -- [Is DuckDB a database for ducks? Using DuckDB to explore the DuckDB open source community.](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing) -- [Using DuckDB to explore the Rasa open source community.](https://colab.research.google.com/drive/1c9HrNwRi8H36ScSn47m3rDqwj5O0obMk?usp=sharing) -- [MRR and churn calculations on Stripe data.](../dlt-ecosystem/verified-sources/stripe.md) - -Please [open a PR](https://github.com/dlt-hub/verified-sources) to add projects that use `dlt` here! diff --git a/docs/website/docs/user-guides/data-scientist.md b/docs/website/docs/user-guides/data-scientist.md deleted file mode 100644 index b8415937e4..0000000000 --- a/docs/website/docs/user-guides/data-scientist.md +++ /dev/null @@ -1,129 +0,0 @@ ---- -title: Data Scientist -description: A guide to using dlt for Data Scientists -keywords: [data scientist, data science, machine learning, machine learning engineer] ---- - - -# Data Scientist - -Data Load Tool (`dlt`) can be highly useful for Data Scientists in several ways. Here are three -potential use cases: - -## Use case #1: Efficient Data Ingestion and Optimized Workflow - -Data Scientists often deal with large volumes of data from various sources. `dlt` can help -streamline the process of data ingestion by providing a robust and scalable tool for loading data -into their analytics environment. It can handle diverse data formats, such as CSV, JSON, or database -dumps, and efficiently load them into a data lake or a data warehouse. - -![dlt-main](images/dlt-main.png) - -By using `dlt`, Data Scientists can save time and effort on data extraction and transformation -tasks, allowing them to focus more on data analysis and models training. The tool is designed as a -library that can be added to their code, making it easy to integrate into existing workflows. - -`dlt` can facilitate a seamless transition from data exploration to production deployment. Data -Scientists can leverage `dlt` capabilities to load data in the format that matches the production -environment while exploring and analyzing the data. This streamlines the process of moving from the -exploration phase to the actual implementation of models, saving time and effort. By using `dlt` -throughout the workflow, Data Scientists can ensure that the data is properly prepared and aligned -with the production environment, leading to smoother integration and deployment of their models. - -- [Use existed Verified Sources](../walkthroughs/add-a-verified-source) and pipeline examples or - [create your own](../walkthroughs/create-a-pipeline) quickly. - -- [Deploy the pipeline](../walkthroughs/deploy-a-pipeline), so that the data is automatically loaded - on a schedule. - -- Transform the [loaded data](../dlt-ecosystem/transformations) with dbt or in - Pandas DataFrames. - -- Learn how to [run](../running-in-production/running), - [monitor](../running-in-production/monitoring), and [alert](../running-in-production/alerting) - when you put your pipeline in production. - -- Use `dlt` when doing exploration in a Jupyter Notebook and move more easily to production. Explore - our - [Colab Demo for Chess.com API](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing) - to realize how easy it is to create and use `dlt` in your projects: - - ![colab-demo](images/colab-demo.png) - -### `dlt` is optimized for local use on laptops - -- It offers a seamless - [integration with Streamlit](../dlt-ecosystem/visualizations/exploring-the-data.md). - This integration enables a smooth and interactive data analysis experience, where Data Scientists - can leverage the power of `dlt` alongside Streamlit's intuitive interface and visualization - capabilities. -- In addition to Streamlit, `dlt` natively supports - [DuckDB](https://dlthub.com/docs/blog/is-duckdb-a-database-for-ducks), an in-process SQL OLAP - database management system. This native support ensures efficient data processing and querying - within `dlt`, leveraging the capabilities of DuckDB. By integrating DuckDB, Data Scientists can - benefit from fast and scalable data operations, enhancing the overall performance of their - analytical workflows. -- Moreover, `dlt` provides resources that can directly return data in the form of - [Pandas DataFrames from an SQL client](../dlt-ecosystem/visualizations/exploring-the-data). This - feature simplifies data retrieval and allows Data Scientists to seamlessly work with data in - familiar Pandas DataFrame format. With this capability, Data Scientists can leverage the rich - ecosystem of Python libraries and tools that support Pandas. - -With `dlt`, the transition from local storage to remote is quick and easy. For example, read the -documentation [Share a dataset: DuckDB -> BigQuery](../walkthroughs/share-a-dataset). - -## Use case #2: Structured Data and Enhanced Data Understanding - -### Structured data - -Data Scientists often prefer structured data lakes over unstructured ones to facilitate efficient -data analysis and modeling. `dlt` can help in this regard by offering seamless integration with -structured data storage systems, allowing Data Scientists to easily load and organize their data in -a structured format. This enables them to access and analyze the data more effectively, improving -their understanding of the underlying data structure. - -![structured-data](images/structured-data.png) - -A `dlt` pipeline is made of a source, which contains resources, and a connection to the destination, -which we call pipeline. So in the simplest use case, you could pass your unstructured data to the -`pipeline` and it will automatically be migrated to structured at the destination. See how to do -that in our [pipeline documentation](../general-usage/pipeline). - -Besides strurdiness, this also adds convenience by automatically converting json types to db types, -such as timestamps, etc. - -Read more about schema evolution in our blog: -**[The structured data lake: How schema evolution enables the next generation of data platforms](https://dlthub.com/docs/blog/next-generation-data-platform).** - -### Data exploration - -Data Scientists require a comprehensive understanding of their data to derive meaningful insights -and build accurate models. `dlt` can contribute to this by providing intuitive and user-friendly -features for data exploration. It allows Data Scientists to quickly gain insights into their data by -visualizing data summaries, statistics, and distributions. With `dlt`, data understanding becomes -clearer and more accessible, enabling Data Scientists to make informed decisions throughout the -analysis process. - -Besides, having a schema imposed on the data acts as a technical description of the data, -accelerating the discovery process. - -See [Destination tables](../general-usage/destination-tables.md) and -[Exploring the data](../dlt-ecosystem/visualizations/exploring-the-data) in our documentation. - -## Use case #3: Data Preprocessing and Transformation - -Data preparation is a crucial step in the data science workflow. `dlt` can facilitate data -preprocessing and transformation tasks by providing a range of built-in features. It simplifies -various tasks like data cleaning, anonymizing, handling missing values, data type conversion, -feature scaling, and feature engineering. Data Scientists can leverage these capabilities to clean -and transform their datasets efficiently, making them suitable for subsequent analysis and modeling. - -Python-first users can heavily customize how `dlt` sources produce data, as `dlt` supports -selecting, [filtering](../general-usage/resource#filter-transform-and-pivot-data), -[renaming](../general-usage/customising-pipelines/renaming_columns), -[anonymizing](../general-usage/customising-pipelines/pseudonymizing_columns), and just about any -custom operation. - -Compliance is also a case where preprocessing is the way to solve the issue: Besides being -python-friendly, the ability to apply transformation logic before loading data allows us to -separate, filter or transform sensitive data. diff --git a/docs/website/docs/user-guides/engineering-manager.md b/docs/website/docs/user-guides/engineering-manager.md deleted file mode 100644 index 70e23eb2c1..0000000000 --- a/docs/website/docs/user-guides/engineering-manager.md +++ /dev/null @@ -1,155 +0,0 @@ ---- -title: Staff Data Engineer -description: A guide to using dlt for Staff Data Engineers -keywords: [staff data engineer, senior data engineer, ETL engineer, head of data platform, - data platform engineer] ---- - -# Staff Data Engineer - -Staff data engineers create data pipelines, data warehouses and data lakes in order to democratize -access to data in their organizations. - -With `dlt` we offer a library and building blocks that data tool builders can use to create modern -data infrastructure for their companies. Staff Data Engineer, Senior Data Engineer, ETL Engineer, -Head of Data Platform - there’s a variety of titles of how data tool builders are called in -companies. - -## What does this role do in an organisation? - -The job responsibilities of this senior vary, but often revolve around building and maintaining a -robust data infrastructure: - -- Tech: They design and implement scalable data architectures, data pipelines, and data processing - frameworks. -- Governance: They ensure data integrity, reliability, and security across the data stack. They - manage data governance, including data quality, data privacy, and regulatory compliance. -- Strategy: Additionally, they evaluate and adopt new technologies, tools, and methodologies to - improve the efficiency, performance, and scalability of data processes. -- Team skills and staffing: Their responsibilities also involve providing technical leadership, - mentoring team members, driving innovation, and aligning the data strategy with the organization's - overall goals. -- Return on investment focus: Ultimately, their focus is on empowering the organization to derive - actionable insights, make data-driven decisions, and unlock the full potential of their data - assets. - -## Choosing a Data Stack - -The above roles play a critical role in choosing the right data stack for their organization. When -selecting a data stack, they need to consider several factors. These include: - -- The organization's data requirements. -- Scalability, performance, data governance and security needs. -- Integration capabilities with existing systems and tools. -- Team skill sets, budget, and long-term strategic goals. - -They evaluate the pros and cons of various technologies, frameworks, and platforms, considering -factors such as ease of use, community support, vendor reliability, and compatibility with their -specific use cases. The goal is to choose a data stack that aligns with the organization's needs, -enables efficient data processing and analysis, promotes data governance and security, and empowers -teams to deliver valuable insights and solutions. - -## What does a senior architect or engineer consider when choosing a tech stack? - -- Company Goals and Strategy. -- Cost and Return on Investment (ROI). -- Staffing and Skills. -- Employee Happiness and Productivity. -- Maintainability and Long-term Support. -- Integration with Existing Systems. -- Scalability and Performance. -- Data Security and Compliance. -- Vendor Reliability and Ecosystem. - -## What makes dlt a must-have for your data stack or platform? - -For starters, `dlt` is the first data pipeline solution that is built for your data team's ROI. Our -vision is to add value, not gatekeep it. - -By being a library built to enable free usage, we are uniquely positioned to run in existing stacks -without replacing them. This enables us to disrupt and revolutionise the industry in ways that only -open source communities can. - -## dlt massively reduces pipeline maintenance, increases efficiency and ROI - -- Reduce engineering effort as much as 5x via a paradigm shift. Structure data automatically to not - do it manually. - Read about [structured data lake](https://dlthub.com/docs/blog/next-generation-data-platform), and - [how to do schema evolution](../reference/explainers/schema-evolution.md). -- Better Collaboration and Communication: Structured data promotes better collaboration and - communication among team members. Since everyone operates on a shared understanding of the data - structure, it becomes easier to discuss and align on data-related topics. Queries, reports, and - analysis can be easily shared and understood by others, enhancing collaboration and teamwork. -- Faster time to build pipelines: After extracting data, if you pass it to `dlt`, you are done. If - not, it needs to be structured. Because structuring is hard, we curate it. Curation involves at - least the producer, and consumer, but often also an analyst and the engineer, and is a long, - friction-ful process. -- Usage focus improves ROI: To use data, we need to understand what it is. Structured data already - contains a technical description, accelerating usage. -- Lower cost: Reading structured data is cheaper and faster because we can specify which parts of a - document we want to read. -- Removing friction: By alerting schema changes to the producer and stakeholder, and by automating - structuring, we can keep the data engineer out of curation and remove the bottleneck. - [Notify maintenance events.](../running-in-production/running#inspect-save-and-alert-on-schema-changes) -- Improving quality: No more garbage in, garbage out. Because `dlt` structures data and alerts schema - changes, we can have better governance. - -## dlt makes your team happy - -- Spend more time using data, less time loading it. When you build a `dlt` pipeline, you only build - the extraction part, automating the tedious structuring and loading. -- Data meshing to reduce friction: By structuring data before loading, the engineer is no longer - involved in curation. This makes both the engineer and the others happy. -- Better governance with end to end pipelining via dbt: - [run dbt packages on the fly](../dlt-ecosystem/transformations/dbt.md), - [lineage out of the box](../general-usage/destination-tables.md#data-lineage). -- Zero learning curve: Declarative loading, simple functional programming. By using `dlt`'s - declarative, standard approach to loading data, there is no complicated code to maintain, and the - analysts can thus maintain the code. -- Autonomy and Self service: Customising pipelines is easy, whether you want to plug an anonymiser, - rename things, or curate what you load. - [Anonymisers, renamers](../general-usage/customising-pipelines/pseudonymizing_columns.md). -- Easy discovery and governance: By tracking metadata like data lineage, describing data with - schemas, and alerting changes, we stay on top of the data. -- Simplified access: Querying structured data can be done by anyone with their tools of choice. - -## dlt is a library that you can run in unprecedented places - -Before `dlt` existed, all loading tools were built either - -- as SaaS (5tran, Stitch, etc.); -- as installed apps with their own orchestrator: Pentaho, Talend, Airbyte; -- or as abandonware framework meant to be unrunnable without help (Singer was released without - orchestration, not for public). - -`dlt` is the first python library in this space, which means you can just run it wherever the rest of -your python stuff runs, without adding complexity. - -- You can run `dlt` in [Airflow](../dlt-ecosystem/deployments/orchestrators/airflow-deployment.md) - - this is the first ingestion tool that does this. -- You can run `dlt` in small spaces like [Cloud Functions](../dlt-ecosystem/deployments/running-in-cloud-functions.md) - or [GitHub Actions](../dlt-ecosystem/deployments/orchestrators/github-actions.md) - - so you could easily set up webhooks, etc. -- You can run `dlt` in your Jupyter Notebook and load data to [DuckDB](../dlt-ecosystem/destinations/duckdb.md). -- You can run `dlt` on large machines, it will attempt to make the best use of the resources available - to it. -- You can [run `dlt` locally](../walkthroughs/run-a-pipeline.md) just like you run any python scripts. - -The implications: - -- Empowering Data Teams and Collaboration: You can discover or prototype in notebooks, run in cloud - functions, and deploy to production, the same scalable, robust code. No more friction between - roles. - [Colab demo.](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing#scrollTo=A3NRS0y38alk) -- Rapid Data Exploration and Prototyping: By running in Colab with DuckDB, you can explore - semi-structured data much faster by structuring it with `dlt` and analysing it in SQL. - [Schema inference](../general-usage/schema#data-normalizer), - [exploring the loaded data](../dlt-ecosystem/visualizations/exploring-the-data.md). -- No vendor limits: `dlt` is forever free, with no vendor strings. We do not create value by creating - a pain for you and solving it. We create value by supporting you beyond. -- `dlt` removes complexity: You can use `dlt` in your existing stack, no overheads, no race conditions, - full observability. Other tools add complexity. -- `dlt` can be leveraged by AI: Because it's a library with low complexity to use, large language - models can produce `dlt` code for your pipelines. -- Ease of adoption: If you are running python, you can adopt `dlt`. `dlt` is orchestrator and - destination agnostic.