From 5cd87e19d331462f3a915f3899091bda6fd84d37 Mon Sep 17 00:00:00 2001 From: Marcin Rudolf Date: Fri, 15 Sep 2023 18:36:18 +0200 Subject: [PATCH] removes content that is not a tutorial, how to, reference or dlt relate knowledge from docs --- ...evolution-next-generation-data-platform.md | 2 +- .../2023-06-10-schema-evolution.md} | 21 ++- .../2023-06-15-automating-data-engineers.md | 2 +- .../website/docs/build-a-pipeline-tutorial.md | 2 - .../docs/dlt-ecosystem/deployments/index.md | 17 ++ .../orchestrators/airflow-deployment.md | 40 ----- .../orchestrators/choosing-an-orchestrator.md | 120 -------------- .../orchestrators/github-actions.md | 31 ---- .../deployments/running-in-cloud-functions.md | 49 ------ .../deployments/where-can-dlt-run.md | 56 ------- docs/website/docs/general-usage/import-dlt.md | 23 --- docs/website/docs/general-usage/state.md | 20 ++- .../airflow-gcp-cloud-composer.md | 2 +- docs/website/docs/reference/performance.md | 36 +++- .../docs/user-guides/analytics-engineer.md | 79 --------- .../website/docs/user-guides/data-beginner.md | 130 --------------- .../website/docs/user-guides/data-engineer.md | 97 ----------- .../docs/user-guides/data-scientist.md | 129 --------------- .../docs/user-guides/engineering-manager.md | 155 ------------------ .../docs/user-guides/images/colab-demo.png | Bin 99180 -> 0 bytes .../docs/user-guides/images/dlt-main.png | Bin 190444 -> 0 bytes .../user-guides/images/structured-data.png | Bin 343160 -> 0 bytes .../deploy-with-airflow-composer.md | 2 +- docs/website/package-lock.json | 4 +- docs/website/sidebars.js | 47 +----- docs/website/src/css/custom.css | 68 +++----- 26 files changed, 115 insertions(+), 1017 deletions(-) rename docs/website/{docs/reference/explainers/schema-evolution.md => blog/2023-06-10-schema-evolution.md} (88%) create mode 100644 docs/website/docs/dlt-ecosystem/deployments/index.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/orchestrators/airflow-deployment.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/orchestrators/choosing-an-orchestrator.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/orchestrators/github-actions.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/running-in-cloud-functions.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/where-can-dlt-run.md delete mode 100644 docs/website/docs/general-usage/import-dlt.md rename docs/website/docs/reference/{ => explainers}/airflow-gcp-cloud-composer.md (95%) delete mode 100644 docs/website/docs/user-guides/analytics-engineer.md delete mode 100644 docs/website/docs/user-guides/data-beginner.md delete mode 100644 docs/website/docs/user-guides/data-engineer.md delete mode 100644 docs/website/docs/user-guides/data-scientist.md delete mode 100644 docs/website/docs/user-guides/engineering-manager.md delete mode 100644 docs/website/docs/user-guides/images/colab-demo.png delete mode 100644 docs/website/docs/user-guides/images/dlt-main.png delete mode 100644 docs/website/docs/user-guides/images/structured-data.png diff --git a/docs/website/blog/2023-05-26-structured-data-lakes-through-schema-evolution-next-generation-data-platform.md b/docs/website/blog/2023-05-26-structured-data-lakes-through-schema-evolution-next-generation-data-platform.md index 09a558555a..d47b9a5b72 100644 --- a/docs/website/blog/2023-05-26-structured-data-lakes-through-schema-evolution-next-generation-data-platform.md +++ b/docs/website/blog/2023-05-26-structured-data-lakes-through-schema-evolution-next-generation-data-platform.md @@ -108,5 +108,5 @@ To try out schema evolution with `dlt`, check out our [colab demo.](https://cola ### Want more? - Join our [Slack](https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) -- Read our [docs on implementing schema evolution](https://dlthub.com/docs/reference/explainers/schema-evolution) +- Read our [schema evolution blog post](https://dlthub.com/docs/blog/schema-evolution) - Stay tuned for the next article in the series: *How to do schema evolution with* `dlt` *in the most effective way* \ No newline at end of file diff --git a/docs/website/docs/reference/explainers/schema-evolution.md b/docs/website/blog/2023-06-10-schema-evolution.md similarity index 88% rename from docs/website/docs/reference/explainers/schema-evolution.md rename to docs/website/blog/2023-06-10-schema-evolution.md index f0a0e9e222..2347e95bfd 100644 --- a/docs/website/docs/reference/explainers/schema-evolution.md +++ b/docs/website/blog/2023-06-10-schema-evolution.md @@ -1,7 +1,12 @@ --- -title: Schema evolution -description: Schema evolution with dlt -keywords: [schema evolution, schema versioning, data contracts] +slug: schema-evolution +title: "Schema Evolution" +authors: + name: Adrian Brudaru + title: Schema Evolution + url: https://github.com/adrianbr + image_url: https://avatars.githubusercontent.com/u/5762770?v=4 +tags: [data engineer shortage, structured data, schema evolution] --- # Schema evolution @@ -131,10 +136,10 @@ business-logic tests, you would still need to implement them in a custom way. ## The implementation recipe 1. Use `dlt`. It will automatically infer and version schemas, so you can simply check if there are - changes. You can just use the [normaliser + loader](../../general-usage/pipeline.md) or - [build extraction with dlt](../../general-usage/resource.md). If you want to define additional - constraints, you can do so in the [schema](../../general-usage/schema.md). -1. [Define your slack hook](../../running-in-production/running.md#using-slack-to-send-messages) or + changes. You can just use the [normaliser + loader](https://dlthub.com/docs/general-usage/pipeline.md) or + [build extraction with dlt](https://dlthub.com/docs/general-usage/resource.md). If you want to define additional + constraints, you can do so in the [schema](https://dlthub.com/docs/general-usage/schema.md). +1. [Define your slack hook](https://dlthub.com/docs/running-in-production/running.md#using-slack-to-send-messages) or create your own notification function. Make sure the slack channel contains the data producer and any stakeholders. -1. [Capture the load job info and send it to the hook](../../running-in-production/running#inspect-save-and-alert-on-schema-changes). +1. [Capture the load job info and send it to the hook](https://dlthub.com/docs/running-in-production/running#inspect-save-and-alert-on-schema-changes). diff --git a/docs/website/blog/2023-06-15-automating-data-engineers.md b/docs/website/blog/2023-06-15-automating-data-engineers.md index 6228a067ab..eef291cdbe 100644 --- a/docs/website/blog/2023-06-15-automating-data-engineers.md +++ b/docs/website/blog/2023-06-15-automating-data-engineers.md @@ -122,5 +122,5 @@ Not only that, but doing things this way lets your team focus on what they do be 2. Notify stakeholders and producers of data changes, so they can curate it. 3. Don’t explore json with data engineers - let analyst explore structured data. -Ready to stop the pain? Read [this explainer on how to do schema evolution with dlt](/docs/reference/explainers/schema-evolution). +Ready to stop the pain? Read [this explainer on how to do schema evolution with dlt](https://dlthub.com/docs/blog/schema-evolution). Want to discuss? Join our [slack](https://join.slack.com/t/dlthub-community/shared_invite/zt-1n5193dbq-rCBmJ6p~ckpSFK4hCF2dYA). \ No newline at end of file diff --git a/docs/website/docs/build-a-pipeline-tutorial.md b/docs/website/docs/build-a-pipeline-tutorial.md index 14c3a78411..078ef5999e 100644 --- a/docs/website/docs/build-a-pipeline-tutorial.md +++ b/docs/website/docs/build-a-pipeline-tutorial.md @@ -413,8 +413,6 @@ These governance features in `dlt` pipelines contribute to better data managemen compliance adherence, and overall data governance, promoting data consistency, traceability, and control throughout the data processing lifecycle. -Read more about [schema evolution.](reference/explainers/schema-evolution.md) - ### Scaling and finetuning `dlt` offers several mechanism and configuration options to scale up and finetune pipelines: diff --git a/docs/website/docs/dlt-ecosystem/deployments/index.md b/docs/website/docs/dlt-ecosystem/deployments/index.md new file mode 100644 index 0000000000..62157a7e18 --- /dev/null +++ b/docs/website/docs/dlt-ecosystem/deployments/index.md @@ -0,0 +1,17 @@ +--- +title: Deployments +description: dlt can run on almost any python environment and hardware +keywords: [dlt, running, environment] +--- +import DocCardList from '@theme/DocCardList'; + +# Deployments +`dlt` runs wherever Python runs. Below you see walkthroughs for Airflow, GCP Cloud Functions, AWS Lambda and GitHub Actions but our users run us on +* in any notebook environment including Colab +* any orchestrator including Kestra, Prefect or Dagster +* local laptops with `duckdb` or `weaviate` as destinations +* Github codespaces and other devcontainers +* regular VMs from all major providers: AWS, GCP or Azure +* in containers via Docker, docker-compose and Kubernetes + + diff --git a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/airflow-deployment.md b/docs/website/docs/dlt-ecosystem/deployments/orchestrators/airflow-deployment.md deleted file mode 100644 index 15f00731e2..0000000000 --- a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/airflow-deployment.md +++ /dev/null @@ -1,40 +0,0 @@ ---- -title: Airflow with Cloud Composer -description: How to run dlt pipeline with Airflow -keywords: [dlt, webhook, serverless, airflow, gcp, cloud composer] ---- - -# Deployment with Airflow and Google Cloud Composer - -[Airflow](https://airflow.apache.org) is like your personal assistant for managing data workflows. -It's a cool open-source platform that lets you create and schedule complex data pipelines. You can -break down your tasks into smaller chunks, set dependencies between them, and keep an eye on how -everything's running. - -[Google Cloud Composer](https://cloud.google.com/composer) is a Google Cloud managed Airflow, which -allows you to use Airflow without having to deploy it. It costs the same as you would run your own, -except all the kinks and inefficiencies have mostly been ironed out. The latest version they offer -features autoscaling which helps reduce cost further by shutting down unused workers. - -Combining Airflow, `dlt`, and Google Cloud Composer is a game-changer. You can supercharge your data -pipelines by leveraging Airflow's workflow management features, enhancing them with `dlt`'s -specialized templates, and enjoying the scalability and reliability of Google Cloud Composer's -managed environment. It's the ultimate combo for handling data integration, transformation, and -loading tasks like a pro. - -`dlt` makes it super convenient to deploy your data load script and integrate it seamlessly with -your Airflow workflow in Google Cloud Composer. It's all about simplicity and getting things done -with just a few keystrokes. - -For this easy style of deployment, `dlt` supports -[the cli command](../../../reference/command-line-interface.md#airflow-composer): - -```bash -dlt deploy {pipeline_script}.py airflow-composer -``` - -which generates the necessary code and instructions. - -Read our -[Walkthroughs: Deploy a pipeline with Airflow and Google Composer](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md) -to find out more. diff --git a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/choosing-an-orchestrator.md b/docs/website/docs/dlt-ecosystem/deployments/orchestrators/choosing-an-orchestrator.md deleted file mode 100644 index 76d6389a1e..0000000000 --- a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/choosing-an-orchestrator.md +++ /dev/null @@ -1,120 +0,0 @@ ---- -title: Choosing an orchestrator -description: How to choose an orchestrator to deploy dlt pipeline -keywords: [orchestrator, airflow, github actions] ---- - -# Choosing an orchestrator - -Orchestrators enable developers to quickly and easily deploy and manage applications in the cloud. - -## What is an orchestrator? - -An orchestrator is a software system that automates the deployment, scaling, and management of -applications and services. - -It provides a single platform for managing and coordinating the components of distributed -applications, including containers, microservices, and other cloud-native resources. - -## Do I need an orchestrator? - -No, but if you do not use one, you will need to consider how to solve the problems one would: - -- Monitoring, Alerting; -- deployment and execution; -- scheduling complex workflows; -- task triggers, dependencies, retries; -- UI for visualising workflows; -- secret vaults/storage; - -So in short, unless you need something very lightweight, you can benefit from an orchestrator. - -## So which one? - -### **Airflow** - -Airflow is a market standard that’s hard to beat. Any shortcomings it might have is more than -compensated for by being an open source mature product with a large community. - -`dlt` supports **airflow deployments**, meaning it’s particularly easy to deploy `dlt` pipelines to -Airflow via simple commands. - -Your Airflow options are: - -1. Broadly used managed Airflow vendors: - - - [GCP Cloud Composer](https://cloud.google.com/composer?hl=en) is - recommended due to GCP being easy to use for non engineers and Google BigQuery being a popular - solution. - - [Astronomer.io](http://Astronomer.io) (recommended for non GCP users). CI/CD out of the box - - AWS has a managed Airflow too, though it is the hardest to use. - -1. Self-managed Airflow: - - - You can self-host and run your own Airflow. This is not recommended unless the team plans to - have the skills to work with this in house. - -Limitations: - -- Airflow manages large scale setups and small alike. - -To deploy a pipeline on Airflow with Google Composer, read our -[step-by-step tutorial](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer) about -using the `dlt deploy` command. - -### **GitHub Actions** - -GitHub Actions is not a full-blown orchestrator, but it works. It supports simple workflows and -scheduling, allowing for a visual, lightweight deployment with web-based monitoring (i.e. you can -see runs in GitHub Actions). It has a free tier, but its pricing is not convenient for large jobs. - -To deploy a pipeline on GitHub Actions, read -[here](../../../walkthroughs/deploy-a-pipeline/deploy-with-github-actions) about using the -`dlt deploy` command and -[here](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration) -about the limitations of GitHub Actions and how their billing works. - -### **Other orchestrators** - -Other orchestrators can also do the job, so if you are limited in choice or prefer something else, -choose differently. If your team prefers a different tool which affects their work positively, -consider that as well. - -What do you need to consider when using other orchestrators? - -## Source - Resource decomposition: - -You can decompose a pipeline into strongly connected components with -`source().decompose(strategy="scc")`. The method returns a list of dlt sources each containing a -single component. Method makes sure that no resource is executed twice. - -Serial decomposition: - -You can load such sources as tasks serially in order present of the list. Such DAG is safe for -pipelines that use the state internally. -[It is used internally by our Airflow mapper to construct DAGs.](https://github.com/dlt-hub/dlt/blob/devel/dlt/helpers/airflow_helper.py) - -Custom decomposition: - -- When decomposing pipelines into tasks, be mindful of shared state. -- Dependent resources pass data to each other via hard disk - so they need to run on the same - worker. Group them in a task that runs them together. -- State is per-pipeline. The pipeline identifier is the pipeline name. A single pipeline state - should be accessed serially to avoid losing details on parallel runs. - -Parallel decomposition: - -If you are using only the resource state (which most of the pipelines really should!) you can run -your tasks in parallel. - -- Perform the `scc` decomposition. -- Run each component in a pipeline with different but deterministic `pipeline_name` (same component - \- same pipeline, you can use names of selected resources in source to construct unique id). - -Each pipeline will have its private state in the destination and there won't be any clashes. As all -the components write to the same schema you may observe a that loader stage is attempting to migrate -the schema, that should be a problem though as long as your data does not create variant columns. - -## Credentials - -[See credentials section for passing credentials to or from dlt.](../../../general-usage/credentials.md) diff --git a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/github-actions.md b/docs/website/docs/dlt-ecosystem/deployments/orchestrators/github-actions.md deleted file mode 100644 index 8a5d693117..0000000000 --- a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/github-actions.md +++ /dev/null @@ -1,31 +0,0 @@ ---- -title: GitHub Actions -description: How to run dlt in GitHub Actions -keywords: [dlt, webhook, serverless, deploy, github actions] ---- - -# Native deployment to GitHub Actions - -What is a native deployment to dlt? A native deployment is a deployment where dlt will generate the -glue code and credentials instructions. - -[GitHub Actions](https://docs.github.com/en/actions) is an automation tool by GitHub for building, -testing, and deploying software projects. It simplifies development with pre-built actions and -custom workflows, enabling tasks like testing, documentation generation, and event-based triggers. -It streamlines workflows and saves time for developers. - -When `dlt` and GitHub Actions join forces, data loading and software development become a breeze. -Say goodbye to manual work and enjoy seamless data management. - -For this easy style of deployment, dlt supports the -[cli command](../../../reference/command-line-interface#github-action): - -```shell -dlt deploy