From 24d1f15f2770621ad6b716939f93a8c6e880af77 Mon Sep 17 00:00:00 2001 From: Marcin Rudolf Date: Fri, 15 Sep 2023 18:36:18 +0200 Subject: [PATCH] removes content that is not a tutorial, how to, reference or dlt relate knowledge from docs --- ...evolution-next-generation-data-platform.md | 2 +- .../2023-06-10-schema-evolution.md} | 21 ++- .../website/docs/build-a-pipeline-tutorial.md | 2 - .../docs/dlt-ecosystem/deployments/index.md | 17 ++ .../orchestrators/airflow-deployment.md | 40 ----- .../orchestrators/choosing-an-orchestrator.md | 120 -------------- .../orchestrators/github-actions.md | 31 ---- .../deployments/running-in-cloud-functions.md | 49 ------ .../deployments/where-can-dlt-run.md | 56 ------- docs/website/docs/general-usage/import-dlt.md | 23 --- .../airflow-gcp-cloud-composer.md | 2 +- docs/website/docs/reference/performance.md | 34 ++++ .../docs/user-guides/analytics-engineer.md | 79 --------- .../website/docs/user-guides/data-beginner.md | 130 --------------- .../website/docs/user-guides/data-engineer.md | 97 ----------- .../docs/user-guides/data-scientist.md | 129 --------------- .../docs/user-guides/engineering-manager.md | 155 ------------------ .../docs/user-guides/images/colab-demo.png | Bin 99180 -> 0 bytes .../docs/user-guides/images/dlt-main.png | Bin 190444 -> 0 bytes .../user-guides/images/structured-data.png | Bin 343160 -> 0 bytes docs/website/sidebars.js | 47 +----- docs/website/src/css/custom.css | 68 +++----- 22 files changed, 99 insertions(+), 1003 deletions(-) rename docs/website/{docs/reference/explainers/schema-evolution.md => blog/2023-06-10-schema-evolution.md} (88%) create mode 100644 docs/website/docs/dlt-ecosystem/deployments/index.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/orchestrators/airflow-deployment.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/orchestrators/choosing-an-orchestrator.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/orchestrators/github-actions.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/running-in-cloud-functions.md delete mode 100644 docs/website/docs/dlt-ecosystem/deployments/where-can-dlt-run.md delete mode 100644 docs/website/docs/general-usage/import-dlt.md rename docs/website/docs/reference/{ => explainers}/airflow-gcp-cloud-composer.md (95%) delete mode 100644 docs/website/docs/user-guides/analytics-engineer.md delete mode 100644 docs/website/docs/user-guides/data-beginner.md delete mode 100644 docs/website/docs/user-guides/data-engineer.md delete mode 100644 docs/website/docs/user-guides/data-scientist.md delete mode 100644 docs/website/docs/user-guides/engineering-manager.md delete mode 100644 docs/website/docs/user-guides/images/colab-demo.png delete mode 100644 docs/website/docs/user-guides/images/dlt-main.png delete mode 100644 docs/website/docs/user-guides/images/structured-data.png diff --git a/docs/website/blog/2023-05-26-structured-data-lakes-through-schema-evolution-next-generation-data-platform.md b/docs/website/blog/2023-05-26-structured-data-lakes-through-schema-evolution-next-generation-data-platform.md index 09a558555a..d47b9a5b72 100644 --- a/docs/website/blog/2023-05-26-structured-data-lakes-through-schema-evolution-next-generation-data-platform.md +++ b/docs/website/blog/2023-05-26-structured-data-lakes-through-schema-evolution-next-generation-data-platform.md @@ -108,5 +108,5 @@ To try out schema evolution with `dlt`, check out our [colab demo.](https://cola ### Want more? - Join our [Slack](https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) -- Read our [docs on implementing schema evolution](https://dlthub.com/docs/reference/explainers/schema-evolution) +- Read our [schema evolution blog post](https://dlthub.com/docs/blog/schema-evolution) - Stay tuned for the next article in the series: *How to do schema evolution with* `dlt` *in the most effective way* \ No newline at end of file diff --git a/docs/website/docs/reference/explainers/schema-evolution.md b/docs/website/blog/2023-06-10-schema-evolution.md similarity index 88% rename from docs/website/docs/reference/explainers/schema-evolution.md rename to docs/website/blog/2023-06-10-schema-evolution.md index f0a0e9e222..2347e95bfd 100644 --- a/docs/website/docs/reference/explainers/schema-evolution.md +++ b/docs/website/blog/2023-06-10-schema-evolution.md @@ -1,7 +1,12 @@ --- -title: Schema evolution -description: Schema evolution with dlt -keywords: [schema evolution, schema versioning, data contracts] +slug: schema-evolution +title: "Schema Evolution" +authors: + name: Adrian Brudaru + title: Schema Evolution + url: https://github.com/adrianbr + image_url: https://avatars.githubusercontent.com/u/5762770?v=4 +tags: [data engineer shortage, structured data, schema evolution] --- # Schema evolution @@ -131,10 +136,10 @@ business-logic tests, you would still need to implement them in a custom way. ## The implementation recipe 1. Use `dlt`. It will automatically infer and version schemas, so you can simply check if there are - changes. You can just use the [normaliser + loader](../../general-usage/pipeline.md) or - [build extraction with dlt](../../general-usage/resource.md). If you want to define additional - constraints, you can do so in the [schema](../../general-usage/schema.md). -1. [Define your slack hook](../../running-in-production/running.md#using-slack-to-send-messages) or + changes. You can just use the [normaliser + loader](https://dlthub.com/docs/general-usage/pipeline.md) or + [build extraction with dlt](https://dlthub.com/docs/general-usage/resource.md). If you want to define additional + constraints, you can do so in the [schema](https://dlthub.com/docs/general-usage/schema.md). +1. [Define your slack hook](https://dlthub.com/docs/running-in-production/running.md#using-slack-to-send-messages) or create your own notification function. Make sure the slack channel contains the data producer and any stakeholders. -1. [Capture the load job info and send it to the hook](../../running-in-production/running#inspect-save-and-alert-on-schema-changes). +1. [Capture the load job info and send it to the hook](https://dlthub.com/docs/running-in-production/running#inspect-save-and-alert-on-schema-changes). diff --git a/docs/website/docs/build-a-pipeline-tutorial.md b/docs/website/docs/build-a-pipeline-tutorial.md index 14c3a78411..078ef5999e 100644 --- a/docs/website/docs/build-a-pipeline-tutorial.md +++ b/docs/website/docs/build-a-pipeline-tutorial.md @@ -413,8 +413,6 @@ These governance features in `dlt` pipelines contribute to better data managemen compliance adherence, and overall data governance, promoting data consistency, traceability, and control throughout the data processing lifecycle. -Read more about [schema evolution.](reference/explainers/schema-evolution.md) - ### Scaling and finetuning `dlt` offers several mechanism and configuration options to scale up and finetune pipelines: diff --git a/docs/website/docs/dlt-ecosystem/deployments/index.md b/docs/website/docs/dlt-ecosystem/deployments/index.md new file mode 100644 index 0000000000..d5adb597c7 --- /dev/null +++ b/docs/website/docs/dlt-ecosystem/deployments/index.md @@ -0,0 +1,17 @@ +--- +title: Deployments +description: dlt can run on almost any python environment and hardware +keywords: [dlt, running, environment] +--- +import DocCardList from '@theme/DocCardList'; + +# Deployments +`dlt` runs wherever Python runs. Below you see walkthroughs for Airflow, GCP Cloud Functions, AWS Lambda and GitHub Actions but our users run us on +* in any notebook environment including Colab +* any orchestrator including Kestra, Prefect or Dagster +* local laptops with `duckdb` or `weaviate` as destinations +* Github codespaces and other devcontainers +* regular VMs from all major providers: AWS, GCP or Azure +* in containers via Docker, docker-compose and Kubernetes + + \ No newline at end of file diff --git a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/airflow-deployment.md b/docs/website/docs/dlt-ecosystem/deployments/orchestrators/airflow-deployment.md deleted file mode 100644 index 15f00731e2..0000000000 --- a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/airflow-deployment.md +++ /dev/null @@ -1,40 +0,0 @@ ---- -title: Airflow with Cloud Composer -description: How to run dlt pipeline with Airflow -keywords: [dlt, webhook, serverless, airflow, gcp, cloud composer] ---- - -# Deployment with Airflow and Google Cloud Composer - -[Airflow](https://airflow.apache.org) is like your personal assistant for managing data workflows. -It's a cool open-source platform that lets you create and schedule complex data pipelines. You can -break down your tasks into smaller chunks, set dependencies between them, and keep an eye on how -everything's running. - -[Google Cloud Composer](https://cloud.google.com/composer) is a Google Cloud managed Airflow, which -allows you to use Airflow without having to deploy it. It costs the same as you would run your own, -except all the kinks and inefficiencies have mostly been ironed out. The latest version they offer -features autoscaling which helps reduce cost further by shutting down unused workers. - -Combining Airflow, `dlt`, and Google Cloud Composer is a game-changer. You can supercharge your data -pipelines by leveraging Airflow's workflow management features, enhancing them with `dlt`'s -specialized templates, and enjoying the scalability and reliability of Google Cloud Composer's -managed environment. It's the ultimate combo for handling data integration, transformation, and -loading tasks like a pro. - -`dlt` makes it super convenient to deploy your data load script and integrate it seamlessly with -your Airflow workflow in Google Cloud Composer. It's all about simplicity and getting things done -with just a few keystrokes. - -For this easy style of deployment, `dlt` supports -[the cli command](../../../reference/command-line-interface.md#airflow-composer): - -```bash -dlt deploy {pipeline_script}.py airflow-composer -``` - -which generates the necessary code and instructions. - -Read our -[Walkthroughs: Deploy a pipeline with Airflow and Google Composer](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md) -to find out more. diff --git a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/choosing-an-orchestrator.md b/docs/website/docs/dlt-ecosystem/deployments/orchestrators/choosing-an-orchestrator.md deleted file mode 100644 index 76d6389a1e..0000000000 --- a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/choosing-an-orchestrator.md +++ /dev/null @@ -1,120 +0,0 @@ ---- -title: Choosing an orchestrator -description: How to choose an orchestrator to deploy dlt pipeline -keywords: [orchestrator, airflow, github actions] ---- - -# Choosing an orchestrator - -Orchestrators enable developers to quickly and easily deploy and manage applications in the cloud. - -## What is an orchestrator? - -An orchestrator is a software system that automates the deployment, scaling, and management of -applications and services. - -It provides a single platform for managing and coordinating the components of distributed -applications, including containers, microservices, and other cloud-native resources. - -## Do I need an orchestrator? - -No, but if you do not use one, you will need to consider how to solve the problems one would: - -- Monitoring, Alerting; -- deployment and execution; -- scheduling complex workflows; -- task triggers, dependencies, retries; -- UI for visualising workflows; -- secret vaults/storage; - -So in short, unless you need something very lightweight, you can benefit from an orchestrator. - -## So which one? - -### **Airflow** - -Airflow is a market standard that’s hard to beat. Any shortcomings it might have is more than -compensated for by being an open source mature product with a large community. - -`dlt` supports **airflow deployments**, meaning it’s particularly easy to deploy `dlt` pipelines to -Airflow via simple commands. - -Your Airflow options are: - -1. Broadly used managed Airflow vendors: - - - [GCP Cloud Composer](https://cloud.google.com/composer?hl=en) is - recommended due to GCP being easy to use for non engineers and Google BigQuery being a popular - solution. - - [Astronomer.io](http://Astronomer.io) (recommended for non GCP users). CI/CD out of the box - - AWS has a managed Airflow too, though it is the hardest to use. - -1. Self-managed Airflow: - - - You can self-host and run your own Airflow. This is not recommended unless the team plans to - have the skills to work with this in house. - -Limitations: - -- Airflow manages large scale setups and small alike. - -To deploy a pipeline on Airflow with Google Composer, read our -[step-by-step tutorial](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer) about -using the `dlt deploy` command. - -### **GitHub Actions** - -GitHub Actions is not a full-blown orchestrator, but it works. It supports simple workflows and -scheduling, allowing for a visual, lightweight deployment with web-based monitoring (i.e. you can -see runs in GitHub Actions). It has a free tier, but its pricing is not convenient for large jobs. - -To deploy a pipeline on GitHub Actions, read -[here](../../../walkthroughs/deploy-a-pipeline/deploy-with-github-actions) about using the -`dlt deploy` command and -[here](https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration) -about the limitations of GitHub Actions and how their billing works. - -### **Other orchestrators** - -Other orchestrators can also do the job, so if you are limited in choice or prefer something else, -choose differently. If your team prefers a different tool which affects their work positively, -consider that as well. - -What do you need to consider when using other orchestrators? - -## Source - Resource decomposition: - -You can decompose a pipeline into strongly connected components with -`source().decompose(strategy="scc")`. The method returns a list of dlt sources each containing a -single component. Method makes sure that no resource is executed twice. - -Serial decomposition: - -You can load such sources as tasks serially in order present of the list. Such DAG is safe for -pipelines that use the state internally. -[It is used internally by our Airflow mapper to construct DAGs.](https://github.com/dlt-hub/dlt/blob/devel/dlt/helpers/airflow_helper.py) - -Custom decomposition: - -- When decomposing pipelines into tasks, be mindful of shared state. -- Dependent resources pass data to each other via hard disk - so they need to run on the same - worker. Group them in a task that runs them together. -- State is per-pipeline. The pipeline identifier is the pipeline name. A single pipeline state - should be accessed serially to avoid losing details on parallel runs. - -Parallel decomposition: - -If you are using only the resource state (which most of the pipelines really should!) you can run -your tasks in parallel. - -- Perform the `scc` decomposition. -- Run each component in a pipeline with different but deterministic `pipeline_name` (same component - \- same pipeline, you can use names of selected resources in source to construct unique id). - -Each pipeline will have its private state in the destination and there won't be any clashes. As all -the components write to the same schema you may observe a that loader stage is attempting to migrate -the schema, that should be a problem though as long as your data does not create variant columns. - -## Credentials - -[See credentials section for passing credentials to or from dlt.](../../../general-usage/credentials.md) diff --git a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/github-actions.md b/docs/website/docs/dlt-ecosystem/deployments/orchestrators/github-actions.md deleted file mode 100644 index 8a5d693117..0000000000 --- a/docs/website/docs/dlt-ecosystem/deployments/orchestrators/github-actions.md +++ /dev/null @@ -1,31 +0,0 @@ ---- -title: GitHub Actions -description: How to run dlt in GitHub Actions -keywords: [dlt, webhook, serverless, deploy, github actions] ---- - -# Native deployment to GitHub Actions - -What is a native deployment to dlt? A native deployment is a deployment where dlt will generate the -glue code and credentials instructions. - -[GitHub Actions](https://docs.github.com/en/actions) is an automation tool by GitHub for building, -testing, and deploying software projects. It simplifies development with pre-built actions and -custom workflows, enabling tasks like testing, documentation generation, and event-based triggers. -It streamlines workflows and saves time for developers. - -When `dlt` and GitHub Actions join forces, data loading and software development become a breeze. -Say goodbye to manual work and enjoy seamless data management. - -For this easy style of deployment, dlt supports the -[cli command](../../../reference/command-line-interface#github-action): - -```shell -dlt deploy