docs: fix absolute links (#1834)

* search and replace absolute links * fix after automatic replacement * fix devel links * add docs preprocessing step to ci docs tests * add check for devel and absolute links * post merge fix * add line number to error output * install node 20 * fix all root links in docs
dlt-hub · Sep 18, 2024 · c96ce7b · c96ce7b
1 parent 61ba65d
commit c96ce7b
Show file tree

Hide file tree

Showing 30 changed files with 118 additions and 58 deletions.
diff --git a/.github/workflows/test_doc_snippets.yml b/.github/workflows/test_doc_snippets.yml
@@ -67,6 +67,11 @@ jobs:
         with:
           python-version: "3.10.x"
 
+      - name: Setup node 20
+        uses: actions/setup-node@v4
+        with:
+          node-version: 20
+
       - name: Install Poetry
         uses: snok/install-poetry@v1
         with:
@@ -81,6 +86,9 @@ jobs:
           path: .venv
           key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}
 
+      - name: run docs preprocessor
+        run: make preprocess-docs
+
       - name: Install dependencies
         # if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
         run: poetry install --no-interaction -E duckdb -E weaviate -E parquet -E qdrant -E bigquery -E postgres -E lancedb --with docs,sentry-sdk --without airflow

diff --git a/Makefile b/Makefile
@@ -107,4 +107,6 @@ test-build-images: build-library
 	docker build -f deploy/dlt/Dockerfile.airflow --build-arg=COMMIT_SHA="$(shell git log -1 --pretty=%h)" --build-arg=IMAGE_VERSION="$(shell poetry version -s)" .
 	# docker build -f deploy/dlt/Dockerfile --build-arg=COMMIT_SHA="$(shell git log -1 --pretty=%h)" --build-arg=IMAGE_VERSION="$(shell poetry version -s)" .
 
-
+preprocess-docs: 
+	# run docs preprocessing to run a few checks and ensure examples can be parsed
+	cd docs/website && npm i && npm run preprocess-docs
diff --git a/docs/website/docs/dlt-ecosystem/destinations/bigquery.md b/docs/website/docs/dlt-ecosystem/destinations/bigquery.md
@@ -220,7 +220,7 @@ When staging is enabled:
 
 ## Supported Column Hints
 
-BigQuery supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns):
+BigQuery supports the following [column hints](../../general-usage/schema#tables-and-columns):
 
 * `partition` - creates a partition with a day granularity on the decorated column (`PARTITION BY DATE`).
   May be used with `datetime`, `date`, and `bigint` data types.

diff --git a/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md b/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md
@@ -220,7 +220,7 @@ To enable this, GCS provides an S3
 compatibility mode that emulates the S3 API, allowing ClickHouse to access GCS buckets via its S3 integration.
 
 For detailed instructions on setting up S3-compatible storage with dlt, including AWS S3, MinIO, and Cloudflare R2, refer to
-the [dlt documentation on filesystem destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#using-s3-compatible-storage).
+the [dlt documentation on filesystem destinations](../../dlt-ecosystem/destinations/filesystem#using-s3-compatible-storage).
 
 To set up GCS staging with HMAC authentication in dlt:
 

diff --git a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md
@@ -414,7 +414,7 @@ disable_compression=true
 
 - To decompress a `gzip` file, you can use tools like `gunzip`. This will convert the compressed file back to its original format, making it readable.
 
-For more details on managing file compression, please visit our documentation on performance optimization: [Disabling and Enabling File Compression](https://dlthub.com/docs/reference/performance#disabling-and-enabling-file-compression).
+For more details on managing file compression, please visit our documentation on performance optimization: [Disabling and Enabling File Compression](../../reference/performance#disabling-and-enabling-file-compression).
 
 ## Files layout
 All the files are stored in a single folder with the name of the dataset that you passed to the `run` or `load` methods of the `pipeline`. In our example chess pipeline, it is **chess_players_games_data**.

diff --git a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md
@@ -194,7 +194,7 @@ Which will read, `|` delimited file, without header and will continue on errors.
 Note that we ignore missing columns `ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE` and we will insert NULL into them.
 
 ## Supported column hints
-Snowflake supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns):
+Snowflake supports the following [column hints](../../general-usage/schema#tables-and-columns):
 * `cluster` - creates a cluster column(s). Many columns per table are supported and only when a new table is created.
 
 ## Table and column identifiers

diff --git a/docs/website/docs/dlt-ecosystem/destinations/synapse.md b/docs/website/docs/dlt-ecosystem/destinations/synapse.md
@@ -173,7 +173,7 @@ Possible values:
 
 ## Supported column hints
 
-Synapse supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns):
+Synapse supports the following [column hints](../../general-usage/schema#tables-and-columns):
 
 * `primary_key` - creates a `PRIMARY KEY NONCLUSTERED NOT ENFORCED` constraint on the column
 * `unique` - creates a `UNIQUE NOT ENFORCED` constraint on the column

diff --git a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md
@@ -102,4 +102,4 @@ DBT_CLOUD__ACCOUNT_ID
 DBT_CLOUD__JOB_ID
 ```
 
-For more information, read the [Credentials](https://dlthub.com/docs/general-usage/credentials) documentation.
+For more information, read the [Credentials](../../../general-usage/credentials) documentation.
diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md b/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md
@@ -15,7 +15,7 @@ streams of data in real time.
 
 Our AWS Kinesis [verified source](https://github.com/dlt-hub/verified-sources/tree/master/sources/kinesis)
 loads messages from Kinesis streams to your preferred
-[destination](https://dlthub.com/docs/dlt-ecosystem/destinations/).
+[destination](../../dlt-ecosystem/destinations/).
 
 Resources that can be loaded using this verified source are:
 

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md b/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md
@@ -95,7 +95,7 @@ Keep in mind that enabling these incurs some performance overhead:
 ## Incremental loading with Arrow tables
 
 You can use incremental loading with Arrow tables as well.
-Usage is the same as without other dlt resources. Refer to the [incremental loading](/general-usage/incremental-loading.md) guide for more information.
+Usage is the same as without other dlt resources. Refer to the [incremental loading](../../general-usage/incremental-loading.md) guide for more information.
 
 Example:
 

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md b/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md
@@ -329,7 +329,7 @@ verified source.
    ```
 
    > Loads all the data till date in the first run, and then
-   > [incrementally](https://dlthub.com/docs/general-usage/incremental-loading) in subsequent runs.
+   > [incrementally](../../general-usage/incremental-loading) in subsequent runs.
 
 1. To load data from a specific start date:
 
@@ -340,7 +340,7 @@ verified source.
    ```
 
    > Loads data starting from the specified date during the first run, and then
-   > [incrementally](https://dlthub.com/docs/general-usage/incremental-loading) in subsequent runs.
+   > [incrementally](../../general-usage/incremental-loading) in subsequent runs.
 
 <!--@@@DLT_TUBA google_analytics-->
 
diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md b/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md
@@ -441,11 +441,11 @@ dlt.resource(
 `name`: Denotes the table name, set here as "spreadsheet_info".
 
 `write_disposition`: Dictates how data is loaded to the destination.
-[Read more](https://dlthub.com/docs/general-usage/incremental-loading#the-3-write-dispositions).
+[Read more](../../general-usage/incremental-loading#the-3-write-dispositions).
 
 `merge_key`: Parameter is used to specify the column used to identify records for merging. In this
 case,"spreadsheet_id", means that the records will be merged based on the values in this column.
-[Read more](https://dlthub.com/docs/general-usage/incremental-loading#merge-incremental_loading).
+[Read more](../../general-usage/incremental-loading#merge-incremental_loading).
 
 ## Customization
 ### Create your own pipeline

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/jira.md b/docs/website/docs/dlt-ecosystem/verified-sources/jira.md
@@ -190,7 +190,7 @@ above.
 
 1. Configure the pipeline by specifying the pipeline name, destination, and dataset. To read more
    about pipeline configuration, please refer to our documentation
-   [here](https://dlthub.com/docs/general-usage/pipeline):
+   [here](../../general-usage/pipeline):
 
     ```py
     pipeline = dlt.pipeline(

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md b/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md
@@ -150,7 +150,7 @@ def matomo_reports(
 
 `site_id`: Website's Site ID as per Matomo account.
 
->Note: This is an [incremental](https://dlthub.com/docs/general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run.
+>Note: This is an [incremental](../../general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run.
 
 ### Source `matomo_visits`:
 
@@ -183,7 +183,7 @@ def matomo_visits(
 
 `get_live_event_visitors`: Retrieve unique visitor data, defaulting to False.
 
->Note: This is an [incremental](https://dlthub.com/docs/general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run.
+>Note: This is an [incremental](../../general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run.
 
 ### Resource `get_last_visits`
 
@@ -214,7 +214,7 @@ def get_last_visits(
 
 `rows_per_page`: Number of rows on each page.
 
->Note: This is an [incremental](https://dlthub.com/docs/general-usage/incremental-loading) resource method and loads the "last_date" from the state of last pipeline run.
+>Note: This is an [incremental](../../general-usage/incremental-loading) resource method and loads the "last_date" from the state of last pipeline run.
 
 
 ### Transformer `visitors`

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md b/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md
@@ -9,7 +9,7 @@ import Header from './_source-info-header.md';
 
 <Header/>
 
-Our OpenAPI source generator - `dlt-init-openapi` - generates [`dlt`](https://dlthub.com/docs) data pipelines from [OpenAPI 3.x specs](https://swagger.io/specification/) using the [rest_api verified source](./rest_api) to extract data from any REST API. If you are not familiar with the `rest_api` source, please read [rest_api](./rest_api) to learn how our `rest_api` source works.
+Our OpenAPI source generator - `dlt-init-openapi` - generates [`dlt`](../../intro) data pipelines from [OpenAPI 3.x specs](https://swagger.io/specification/) using the [rest_api verified source](./rest_api) to extract data from any REST API. If you are not familiar with the `rest_api` source, please read [rest_api](./rest_api) to learn how our `rest_api` source works.
 
 :::tip
 We also have a cool [Google Colab example](https://colab.research.google.com/drive/1MRZvguOTZj1MlkEGzjiso8lQ_wr1MJRI?usp=sharing#scrollTo=LHGxzf1Ev_yr) that demonstrates this generator. 😎

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/pg_replication.md b/docs/website/docs/dlt-ecosystem/verified-sources/pg_replication.md
@@ -65,10 +65,10 @@ To get started with your data pipeline, follow these steps:
    dlt init pg_replication duckdb
    ```
 
-   It will initialize [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/pg_replication_pipeline.py) with a Postgres replication as the [source](https://dlthub.com/docs/general-usage/source) and [DuckDB](https://dlthub.com/docs/dlt-ecosystem/destinations/duckdb) as the [destination](https://dlthub.com/docs/dlt-ecosystem/destinations).
+   It will initialize [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/pg_replication_pipeline.py) with a Postgres replication as the [source](../../general-usage/source) and [DuckDB](../../dlt-ecosystem/destinations/duckdb) as the [destination](../../dlt-ecosystem/destinations).
 
 
-2. If you'd like to use a different destination, simply replace `duckdb` with the name of your preferred [destination](https://dlthub.com/docs/dlt-ecosystem/destinations).
+2. If you'd like to use a different destination, simply replace `duckdb` with the name of your preferred [destination](../../dlt-ecosystem/destinations).
     
 3. This source uses `sql_database` source, you can init it as follows:
     
@@ -81,7 +81,7 @@ To get started with your data pipeline, follow these steps:
     
 4. After running these two commands, a new directory will be created with the necessary files and configuration settings to get started.
    
-   For more information, read the guide on [how to add a verified source](https://dlthub.com/docs/walkthroughs/add-a-verified-source).
+   For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).
 
    :::note
    You can omit the `[sql.sources.credentials]` section in `secrets.toml` as it is not required.
@@ -109,9 +109,9 @@ To get started with your data pipeline, follow these steps:
    sources.pg_replication.credentials="postgresql://[email protected]:port/database"
    ```
 
-3. Finally, follow the instructions in [Destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/) to add credentials for your chosen destination. This will ensure that your data is properly routed.
+3. Finally, follow the instructions in [Destinations](../../dlt-ecosystem/destinations/) to add credentials for your chosen destination. This will ensure that your data is properly routed.
 
-For more information, read the [Configuration section.](https://dlthub.com/docs/general-usage/credentials)
+For more information, read the [Configuration section.](../../general-usage/credentials)
 
 ## Run the pipeline
 
@@ -130,12 +130,12 @@ For more information, read the [Configuration section.](https://dlthub.com/docs/
    For example, the `pipeline_name` for the above pipeline example is `pg_replication_pipeline`, you may also use any custom name instead.
 
 
-   For more information, read the guide on [how to run a pipeline](https://dlthub.com/docs/walkthroughs/run-a-pipeline).
+   For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).
     
 
 ## Sources and resources
 
-`dlt` works on the principle of [sources](https://dlthub.com/docs/general-usage/source) and [resources](https://dlthub.com/docs/general-usage/resource).
+`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource).
 
 ### Resource `replication_resource`
 

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/advanced.md b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/advanced.md
@@ -14,7 +14,7 @@ import Header from '../_source-info-header.md';
 
 Efficient data management often requires loading only new or updated data from your SQL databases, rather than reprocessing the entire dataset. This is where incremental loading comes into play.
 
-Incremental loading uses a cursor column (e.g., timestamp or auto-incrementing ID) to load only data newer than a specified initial value, enhancing efficiency by reducing processing time and resource use. Read [here](https://dlthub.com/docs/walkthroughs/sql-incremental-configuration) for more details on incremental loading with `dlt`.
+Incremental loading uses a cursor column (e.g., timestamp or auto-incrementing ID) to load only data newer than a specified initial value, enhancing efficiency by reducing processing time and resource use. Read [here](../../../walkthroughs/sql-incremental-configuration) for more details on incremental loading with `dlt`.
 
 
 #### How to configure
@@ -51,7 +51,7 @@ certain range.
   ```
 
   Behind the scene, the loader generates a SQL query filtering rows with `last_modified` values greater than the incremental value. In the first run, this is the initial value (midnight (00:00:00) January 1, 2024).
-  In subsequent runs, it is the latest value of `last_modified` that `dlt` stores in [state](https://dlthub.com/docs/general-usage/state).
+  In subsequent runs, it is the latest value of `last_modified` that `dlt` stores in [state](../../../general-usage/state).
 
 2. **Incremental loading with the source `sql_database`**.
 
@@ -177,9 +177,9 @@ The examples below show how you can set arguments in any of the `.toml` files (`
     database = sql_database()
     ```
 
-You'll be able to configure all the arguments this way (except adapter callback function). [Standard dlt rules apply](https://dlthub.com/docs/general-usage/credentials/configuration#configure-dlt-sources-and-resources).
+You'll be able to configure all the arguments this way (except adapter callback function). [Standard dlt rules apply]((/general-usage/credentials/setup).
 
-It is also possible to set these arguments as environment variables [using the proper naming convention](https://dlthub.com/docs/general-usage/credentials/config_providers#toml-vs-environment-variables):
+It is also possible to set these arguments as environment variables [using the proper naming convention](../../../general-usage/credentials/setup#naming-convention):
 ```sh
 SOURCES__SQL_DATABASE__CREDENTIALS="mssql+pyodbc://loader.database.windows.net/dlt_data?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server"
 SOURCES__SQL_DATABASE__BACKEND=pandas

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/configuration.md b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/configuration.md
@@ -130,7 +130,7 @@ There are several options for adding your connection credentials into your `dlt`
 
 #### 1. Setting them in `secrets.toml` or as environment variables (Recommended)
 
-You can set up credentials using [any method](https://dlthub.com/docs/devel/general-usage/credentials/setup#available-config-providers) supported by `dlt`. We recommend using `.dlt/secrets.toml` or the environment variables. See Step 2 of the [setup](./setup) for how to set credentials inside `secrets.toml`. For more information on passing credentials read [here](https://dlthub.com/docs/devel/general-usage/credentials/setup).
+You can set up credentials using [any method](../../../general-usage/credentials/setup#available-config-providers) supported by `dlt`. We recommend using `.dlt/secrets.toml` or the environment variables. See Step 2 of the [setup](./setup) for how to set credentials inside `secrets.toml`. For more information on passing credentials read [here](../../../general-usage/credentials/setup).
 
 
 #### 2. Passing them directly in the script

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/usage.md b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database/usage.md
@@ -41,7 +41,7 @@ The PyArrow backend does not yield individual rows rather loads chunks of data a
 
 
 Examples:
-1. Pseudonymizing data to hide personally identifiable information (PII) before loading it to the destination. (See [here](https://dlthub.com/docs/general-usage/customising-pipelines/pseudonymizing_columns) for more information on pseudonymizing data with `dlt`)
+1. Pseudonymizing data to hide personally identifiable information (PII) before loading it to the destination. (See [here](../../../general-usage/customising-pipelines/pseudonymizing_columns) for more information on pseudonymizing data with `dlt`)
 
     ```py
     import dlt
@@ -99,10 +99,10 @@ Examples:
 
 ## Deploying the sql_database pipeline
 
-You can deploy the `sql_database` pipeline with any of the `dlt` deployment methods, such as [GitHub Actions](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-github-actions), [Airflow](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [Dagster](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster) etc. See [here](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline) for a full list of deployment methods.
+You can deploy the `sql_database` pipeline with any of the `dlt` deployment methods, such as [GitHub Actions](../../../walkthroughs/deploy-a-pipeline/deploy-with-github-actions), [Airflow](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [Dagster](../../../walkthroughs/deploy-a-pipeline/deploy-with-dagster) etc. See [here](../../../walkthroughs/deploy-a-pipeline) for a full list of deployment methods.
 
 ### Running on Airflow
 When running on Airflow:
-1. Use the `dlt` [Airflow Helper](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file) to create tasks from the `sql_database` source. (If you want to run table extraction in parallel, then you can do this by setting `decompose = "parallel-isolated"` when doing the source->DAG conversion. See [here](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer#2-modify-dag-file) for code example.)
+1. Use the `dlt` [Airflow Helper](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file) to create tasks from the `sql_database` source. (If you want to run table extraction in parallel, then you can do this by setting `decompose = "parallel-isolated"` when doing the source->DAG conversion. See [here](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer#2-modify-dag-file) for code example.)
 2. Reflect tables at runtime with `defer_table_reflect` argument.
 3. Set `allow_external_schedulers` to load data using [Airflow intervals](../../../general-usage/incremental-loading.md#using-airflow-schedule-for-backfill-and-incremental-loading).