Skip to content

Commit

Permalink
docs: fix absolute links (#1834)
Browse files Browse the repository at this point in the history
* search and replace absolute links

* fix after automatic replacement

* fix devel links

* add docs preprocessing step to ci docs tests

* add check for devel and absolute links

* post merge fix

* add line number to error output

* install node 20

* fix all root links in docs
  • Loading branch information
sh-rp authored Sep 18, 2024
1 parent 61ba65d commit c96ce7b
Show file tree
Hide file tree
Showing 30 changed files with 118 additions and 58 deletions.
8 changes: 8 additions & 0 deletions .github/workflows/test_doc_snippets.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,11 @@ jobs:
with:
python-version: "3.10.x"

- name: Setup node 20
uses: actions/setup-node@v4
with:
node-version: 20

- name: Install Poetry
uses: snok/install-poetry@v1
with:
Expand All @@ -81,6 +86,9 @@ jobs:
path: .venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}

- name: run docs preprocessor
run: make preprocess-docs

- name: Install dependencies
# if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
run: poetry install --no-interaction -E duckdb -E weaviate -E parquet -E qdrant -E bigquery -E postgres -E lancedb --with docs,sentry-sdk --without airflow
Expand Down
4 changes: 3 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -107,4 +107,6 @@ test-build-images: build-library
docker build -f deploy/dlt/Dockerfile.airflow --build-arg=COMMIT_SHA="$(shell git log -1 --pretty=%h)" --build-arg=IMAGE_VERSION="$(shell poetry version -s)" .
# docker build -f deploy/dlt/Dockerfile --build-arg=COMMIT_SHA="$(shell git log -1 --pretty=%h)" --build-arg=IMAGE_VERSION="$(shell poetry version -s)" .


preprocess-docs:
# run docs preprocessing to run a few checks and ensure examples can be parsed
cd docs/website && npm i && npm run preprocess-docs
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/bigquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ When staging is enabled:

## Supported Column Hints

BigQuery supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns):
BigQuery supports the following [column hints](../../general-usage/schema#tables-and-columns):

* `partition` - creates a partition with a day granularity on the decorated column (`PARTITION BY DATE`).
May be used with `datetime`, `date`, and `bigint` data types.
Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/clickhouse.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ To enable this, GCS provides an S3
compatibility mode that emulates the S3 API, allowing ClickHouse to access GCS buckets via its S3 integration.

For detailed instructions on setting up S3-compatible storage with dlt, including AWS S3, MinIO, and Cloudflare R2, refer to
the [dlt documentation on filesystem destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#using-s3-compatible-storage).
the [dlt documentation on filesystem destinations](../../dlt-ecosystem/destinations/filesystem#using-s3-compatible-storage).

To set up GCS staging with HMAC authentication in dlt:

Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -414,7 +414,7 @@ disable_compression=true

- To decompress a `gzip` file, you can use tools like `gunzip`. This will convert the compressed file back to its original format, making it readable.

For more details on managing file compression, please visit our documentation on performance optimization: [Disabling and Enabling File Compression](https://dlthub.com/docs/reference/performance#disabling-and-enabling-file-compression).
For more details on managing file compression, please visit our documentation on performance optimization: [Disabling and Enabling File Compression](../../reference/performance#disabling-and-enabling-file-compression).

## Files layout
All the files are stored in a single folder with the name of the dataset that you passed to the `run` or `load` methods of the `pipeline`. In our example chess pipeline, it is **chess_players_games_data**.
Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ Which will read, `|` delimited file, without header and will continue on errors.
Note that we ignore missing columns `ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE` and we will insert NULL into them.

## Supported column hints
Snowflake supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns):
Snowflake supports the following [column hints](../../general-usage/schema#tables-and-columns):
* `cluster` - creates a cluster column(s). Many columns per table are supported and only when a new table is created.

## Table and column identifiers
Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/synapse.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ Possible values:
## Supported column hints

Synapse supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns):
Synapse supports the following [column hints](../../general-usage/schema#tables-and-columns):

* `primary_key` - creates a `PRIMARY KEY NONCLUSTERED NOT ENFORCED` constraint on the column
* `unique` - creates a `UNIQUE NOT ENFORCED` constraint on the column
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,4 +102,4 @@ DBT_CLOUD__ACCOUNT_ID
DBT_CLOUD__JOB_ID
```

For more information, read the [Credentials](https://dlthub.com/docs/general-usage/credentials) documentation.
For more information, read the [Credentials](../../../general-usage/credentials) documentation.
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ streams of data in real time.

Our AWS Kinesis [verified source](https://github.com/dlt-hub/verified-sources/tree/master/sources/kinesis)
loads messages from Kinesis streams to your preferred
[destination](https://dlthub.com/docs/dlt-ecosystem/destinations/).
[destination](../../dlt-ecosystem/destinations/).

Resources that can be loaded using this verified source are:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ Keep in mind that enabling these incurs some performance overhead:
## Incremental loading with Arrow tables

You can use incremental loading with Arrow tables as well.
Usage is the same as without other dlt resources. Refer to the [incremental loading](/general-usage/incremental-loading.md) guide for more information.
Usage is the same as without other dlt resources. Refer to the [incremental loading](../../general-usage/incremental-loading.md) guide for more information.

Example:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,7 @@ verified source.
```

> Loads all the data till date in the first run, and then
> [incrementally](https://dlthub.com/docs/general-usage/incremental-loading) in subsequent runs.
> [incrementally](../../general-usage/incremental-loading) in subsequent runs.
1. To load data from a specific start date:

Expand All @@ -340,7 +340,7 @@ verified source.
```

> Loads data starting from the specified date during the first run, and then
> [incrementally](https://dlthub.com/docs/general-usage/incremental-loading) in subsequent runs.
> [incrementally](../../general-usage/incremental-loading) in subsequent runs.
<!--@@@DLT_TUBA google_analytics-->

Original file line number Diff line number Diff line change
Expand Up @@ -441,11 +441,11 @@ dlt.resource(
`name`: Denotes the table name, set here as "spreadsheet_info".

`write_disposition`: Dictates how data is loaded to the destination.
[Read more](https://dlthub.com/docs/general-usage/incremental-loading#the-3-write-dispositions).
[Read more](../../general-usage/incremental-loading#the-3-write-dispositions).

`merge_key`: Parameter is used to specify the column used to identify records for merging. In this
case,"spreadsheet_id", means that the records will be merged based on the values in this column.
[Read more](https://dlthub.com/docs/general-usage/incremental-loading#merge-incremental_loading).
[Read more](../../general-usage/incremental-loading#merge-incremental_loading).

## Customization
### Create your own pipeline
Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/verified-sources/jira.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ above.

1. Configure the pipeline by specifying the pipeline name, destination, and dataset. To read more
about pipeline configuration, please refer to our documentation
[here](https://dlthub.com/docs/general-usage/pipeline):
[here](../../general-usage/pipeline):

```py
pipeline = dlt.pipeline(
Expand Down
6 changes: 3 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/matomo.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ def matomo_reports(

`site_id`: Website's Site ID as per Matomo account.

>Note: This is an [incremental](https://dlthub.com/docs/general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run.
>Note: This is an [incremental](../../general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run.
### Source `matomo_visits`:

Expand Down Expand Up @@ -183,7 +183,7 @@ def matomo_visits(

`get_live_event_visitors`: Retrieve unique visitor data, defaulting to False.

>Note: This is an [incremental](https://dlthub.com/docs/general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run.
>Note: This is an [incremental](../../general-usage/incremental-loading) source method and loads the "last_date" from the state of last pipeline run.
### Resource `get_last_visits`

Expand Down Expand Up @@ -214,7 +214,7 @@ def get_last_visits(

`rows_per_page`: Number of rows on each page.

>Note: This is an [incremental](https://dlthub.com/docs/general-usage/incremental-loading) resource method and loads the "last_date" from the state of last pipeline run.
>Note: This is an [incremental](../../general-usage/incremental-loading) resource method and loads the "last_date" from the state of last pipeline run.

### Transformer `visitors`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import Header from './_source-info-header.md';

<Header/>

Our OpenAPI source generator - `dlt-init-openapi` - generates [`dlt`](https://dlthub.com/docs) data pipelines from [OpenAPI 3.x specs](https://swagger.io/specification/) using the [rest_api verified source](./rest_api) to extract data from any REST API. If you are not familiar with the `rest_api` source, please read [rest_api](./rest_api) to learn how our `rest_api` source works.
Our OpenAPI source generator - `dlt-init-openapi` - generates [`dlt`](../../intro) data pipelines from [OpenAPI 3.x specs](https://swagger.io/specification/) using the [rest_api verified source](./rest_api) to extract data from any REST API. If you are not familiar with the `rest_api` source, please read [rest_api](./rest_api) to learn how our `rest_api` source works.

:::tip
We also have a cool [Google Colab example](https://colab.research.google.com/drive/1MRZvguOTZj1MlkEGzjiso8lQ_wr1MJRI?usp=sharing#scrollTo=LHGxzf1Ev_yr) that demonstrates this generator. 😎
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,10 @@ To get started with your data pipeline, follow these steps:
dlt init pg_replication duckdb
```

It will initialize [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/pg_replication_pipeline.py) with a Postgres replication as the [source](https://dlthub.com/docs/general-usage/source) and [DuckDB](https://dlthub.com/docs/dlt-ecosystem/destinations/duckdb) as the [destination](https://dlthub.com/docs/dlt-ecosystem/destinations).
It will initialize [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/pg_replication_pipeline.py) with a Postgres replication as the [source](../../general-usage/source) and [DuckDB](../../dlt-ecosystem/destinations/duckdb) as the [destination](../../dlt-ecosystem/destinations).


2. If you'd like to use a different destination, simply replace `duckdb` with the name of your preferred [destination](https://dlthub.com/docs/dlt-ecosystem/destinations).
2. If you'd like to use a different destination, simply replace `duckdb` with the name of your preferred [destination](../../dlt-ecosystem/destinations).
3. This source uses `sql_database` source, you can init it as follows:
Expand All @@ -81,7 +81,7 @@ To get started with your data pipeline, follow these steps:
4. After running these two commands, a new directory will be created with the necessary files and configuration settings to get started.
For more information, read the guide on [how to add a verified source](https://dlthub.com/docs/walkthroughs/add-a-verified-source).
For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source).
:::note
You can omit the `[sql.sources.credentials]` section in `secrets.toml` as it is not required.
Expand Down Expand Up @@ -109,9 +109,9 @@ To get started with your data pipeline, follow these steps:
sources.pg_replication.credentials="postgresql://[email protected]:port/database"
```

3. Finally, follow the instructions in [Destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/) to add credentials for your chosen destination. This will ensure that your data is properly routed.
3. Finally, follow the instructions in [Destinations](../../dlt-ecosystem/destinations/) to add credentials for your chosen destination. This will ensure that your data is properly routed.

For more information, read the [Configuration section.](https://dlthub.com/docs/general-usage/credentials)
For more information, read the [Configuration section.](../../general-usage/credentials)

## Run the pipeline

Expand All @@ -130,12 +130,12 @@ For more information, read the [Configuration section.](https://dlthub.com/docs/
For example, the `pipeline_name` for the above pipeline example is `pg_replication_pipeline`, you may also use any custom name instead.
For more information, read the guide on [how to run a pipeline](https://dlthub.com/docs/walkthroughs/run-a-pipeline).
For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline).
## Sources and resources
`dlt` works on the principle of [sources](https://dlthub.com/docs/general-usage/source) and [resources](https://dlthub.com/docs/general-usage/resource).
`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource).
### Resource `replication_resource`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import Header from '../_source-info-header.md';

Efficient data management often requires loading only new or updated data from your SQL databases, rather than reprocessing the entire dataset. This is where incremental loading comes into play.

Incremental loading uses a cursor column (e.g., timestamp or auto-incrementing ID) to load only data newer than a specified initial value, enhancing efficiency by reducing processing time and resource use. Read [here](https://dlthub.com/docs/walkthroughs/sql-incremental-configuration) for more details on incremental loading with `dlt`.
Incremental loading uses a cursor column (e.g., timestamp or auto-incrementing ID) to load only data newer than a specified initial value, enhancing efficiency by reducing processing time and resource use. Read [here](../../../walkthroughs/sql-incremental-configuration) for more details on incremental loading with `dlt`.


#### How to configure
Expand Down Expand Up @@ -51,7 +51,7 @@ certain range.
```

Behind the scene, the loader generates a SQL query filtering rows with `last_modified` values greater than the incremental value. In the first run, this is the initial value (midnight (00:00:00) January 1, 2024).
In subsequent runs, it is the latest value of `last_modified` that `dlt` stores in [state](https://dlthub.com/docs/general-usage/state).
In subsequent runs, it is the latest value of `last_modified` that `dlt` stores in [state](../../../general-usage/state).

2. **Incremental loading with the source `sql_database`**.

Expand Down Expand Up @@ -177,9 +177,9 @@ The examples below show how you can set arguments in any of the `.toml` files (`
database = sql_database()
```

You'll be able to configure all the arguments this way (except adapter callback function). [Standard dlt rules apply](https://dlthub.com/docs/general-usage/credentials/configuration#configure-dlt-sources-and-resources).
You'll be able to configure all the arguments this way (except adapter callback function). [Standard dlt rules apply]((/general-usage/credentials/setup).

It is also possible to set these arguments as environment variables [using the proper naming convention](https://dlthub.com/docs/general-usage/credentials/config_providers#toml-vs-environment-variables):
It is also possible to set these arguments as environment variables [using the proper naming convention](../../../general-usage/credentials/setup#naming-convention):
```sh
SOURCES__SQL_DATABASE__CREDENTIALS="mssql+pyodbc://loader.database.windows.net/dlt_data?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server"
SOURCES__SQL_DATABASE__BACKEND=pandas
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ There are several options for adding your connection credentials into your `dlt`

#### 1. Setting them in `secrets.toml` or as environment variables (Recommended)

You can set up credentials using [any method](https://dlthub.com/docs/devel/general-usage/credentials/setup#available-config-providers) supported by `dlt`. We recommend using `.dlt/secrets.toml` or the environment variables. See Step 2 of the [setup](./setup) for how to set credentials inside `secrets.toml`. For more information on passing credentials read [here](https://dlthub.com/docs/devel/general-usage/credentials/setup).
You can set up credentials using [any method](../../../general-usage/credentials/setup#available-config-providers) supported by `dlt`. We recommend using `.dlt/secrets.toml` or the environment variables. See Step 2 of the [setup](./setup) for how to set credentials inside `secrets.toml`. For more information on passing credentials read [here](../../../general-usage/credentials/setup).


#### 2. Passing them directly in the script
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ The PyArrow backend does not yield individual rows rather loads chunks of data a


Examples:
1. Pseudonymizing data to hide personally identifiable information (PII) before loading it to the destination. (See [here](https://dlthub.com/docs/general-usage/customising-pipelines/pseudonymizing_columns) for more information on pseudonymizing data with `dlt`)
1. Pseudonymizing data to hide personally identifiable information (PII) before loading it to the destination. (See [here](../../../general-usage/customising-pipelines/pseudonymizing_columns) for more information on pseudonymizing data with `dlt`)

```py
import dlt
Expand Down Expand Up @@ -99,10 +99,10 @@ Examples:

## Deploying the sql_database pipeline

You can deploy the `sql_database` pipeline with any of the `dlt` deployment methods, such as [GitHub Actions](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-github-actions), [Airflow](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [Dagster](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster) etc. See [here](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline) for a full list of deployment methods.
You can deploy the `sql_database` pipeline with any of the `dlt` deployment methods, such as [GitHub Actions](../../../walkthroughs/deploy-a-pipeline/deploy-with-github-actions), [Airflow](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [Dagster](../../../walkthroughs/deploy-a-pipeline/deploy-with-dagster) etc. See [here](../../../walkthroughs/deploy-a-pipeline) for a full list of deployment methods.

### Running on Airflow
When running on Airflow:
1. Use the `dlt` [Airflow Helper](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file) to create tasks from the `sql_database` source. (If you want to run table extraction in parallel, then you can do this by setting `decompose = "parallel-isolated"` when doing the source->DAG conversion. See [here](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer#2-modify-dag-file) for code example.)
1. Use the `dlt` [Airflow Helper](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file) to create tasks from the `sql_database` source. (If you want to run table extraction in parallel, then you can do this by setting `decompose = "parallel-isolated"` when doing the source->DAG conversion. See [here](../../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer#2-modify-dag-file) for code example.)
2. Reflect tables at runtime with `defer_table_reflect` argument.
3. Set `allow_external_schedulers` to load data using [Airflow intervals](../../../general-usage/incremental-loading.md#using-airflow-schedule-for-backfill-and-incremental-loading).
Loading

0 comments on commit c96ce7b

Please sign in to comment.