-
Notifications
You must be signed in to change notification settings - Fork 995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add watsonx spark config and setup files #6895
Changes from all commits
522fcd3
11183b3
c44afe4
fb5bfdd
bd59429
4411403
1986925
35de721
149b231
1cd139d
ef3b100
27a31ff
1dc0af8
e87b6f5
05d793c
c6e9fdb
6e2db29
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
--- | ||
title: "IBM watsonx.data Spark setup" | ||
description: "Read this guide to learn about the IBM watsonx.data Spark setup in dbt." | ||
id: "watsonx-spark-setup" | ||
meta: | ||
maintained_by: IBM | ||
authors: Bayan Albunayan, Reema Alzaid, Manjot Sidhu | ||
github_repo: 'IBM/dbt-watsonx-spark' | ||
pypi_package: 'dbt-watsonx-spark' | ||
min_core_version: v0.0.8 | ||
cloud_support: 'Not Supported' | ||
min_supported_version: 'n/a' | ||
slack_channel_name: | ||
slack_channel_link: | ||
platform_name: IBM watsonx.data | ||
config_page: /reference/resource-configs/watsonx-Spark-config | ||
--- | ||
|
||
import SetUpPages from '/snippets/_setup-pages-intro.md'; | ||
|
||
<SetUpPages meta={frontMatter.meta}/> | ||
|
||
The `dbt-watsonx-spark` adapter allows you to use dbt to transform and manage data on IBM watsonx.data Spark, leveraging its distributed SQL query engine capabilities. | ||
|
||
Before proceeding, ensure you have the following: | ||
- An active IBM watsonx.data, For [IBM Cloud (SaaS)](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started). For [Software](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=installing-watsonxdata-developer-version) | ||
- Provision **Native Spark engine** in watsonx.data, For [IBM Cloud (SaaS)](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-prov_nspark). For [Software](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=spark-native-engine) | ||
- An active **Spark query server** in your **Native Spark engine** | ||
|
||
Read the official documentation for using **watsonx.data** with `dbt-watsonx-spark` | ||
|
||
- [Documentation for IBM Cloud and SaaS offerings](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-dbt_watsonx_spark_inst) | ||
- [Documentation for IBM watsonx.data software](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=integration-data-build-tool-adapter-spark) | ||
|
||
## Installing dbt-watsonx-spark | ||
Check warning on line 35 in website/docs/docs/core/connect-data-platform/watsonx-spark-setup.md
|
||
Note: From dbt v1.8, installing an adapter no longer installs 'dbt-core' automatically. This is because adapters and dbt Core versions are decoupled to avoid overwriting dbt-core installations.Use the following command for installation: | ||
Check warning on line 36 in website/docs/docs/core/connect-data-platform/watsonx-spark-setup.md
|
||
|
||
```sh | ||
python -m pip install dbt-core dbt-watsonx-spark | ||
``` | ||
|
||
## Configuring `dbt-watsonx-spark` | ||
For IBM watsonx.data-specific configuration, refer to [IBM watsonx.data configs.](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=spark-configuration-setting-up-your-profile) | ||
|
||
## Connecting to IBM watsonx.data Spark | ||
Check warning on line 45 in website/docs/docs/core/connect-data-platform/watsonx-spark-setup.md
|
||
|
||
To connect dbt with watsonx.data Spark, configure a profile in your `profiles.yml` file located in the `.dbt/` directory of your home folder. The following is an example configuration for connecting to IBM watsonx.data SaaS and Software instances: | ||
|
||
<File name='~/.dbt/profiles.yml'> | ||
|
||
```yaml | ||
project_name: | ||
target: "dev" | ||
outputs: | ||
dev: | ||
type: watsonx_spark | ||
method: http | ||
schema: [schema name] | ||
host: [hostname] | ||
uri: [uri] | ||
catalog: [catalog name] | ||
use_ssl: false | ||
auth: | ||
instance: [Watsonx.data Instance ID] | ||
user: [username] | ||
apikey: [apikey] | ||
``` | ||
|
||
</File> | ||
|
||
## Host parameters | ||
|
||
The following profile fields are required to configure watsonx.data Spark connections. For IBM watsonx.data SaaS or Software instances, To get the 'profile' details, click 'View connect details' when the 'query server' is in RUNNING status in watsonx.data (In watsonx.data (both SaaS or Software). The Connection details page opens with the profile configuration. | ||
Copy and paste the connection details in the profiles.yml file that is located in .dbt of your home directory | ||
|
||
The following profile fields are required to configure watsonx.data Spark connections: | ||
|
||
| Option | Required/Optional | <div style={{width:'200px'}}>Description</div> | <div style={{width:'300px'}}>Example</div> | | ||
| ---------- | ----------------------------- | ------------------------------------------------------------------------- | ----------------- | | ||
| `method` | Required | Specifies the connection method to the spark query server. Use `http`. | `http` | | ||
| `schema` | Required| To choose an existing schema within spark engine or create a new schema. | `spark_schema` | | ||
| `host` | Required | Hostname of the watsonx.data console. For more information, see [Getting connection information](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=references-getting-connection-information#connection_info__conn_info_).| `https://dataplatform.cloud.ibm.com` | | ||
| `uri` | Required| URI of your query server that is running on watsonx.data. For more information, see [Getting connection information](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=references-getting-connection-information#connection_info__conn_info_). | `/lakehouse/api/v2/spark_engines/<sparkID>/query_servers/<queryID>/connect/cliservice`| | ||
| `catalog` | Required | The catalog that is associated with the Spark engine. | `my_catalog` | | ||
| `use_ssl` | Optional (default: **false**) | Specifies whether to use SSL. | `true` or `false` | | ||
| `instance` | Required | For **SaaS** set it as CRN of watsonx.data. As for **Software**, set it as instance ID of watsonx.data| `1726574045872688`| | ||
| `user` | Required | Username for the watsonx.data instance. for [Saas] use email as username | `username` or `[email protected]`| | ||
| `apikey` | Required | Your API key. For more info on [SaaS](https://www.ibm.com/docs/en/software-hub/5.1.x?topic=started-generating-api-keys), For [Software](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui#manage-user-keys) | `API key` | | ||
|
||
### Schemas and catalogs | ||
|
||
When selecting the catalog, ensure the user has read and write access. This selection does not limit your ability to query into the schema spcified/created but also serves as the default location for materialized `tables`, `views`, and `incremental`. | ||
|
||
### SSL verification | ||
|
||
- If the Spark instance uses an unsecured HTTP connection, set `use_ssl` to `false`. | ||
- If the instance uses `HTTPS`, set it `true`. | ||
|
||
## Additional parameters | ||
|
||
The following profile fields are optional. You can configure the instance session and dbt for the connection. | ||
|
||
| Profile field | Description | Example | | ||
| ------------------------ | ------------------------------------------------------------ | --------------------------------- | | ||
| `threads` | How many threads dbt should use (default is `1`) | `8` | | ||
| `retry_all` | Enables automatic retries for transient connection failures. | `true` | | ||
| `connect_timeout` | Timeout for establishing a connection (in seconds). | `5` | | ||
| `connect_retries` | Number of retry attempts for connection failures. | `3` | | ||
|
||
## Limitations and considerations | ||
|
||
- **Supports only HTTP**: No support for ODBC, Thrift, or session-based connections. | ||
- **Limited dbt Cloud Support**: Not fully compatible with dbt Cloud. | ||
- **Metadata Persistence**: Some dbt features, such as column descriptions, may not persist in all table formats. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
--- | ||
title: "IBM watsonx.data Spark configurations" | ||
id: "watsonx-spark-config" | ||
--- | ||
|
||
## Instance requirements | ||
|
||
To use IBM watsonx.data Spark with `dbt-watsonx-spark` adapter, ensure the instance has an attached catalog that supports creating, renaming, altering, and dropping objects such as tables and views. The user connecting to the instance via the `dbt-watsonx-spark` adapter must have the necessary permissions for the target catalog. | ||
|
||
For detailed setup instructions, including setting up watsonx.data, adding the Spark engine, configuring storages, registering data sources, and managing permissions, refer to the official IBM documentation: | ||
- watsonx.data Software Documentation: [IBM watsonx.data Software Guide](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x) | ||
- watsonx.data SaaS Documentation: [IBM watsonx.data SaaS Guide](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started) | ||
|
||
|
||
## Session properties | ||
|
||
With IBM watsonx.data SaaS/Software instance, you can [set session properties](https://sparkdb.io/docs/current/sql/set-session.html) to modify the current configuration for your user session. | ||
|
||
To temporarily adjust session properties for a specific dbt model or a group of models, use a [dbt hook](/reference/resource-configs/pre-hook-post-hook). For example: | ||
|
||
```sql | ||
{{ | ||
config( | ||
pre_hook="set session query_max_run_time='10m'" | ||
) | ||
}} | ||
``` | ||
|
||
## Connector properties | ||
|
||
IBM watsonx.data SaaS/Software supports various Spark-specific connector properties to control data representation, execution performance, and storage format. | ||
|
||
For more details on supported configurations for each data source, refer to: | ||
|
||
- [watsonx.data SaaS Catalog](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-reg_database) | ||
- [watsonx.data Software Catalog](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=components-adding-data-source) | ||
|
||
### Additional configuration | ||
|
||
The `dbt-watsonx-spark` adapter allows additional configurations to be set in the catalog profile: | ||
|
||
- `Catalog:` Specifies the catalog to use for the Spark connection. The plugin can automatically detect the file format type `(Iceberg, Hive, or Delta)` based on the catalog type. | ||
- `use_ssl:` Enables SSL encryption for secure connections. | ||
|
||
Example configuration: | ||
|
||
```yaml | ||
project_name: | ||
target: "dev" | ||
outputs: | ||
dev: | ||
type: watsonx_spark | ||
method: http | ||
schema: [schema name] | ||
host: [hostname] | ||
uri: [uri] | ||
catalog: [catalog name] | ||
use_ssl: false | ||
auth: | ||
instance: [Watsonx.data Instance ID] | ||
user: [username] | ||
apikey: [apikey] | ||
``` | ||
|
||
--- | ||
|
||
### File format configuration | ||
|
||
The supported file formats depend on the catalog type: | ||
|
||
- **Iceberg Catalog:** Supports **Iceberg** tables. | ||
- **Hive Catalog:** Supports **Hive** tables. | ||
- **Delta Lake Catalog:** Supports **Delta** tables. | ||
- **Hudi Catalog:** Supports **Hudi** tables. | ||
|
||
The plugin **automatically** detects the file format type based on the catalog specified in the configuration. | ||
|
||
By specifying file format dbt models. For example: | ||
|
||
```sql | ||
{{ | ||
config( | ||
materialized='table', | ||
file_format='iceberg' or 'hive' or 'delta' or 'hudi' | ||
) | ||
}} | ||
``` | ||
|
||
**For more details**, refer to the [documentation.](https://spark.apache.org/docs/3.5.3/sql-ref-syntax.html#sql-syntax) | ||
|
||
## Seeds and prepared statements | ||
You can configure column data types either in the dbt_project.yml file or in property files, as supported by dbt. For more details on seed configuration and best practices, refer to the [dbt seed configuration documentation](/reference/seed-configs). | ||
|
||
|
||
## Materializations | ||
The `dbt-watsonx-spark` adapter supports table materializations, allowing you to manage how your data is stored and queried in watsonx.data Spark. | ||
|
||
For further information on configuring materializations, refer to the [dbt materializations documentation](/reference/resource-configs/materialized). | ||
|
||
### Table | ||
|
||
The `dbt-watsonx-spark` adapter enables you to create and update tables through table materialization, making it easier to work with data in watsonx.data Spark. | ||
|
||
### View | ||
|
||
The adapter automatically creates views by default if no materialization is explicitly specified. | ||
|
||
### Incremental | ||
|
||
Incremental materialization is supported but requires additional configuration for partitioning and performance tuning. | ||
|
||
#### Recommendations | ||
- **Check Permissions:** Ensure that the necessary permissions for table creation are enabled in the catalog or schema. | ||
- **Check Connector Documentation:** Review watsonx.data Spark [data ingestion in watsonx.data](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=data-overview-ingestion) to ensure it supports table | ||
creation and modification. | ||
|
||
## Unsupported features | ||
Despite its extensive capabilities, the `dbt-watsonx-spark` adapter has some limitations: | ||
|
||
- **Incremental Materialization**: Supported but requires additional configuration for partitioning and performance tuning. | ||
- **Materialized Views**: Not natively supported in Spark SQL within Watsonx.data. | ||
- **Snapshots**: Not supported due to Spark’s lack of built-in snapshot functionality. | ||
- **Performance Considerations**: | ||
- Large datasets may require tuning of Spark configurations such as shuffle partitions and memory allocation. | ||
- Some transformations may be expensive due to Spark’s in-memory processing model. | ||
|
||
By understanding these capabilities and constraints, users can maximize the effectiveness of dbt with Watsonx.data Spark for scalable data transformations and analytics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to say "This selection does not limit your ability to query into the schema specified/created by other users?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly