-
Notifications
You must be signed in to change notification settings - Fork 995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add watsonx spark config and setup files #6895
Changes from 3 commits
522fcd3
11183b3
c44afe4
fb5bfdd
bd59429
4411403
1986925
35de721
149b231
1cd139d
ef3b100
27a31ff
1dc0af8
e87b6f5
05d793c
c6e9fdb
6e2db29
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
--- | ||
title: "IBM watsonx.data Spark setup" | ||
description: "Read this guide to learn about the IBM watsonx.data Spark setup in dbt." | ||
id: "watsonx-spark-setup" | ||
meta: | ||
maintained_by: IBM | ||
authors: Bayan Albunayan, Reema Alzaid, Manjot Sidhu | ||
github_repo: 'IBM/dbt-watsonx-spark' | ||
pypi_package: 'dbt-watsonx-spark' | ||
min_core_version: v0.0.8 | ||
cloud_support: 'Not Supported' | ||
min_supported_version: 'n/a' | ||
slack_channel_name: | ||
slack_channel_link: | ||
platform_name: IBM watsonx.data | ||
config_page: /reference/resource-configs/watsonx-Spark-config | ||
--- | ||
|
||
**The `dbt-watsonx-spark` adapter allows you to use dbt to transform and manage data on IBM watsonx.data Spark, leveraging its distributed SQL query engine capabilities.** | ||
|
||
Before proceeding, ensure you have the following: | ||
- An active IBM watsonx.data, For [IBM Cloud (SaaS)](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started). For [Software](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=installing-watsonxdata-developer-version) | ||
- Provision **Native Spark engine** in watsonx.data, For [IBM Cloud (SaaS)](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-prov_nspark). For [Software](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=spark-native-engine) | ||
- An active **Spark query server** in your **Native Spark engine** | ||
|
||
Read the official documentation for using **watsonx.data** with `dbt-watsonx-spark` | ||
|
||
- [Documentation for IBM Cloud and SaaS offerings](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-dbt_watsonx_spark_inst) | ||
- [Documentation for IBM watsonx.data software](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=integration-data-build-tool-adapter-spark) | ||
|
||
## Installing dbt-watsonx-spark | ||
|
||
Since dbt v1.8, installing an adapter no longer installs `dbt-core` automatically. | ||
Use pip to install the adapter. Before 1.8, installing the adapter would automatically install `dbt-core` and any additional dependencies. Beginning in 1.8, installing an adapter does not automatically install dbt-core. This is because adapters and dbt Core versions have been decoupled from each other so we no longer want to overwrite existing dbt-core installations. Use the following command for installation: | ||
|
||
```sh | ||
python -m pip install dbt-core dbt-watsonx-spark | ||
``` | ||
|
||
## Configuring `dbt-watsonx-spark` | ||
For IBM watsonx.data-specific configuration, please refer to [IBM watsonx.data configs.](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=spark-configuration-setting-up-your-profile) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove "please" |
||
|
||
|
||
## Connecting to IBM watsonx.data Spark | ||
|
||
To connect dbt with watsonx.data Spark, you need to configure a profile in your `profiles.yml` file located in the `.dbt/` directory of your home folder. The following is an example configuration for connecting to IBM watsonx.data SaaS and Software instances: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Change, "you need to configure" to "configure a profile in your..." |
||
|
||
<File name='~/.dbt/profiles.yml'> | ||
|
||
```yaml | ||
project_name: | ||
target: "dev" | ||
outputs: | ||
dev: | ||
type: watsonx_spark | ||
method: http | ||
schema: [schema name] | ||
host: [hostname] | ||
uri: [uri] | ||
catalog: [catalog name] | ||
use_ssl: false | ||
auth: | ||
instance: [Watsonx.data Instance ID] | ||
user: [username] | ||
apikey: [apikey] | ||
``` | ||
|
||
</File> | ||
|
||
## Host parameters | ||
|
||
The following profile fields are required to configure watsonx.data Spark connections. For IBM watsonx.data SaaS or Software instances, you can get the `profile` details by clicking **View connect details** after `the query server` is in RUNNING stat, The Connection details page opens with the profile configuration. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To get the 'profile' details, click 'View connect details' when the 'query server' is in RUNNING status in watsonx.data (In watsonx.data (both SaaS or Software). The Connection details page opens... Remove , after stat. |
||
Copy the connection details. Then Paste the connection details in the profiles.yml file that is located in .dbt of your home directory | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Copy and paste the connection details in the profiles.yml... |
||
|
||
The following profile fields are required to configure watsonx.data Spark connections: | ||
|
||
| Option | Required/Optional | Description | Example | | ||
| ---------- | ----------------------------- | ------------------------------------------------------------------------- | ----------------- | | ||
| `method` | Required | Specifies the connection method to the spark query server. Use `http`. | `http` | | ||
| `schema` | Required| To choose an existing schema within spark engine or create a new schema. | `spark_schema` | | ||
| `host` | Required | Hostname of your watsonx.data console. For more information, see [Getting connection information](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=references-getting-connection-information#connection_info__conn_info_).| `https://dataplatform.cloud.ibm.com` | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. your watsonx.data > the watsonx.data |
||
| `uri` | Required| URI of your query server that is running on watsonx.data. For more information, see [Getting connection information](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=references-getting-connection-information#connection_info__conn_info_). | `/lakehouse/api/v2/spark_engines/<sparkID>/query_servers/<queryID>/connect/cliservice`| | ||
| `catalog` | Required | The catalog that is associated with the Spark engine. | `my_catalog` | | ||
| `use_ssl` | Optional (default: **false**) | Specifies whether to use SSL. | `true` or `false` | | ||
| `instance` | Required | For **SaaS** set it as CRN of watsonx.data. As for **Software**, set it as instance ID of watsonx.data| `1726574045872688`| | ||
| `user` | Required | Your watsonx.data username | `[email protected]`| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The authentication (username and passwords) are different for SaaS and Software and also there are some variations. Ensure to give all those details. Also, change "Your watsonx.data username" > "Username for the watsonx.data instance" |
||
| `apikey` | Required | Your API key. For more info on [SaaS](https://www.ibm.com/docs/en/software-hub/5.1.x?topic=started-generating-api-keys), For [Software](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui#manage-user-keys) | `API key` | | ||
|
||
### Schemas and Catalogs | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When selecting the catalog, ensure the user has read and write access. This selection does not limit your ability to query into the schema spcified/created but also serves as the default location for materialized `tables`, `views`, and `incremental`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you mean to say "This selection does not limit your ability to query into the schema specified/created by other users?" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. exactly |
||
|
||
### SSL verification | ||
|
||
- If the Spark instance uses an unsecured HTTP connection, set `use_ssl` to `false`. | ||
- If the instance uses `HTTPS`, this parameter should be set to `true`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Change "this parameter should be set to 'true' > set it 'true' |
||
|
||
## Additional parameters | ||
|
||
The following profile fields are optional to set up. They let you configure your instance session and dbt for your connection. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The following profile fields are optional. You can configure the instance session and dbt for the connection. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DivyaLokesh I’ve addressed all the requested changes, Please let me know if there’s anything else |
||
|
||
| Profile field | Description | Example | | ||
| ------------------------ | ------------------------------------------------------------ | --------------------------------- | | ||
| `threads` | How many threads dbt should use (default is `1`) | `8` | | ||
| `retry_all` | Enables automatic retries for transient connection failures. | `true` | | ||
| `connect_timeout` | Timeout for establishing a connection (in seconds). | `5` | | ||
| `connect_retries` | Number of retry attempts for connection failures. | `3` | | ||
|
||
## Limitations & Considerations | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- **Supports only HTTP**: No support for ODBC, Thrift, or session-based connections. | ||
- **Limited dbt Cloud Support**: Not fully compatible with dbt Cloud. | ||
- **Metadata Persistence**: Some dbt features, such as column descriptions, may not persist in all table formats. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
--- | ||
title: "IBM watsonx.data Spark configurations" | ||
id: "watsonx-spark-config" | ||
--- | ||
|
||
## Instance requirements | ||
|
||
To use IBM watsonx.data Spark with `dbt-watsonx-spark` adapter, ensure the instance has an attached catalog that supports creating, renaming, altering, and dropping objects such as tables and views. The user connecting to the instance via the `dbt-watsonx-spark` adapter must have the necessary permissions for the target catalog. | ||
|
||
For detailed setup instructions, including setting up watsonx.data, adding the Spark engine, configuring storages, registering data sources, and managing permissions, refer to the official IBM documentation: | ||
- watsonx.data Software Documentation: [IBM watsonx.data Software Guide](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x) | ||
- watsonx.data SaaS Documentation: [IBM watsonx.data SaaS Guide](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started) | ||
|
||
|
||
|
||
## Session properties | ||
|
||
With IBM watsonx.data SaaS/Software instance, you can [set session properties](https://sparkdb.io/docs/current/sql/set-session.html) to modify the current configuration for your user session. | ||
|
||
To temporarily adjust session properties for a specific dbt model or a group of models, use a [dbt hook](../../reference/resource-configs/pre-hook-post-hook). For example: | ||
|
||
```sql | ||
{{ | ||
config( | ||
pre_hook="set session query_max_run_time='10m'" | ||
) | ||
}} | ||
``` | ||
|
||
## Connector properties | ||
|
||
IBM watsonx.data SaaS/Software supports various Spark-specific connector properties to control data representation, execution performance, and storage format. | ||
|
||
For more details on supported configurations for each data source, refer to: | ||
|
||
- [watsonx.data SaaS Catalog](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-reg_database) | ||
- [watsonx.data Software Catalog](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=components-adding-data-source) | ||
|
||
### **Extra Configuration** | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The `dbt-watsonx-spark` adapter allows additional configurations to be set in the catalog profile: | ||
|
||
- `Catalog:` Specifies the catalog to use for the Spark connection. The plugin can automatically detect the file format type `(Iceberg, Hive, or Delta)` based on the catalog type. | ||
- `use_ssl:` Enables SSL encryption for secure connections. | ||
|
||
Example configuration: | ||
|
||
```yaml | ||
project_name: | ||
target: "dev" | ||
outputs: | ||
dev: | ||
type: watsonx_spark | ||
method: http | ||
schema: [schema name] | ||
host: [hostname] | ||
uri: [uri] | ||
catalog: [catalog name] | ||
use_ssl: false | ||
auth: | ||
instance: [Watsonx.data Instance ID] | ||
user: [username] | ||
apikey: [apikey] | ||
``` | ||
|
||
--- | ||
|
||
### **File Format Configuration** | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The supported file formats depend on the catalog type: | ||
|
||
- **Iceberg Catalog:** Supports **Iceberg** tables. | ||
- **Hive Catalog:** Supports **Hive** tables. | ||
- **Delta Lake Catalog:** Supports **Delta** tables. | ||
- **Hudi Catalog:** Supports **Hudi** tables. | ||
|
||
The plugin **automatically** detects the file format type based on the catalog specified in the configuration. | ||
|
||
|
||
|
||
By specifying file format dbt models. Example: | ||
|
||
```sql | ||
{{ | ||
config( | ||
materialized='table', | ||
file_format='iceberg' or 'hive' or 'delta' or 'hudi' | ||
) | ||
}} | ||
``` | ||
|
||
**For more details**, refer to the [documentation.](https://spark.apache.org/docs/3.5.3/sql-ref-syntax.html#sql-syntax) | ||
|
||
|
||
## Seeds and prepared statements | ||
You can configure column data types either in the dbt_project.yml file or in property files, as supported by dbt. For more details on seed configuration and best practices, refer to the [dbt seed configuration documentation](https://docs.getdbt.com/reference/seed-configs). | ||
|
||
|
||
## Materializations | ||
The `dbt-watsonx-spark` adapter supports table materializations, allowing you to manage how your data is stored and queried in watsonx.data Spark. | ||
|
||
For further information on configuring materializations, refer to the [dbt materializations documentation](https://docs.getdbt.com/reference/resource-configs/materialized). | ||
|
||
### Table | ||
|
||
The `dbt-watsonx-spark` adapter enables you to create and update tables through table materialization, making it easier to work with data in watsonx.data Spark. | ||
|
||
### **View** | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The adapter automatically creates views by default if no materialization is explicitly specified. | ||
|
||
### **Incremental** | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Incremental materialization is supported but requires additional configuration for partitioning and performance tuning. | ||
|
||
#### Recommendations | ||
- **Check Permissions:** Ensure that the necessary permissions for table creation are enabled in the catalog or schema. | ||
- **Check Connector Documentation:** Review watsonx.data Spark [data ingestion in watsonx.data](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=data-overview-ingestion) to ensure it supports table | ||
creation and modification. | ||
|
||
## Unsupported features | ||
Despite its extensive capabilities, the `dbt-watsonx-spark` adapter has some limitations: | ||
|
||
- **Incremental Materialization**: Supported but requires additional configuration for partitioning and performance tuning. | ||
- **Materialized Views**: Not natively supported in Spark SQL within Watsonx.data. | ||
- **Snapshots**: Not supported due to Spark’s lack of built-in snapshot functionality. | ||
- **Performance Considerations**: | ||
- Large datasets may require tuning of Spark configurations such as shuffle partitions and memory allocation. | ||
- Some transformations may be expensive due to Spark’s in-memory processing model. | ||
|
||
By understanding these capabilities and constraints, users can maximize the effectiveness of dbt with Watsonx.data Spark for scalable data transformations and analytics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine line 33 and 34 as follows:
Use the following command to install the adapter:
Note: From dbt v1.8, installing an adapter no longer installs 'dbt-core' automatically. This is because adapters and dbt Core versions are decoupled to avoid overwriting dbt-core installations.