dbt-labs · mirnawong1 · Feb 20, 2025 · Feb 9, 2025 · Feb 10, 2025 · Feb 10, 2025
@@ -15,5 +15,5 @@ Community adapters are adapter plugins contributed and maintained by members of
 | [MySQL](/docs/core/connect-data-platform/mysql-setup) | [RisingWave](/docs/core/connect-data-platform/risingwave-setup) | [Rockset](/docs/core/connect-data-platform/rockset-setup) |
 | [SingleStore](/docs/core/connect-data-platform/singlestore-setup)| [SQL Server & Azure SQL](/docs/core/connect-data-platform/mssql-setup) | [SQLite](/docs/core/connect-data-platform/sqlite-setup) |
 | [Starrocks](/docs/core/connect-data-platform/starrocks-setup) | [TiDB](/docs/core/connect-data-platform/tidb-setup)| [TimescaleDB](https://dbt-timescaledb.debruyn.dev/) |
-| [Upsolver](/docs/core/connect-data-platform/upsolver-setup) | [Vertica](/docs/core/connect-data-platform/vertica-setup) | [Watsonx-Presto](/docs/core/connect-data-platform/watsonx-presto-setup) | 
-| [Yellowbrick](/docs/core/connect-data-platform/yellowbrick-setup) | 
+| [Upsolver](/docs/core/connect-data-platform/upsolver-setup) | [Vertica](/docs/core/connect-data-platform/vertica-setup) | [Watsonx-Presto](/docs/core/connect-data-platform/watsonx-presto-setup) |
+| [IBM watsonx.data - Spark](/docs/core/connect-data-platform/watsonx-spark-setup) | [Yellowbrick](/docs/core/connect-data-platform/yellowbrick-setup) |
@@ -0,0 +1,113 @@
+---
+title: "IBM watsonx.data Spark setup"
+description: "Read this guide to learn about the IBM watsonx.data Spark setup in dbt."
+id: "watsonx-spark-setup"
+meta:
+  maintained_by: IBM
+  authors: Bayan Albunayan, Reema Alzaid, Manjot Sidhu 
+  github_repo: 'IBM/dbt-watsonx-spark'
+  pypi_package: 'dbt-watsonx-spark'
+  min_core_version: v0.0.8
+  cloud_support: 'Not Supported'
+  min_supported_version: 'n/a'
+  slack_channel_name: 
+  slack_channel_link: 
+  platform_name: IBM watsonx.data
+  config_page: /reference/resource-configs/watsonx-Spark-config
+---
+
+**The `dbt-watsonx-spark` adapter allows you to use dbt to transform and manage data on IBM watsonx.data Spark, leveraging its distributed SQL query engine capabilities.**
+
+Before proceeding, ensure you have the following:
+- An active IBM watsonx.data, For [IBM Cloud (SaaS)](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started). For [Software](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=installing-watsonxdata-developer-version)
+- Provision **Native Spark engine** in watsonx.data, For [IBM Cloud (SaaS)](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-prov_nspark). For [Software](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=spark-native-engine)
+- An active **Spark query server** in your **Native Spark engine** 
+
+Read the official documentation for using **watsonx.data** with `dbt-watsonx-spark`
+
+- [Documentation for IBM Cloud and SaaS offerings](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-dbt_watsonx_spark_inst)
+- [Documentation for IBM watsonx.data software](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=integration-data-build-tool-adapter-spark)
+
+## Installing dbt-watsonx-spark
+
+Since dbt v1.8, installing an adapter no longer installs `dbt-core` automatically.
+Use pip to install the adapter. Before 1.8, installing the adapter would automatically install `dbt-core` and any additional dependencies. Beginning in 1.8, installing an adapter does not automatically install dbt-core. This is because adapters and dbt Core versions have been decoupled from each other so we no longer want to overwrite existing dbt-core installations. Use the following command for installation:
+
+```sh
+python -m pip install dbt-core dbt-watsonx-spark
+```
+
+## Configuring `dbt-watsonx-spark`
+For IBM watsonx.data-specific configuration, please refer to [IBM watsonx.data configs.](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=spark-configuration-setting-up-your-profile)
+
+
+## Connecting to IBM watsonx.data Spark
+
+To connect dbt with watsonx.data Spark, you need to configure a profile in your `profiles.yml` file located in the `.dbt/` directory of your home folder. The following is an example configuration for connecting to IBM watsonx.data SaaS and Software instances:
+
+<File name='~/.dbt/profiles.yml'>
+
+```yaml
+project_name:
+  target: "dev"
+  outputs:
+    dev:
+      type: watsonx_spark
+      method: http
+      schema: [schema name]
+      host: [hostname]
+      uri: [uri]
+      catalog: [catalog name]
+      use_ssl: false
+      auth:
+        instance: [Watsonx.data Instance ID]
+        user: [username]
+        apikey: [apikey]
+```
+
+</File>
+
+## Host parameters
+
+The following profile fields are required to configure watsonx.data Spark connections. For IBM watsonx.data SaaS or Software instances, you can get the `profile` details by clicking **View connect details** after `the query server` is in RUNNING stat, The Connection details page opens with the profile configuration.
+Copy the connection details. Then Paste the connection details in the profiles.yml file that is located in .dbt of your home directory
+
+The following profile fields are required to configure watsonx.data Spark connections:
+
+| Option     | Required/Optional           | Description                                                               | Example           |
+| ---------- | ----------------------------- | ------------------------------------------------------------------------- | ----------------- |
+| `method`   | Required |    Specifies the connection method to the spark query server. Use `http`.    | `http`            |
+| `schema`   | Required|    To choose an existing schema within spark engine or create a new schema.  | `spark_schema`    |
+| `host`     | Required |    Hostname of your watsonx.data console. For more information, see [Getting connection information](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=references-getting-connection-information#connection_info__conn_info_).| `https://dataplatform.cloud.ibm.com`       |
+| `uri`      | Required| URI of your query server that is running on watsonx.data. For more information, see [Getting connection information](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=references-getting-connection-information#connection_info__conn_info_). | `/lakehouse/api/v2/spark_engines/<sparkID>/query_servers/<queryID>/connect/cliservice`|
+| `catalog`  | Required                      | The catalog that is associated with the Spark engine.                     | `my_catalog`      |
+| `use_ssl`  | Optional (default: **false**) | Specifies whether to use SSL.                                             | `true` or `false` |
+| `instance` | Required                      | For **SaaS** set it as CRN of watsonx.data. As for **Software**, set it as instance ID of watsonx.data| `1726574045872688`|
+| `user`     | Required                      | Your watsonx.data username                                                | `[email protected]`|
+| `apikey`   | Required                      | Your API key. For more info on [SaaS](https://www.ibm.com/docs/en/software-hub/5.1.x?topic=started-generating-api-keys), For [Software](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui#manage-user-keys)                                                       | `API key`        |
+
+### Schemas and Catalogs
+
+When selecting the catalog, ensure the user has read and write access. This selection does not limit your ability to query into the schema spcified/created but also serves as the default location for materialized `tables`, `views`, and `incremental`.
+
+### SSL verification
+
+- If the Spark instance uses an unsecured HTTP connection, set `use_ssl` to `false`.
+- If the instance uses `HTTPS`, this parameter should be set to `true`.
+
+## Additional parameters
+
+The following profile fields are optional to set up. They let you configure your instance session and dbt for your connection.
+
+| Profile field            | Description                                                  | Example                           |
+| ------------------------ | ------------------------------------------------------------ | --------------------------------- |
+| `threads`                | How many threads dbt should use (default is `1`)             | `8`                               |
+| `retry_all`              | Enables automatic retries for transient connection failures. | `true`                            |
+| `connect_timeout`        | Timeout for establishing a connection (in seconds).          | `5`                               |
+| `connect_retries`        | Number of retry attempts for connection failures.            | `3`                               |
+
+## Limitations & Considerations
+
+- **Supports only HTTP**: No support for ODBC, Thrift, or session-based connections.
+- **Limited dbt Cloud Support**: Not fully compatible with dbt Cloud.
+- **Metadata Persistence**: Some dbt features, such as column descriptions, may not persist in all table formats.
@@ -0,0 +1,131 @@
+---
+title: "IBM watsonx.data Spark configurations"
+id: "watsonx-spark-config"
+---
+
+## Instance requirements
+
+To use IBM watsonx.data Spark with `dbt-watsonx-spark` adapter, ensure the instance has an attached catalog that supports creating, renaming, altering, and dropping objects such as tables and views. The user connecting to the instance via the `dbt-watsonx-spark` adapter must have the necessary permissions for the target catalog.
+
+For detailed setup instructions, including setting up watsonx.data, adding the Spark engine, configuring storages, registering data sources, and managing permissions, refer to the official IBM documentation:
+- watsonx.data Software Documentation: [IBM watsonx.data Software Guide](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x)
+- watsonx.data SaaS Documentation: [IBM watsonx.data SaaS Guide](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started)
+
+
+
+## Session properties
+
+With IBM watsonx.data SaaS/Software instance, you can [set session properties](https://sparkdb.io/docs/current/sql/set-session.html) to modify the current configuration for your user session.
+
+To temporarily adjust session properties for a specific dbt model or a group of models, use a [dbt hook](../../reference/resource-configs/pre-hook-post-hook). For example:
+
+```sql
+{{
+  config(
+    pre_hook="set session query_max_run_time='10m'"
+  )
+}}
+```
+
+## Connector properties
+
+IBM watsonx.data SaaS/Software supports various Spark-specific connector properties to control data representation, execution performance, and storage format.
+
+For more details on supported configurations for each data source, refer to:
+
+- [watsonx.data SaaS Catalog](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-reg_database)
+- [watsonx.data Software Catalog](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=components-adding-data-source)
+
+### **Extra Configuration**
+
+The `dbt-watsonx-spark` adapter allows additional configurations to be set in the catalog profile:
+
+- `Catalog:` Specifies the catalog to use for the Spark connection. The plugin can automatically detect the file format type `(Iceberg, Hive, or Delta)` based on the catalog type.
+- `use_ssl:` Enables SSL encryption for secure connections.
+
+Example configuration:
+
+```yaml
+project_name:
+  target: "dev"
+  outputs:
+    dev:
+      type: watsonx_spark
+      method: http
+      schema: [schema name]
+      host: [hostname]
+      uri: [uri]
+      catalog: [catalog name]
+      use_ssl: false
+      auth:
+        instance: [Watsonx.data Instance ID]
+        user: [username]
+        apikey: [apikey]
+```
+
+---
+
+### **File Format Configuration**
+
+The supported file formats depend on the catalog type:
+
+- **Iceberg Catalog:** Supports **Iceberg** tables.
+- **Hive Catalog:** Supports **Hive** tables.
+- **Delta Lake Catalog:** Supports **Delta** tables.
+- **Hudi Catalog:** Supports **Hudi** tables.
+
+The plugin **automatically** detects the file format type based on the catalog specified in the configuration.
+
+
+
+By specifying file format dbt models. Example:
+
+```sql
+{{
+  config(
+    materialized='table',
+    file_format='iceberg' or 'hive' or 'delta' or 'hudi'
+  )
+}}
+```
+
+**For more details**, refer to the [documentation.](https://spark.apache.org/docs/3.5.3/sql-ref-syntax.html#sql-syntax)
+
+
+## Seeds and prepared statements
+You can configure column data types either in the dbt_project.yml file or in property files, as supported by dbt. For more details on seed configuration and best practices, refer to the [dbt seed configuration documentation](https://docs.getdbt.com/reference/seed-configs).
+
+
+## Materializations
+The `dbt-watsonx-spark` adapter supports table materializations, allowing you to manage how your data is stored and queried in watsonx.data Spark.
+
+For further information on configuring materializations, refer to the [dbt materializations documentation](https://docs.getdbt.com/reference/resource-configs/materialized).
+
+### Table
+
+The `dbt-watsonx-spark` adapter enables you to create and update tables through table materialization, making it easier to work with data in watsonx.data Spark.
+
+### **View**
+
+The adapter automatically creates views by default if no materialization is explicitly specified.
+
+### **Incremental**
+
+Incremental materialization is supported but requires additional configuration for partitioning and performance tuning.
+
+#### Recommendations
+- **Check Permissions:** Ensure that the necessary permissions for table creation are enabled in the catalog or schema.
+- **Check Connector Documentation:** Review watsonx.data Spark [data ingestion in watsonx.data](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=data-overview-ingestion) to ensure it supports table 
+creation and modification.
+
+## Unsupported features
+Despite its extensive capabilities, the `dbt-watsonx-spark` adapter has some limitations:
+
+- **Incremental Materialization**: Supported but requires additional configuration for partitioning and performance tuning.
+- **Materialized Views**: Not natively supported in Spark SQL within Watsonx.data.
+- **Snapshots**: Not supported due to Spark’s lack of built-in snapshot functionality.
+- **Performance Considerations**:
+  - Large datasets may require tuning of Spark configurations such as shuffle partitions and memory allocation.
+  - Some transformations may be expensive due to Spark’s in-memory processing model.
+
+By understanding these capabilities and constraints, users can maximize the effectiveness of dbt with Watsonx.data Spark for scalable data transformations and analytics.
@@ -255,6 +255,7 @@ const sidebarSettings = {
                 "docs/core/connect-data-platform/upsolver-setup",
                 "docs/core/connect-data-platform/vertica-setup",
                 "docs/core/connect-data-platform/watsonx-presto-setup",
+                "docs/core/connect-data-platform/watsonx-spark-setup",
                 "docs/core/connect-data-platform/yellowbrick-setup",
               ],
             },
@@ -910,6 +911,7 @@ const sidebarSettings = {
         "reference/resource-configs/upsolver-configs",
         "reference/resource-configs/vertica-configs",
         "reference/resource-configs/watsonx-presto-config",
+        "reference/resource-configs/watsonx-spark-config",
         "reference/resource-configs/yellowbrick-configs",
       ],
     },