Merge branch 'main' into new/use-case-databricks-workflow

sodadata · Aug 12, 2024 · 98aad90 · 98aad90
2 parents fbbfab2 + e3d8145
commit 98aad90
Show file tree

Hide file tree

Showing 22 changed files with 140 additions and 12 deletions.
diff --git a/_includes/python-versions.md b/_includes/python-versions.md
@@ -0,0 +1,8 @@
+<details>
+  <summary style="color:#00BC7E">Python versions Soda supports</summary>
+  Soda officially supports Python versions 3.8, 3.9, and 3.10. <br />
+  Efforts to fully support Python 3.11 and 3.12 are ongoing.
+  <br /><br />
+  Using Python 3.11, some users might have some issues with dependencies constraints. At times, extra the combination of Python 3.11 and dependencies constraints requires that a dependency be built from source rather than downloaded pre-built. <br /><br />
+  The same applies to Python 3.12, although there is some anecdotal evidence that indicates that 3.12 might not work in all scenarios due to dependencies constraints.
+</details>
diff --git a/_release-notes/soda-agent-1.1.22.md b/_release-notes/soda-agent-1.1.22.md
@@ -0,0 +1,10 @@
+---
+name: "1.1.22"
+date: 2024-08-01
+products:
+  - soda-agent
+---
+## 1.1.22
+
+This release maps to [Soda Library 1.5.22]({% link release-notes/soda-library.md %}). <br />
+Access [Soda documentation]({% link soda/upgrade.md %}#upgrade-a-soda-agent) for instructions to upgrade a Soda Agent helm chart to use the latest version of Soda Library.
diff --git a/_release-notes/soda-agent-1.1.23.md b/_release-notes/soda-agent-1.1.23.md
@@ -0,0 +1,10 @@
+---
+name: "1.1.23"
+date: 2024-08-02
+products:
+  - soda-agent
+---
+## 1.1.23
+
+This release maps to [Soda Library 1.5.23]({% link release-notes/soda-library.md %}). <br />
+Access [Soda documentation]({% link soda/upgrade.md %}#upgrade-a-soda-agent) for instructions to upgrade a Soda Agent helm chart to use the latest version of Soda Library.
diff --git a/_release-notes/soda-library-1.5.21.md b/_release-notes/soda-library-1.5.21.md
@@ -0,0 +1,16 @@
+---
+name: "1.5.21"
+date: 2024-07-31
+products:
+  - soda-library
+---
+
+## 1.5.21 Fixes
+
+* Add nchar, nvarchar and binary to text types for profiling. by @jzalucki in #281
+* CLOUD 8061: alias table names in sql queries by @jzalucki in #280
+* Oracle data source properties prefix should be None instead of "None" when no service name is provided. by @jzalucki in #283
+* Sqlserver: use appropriate aggregate methods to build queries by @jzalucki in #284
+* Cross row count check should support custom identity. by @jzalucki in #285
+* Copyedit on frequency detection error message by @janet-can in #287
+* Chore: update auto-assignments by @milanaleksic in #289
diff --git a/_release-notes/soda-library-1.5.22.md b/_release-notes/soda-library-1.5.22.md
@@ -0,0 +1,14 @@
+---
+name: "1.5.22"
+date: 2024-08-01
+products:
+  - soda-library
+---
+
+## 1.5.22 Fixes
+
+* Observability: minimize metadata retrieval, do not push data into dis… by @m1n0 in #282
+* Handle SQL exception nicely for failed rows and user-defined check. by @jzalucki in #286
+* Spark: send discovery data despite errors. by @jzalucki in #290
+* Quote column names during observability partition detection. by @jzalucki in #288
+* Spark: failed rows should not be limited to max 100 total results. by @jzalucki in #292
diff --git a/_release-notes/soda-library-1.5.23.md b/_release-notes/soda-library-1.5.23.md
@@ -0,0 +1,11 @@
+---
+name: "1.5.23"
+date: 2024-08-02
+products:
+  - soda-library
+---
+
+## 1.5.23 Fixes
+
+* Attempt to always show freshness even if last 24 hours partition does not return data. by @jzalucki in #291
+* Observability: Always add partition column to profiling result by @m1n0 in #293
diff --git a/api-docs/reporting-api-to-overview-dashboards.md b/api-docs/reporting-api-to-overview-dashboards.md
@@ -25,7 +25,7 @@ This article offers an example for building a data quality reporting dashboard u
 
 ## Prerequisites and limitations
 * You have some knowledge of Python and are familiar with `pandas` and HTTP request libraries such as `httpx`.
-* You have installed Python 3.8 or later.
+* You have installed Python 3.8, 3.9, or 3.10.
 * You have a Soda Cloud account.
 * You have [installed Soda Library]({% link soda-library/install.md %}) in your environment and [connected]({% link soda-library/install.md %}#configure-soda) it to your Soda Cloud account.
 * You have used Soda Library to run at least one scan against data in a dataset.

diff --git a/soda-cl/sample-datasets.md b/soda-cl/sample-datasets.md
@@ -96,6 +96,8 @@ Note that you cannot use an `exclude_columns` configuration to disable sample ro
 
 ## Specify columns for failed row sampling
 
+{% include banner-upgrade.md %}
+
 Beyond collecting samples of data from datasets, you can also use a `samples columns` configuration to an individual check to specify the columns for which Soda must implicitly collect failed row sample values. Soda only collects the check's failed row samples for the columns you specify in the list, as in the `duplicate_count` example below. 
 
 Soda implicitly collects failed row samples for the following checks:

diff --git a/soda-cl/troubleshoot.md b/soda-cl/troubleshoot.md
@@ -22,6 +22,7 @@ parent: SodaCL reference
 [Errors when using in-check filters](#errors-when-using-in-check-filters)<br />
 [Using reference checks with Spark DataFrames](#using-reference-checks-with-spark-dataframes)<br />
 [Single quotes in valid values list result in error](#single-quotes-in-valid-values-list-result-in-error)<br />
+[Databricks issue with column names that being with a number](#databricks-issue-with-column-names-that-being-with-a-number)<br />
 <br />
 
 <hr/>
@@ -242,6 +243,18 @@ checks for my_dataset:
 {% include single-quotes.md %}
 
 
+## Databricks issue with column names that being with a number
+
+**Problem:** When running scans on Databricks, Soda encounters an error on columns that begin with a number.
+
+**Solution:** In Databricks, when dealing with column names that start with numbers or contain special characters such as spaces, you typically need to use backticks to enclose the column identifier. This is because Databricks uses a SQL dialect that is similar to Hive SQL, which supports backticks for escaping identifiers. For example:
+```yaml
+checks for soda_test:
+  - missing_count(`1_bigint`):
+      name: test
+      fail: when > 0
+```
+
 ## Go further
 
 * Need help? Join the <a href="https://community.soda.io/slack" target="_blank"> Soda community on Slack</a>.

diff --git a/soda-cloud/anomaly-dashboard.md b/soda-cloud/anomaly-dashboard.md
@@ -38,6 +38,7 @@ After establishing these patterns, Soda automatically detects anomalies relative
 &nbsp;&nbsp;&nbsp;&nbsp;[Activate an anomaly dashboard to an existing dataset](#activate-an-anomaly-dashboard-to-an-existing-dataset)<br />
 [About the anomaly dashboard](#about-the-anomaly-dashboard)<br />
 &nbsp;&nbsp;&nbsp;&nbsp;[Empty metrics tiles](#empty-metrics-tiles)<br />
+&nbsp;&nbsp;&nbsp;&nbsp;[Known issues and limitations](#known-issues-and-limitations)<br />
 [Add anomaly notifications](#add-anomaly-notifications)<br />
 [About profiling and partitioning](#about-profiling-and-partitioning)<br />
 &nbsp;&nbsp;&nbsp;&nbsp;[Change the time partitioning column](#change-the-time-partitioning-column)<br />
@@ -105,6 +106,10 @@ If, after the anomaly detection algorithm has completed its pattern training, th
 * There is no column that contains NUMBER type data (INT, FLOAT, etc.) which an average metric check requires. Where it cannot detect a column with the necessary data type, Soda leaves the **Average** tile blank.
 
 
+### Known issues and limitations
+
+* Soda anomaly dashboard does not profile columns that contain timestamps or dates. As such, Soda only executes a freshness check for such columns for the anomaly dashboard to validate data freshness, but not anomalies in the columns that contain dates or timestamps.
+
 ## Add anomaly notifications
 
 {% include anomaly-notifs.md %}

diff --git a/soda-library/install.md b/soda-library/install.md
@@ -39,11 +39,13 @@ As a step in the **Get started roadmap**, this guide offers instructions to set
 
 To use Soda Library, you must have installed the following on your system.
 
-* Python 3.8 or greater. To check your existing version, use the CLI command: `python --version` or `python3 --version` <br /> 
+* Python 3.8, 3.9, or 3.10. To check your existing version, use the CLI command: `python --version` or `python3 --version` <br /> 
 If you have not already installed Python, consider using <a href="https://github.com/pyenv/pyenv/wiki" target="_blank">pyenv</a> to manage multiple versions of Python in your environment.
 * Pip 21.0 or greater. To check your existing version, use the CLI command: `pip --version`
 * A Soda Cloud account; see next section.
 
+{% include python-versions.md %}
+
 ## Create a Soda Cloud account
 
 1. In a browser, navigate to <a href="https://cloud.soda.io/signup?utm_source=docs" target="_blank">cloud.soda.io/signup</a> to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in. 

diff --git a/soda-library/programmatic.md b/soda-library/programmatic.md
@@ -34,10 +34,12 @@ As a step in the **Get started roadmap**, this guide offers instructions to set
 
 To use Soda Library, you must have installed the following on your system.
 
-* Python 3.8 or greater
+* Python 3.8, 3.9, or 3.10
 * Pip 21.0 or greater
 * A Soda Cloud account; see next section.
 
+{% include python-versions.md %}
+
 ## Create a Soda Cloud account
 
 1. In a browser, navigate to <a href="https://cloud.soda.io/signup?utm_source=docs" target="_blank">cloud.soda.io/signup</a> to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.

diff --git a/soda/connect-dask.md b/soda/connect-dask.md
@@ -88,6 +88,8 @@ scan.set_verbose(True)
 scan.execute()
 ```
 
+<br />
+
 ### Load JSON file into Dataframe
 
 {% include code-header.html %}
@@ -103,6 +105,13 @@ df = pd.read_json('your_file.json')
 
 ...
 ```
+<br />
+
+## Troubleshoot
+
+**Problem:** You encounter errors when trying to install `soda-dask-pandas` in an environment that uses Python 3.11. This may manifest as an issue with dependencies or as an error that reads, `Pre-scan validation failed, see logs for details.`
+
+**Workaround:** Uninstall the `soda-dask-pandas` package, then downgrade the version of Python your environment uses to Python 3.9. Install the `soda-dask-pandas` package again. 
 
 <br />
 <br />

diff --git a/soda/connect-troubleshoot.md b/soda/connect-troubleshoot.md
@@ -12,6 +12,7 @@ Last modified on {% last_modified_at %}
 [Snowflake proxy connection error](#go-further)<br />
 [Spark DataFrame object error](#spark-dataframe-object-error)<br />
 [ImportError during programmatic scan](#importerror-during-programmatic-scan)<br />
+[Scan error with Soda Dask and Pandas](#scan-error-with-soda-dask-and-pandas)<br />
 [Go further](#go-further)<br />
 <br />
 
@@ -41,6 +42,14 @@ Last modified on {% last_modified_at %}
 
 <br />
 
+## Scan error with Soda Dask and Pandas
+
+**Problem:** You encounter errors when trying to install `soda-dask-pandas` in an environment that uses Python 3.11. This may manifest as an issue with dependencies or as an error that reads, `Pre-scan validation failed, see logs for details.`
+
+**Workaround:** Uninstall the `soda-dask-pandas` package, then downgrade the version of Python your environment uses to Python 3.9. Install the `soda-dask-pandas` package again. 
+
+<br />
+
 ## Go further
 
 * Access [Troubleshoot SodaCL]({% link soda-cl/troubleshoot.md %}) for help resolving issues running scans with SodaCL.

diff --git a/soda/integrate-alation.md b/soda/integrate-alation.md
@@ -92,7 +92,7 @@ Retrieve the Alation `datasource_container_id` for the `datasource_container_nam
 
 ## Enable API access to Alation with SSO
 
-If your Alation account employs single sign-on (SSO) access, you must <a href="https://developer.alation.com/dev/docs/creating-an-api-service-account" target="_blank">Create an API service account</a> for Soda to integrate with Alation. 
+If your Alation account employs single sign-on (SSO) access, you must <a href="https://developer.alation.com/dev/docs/creating-an-api-service-account" target="_blank">Create an API service account</a> for Soda to integrate with Alation.
 
 If your Alation account does not use SSO, skip this step and proceed to [Customize the catalog](#customize-the-catalog).
 
@@ -101,7 +101,7 @@ If your Alation account does not use SSO, skip this step and proceed to [Customi
 1. Create custom fields in Alation that reference information that Soda Cloud pushes to the catalog. These are the fields the catalog users will see that will display Soda Cloud data quality details.
 <br />
 In your Alation account, navigate to **Settings** > **Catalog Admin** > **Customize Catalog**. In the **Custom Fields** tab, create the following fields:
-* Under the **Pickers** heading, create a field for “Has DQ” with Options “True” and “False”.
+* Under the **Pickers** heading, create a field for “Has DQ” with Options “True” and “False”. The Alation API is case sensitive so be sure to use these exact values.
 * Under the **Dates** heading, create a field for “Profile - Last Run”.
 * Under the **Rich Texts** heading, create the following fields:
   * “Soda DQ Overview”

diff --git a/soda/new-documentation.md b/soda/new-documentation.md
@@ -9,6 +9,20 @@ parent: Learning resources
 
 <br /> 
 
+#### August 8, 2024
+* Added content to clarify that Soda Library officially supports Python 3.8, 3.9, and 3.10.
+
+#### August 2, 2024
+* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Library 1.5.23 and Soda Agent 1.1.23.
+
+#### August 1, 2024
+* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Library 1.5.22 and Soda Agent 1.1.22.
+
+#### July 31, 2024
+* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Library 1.5.21.
+* Added [troubleshooting tip]({% link soda-cl/troubleshoot.md %}#databricks-issue-with-column-names-that-being-with-a-number) for running Soda scans on Databricks where column names beging with numbers.
+* Added [Known issues and limitations]({% link soda-cloud/anomaly-dashboard.md %}#known-issues-and-limitations) section to anomaly dashboard content.
+
 #### July 29, 2024
 * Added [release notes]({% link release-notes/all.md %}) documentation for Soda Core 3.3.13.
 

diff --git a/soda/quick-start-databricks.md b/soda/quick-start-databricks.md
@@ -34,9 +34,12 @@ To validate your account license or free trial, Soda Library must communicate wi
 ## Set up Soda
 
 Soda Library has the following requirements:
-* Python 3.8 or greater
+* Python 3.8, 3.9, or 3.10
 * Pip 21.0 or greater
 
+{% include python-versions.md %}
+
+<br />
 Download the notebook: <a href="soda-databricks-notebook.ipynb" download>Soda Databricks notebook</a>
 
 {% include code-header.html %}

diff --git a/soda/quick-start-migration.md b/soda/quick-start-migration.md
@@ -51,7 +51,7 @@ What follows is an abridged version of installing and configuring Soda for Postg
 
 1. In a browser, navigate to <a href="https://cloud.soda.io/signup" target="_blank">cloud.soda.io/signup</a> to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.
 2. Navigate to **your avatar** > **Profile**, then access the **API keys** tab. Click the plus icon to generate new API keys. Copy+paste the API key values to a temporary, secure place in your local environment.
-3. With Python 3.8 or greater and Pip 21.0 or greater, use the command-line to install Soda locally in a new virtual environment. 
+3. With Python 3.8, 3.9, or 3.10 and Pip 21.0 or greater, use the command-line to install Soda locally in a new virtual environment. 
 ```shell
 python3 -m venv .venv
 source .venv/bin/activate 

diff --git a/soda/quick-start-prod.md b/soda/quick-start-prod.md
@@ -44,7 +44,7 @@ Borrow from this guide to connect to your own data source, set up scan points in
 
 ## Install Soda from the command-line
 
-With Python 3.8 installed, the Engineer creates a virtual environment in Terminal, then installs the Soda package for PostgreSQL using the following command.
+With Python 3.8, 3.9, or 3.10 installed, the Engineer creates a virtual environment in Terminal, then installs the Soda package for PostgreSQL using the following command.
 
 {% include code-header.html %}
 ```shell

diff --git a/soda/quick-start-sip.md b/soda/quick-start-sip.md
@@ -30,7 +30,7 @@ Use the example data in this quick tutorial to set up and run a simple Soda scan
 This tutorial references a MacOS environment.
 
 1. Check the following prerequisites:
-* You have installed <a href="https://www.python.org/downloads/" target="_blank">Python 3.8</a> or greater. 
+* You have installed Python 3.8, 3.9, or 3.10. 
 * You have installed Pip 21.0 or greater.
 * (Optional) You have installed <a href="https://www.docker.com/products/docker-desktop/" target="_blank">Docker Desktop</a> and have access to <a href="https://github.com/" target="_blaak">GitHub </a>, to set up an example data source.
 2. Visit <a href="https://cloud.soda.io/signup" target="_blank">https://cloud.soda.io/signup</a> to sign up for a Soda Cloud account which is free for a 45-day trial.<br />

diff --git a/soda/route-failed-rows.md b/soda/route-failed-rows.md
@@ -22,7 +22,7 @@ See also: [Examine failed row samples]({% link soda-cloud/failed-rows.md %})<br
 
 ## Prerequisites
 * a code or text editor such as PyCharm or Visual Studio Code
-* Python 3.8 or greater
+* Python 3.8, 3.9, or 3.10
 * Pip 21.0 or greater
 
 

diff --git a/soda/setup-guide.md b/soda/setup-guide.md
@@ -60,7 +60,7 @@ Use this setup for: <br />
 ✅ **Data migration**: Migrate good-quality data from one data source to another. See: [Test before data migration]({% link soda/quick-start-migration.md %})<br />
 
 Requirements:
-* Python 3.8 or greater
+* Python 3.8, 3.9, or 3.10
 * Pip 21.0 or greater
 * Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)
 
@@ -122,7 +122,7 @@ Use this setup for:<br />
 ✅ **Databricks Notebook**: Invoke Soda data quality scans in a Databricks Notebook. See: [Add Soda to a Databricks notebook]({% link soda/quick-start-databricks.md %})<br />
 
 Requirements:
-* Python 3.8 or greater
+* Python 3.8, 3.9, or 3.10
 * Pip 21.0 or greater
 * Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)