Skip to content

Commit

Permalink
Merge branch 'main' into new/use-case-databricks-workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
janet-can committed Aug 12, 2024
2 parents fbbfab2 + e3d8145 commit 98aad90
Show file tree
Hide file tree
Showing 22 changed files with 140 additions and 12 deletions.
8 changes: 8 additions & 0 deletions _includes/python-versions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<details>
<summary style="color:#00BC7E">Python versions Soda supports</summary>
Soda officially supports Python versions 3.8, 3.9, and 3.10. <br />
Efforts to fully support Python 3.11 and 3.12 are ongoing.
<br /><br />
Using Python 3.11, some users might have some issues with dependencies constraints. At times, extra the combination of Python 3.11 and dependencies constraints requires that a dependency be built from source rather than downloaded pre-built. <br /><br />
The same applies to Python 3.12, although there is some anecdotal evidence that indicates that 3.12 might not work in all scenarios due to dependencies constraints.
</details>
10 changes: 10 additions & 0 deletions _release-notes/soda-agent-1.1.22.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
name: "1.1.22"
date: 2024-08-01
products:
- soda-agent
---
## 1.1.22

This release maps to [Soda Library 1.5.22]({% link release-notes/soda-library.md %}). <br />
Access [Soda documentation]({% link soda/upgrade.md %}#upgrade-a-soda-agent) for instructions to upgrade a Soda Agent helm chart to use the latest version of Soda Library.
10 changes: 10 additions & 0 deletions _release-notes/soda-agent-1.1.23.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
name: "1.1.23"
date: 2024-08-02
products:
- soda-agent
---
## 1.1.23

This release maps to [Soda Library 1.5.23]({% link release-notes/soda-library.md %}). <br />
Access [Soda documentation]({% link soda/upgrade.md %}#upgrade-a-soda-agent) for instructions to upgrade a Soda Agent helm chart to use the latest version of Soda Library.
16 changes: 16 additions & 0 deletions _release-notes/soda-library-1.5.21.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
name: "1.5.21"
date: 2024-07-31
products:
- soda-library
---

## 1.5.21 Fixes

* Add nchar, nvarchar and binary to text types for profiling. by @jzalucki in #281
* CLOUD 8061: alias table names in sql queries by @jzalucki in #280
* Oracle data source properties prefix should be None instead of "None" when no service name is provided. by @jzalucki in #283
* Sqlserver: use appropriate aggregate methods to build queries by @jzalucki in #284
* Cross row count check should support custom identity. by @jzalucki in #285
* Copyedit on frequency detection error message by @janet-can in #287
* Chore: update auto-assignments by @milanaleksic in #289
14 changes: 14 additions & 0 deletions _release-notes/soda-library-1.5.22.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
name: "1.5.22"
date: 2024-08-01
products:
- soda-library
---

## 1.5.22 Fixes

* Observability: minimize metadata retrieval, do not push data into dis… by @m1n0 in #282
* Handle SQL exception nicely for failed rows and user-defined check. by @jzalucki in #286
* Spark: send discovery data despite errors. by @jzalucki in #290
* Quote column names during observability partition detection. by @jzalucki in #288
* Spark: failed rows should not be limited to max 100 total results. by @jzalucki in #292
11 changes: 11 additions & 0 deletions _release-notes/soda-library-1.5.23.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
name: "1.5.23"
date: 2024-08-02
products:
- soda-library
---

## 1.5.23 Fixes

* Attempt to always show freshness even if last 24 hours partition does not return data. by @jzalucki in #291
* Observability: Always add partition column to profiling result by @m1n0 in #293
2 changes: 1 addition & 1 deletion api-docs/reporting-api-to-overview-dashboards.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ This article offers an example for building a data quality reporting dashboard u

## Prerequisites and limitations
* You have some knowledge of Python and are familiar with `pandas` and HTTP request libraries such as `httpx`.
* You have installed Python 3.8 or later.
* You have installed Python 3.8, 3.9, or 3.10.
* You have a Soda Cloud account.
* You have [installed Soda Library]({% link soda-library/install.md %}) in your environment and [connected]({% link soda-library/install.md %}#configure-soda) it to your Soda Cloud account.
* You have used Soda Library to run at least one scan against data in a dataset.
Expand Down
2 changes: 2 additions & 0 deletions soda-cl/sample-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,8 @@ Note that you cannot use an `exclude_columns` configuration to disable sample ro

## Specify columns for failed row sampling

{% include banner-upgrade.md %}

Beyond collecting samples of data from datasets, you can also use a `samples columns` configuration to an individual check to specify the columns for which Soda must implicitly collect failed row sample values. Soda only collects the check's failed row samples for the columns you specify in the list, as in the `duplicate_count` example below.

Soda implicitly collects failed row samples for the following checks:
Expand Down
13 changes: 13 additions & 0 deletions soda-cl/troubleshoot.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ parent: SodaCL reference
[Errors when using in-check filters](#errors-when-using-in-check-filters)<br />
[Using reference checks with Spark DataFrames](#using-reference-checks-with-spark-dataframes)<br />
[Single quotes in valid values list result in error](#single-quotes-in-valid-values-list-result-in-error)<br />
[Databricks issue with column names that being with a number](#databricks-issue-with-column-names-that-being-with-a-number)<br />
<br />

<hr/>
Expand Down Expand Up @@ -242,6 +243,18 @@ checks for my_dataset:
{% include single-quotes.md %}


## Databricks issue with column names that being with a number

**Problem:** When running scans on Databricks, Soda encounters an error on columns that begin with a number.

**Solution:** In Databricks, when dealing with column names that start with numbers or contain special characters such as spaces, you typically need to use backticks to enclose the column identifier. This is because Databricks uses a SQL dialect that is similar to Hive SQL, which supports backticks for escaping identifiers. For example:
```yaml
checks for soda_test:
- missing_count(`1_bigint`):
name: test
fail: when > 0
```
## Go further
* Need help? Join the <a href="https://community.soda.io/slack" target="_blank"> Soda community on Slack</a>.
Expand Down
5 changes: 5 additions & 0 deletions soda-cloud/anomaly-dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ After establishing these patterns, Soda automatically detects anomalies relative
&nbsp;&nbsp;&nbsp;&nbsp;[Activate an anomaly dashboard to an existing dataset](#activate-an-anomaly-dashboard-to-an-existing-dataset)<br />
[About the anomaly dashboard](#about-the-anomaly-dashboard)<br />
&nbsp;&nbsp;&nbsp;&nbsp;[Empty metrics tiles](#empty-metrics-tiles)<br />
&nbsp;&nbsp;&nbsp;&nbsp;[Known issues and limitations](#known-issues-and-limitations)<br />
[Add anomaly notifications](#add-anomaly-notifications)<br />
[About profiling and partitioning](#about-profiling-and-partitioning)<br />
&nbsp;&nbsp;&nbsp;&nbsp;[Change the time partitioning column](#change-the-time-partitioning-column)<br />
Expand Down Expand Up @@ -105,6 +106,10 @@ If, after the anomaly detection algorithm has completed its pattern training, th
* There is no column that contains NUMBER type data (INT, FLOAT, etc.) which an average metric check requires. Where it cannot detect a column with the necessary data type, Soda leaves the **Average** tile blank.
### Known issues and limitations
* Soda anomaly dashboard does not profile columns that contain timestamps or dates. As such, Soda only executes a freshness check for such columns for the anomaly dashboard to validate data freshness, but not anomalies in the columns that contain dates or timestamps.
## Add anomaly notifications
{% include anomaly-notifs.md %}
Expand Down
4 changes: 3 additions & 1 deletion soda-library/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,13 @@ As a step in the **Get started roadmap**, this guide offers instructions to set

To use Soda Library, you must have installed the following on your system.

* Python 3.8 or greater. To check your existing version, use the CLI command: `python --version` or `python3 --version` <br />
* Python 3.8, 3.9, or 3.10. To check your existing version, use the CLI command: `python --version` or `python3 --version` <br />
If you have not already installed Python, consider using <a href="https://github.com/pyenv/pyenv/wiki" target="_blank">pyenv</a> to manage multiple versions of Python in your environment.
* Pip 21.0 or greater. To check your existing version, use the CLI command: `pip --version`
* A Soda Cloud account; see next section.

{% include python-versions.md %}

## Create a Soda Cloud account

1. In a browser, navigate to <a href="https://cloud.soda.io/signup?utm_source=docs" target="_blank">cloud.soda.io/signup</a> to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.
Expand Down
4 changes: 3 additions & 1 deletion soda-library/programmatic.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,12 @@ As a step in the **Get started roadmap**, this guide offers instructions to set

To use Soda Library, you must have installed the following on your system.

* Python 3.8 or greater
* Python 3.8, 3.9, or 3.10
* Pip 21.0 or greater
* A Soda Cloud account; see next section.

{% include python-versions.md %}

## Create a Soda Cloud account

1. In a browser, navigate to <a href="https://cloud.soda.io/signup?utm_source=docs" target="_blank">cloud.soda.io/signup</a> to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.
Expand Down
9 changes: 9 additions & 0 deletions soda/connect-dask.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@ scan.set_verbose(True)
scan.execute()
```

<br />

### Load JSON file into Dataframe

{% include code-header.html %}
Expand All @@ -103,6 +105,13 @@ df = pd.read_json('your_file.json')

...
```
<br />

## Troubleshoot

**Problem:** You encounter errors when trying to install `soda-dask-pandas` in an environment that uses Python 3.11. This may manifest as an issue with dependencies or as an error that reads, `Pre-scan validation failed, see logs for details.`

**Workaround:** Uninstall the `soda-dask-pandas` package, then downgrade the version of Python your environment uses to Python 3.9. Install the `soda-dask-pandas` package again.

<br />
<br />
Expand Down
9 changes: 9 additions & 0 deletions soda/connect-troubleshoot.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Last modified on {% last_modified_at %}
[Snowflake proxy connection error](#go-further)<br />
[Spark DataFrame object error](#spark-dataframe-object-error)<br />
[ImportError during programmatic scan](#importerror-during-programmatic-scan)<br />
[Scan error with Soda Dask and Pandas](#scan-error-with-soda-dask-and-pandas)<br />
[Go further](#go-further)<br />
<br />

Expand Down Expand Up @@ -41,6 +42,14 @@ Last modified on {% last_modified_at %}

<br />

## Scan error with Soda Dask and Pandas

**Problem:** You encounter errors when trying to install `soda-dask-pandas` in an environment that uses Python 3.11. This may manifest as an issue with dependencies or as an error that reads, `Pre-scan validation failed, see logs for details.`

**Workaround:** Uninstall the `soda-dask-pandas` package, then downgrade the version of Python your environment uses to Python 3.9. Install the `soda-dask-pandas` package again.

<br />

## Go further

* Access [Troubleshoot SodaCL]({% link soda-cl/troubleshoot.md %}) for help resolving issues running scans with SodaCL.
Expand Down
4 changes: 2 additions & 2 deletions soda/integrate-alation.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Retrieve the Alation `datasource_container_id` for the `datasource_container_nam

## Enable API access to Alation with SSO

If your Alation account employs single sign-on (SSO) access, you must <a href="https://developer.alation.com/dev/docs/creating-an-api-service-account" target="_blank">Create an API service account</a> for Soda to integrate with Alation.
If your Alation account employs single sign-on (SSO) access, you must <a href="https://developer.alation.com/dev/docs/creating-an-api-service-account" target="_blank">Create an API service account</a> for Soda to integrate with Alation.

If your Alation account does not use SSO, skip this step and proceed to [Customize the catalog](#customize-the-catalog).

Expand All @@ -101,7 +101,7 @@ If your Alation account does not use SSO, skip this step and proceed to [Customi
1. Create custom fields in Alation that reference information that Soda Cloud pushes to the catalog. These are the fields the catalog users will see that will display Soda Cloud data quality details.
<br />
In your Alation account, navigate to **Settings** > **Catalog Admin** > **Customize Catalog**. In the **Custom Fields** tab, create the following fields:
* Under the **Pickers** heading, create a field for “Has DQ” with Options “True” and “False”.
* Under the **Pickers** heading, create a field for “Has DQ” with Options “True” and “False”. The Alation API is case sensitive so be sure to use these exact values.
* Under the **Dates** heading, create a field for “Profile - Last Run”.
* Under the **Rich Texts** heading, create the following fields:
* “Soda DQ Overview”
Expand Down
14 changes: 14 additions & 0 deletions soda/new-documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,20 @@ parent: Learning resources

<br />

#### August 8, 2024
* Added content to clarify that Soda Library officially supports Python 3.8, 3.9, and 3.10.

#### August 2, 2024
* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Library 1.5.23 and Soda Agent 1.1.23.

#### August 1, 2024
* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Library 1.5.22 and Soda Agent 1.1.22.

#### July 31, 2024
* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Library 1.5.21.
* Added [troubleshooting tip]({% link soda-cl/troubleshoot.md %}#databricks-issue-with-column-names-that-being-with-a-number) for running Soda scans on Databricks where column names beging with numbers.
* Added [Known issues and limitations]({% link soda-cloud/anomaly-dashboard.md %}#known-issues-and-limitations) section to anomaly dashboard content.

#### July 29, 2024
* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Core 3.3.13.

Expand Down
5 changes: 4 additions & 1 deletion soda/quick-start-databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,12 @@ To validate your account license or free trial, Soda Library must communicate wi
## Set up Soda

Soda Library has the following requirements:
* Python 3.8 or greater
* Python 3.8, 3.9, or 3.10
* Pip 21.0 or greater

{% include python-versions.md %}

<br />
Download the notebook: <a href="soda-databricks-notebook.ipynb" download>Soda Databricks notebook</a>

{% include code-header.html %}
Expand Down
2 changes: 1 addition & 1 deletion soda/quick-start-migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ What follows is an abridged version of installing and configuring Soda for Postg

1. In a browser, navigate to <a href="https://cloud.soda.io/signup" target="_blank">cloud.soda.io/signup</a> to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.
2. Navigate to **your avatar** > **Profile**, then access the **API keys** tab. Click the plus icon to generate new API keys. Copy+paste the API key values to a temporary, secure place in your local environment.
3. With Python 3.8 or greater and Pip 21.0 or greater, use the command-line to install Soda locally in a new virtual environment.
3. With Python 3.8, 3.9, or 3.10 and Pip 21.0 or greater, use the command-line to install Soda locally in a new virtual environment.
```shell
python3 -m venv .venv
source .venv/bin/activate
Expand Down
2 changes: 1 addition & 1 deletion soda/quick-start-prod.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Borrow from this guide to connect to your own data source, set up scan points in

## Install Soda from the command-line

With Python 3.8 installed, the Engineer creates a virtual environment in Terminal, then installs the Soda package for PostgreSQL using the following command.
With Python 3.8, 3.9, or 3.10 installed, the Engineer creates a virtual environment in Terminal, then installs the Soda package for PostgreSQL using the following command.

{% include code-header.html %}
```shell
Expand Down
2 changes: 1 addition & 1 deletion soda/quick-start-sip.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Use the example data in this quick tutorial to set up and run a simple Soda scan
This tutorial references a MacOS environment.

1. Check the following prerequisites:
* You have installed <a href="https://www.python.org/downloads/" target="_blank">Python 3.8</a> or greater.
* You have installed Python 3.8, 3.9, or 3.10.
* You have installed Pip 21.0 or greater.
* (Optional) You have installed <a href="https://www.docker.com/products/docker-desktop/" target="_blank">Docker Desktop</a> and have access to <a href="https://github.com/" target="_blaak">GitHub </a>, to set up an example data source.
2. Visit <a href="https://cloud.soda.io/signup" target="_blank">https://cloud.soda.io/signup</a> to sign up for a Soda Cloud account which is free for a 45-day trial.<br />
Expand Down
2 changes: 1 addition & 1 deletion soda/route-failed-rows.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ See also: [Examine failed row samples]({% link soda-cloud/failed-rows.md %})<br

## Prerequisites
* a code or text editor such as PyCharm or Visual Studio Code
* Python 3.8 or greater
* Python 3.8, 3.9, or 3.10
* Pip 21.0 or greater


Expand Down
4 changes: 2 additions & 2 deletions soda/setup-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Use this setup for: <br />
**Data migration**: Migrate good-quality data from one data source to another. See: [Test before data migration]({% link soda/quick-start-migration.md %})<br />

Requirements:
* Python 3.8 or greater
* Python 3.8, 3.9, or 3.10
* Pip 21.0 or greater
* Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)

Expand Down Expand Up @@ -122,7 +122,7 @@ Use this setup for:<br />
**Databricks Notebook**: Invoke Soda data quality scans in a Databricks Notebook. See: [Add Soda to a Databricks notebook]({% link soda/quick-start-databricks.md %})<br />

Requirements:
* Python 3.8 or greater
* Python 3.8, 3.9, or 3.10
* Pip 21.0 or greater
* Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)

Expand Down

0 comments on commit 98aad90

Please sign in to comment.