Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] ColumnValuesNonNull and ColumnValuesNonNullCount metrics #10959

Merged
merged 88 commits into from
Feb 25, 2025

Conversation

NathanFarmer
Copy link
Contributor

@NathanFarmer NathanFarmer commented Feb 21, 2025

  • Implement ColumnValuesNonNull and ColumnValuesNonNullCount metrics
  • Add special case for return type of Metrics with names that end in condition so they don't also return domain and value kwargs
  • Use new backend-specific testing pattern established by Expectation integration testing framework
  • Move integration test files into tests/integration/metrics like Expectation integration testing framework

  • Description of PR changes above includes a link to an existing GitHub issue
  • PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
  • Code is linted - run invoke lint (uses ruff format + ruff check)
  • Appropriate tests and docs have been updated

NathanFarmer and others added 30 commits February 18, 2025 21:33
…ctations/great_expectations into f/gx-40/batch-compute-metrics
Copy link
Contributor

@billdirks billdirks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Overall it looks good. Do we need to delete between? That is, can we have the null metrics in addition to between?

Comment on lines 1329 to 1333
value = raw_metrics[metric_name][1]
# "condition" metrics return the domain and value kwargs
# we just want the value, which is the first item in the tuple
if metric_name.endswith(MetricNameSuffix.CONDITION) and isinstance(value, tuple):
value = value[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd move this logic into a helper method, eg parse_metric_value.

I make make an equivalent one for parse_metric_config_id.

These seem like lower level details than this method reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the ask here. I refactored some of this into 2 helper methods and added NamedTuples to make it clearer what's going on.

@@ -1,3 +1,3 @@
from .batch.row_count import BatchRowCount
from .column_values.between import ColumnValuesBetween
from .column_values.non_null import ColumnValuesNonNull, ColumnValuesNonNullCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we deleting between? Isn't this a new metric in addition to between?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between was there to demonstrate how it would all look, but then I learned we needed to add this (different) metric to support another epic. Between was never tested, which is really the bulk of the remaining work at this point.

Comment on lines +45 to +71
class ConditionValues(MetricResult[Union[pd.Series, "pyspark.sql.Column", "BinaryExpression"]]):
@classmethod
def validate_value_type(cls, value):
if isinstance(value, pd.Series):
return value

try:
from great_expectations.compatibility.pyspark import pyspark

if isinstance(value, pyspark.sql.Column):
return value
except (ImportError, AttributeError):
pass

try:
from great_expectations.compatibility.sqlalchemy import BinaryExpression

if isinstance(value, BinaryExpression):
return value
except (ImportError, AttributeError):
pass

raise ConditionValuesValueError(type(value))

@classmethod
def __get_validators__(cls):
yield cls.validate_value_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 nice workaround!

Copy link
Member

@joshua-stauffer joshua-stauffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks for fixing that gnarly type

@NathanFarmer NathanFarmer added this pull request to the merge queue Feb 25, 2025
Comment on lines +34 to +50
PANDAS_DATA_SOURCES: Sequence[DataSourceTestConfig] = [
PandasFilesystemCsvDatasourceTestConfig(),
PandasDataFrameDatasourceTestConfig(),
]

SPARK_DATA_SOURCES: Sequence[DataSourceTestConfig] = [
SparkFilesystemCsvDatasourceTestConfig(),
]

SQL_DATA_SOURCES: Sequence[DataSourceTestConfig] = [
BigQueryDatasourceTestConfig(),
DatabricksDatasourceTestConfig(),
MSSQLDatasourceTestConfig(),
PostgreSQLDatasourceTestConfig(),
SnowflakeDatasourceTestConfig(),
SqliteDatasourceTestConfig(),
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest we move these to a common conftest module so we can test every metric against a standard set of backends.

We should probably restrict this list to only test against data sources which are officially supported by core. As far as I know, MSSQL is not on that list.

data_source_configs=PANDAS_DATA_SOURCES,
data=DATA_FRAME,
)
def test_success_pandas(self, batch_for_datasource) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we type these test params?

Merged via the queue into develop with commit 7790b6c Feb 25, 2025
100 checks passed
@NathanFarmer NathanFarmer deleted the f/gx-43/add-column-values-metric branch February 25, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants