-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] ColumnValuesNonNull
and ColumnValuesNonNullCount
metrics
#10959
Conversation
…e than 1 of the same metric
…ctations/great_expectations into f/gx-40/batch-compute-metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! Overall it looks good. Do we need to delete between? That is, can we have the null metrics in addition to between?
value = raw_metrics[metric_name][1] | ||
# "condition" metrics return the domain and value kwargs | ||
# we just want the value, which is the first item in the tuple | ||
if metric_name.endswith(MetricNameSuffix.CONDITION) and isinstance(value, tuple): | ||
value = value[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd move this logic into a helper method, eg parse_metric_value
.
I make make an equivalent one for parse_metric_config_id
.
These seem like lower level details than this method reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the ask here. I refactored some of this into 2 helper methods and added NamedTuples
to make it clearer what's going on.
@@ -1,3 +1,3 @@ | |||
from .batch.row_count import BatchRowCount | |||
from .column_values.between import ColumnValuesBetween | |||
from .column_values.non_null import ColumnValuesNonNull, ColumnValuesNonNullCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we deleting between? Isn't this a new metric in addition to between?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between was there to demonstrate how it would all look, but then I learned we needed to add this (different) metric to support another epic. Between was never tested, which is really the bulk of the remaining work at this point.
class ConditionValues(MetricResult[Union[pd.Series, "pyspark.sql.Column", "BinaryExpression"]]): | ||
@classmethod | ||
def validate_value_type(cls, value): | ||
if isinstance(value, pd.Series): | ||
return value | ||
|
||
try: | ||
from great_expectations.compatibility.pyspark import pyspark | ||
|
||
if isinstance(value, pyspark.sql.Column): | ||
return value | ||
except (ImportError, AttributeError): | ||
pass | ||
|
||
try: | ||
from great_expectations.compatibility.sqlalchemy import BinaryExpression | ||
|
||
if isinstance(value, BinaryExpression): | ||
return value | ||
except (ImportError, AttributeError): | ||
pass | ||
|
||
raise ConditionValuesValueError(type(value)) | ||
|
||
@classmethod | ||
def __get_validators__(cls): | ||
yield cls.validate_value_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 nice workaround!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thanks for fixing that gnarly type
PANDAS_DATA_SOURCES: Sequence[DataSourceTestConfig] = [ | ||
PandasFilesystemCsvDatasourceTestConfig(), | ||
PandasDataFrameDatasourceTestConfig(), | ||
] | ||
|
||
SPARK_DATA_SOURCES: Sequence[DataSourceTestConfig] = [ | ||
SparkFilesystemCsvDatasourceTestConfig(), | ||
] | ||
|
||
SQL_DATA_SOURCES: Sequence[DataSourceTestConfig] = [ | ||
BigQueryDatasourceTestConfig(), | ||
DatabricksDatasourceTestConfig(), | ||
MSSQLDatasourceTestConfig(), | ||
PostgreSQLDatasourceTestConfig(), | ||
SnowflakeDatasourceTestConfig(), | ||
SqliteDatasourceTestConfig(), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suggest we move these to a common conftest
module so we can test every metric against a standard set of backends.
We should probably restrict this list to only test against data sources which are officially supported by core. As far as I know, MSSQL is not on that list.
data_source_configs=PANDAS_DATA_SOURCES, | ||
data=DATA_FRAME, | ||
) | ||
def test_success_pandas(self, batch_for_datasource) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we type these test params?
ColumnValuesNonNull
andColumnValuesNonNullCount
metricscondition
so they don't also return domain and value kwargstests/integration/metrics
like Expectation integration testing frameworkinvoke lint
(usesruff format
+ruff check
)