Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-45680: Allow boolean columns to be used in query 'where' #1051

Merged
merged 6 commits into from
Aug 12, 2024

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Aug 8, 2024

Fix an issue where boolean metadata columns (like exposure.can_see_sky and exposure.has_simulated) were not usable in "where" clauses for query functions.

These column names can now be used as a boolean expression, for example where="exposure.can_see_sky or where="NOT exposure.can_see_sky".

In the new query system's ExpressionFactory, you can convert a boolean column to a usable boolean Predicate by calling as_boolean().

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Copy link

codecov bot commented Aug 9, 2024

Codecov Report

Attention: Patch coverage is 93.04348% with 8 lines in your changes missing coverage. Please review.

Project coverage is 89.44%. Comparing base (8e57637) to head (7aa86be).

Files Patch % Lines
...hon/lsst/daf/butler/queries/_expression_strings.py 84.21% 0 Missing and 3 partials ⚠️
python/lsst/daf/butler/queries/tree/_predicate.py 81.25% 2 Missing and 1 partial ⚠️
...thon/lsst/daf/butler/queries/expression_factory.py 89.47% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1051      +/-   ##
==========================================
+ Coverage   89.40%   89.44%   +0.04%     
==========================================
  Files         360      360              
  Lines       45733    45844     +111     
  Branches     9382     9406      +24     
==========================================
+ Hits        40888    41006     +118     
+ Misses       3519     3506      -13     
- Partials     1326     1332       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@classmethod
def from_bool_expression(cls, value: ColumnExpression) -> Predicate:
"""Construct a predicate that wraps a boolean ColumnExpression, taking
on the value of the underlying ColumnExpression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had hoped to never have to allow a boolean ColumnExpression to be constructed (going to Predicate immediately instead). But I think I've figured out why you didn't do that, and I agree with it: we'd need Predicate-subclass versions of DimensionFieldReference and DatasetFieldReference, and we'd need to include them in all of the cases where column references of other types already appear, and that'd be a much bigger problem. Is that right?

If so, I think instead we need to be extra vigilant about putting those column expressions in their predicate wrappers as early as possible, and keeping boolean operators out of those column expressions, because that would subvert all of the boolean-expression normalization and reasoning we do. I suspect typing is already going to do prevent problems pretty effectively, since we don't have any boolean-operator expressions, but we could lock it down a bit further by declaring that value here (and in BooleanWrapper) must be ColumnReference, not merely ColumnExpression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking wasn't actually that involved, I just observed that:

  • Boolean ColumnExpressions already existed, and it didn't occur to me to go out of my way to create a separate representation for boolean SQL columns than the one used for int or string.
  • Most of the other predicates worked by wrapping one or more column expressions with a little bit of information specifying how they convert to booleans -- and "no-op" seemed like a reasonable addition to the set of conversions.

So I just went with the grain of what was already there and it slotted in easily with minimal code.

I do share your concern about this problem potentially coming back since any place that references a boolean column will need to do the same thing, but that's sort of inherent to treating booleans as a special case. Changing the type is a good suggestion, I'll do that.

I suspect handling boolean_column = NULL is going to get ugly though so I may need to figure something else out -- or just not support it, or special case it to only work for a single boolean column and not arbitrary boolean expressions.

@dhirving dhirving changed the title DM-45680: Allow boolean columns to be used in where strings DM-45680: Allow boolean columns to be used in query 'where' Aug 10, 2024
def is_null(self) -> tree.Predicate:
return ResolvedScalarExpressionProxy(self._boolean_expression).is_null

def as_boolean(self) -> tree.Predicate:
Copy link
Contributor Author

@dhirving dhirving Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like forcing users to call as_boolean() to get a Predicate out of this, but I didn't really have a better idea.

We can't just return a Predicate directly from DimensionElementProxy.__getattr__ because it would completely destroy the static typing. The return type would become ScalarExpressionProxy | Predicate and those types have almost no methods in common.

This at least throws a semi-helpful runtime error if you fail to call as_boolean() or attempt to do a comparison like boolean_column == 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish Python typing had something like Any[A | B], where it'd happily do an implicit cast to one of the options you give it without losing the typing entirely (the way Any does).

I'm not convinced the static typing is so important here that it's worth doing even slight harm to the runtime interface - I'm imagining users in Jupyter notebooks doing tab completion as the most important use case for ExpressionFactory. But I do expect boolean columns to be very rare, and this is a small change that's easy to revisit later, and the same cannot be said of trying to put the entire Predicate interface into ScalarExpressionProxy.

Fix an issue where boolean metadata columns (like exposure.can_see_sky and exposure.has_simulated) were not usable in "where" clauses for Registry query functions.

These column names can now be used as a boolean expression, for example `where="exposure.can_see_sky"` or `where="NOT exposure.can_see_sky"`.
We need roughly the same test for both registry and the new query system, so move it out of the registry tests to a place where we have access to both query systems.
Add an as_boolean() method to the expression proxies used to reference dimension fields, so boolean-valued fields can be converted to Predicate instances that can be used to constrain a query.
Boolean columns in the DB are always nullable (and exposure.can_see_sky has many NULLs in the real databases.)  So add a way to search for them in the query system.
Add support for `= NULL` and `!= NULL` syntax for boolean columns in query where strings.
@dhirving dhirving marked this pull request as ready for review August 10, 2024 00:50
def is_null(self) -> tree.Predicate:
return ResolvedScalarExpressionProxy(self._boolean_expression).is_null

def as_boolean(self) -> tree.Predicate:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish Python typing had something like Any[A | B], where it'd happily do an implicit cast to one of the options you give it without losing the typing entirely (the way Any does).

I'm not convinced the static typing is so important here that it's worth doing even slight harm to the runtime interface - I'm imagining users in Jupyter notebooks doing tab completion as the most important use case for ExpressionFactory. But I do expect boolean columns to be very rare, and this is a small change that's easy to revisit later, and the same cannot be said of trying to put the entire Predicate interface into ScalarExpressionProxy.

@dhirving dhirving merged commit a0830e1 into main Aug 12, 2024
18 checks passed
@dhirving dhirving deleted the tickets/DM-45680 branch August 12, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants