DM-45429: Initial implementation of a general query result #1047

andy-slac · 2024-08-05T22:42:26Z

Registry method queryDatasetAssociations is reimplemented (for both
direct and remote butler) to use new query system and new general query
result class.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

andy-slac · 2024-08-05T22:53:02Z

@TallJimbo, I have an initial version of a general query result calss and I reimplemented queryDatasetAssociations to use that to make sure it can work. Could you check that what I've done so far matches your model for it?

codecov · 2024-08-05T22:57:59Z

Codecov Report

Attention: Patch coverage is 92.88136% with 21 lines in your changes missing coverage. Please review.

Project coverage is 89.48%. Comparing base (20b5ab1) to head (5f358c5).

Files	Patch %	Lines
python/lsst/daf/butler/column_spec.py	77.77%	8 Missing and 4 partials ⚠️
python/lsst/daf/butler/queries/_query.py	77.27%	3 Missing and 2 partials ⚠️
...hon/lsst/daf/butler/direct_query_driver/_driver.py	50.00%	0 Missing and 1 partial ⚠️
.../lsst/daf/butler/queries/_general_query_results.py	98.21%	1 Missing ⚠️
python/lsst/daf/butler/remote_butler/_registry.py	87.50%	0 Missing and 1 partial ⚠️
...ote_butler/server/handlers/_query_serialization.py	87.50%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1047      +/-   ##
==========================================
+ Coverage   89.44%   89.48%   +0.04%     
==========================================
  Files         360      361       +1     
  Lines       45842    46095     +253     
  Branches     9406     9454      +48     
==========================================
+ Hits        41004    41249     +245     
- Misses       3506     3513       +7     
- Partials     1332     1333       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TallJimbo

As I mentioned in the stand-up, I had imagined adding a to-be-public Query.general method rather than Query.dataset_associations, as sort of a power-user escape valve. I'm not opposed to also having Query.dataset_associations.

TallJimbo · 2024-08-06T17:11:46Z

python/lsst/daf/butler/queries/_general_query_results.py

+
+    def iter_refs(self, dataset_type: DatasetType) -> Iterator[tuple[DatasetRef, dict[ResultColumn, Any]]]:
+        """Iterate over result rows and return DatasetRef constructed from each
+        row and an original row.


There's one place in QG generation where we'd want a method sort of like this that yields tuples of DatasetRef and DataCoordinate (it currently calls DataCoordinateQueryResults.findRelatedDatasets).

I had originally planned to do that via a special dict-like row object that had single-row get_ref, get_data_id etc. methods, but I think I like your iteration adapter better as it allows for doing work common to all rows up front. But I'd like to extend this to be able to support at least DatasetRef + DataCoordinate + [extra stuff] somehow, and maybe even support DatassetRef objects for multiple dataset. Maybe something like this?

def iter_tuples(self, *dataset_type: DatasetType) -> Iterator[tuple[DataCoordinate, dict[str, DatasetRef], dict[ResultColumn, Any]]

where it always returns the full data ID, a dict of DatasetRef keyed by dataset type name, and then whatever columns aren't included in one of those? To be honest I'm not thrilled with this, either, but maybe it'll inspire you to find something better?

The return annotation is hard to read. Would it make sense to introduce a helper NamedTuple like this:

class GeneralResultTuple(typing.NamedTuple): data_id: DataCoordinate refs: dict[str, DtasetRef] # or maybe just list[DatasetRef]? raw_row: dict[str, Any] class GeneralQueryResults(QueryResultsBase): def iter_tuples(self, *dataset_type: DatasetType) -> Iterator[GeneralResultTuple]: ...

👍 I do like NamedTuple for this; you can still unpack it if you want to, but it's also very clear what's in it and you can hang documentation on it. And I like list rather than dict for the DatasetRefs, too, since we can make that ordering consistent with the argument rather than arbitrary.

TallJimbo · 2024-08-06T17:13:10Z

python/lsst/daf/butler/queries/_general_query_results.py

+        Yields
+        ------
+        row_dict : `dict` [`ResultColumn`, `Any`]
+            Result row as dictionary, the keys are `ResultColumn` instances.


Another possibility for the key type here is a <dimension-or-dataset-type-name>.<column> string; that might be better if we make this type more publicly accessible. In either case it should probably match however we let the user specify the desired result columns in any public interface that returns one of these.

I switched to str as a key, it is indeed more natural as API uses strings as parameters. I also replaced dataset_associations with general which takes the same parameters as GeneralResultSpec.

TallJimbo · 2024-08-08T14:09:35Z

python/lsst/daf/butler/queries/_query.py

+        self,
+        dimensions: DimensionGroup,
+        dimension_fields: Mapping[str, set[str]] = {},
+        dataset_fields: Mapping[str, set[DatasetFieldName]] = {},


Since we don't need to mutate them, better to make the value types collections.abc.Set rather than set.

What do you think about making all of these keyword-only arguments, and adding a *args: str variadic argument that we'd filter though interpret_identifier (in queries/_identifiers.py) and use to update the existing categorized values?

It might also be nice to have a simple way to say "all of the fields necessary for DatasetRefs of this dataset type", maybe datasets={name: ...} or something?

andy-slac · 2024-08-09T00:28:29Z

python/lsst/daf/butler/remote_butler/_registry.py

+        resolved_collections = self.queryCollections(
+            collections, datasetType=datasetType, collectionTypes=collectionTypes, flattenChains=flattenChains
+        )
+        with self._butler._query() as query:
+            query = query.join_dataset_search(datasetType, resolved_collections)
+            result = query.general(
+                datasetType.dimensions,
+                dataset_fields={datasetType.name: {"dataset_id", "run", "collection", "timespan"}},
+            )
+            yield from DatasetAssociation.from_query_result(result, datasetType)


@timj, this is how replacement code for queryDatasetAssociations is going to look like. I checked RepoExportContext class to see if I can replace queryDatasetAssociations usage there, but that class does not even have butler instance now, and it uses registry for everything. We can do it later and keep queryDatasetAssociations in registry for now. Both direct and remote registries use new query system.

Great. I had forgotten that we need to change the export code so that it doesn't use registry.

python/lsst/daf/butler/queries/_general_query_results.py

python/lsst/daf/butler/queries/_query.py

python/lsst/daf/butler/column_spec.py

The code was joining calib table twice, with and without alias, which caused a cartesian join with that table.

Postgres sometimes needs help in guessing column type when doing UNION of several queries.

Registry method `queryDatasetAssociations` is reimplemented (for both direct and remote butler) to use new query system and new general query result class.

…tions. This enables testing of the remote registry implementation, which uncovered a problem with deserialization of general result pages. Had to add additional code to serialize/deserialize that thing correctly.

…hod. General result iterator now returns dictionaries indexed by strings.

Special NamedTuple class documents the items in the returned tuples.

Adding unit tests exposed the issue with serialization of ingest_time. We still create schema with a native time type for that column, while its corresponding ColumnSpec says it should be astropy time. The code that I added recently used SerializableTime with type adapter to serialize it which did not work with datetime. I had to reimplement serialization methods to allow more flexible handling of ingest_date types.

…setAssociations.

timj requested a review from TallJimbo August 6, 2024 16:51

TallJimbo reviewed Aug 6, 2024

View reviewed changes

andy-slac force-pushed the tickets/DM-45429 branch from 519d90b to 12f5406 Compare August 8, 2024 00:21

TallJimbo reviewed Aug 8, 2024

View reviewed changes

andy-slac commented Aug 9, 2024

View reviewed changes

andy-slac force-pushed the tickets/DM-45429 branch from 5782b72 to 03f406e Compare August 9, 2024 01:57

andy-slac marked this pull request as ready for review August 9, 2024 16:52

TallJimbo approved these changes Aug 12, 2024

View reviewed changes

python/lsst/daf/butler/queries/_general_query_results.py Outdated Show resolved Hide resolved

python/lsst/daf/butler/queries/_query.py Outdated Show resolved Hide resolved

python/lsst/daf/butler/column_spec.py Show resolved Hide resolved

andy-slac force-pushed the tickets/DM-45429 branch 3 times, most recently from 5f358c5 to fdc1b0f Compare August 13, 2024 21:23

andy-slac added 10 commits August 13, 2024 14:57

Fix query joiner construction to avoid cartesian joins.

aea476d

The code was joining calib table twice, with and without alias, which caused a cartesian join with that table.

Improve handling of Timespan in queries.

318c00a

Postgres sometimes needs help in guessing column type when doing UNION of several queries.

Initial implementation of a general query result (DM-45429)

6a50602

Registry method `queryDatasetAssociations` is reimplemented (for both direct and remote butler) to use new query system and new general query result class.

Remove Query.dataset_associations and replace it with general met…

1d8674f

…hod. General result iterator now returns dictionaries indexed by strings.

General query result returns tuples from iter_tuples method.

83fa69c

Special NamedTuple class documents the items in the returned tuples.

Allow Query.general to take positional parameters with field names.

a14cc90

Simplify construction of DatasetAssociation from general query result.

746888a

Update RepoExportContext to use new query system instead of queryData…

0fbe458

…setAssociations.

andy-slac force-pushed the tickets/DM-45429 branch from 67df37b to 0fbe458 Compare August 13, 2024 22:01

andy-slac merged commit b716002 into main Aug 14, 2024
16 checks passed

andy-slac deleted the tickets/DM-45429 branch August 14, 2024 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-45429: Initial implementation of a general query result #1047

DM-45429: Initial implementation of a general query result #1047

andy-slac commented Aug 5, 2024 •

edited

Loading

andy-slac commented Aug 5, 2024

codecov bot commented Aug 5, 2024 •

edited

Loading

TallJimbo left a comment

TallJimbo Aug 6, 2024 •

edited

Loading

andy-slac Aug 8, 2024

TallJimbo Aug 8, 2024

TallJimbo Aug 6, 2024

andy-slac Aug 8, 2024

TallJimbo Aug 8, 2024

andy-slac Aug 9, 2024

timj Aug 9, 2024

DM-45429: Initial implementation of a general query result #1047

DM-45429: Initial implementation of a general query result #1047

Conversation

andy-slac commented Aug 5, 2024 • edited Loading

Checklist

andy-slac commented Aug 5, 2024

codecov bot commented Aug 5, 2024 • edited Loading

Codecov Report

TallJimbo left a comment

Choose a reason for hiding this comment

TallJimbo Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

andy-slac Aug 8, 2024

Choose a reason for hiding this comment

TallJimbo Aug 8, 2024

Choose a reason for hiding this comment

TallJimbo Aug 6, 2024

Choose a reason for hiding this comment

andy-slac Aug 8, 2024

Choose a reason for hiding this comment

TallJimbo Aug 8, 2024

Choose a reason for hiding this comment

andy-slac Aug 9, 2024

Choose a reason for hiding this comment

timj Aug 9, 2024

Choose a reason for hiding this comment

andy-slac commented Aug 5, 2024 •

edited

Loading

codecov bot commented Aug 5, 2024 •

edited

Loading

TallJimbo Aug 6, 2024 •

edited

Loading