Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-45429: Initial implementation of a general query result #1047

Merged
merged 10 commits into from
Aug 14, 2024

Conversation

andy-slac
Copy link
Contributor

@andy-slac andy-slac commented Aug 5, 2024

Registry method queryDatasetAssociations is reimplemented (for both
direct and remote butler) to use new query system and new general query
result class.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

@andy-slac
Copy link
Contributor Author

@TallJimbo, I have an initial version of a general query result calss and I reimplemented queryDatasetAssociations to use that to make sure it can work. Could you check that what I've done so far matches your model for it?

Copy link

codecov bot commented Aug 5, 2024

Codecov Report

Attention: Patch coverage is 92.88136% with 21 lines in your changes missing coverage. Please review.

Project coverage is 89.48%. Comparing base (20b5ab1) to head (5f358c5).

Files Patch % Lines
python/lsst/daf/butler/column_spec.py 77.77% 8 Missing and 4 partials ⚠️
python/lsst/daf/butler/queries/_query.py 77.27% 3 Missing and 2 partials ⚠️
...hon/lsst/daf/butler/direct_query_driver/_driver.py 50.00% 0 Missing and 1 partial ⚠️
.../lsst/daf/butler/queries/_general_query_results.py 98.21% 1 Missing ⚠️
python/lsst/daf/butler/remote_butler/_registry.py 87.50% 0 Missing and 1 partial ⚠️
...ote_butler/server/handlers/_query_serialization.py 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1047      +/-   ##
==========================================
+ Coverage   89.44%   89.48%   +0.04%     
==========================================
  Files         360      361       +1     
  Lines       45842    46095     +253     
  Branches     9406     9454      +48     
==========================================
+ Hits        41004    41249     +245     
- Misses       3506     3513       +7     
- Partials     1332     1333       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@timj timj requested a review from TallJimbo August 6, 2024 16:51
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the stand-up, I had imagined adding a to-be-public Query.general method rather than Query.dataset_associations, as sort of a power-user escape valve. I'm not opposed to also having Query.dataset_associations.


def iter_refs(self, dataset_type: DatasetType) -> Iterator[tuple[DatasetRef, dict[ResultColumn, Any]]]:
"""Iterate over result rows and return DatasetRef constructed from each
row and an original row.
Copy link
Member

@TallJimbo TallJimbo Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one place in QG generation where we'd want a method sort of like this that yields tuples of DatasetRef and DataCoordinate (it currently calls DataCoordinateQueryResults.findRelatedDatasets).

I had originally planned to do that via a special dict-like row object that had single-row get_ref, get_data_id etc. methods, but I think I like your iteration adapter better as it allows for doing work common to all rows up front. But I'd like to extend this to be able to support at least DatasetRef + DataCoordinate + [extra stuff] somehow, and maybe even support DatassetRef objects for multiple dataset. Maybe something like this?

def iter_tuples(self, *dataset_type: DatasetType) -> Iterator[tuple[DataCoordinate, dict[str, DatasetRef], dict[ResultColumn, Any]]

where it always returns the full data ID, a dict of DatasetRef keyed by dataset type name, and then whatever columns aren't included in one of those? To be honest I'm not thrilled with this, either, but maybe it'll inspire you to find something better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return annotation is hard to read. Would it make sense to introduce a helper NamedTuple like this:

class GeneralResultTuple(typing.NamedTuple):

    data_id: DataCoordinate
    refs: dict[str, DtasetRef]  # or maybe just list[DatasetRef]?
    raw_row: dict[str, Any]

class GeneralQueryResults(QueryResultsBase):

    def iter_tuples(self, *dataset_type: DatasetType) -> Iterator[GeneralResultTuple]:
        ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I do like NamedTuple for this; you can still unpack it if you want to, but it's also very clear what's in it and you can hang documentation on it. And I like list rather than dict for the DatasetRefs, too, since we can make that ordering consistent with the argument rather than arbitrary.

Yields
------
row_dict : `dict` [`ResultColumn`, `Any`]
Result row as dictionary, the keys are `ResultColumn` instances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility for the key type here is a <dimension-or-dataset-type-name>.<column> string; that might be better if we make this type more publicly accessible. In either case it should probably match however we let the user specify the desired result columns in any public interface that returns one of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to str as a key, it is indeed more natural as API uses strings as parameters. I also replaced dataset_associations with general which takes the same parameters as GeneralResultSpec.

self,
dimensions: DimensionGroup,
dimension_fields: Mapping[str, set[str]] = {},
dataset_fields: Mapping[str, set[DatasetFieldName]] = {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't need to mutate them, better to make the value types collections.abc.Set rather than set.

What do you think about making all of these keyword-only arguments, and adding a *args: str variadic argument that we'd filter though interpret_identifier (in queries/_identifiers.py) and use to update the existing categorized values?

It might also be nice to have a simple way to say "all of the fields necessary for DatasetRefs of this dataset type", maybe datasets={name: ...} or something?

Comment on lines +519 to +528
resolved_collections = self.queryCollections(
collections, datasetType=datasetType, collectionTypes=collectionTypes, flattenChains=flattenChains
)
with self._butler._query() as query:
query = query.join_dataset_search(datasetType, resolved_collections)
result = query.general(
datasetType.dimensions,
dataset_fields={datasetType.name: {"dataset_id", "run", "collection", "timespan"}},
)
yield from DatasetAssociation.from_query_result(result, datasetType)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timj, this is how replacement code for queryDatasetAssociations is going to look like. I checked RepoExportContext class to see if I can replace queryDatasetAssociations usage there, but that class does not even have butler instance now, and it uses registry for everything. We can do it later and keep queryDatasetAssociations in registry for now. Both direct and remote registries use new query system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I had forgotten that we need to change the export code so that it doesn't use registry.

@andy-slac andy-slac marked this pull request as ready for review August 9, 2024 16:52
python/lsst/daf/butler/queries/_general_query_results.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/queries/_query.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/column_spec.py Show resolved Hide resolved
@andy-slac andy-slac force-pushed the tickets/DM-45429 branch 3 times, most recently from 5f358c5 to fdc1b0f Compare August 13, 2024 21:23
The code was joining calib table twice, with and without alias, which
caused a cartesian join with that table.
Postgres sometimes needs help in guessing column type when doing UNION
of several queries.
Registry method `queryDatasetAssociations` is reimplemented (for both
direct and remote butler) to use new query system and new general query
result class.
…tions.

This enables testing of the remote registry implementation, which uncovered
a problem with deserialization of general result pages. Had to add additional
code to serialize/deserialize that thing correctly.
…hod.

General result iterator now returns dictionaries indexed by strings.
Special NamedTuple class documents the items in the returned tuples.
Adding unit tests exposed the issue with serialization of ingest_time.
We still create schema with a native time type for that column, while
its corresponding ColumnSpec says it should be astropy time. The code
that I added recently used SerializableTime with type adapter to
serialize it which did not work with datetime. I had to reimplement
serialization methods to allow more flexible handling of ingest_date types.
@andy-slac andy-slac merged commit b716002 into main Aug 14, 2024
16 checks passed
@andy-slac andy-slac deleted the tickets/DM-45429 branch August 14, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants