-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-45429: Initial implementation of a general query result #1047
Conversation
@TallJimbo, I have an initial version of a general query result calss and I reimplemented |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1047 +/- ##
==========================================
+ Coverage 89.44% 89.48% +0.04%
==========================================
Files 360 361 +1
Lines 45842 46095 +253
Branches 9406 9454 +48
==========================================
+ Hits 41004 41249 +245
- Misses 3506 3513 +7
- Partials 1332 1333 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in the stand-up, I had imagined adding a to-be-public Query.general
method rather than Query.dataset_associations
, as sort of a power-user escape valve. I'm not opposed to also having Query.dataset_associations
.
|
||
def iter_refs(self, dataset_type: DatasetType) -> Iterator[tuple[DatasetRef, dict[ResultColumn, Any]]]: | ||
"""Iterate over result rows and return DatasetRef constructed from each | ||
row and an original row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one place in QG generation where we'd want a method sort of like this that yields tuples of DatasetRef
and DataCoordinate
(it currently calls DataCoordinateQueryResults.findRelatedDatasets
).
I had originally planned to do that via a special dict-like row object that had single-row get_ref
, get_data_id
etc. methods, but I think I like your iteration adapter better as it allows for doing work common to all rows up front. But I'd like to extend this to be able to support at least DatasetRef
+ DataCoordinate
+ [extra stuff] somehow, and maybe even support DatassetRef
objects for multiple dataset. Maybe something like this?
def iter_tuples(self, *dataset_type: DatasetType) -> Iterator[tuple[DataCoordinate, dict[str, DatasetRef], dict[ResultColumn, Any]]
where it always returns the full data ID, a dict of DatasetRef
keyed by dataset type name, and then whatever columns aren't included in one of those? To be honest I'm not thrilled with this, either, but maybe it'll inspire you to find something better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return annotation is hard to read. Would it make sense to introduce a helper NamedTuple like this:
class GeneralResultTuple(typing.NamedTuple):
data_id: DataCoordinate
refs: dict[str, DtasetRef] # or maybe just list[DatasetRef]?
raw_row: dict[str, Any]
class GeneralQueryResults(QueryResultsBase):
def iter_tuples(self, *dataset_type: DatasetType) -> Iterator[GeneralResultTuple]:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I do like NamedTuple
for this; you can still unpack it if you want to, but it's also very clear what's in it and you can hang documentation on it. And I like list
rather than dict
for the DatasetRefs
, too, since we can make that ordering consistent with the argument rather than arbitrary.
Yields | ||
------ | ||
row_dict : `dict` [`ResultColumn`, `Any`] | ||
Result row as dictionary, the keys are `ResultColumn` instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility for the key type here is a <dimension-or-dataset-type-name>.<column>
string; that might be better if we make this type more publicly accessible. In either case it should probably match however we let the user specify the desired result columns in any public interface that returns one of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to str
as a key, it is indeed more natural as API uses strings as parameters. I also replaced dataset_associations
with general
which takes the same parameters as GeneralResultSpec
.
519d90b
to
12f5406
Compare
self, | ||
dimensions: DimensionGroup, | ||
dimension_fields: Mapping[str, set[str]] = {}, | ||
dataset_fields: Mapping[str, set[DatasetFieldName]] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't need to mutate them, better to make the value types collections.abc.Set
rather than set
.
What do you think about making all of these keyword-only arguments, and adding a *args: str
variadic argument that we'd filter though interpret_identifier
(in queries/_identifiers.py
) and use to update the existing categorized values?
It might also be nice to have a simple way to say "all of the fields necessary for DatasetRefs
of this dataset type", maybe datasets={name: ...}
or something?
resolved_collections = self.queryCollections( | ||
collections, datasetType=datasetType, collectionTypes=collectionTypes, flattenChains=flattenChains | ||
) | ||
with self._butler._query() as query: | ||
query = query.join_dataset_search(datasetType, resolved_collections) | ||
result = query.general( | ||
datasetType.dimensions, | ||
dataset_fields={datasetType.name: {"dataset_id", "run", "collection", "timespan"}}, | ||
) | ||
yield from DatasetAssociation.from_query_result(result, datasetType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timj, this is how replacement code for queryDatasetAssociations
is going to look like. I checked RepoExportContext
class to see if I can replace queryDatasetAssociations
usage there, but that class does not even have butler instance now, and it uses registry for everything. We can do it later and keep queryDatasetAssociations
in registry for now. Both direct and remote registries use new query system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I had forgotten that we need to change the export code so that it doesn't use registry.
5782b72
to
03f406e
Compare
5f358c5
to
fdc1b0f
Compare
The code was joining calib table twice, with and without alias, which caused a cartesian join with that table.
Postgres sometimes needs help in guessing column type when doing UNION of several queries.
Registry method `queryDatasetAssociations` is reimplemented (for both direct and remote butler) to use new query system and new general query result class.
…tions. This enables testing of the remote registry implementation, which uncovered a problem with deserialization of general result pages. Had to add additional code to serialize/deserialize that thing correctly.
…hod. General result iterator now returns dictionaries indexed by strings.
Special NamedTuple class documents the items in the returned tuples.
Adding unit tests exposed the issue with serialization of ingest_time. We still create schema with a native time type for that column, while its corresponding ColumnSpec says it should be astropy time. The code that I added recently used SerializableTime with type adapter to serialize it which did not work with datetime. I had to reimplement serialization methods to allow more flexible handling of ingest_date types.
67df37b
to
0fbe458
Compare
Registry method
queryDatasetAssociations
is reimplemented (for bothdirect and remote butler) to use new query system and new general query
result class.
Checklist
doc/changes
configs/old_dimensions