DM-46129: Make collections.query_info a single HTTP call #1074

dhirving · 2024-09-09T17:58:34Z

Previously RemoteButlerCollections.query_info would make an HTTP call for each resolved collection. It now does everything in a single call to the server instead.

RemoteButlerCollections.get_info and other collection methods in RemoteButlerRegistry have also been updated to re-use the new endpoint.

Also fixed an issue where the summary_datasets parameter to DirectButlerCollections.query_info was not doing anything.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

codecov · 2024-09-09T18:12:36Z

Codecov Report

Attention: Patch coverage is 95.52239% with 6 lines in your changes missing coverage. Please review.

Project coverage is 89.68%. Comparing base (5a6ff82) to head (af34091).
Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
...butler/remote_butler/_remote_butler_collections.py	89.28%	1 Missing and 2 partials ⚠️
python/lsst/daf/butler/_dataset_type.py	66.66%	1 Missing and 1 partial ⚠️
...on/lsst/daf/butler/remote_butler/_remote_butler.py	92.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1074      +/-   ##
==========================================
+ Coverage   89.66%   89.68%   +0.01%     
==========================================
  Files         359      360       +1     
  Lines       46935    46988      +53     
  Branches     9652     9653       +1     
==========================================
+ Hits        42086    42139      +53     
+ Misses       3481     3480       -1     
- Partials     1368     1369       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andy-slac · 2024-09-10T05:28:21Z

DM-45993 was merged, please rebase and add forwarding of new query_info parameters to remote server.

Previously collections.query_info would make an HTTP call for each resolved collection. It now does everything in a single call to the server instead.

There is no reason to add a duplicate API for this, because get_info is identical to calling query_info with a single collection name.

The various Butler sub-objects (registry, collections) need to access the registry defaults. Added a small helper class so that they don't need a reference to RemoteButler or each other in order to fetch the defaults.

Now that there is an independent implementation of ButlerCollections for RemoteButler, forward the equivalent Registry methods to it to reduce duplication.

Update RemoteButlerCollections to pass the new optimization parameters added in DM-45993 to the server.

When attempting to add a test for the summary_datasets parameter to ButlerCollectionInfo for RemoteButler, it turned out that it does not doing anything in the DirectButler version. This was occurring because a caching context causes this parameter to be ignored, and the cache was always enabled in query_info previously. This cache does not have a benefit for the new implementation.

andy-slac

Looks great. Thanks for adding a unit test for new query_info parameters! Maybe we should try to simplify things by passing dataset type names instead of dataset types, but that is up to you.

andy-slac · 2024-09-10T23:19:23Z

python/lsst/daf/butler/_dataset_type.py

@@ -796,3 +796,21 @@ def _unpickle_via_factory(factory: Callable, args: Any, kwargs: Any) -> DatasetT
    arguments as well as positional arguments.
    """
    return factory(*args, **kwargs)
+
+
+def get_dataset_type_name(datasetTypeOrName: DatasetType | str) -> str:


Maybe make it a classmethod of DatasetType to avoid extra imports?

andy-slac · 2024-09-10T23:21:11Z

python/lsst/daf/butler/registry/datasets/byDimensions/_manager.py

 from ...._dataset_ref import DatasetId, DatasetIdGenEnum, DatasetRef, DatasetType
+from ...._dataset_type import get_dataset_type_name


DatasetType above should also be coming from _dataset_type.

andy-slac · 2024-09-10T23:26:10Z

python/lsst/daf/butler/_butler_collections.py

@@ -280,7 +283,7 @@ def query_info(
        include_parents: bool = False,
        include_summary: bool = False,
        include_doc: bool = False,
-        summary_datasets: Iterable[DatasetType] | None = None,
+        summary_datasets: Iterable[DatasetType] | Iterable[str] | None = None,


Should we allow Iterable[str] only to avoid all complications? I think that clients of this method that actually pass summary_datasets should already have a list of names for other purposes. I guess this also means that datasets manager has to accept Iterable[str] instead of Iterable[DatasetType]?

I was thinking the same thing, but the convention in the rest of the new query system is that dataset types can be specified as either the name or DatasetType instance so it makes sense to follow that. It's also common for this parameter to come from the result of a queryDatasetTypes call.

andy-slac · 2024-09-10T23:31:53Z

python/lsst/daf/butler/_dataset_type.py

+    datasetTypeOrName : `DatasetType` | `str`
+        A DatasetType, or the name of a DatasetType. This union is a common
+        parameter in many `Butler` methods.
+    """


Docstring also needs Returns.

dhirving force-pushed the tickets/DM-46129 branch 4 times, most recently from f42382b to cdb1bfe Compare September 9, 2024 21:52

dhirving mentioned this pull request Sep 9, 2024

DM-45993: Optimize DirectButlerCollections.query_info to avoid too many queries #1075

Merged

3 tasks

dhirving added 6 commits September 10, 2024 10:45

Make collections.query_info a single HTTP call

bcf3cd2

Previously collections.query_info would make an HTTP call for each resolved collection. It now does everything in a single call to the server instead.

Make get_info use the same endpoint as query_info

063d572

There is no reason to add a duplicate API for this, because get_info is identical to calling query_info with a single collection name.

Break RegistryDefaults circular dependency

694d360

The various Butler sub-objects (registry, collections) need to access the registry defaults. Added a small helper class so that they don't need a reference to RemoteButler or each other in order to fetch the defaults.

Forward RemoteButlerRegistry to ButlerCollections

a7ea9f2

Now that there is an independent implementation of ButlerCollections for RemoteButler, forward the equivalent Registry methods to it to reduce duplication.

Mark deprecated REST endpoints

56d6a27

Handle collection query optimization parameters

8c3ca9c

Update RemoteButlerCollections to pass the new optimization parameters added in DM-45993 to the server.

dhirving force-pushed the tickets/DM-46129 branch from cdb1bfe to 8c3ca9c Compare September 10, 2024 17:49

dhirving marked this pull request as ready for review September 10, 2024 18:20

dhirving force-pushed the tickets/DM-46129 branch from c2d6947 to cc654d1 Compare September 10, 2024 22:50

dhirving force-pushed the tickets/DM-46129 branch from cc654d1 to 701bd48 Compare September 10, 2024 22:52

andy-slac approved these changes Sep 10, 2024

View reviewed changes

Fix minor review comments

af34091

dhirving merged commit fb8c666 into main Sep 11, 2024
18 checks passed

dhirving deleted the tickets/DM-46129 branch September 11, 2024 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-46129: Make collections.query_info a single HTTP call #1074

DM-46129: Make collections.query_info a single HTTP call #1074

dhirving commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

andy-slac commented Sep 10, 2024

andy-slac left a comment

andy-slac Sep 10, 2024

andy-slac Sep 10, 2024

andy-slac Sep 10, 2024

dhirving Sep 11, 2024

andy-slac Sep 10, 2024

		from ...._dataset_ref import DatasetId, DatasetIdGenEnum, DatasetRef, DatasetType
		from ...._dataset_type import get_dataset_type_name

DM-46129: Make collections.query_info a single HTTP call #1074

DM-46129: Make collections.query_info a single HTTP call #1074

Conversation

dhirving commented Sep 9, 2024 • edited Loading

Checklist

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

andy-slac commented Sep 10, 2024

andy-slac left a comment

Choose a reason for hiding this comment

andy-slac Sep 10, 2024

Choose a reason for hiding this comment

andy-slac Sep 10, 2024

Choose a reason for hiding this comment

andy-slac Sep 10, 2024

Choose a reason for hiding this comment

dhirving Sep 11, 2024

Choose a reason for hiding this comment

andy-slac Sep 10, 2024

Choose a reason for hiding this comment

dhirving commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading