-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
311 telemetry #312
base: main
Are you sure you want to change the base?
311 telemetry #312
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #312 +/- ##
=======================================
Coverage 98.75% 98.75%
=======================================
Files 15 15
Lines 1205 1205
=======================================
Hits 1190 1190
Misses 15 15 ☔ View full report in Codecov by Sentry. |
Closing this as the testing infrastructure required is too different to the current setup & the functionality can be generalised - it's better off as it's own package. |
Tests failing due to access-ipy-telemetry package not being available on PyPI yet - will need to upload to PyPI when I get my hands on the org credentials. Changing to a git blob for now. |
9b73b03
to
ca8bb09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the git
dependency is breaking my editable install so I can't run the tests:
Obtaining file:///Users/marc/Documents/ACCESS-NRI/access-nri-intake-catalog/access-nri-intake-catalog
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: cftime in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (1.6.4)
Requirement already satisfied: ecgtools>=2023.7.13 in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (2024.7.31)
Requirement already satisfied: intake>=0.7.0 in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (0.7.0)
Requirement already satisfied: intake-dataframe-catalog>=0.2.4 in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (0.2.4)
Requirement already satisfied: intake-esm>=2023.11.10 in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (2024.2.6)
Requirement already satisfied: jsonschema in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (4.23.0)
Requirement already satisfied: pooch in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (1.8.2)
Requirement already satisfied: xarray in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (2024.10.0)
INFO: pip is looking at multiple versions of access-nri-intake to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement access-ipy-telemetry>=0.1.0 (from access-nri-intake) (from versions: none)
ERROR: No matching distribution found for access-ipy-telemetry>=0.1.0
Weird... will look into it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My attempt to replicate the install is not going well:
pip install https://github.com/ACCESS-NRI/access-ipy-telemetry.git 1 ✘ 4s access-nri-intake-test Py 13:09:59
Collecting https://github.com/ACCESS-NRI/access-ipy-telemetry.git
Downloading https://github.com/ACCESS-NRI/access-ipy-telemetry.git
- 388.3 kB 10.8 MB/s 0:00:00
ERROR: Cannot unpack file /private/var/folders/93/b5zp1z_j4r35lgsbh92wkyy80000gp/T/pip-unpack-cstb5xru/access-ipy-telemetry.git (downloaded from /private/var/folders/93/b5zp1z_j4r35lgsbh92wkyy80000gp/T/pip-req-build-bes3kd5a, content-type: text/html; charset=utf-8); cannot detect archive format
ERROR: Cannot determine archive format of /private/var/folders/93/b5zp1z_j4r35lgsbh92wkyy80000gp/T/pip-req-build-bes3kd5a
Let's leave this for now - we'll need to add that package to PyPI at some point anyway, so might as well get that done and come back to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the code updates themselves look pretty straightforward, so no real comments there (modulo wondering what happens with conda builds).
However, I do have another concern about personal data management & tracking. I noticed on the access_py_telemetry
docs that the telemetry is on by default on the conda environments. In that case, I think there need to be plenty of loud, obvious messages that come up saying that that's the case, and including the instructions for disabling it. I've also added a comment where the telemetry is (re-)enabled by this package.
@@ -22,6 +22,7 @@ dependencies = [ | |||
"jsonschema", | |||
"pooch", | |||
"xarray", | |||
"access-py-telemetry>=0.1.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to find this on conda.... will that break the conda package build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I don't know the answer. I've put the package onto PyPI and set the CI environments to find it with pip. I'll find out what I need to do to add it to conda too for completeness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this end up getting covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, I haven't got a chance yet - I'll take a look at it now.
cat_version = data._captured_init_kwargs.get("metadata", {}).get( | ||
"version", "latest" | ||
) # Get the catalog version number and set it to "latest" if it can't be found | ||
configure_telemetry(["--enable", "--silent"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a message be emitted here to warn the user that the telemetry has been turned on, and how to turn it back off if they so wish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's a wider question here about how we implement telemetry. I'd defer to @aidanheerdegen's judgement here on:
- Whether we alert users about telemetry - I don't see any reason not to let users know, but I wonder about whether letting users know about telemetry at import time is the right place to do so. We have added some telemetry notices elsewhere, maybe adding it to the docs here would be appropriate?.
- Whether we allow them to turn it off. Allow is a bit of a misnomer here - you can always turn off telemetry if it's configurable through a python function with the right permissions (as it is here), but if we inform users & tell them how to disable it, we run the risk of losing a lot of information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B I am a bit of tinfoil hat and if I got a notice regarding telemetry and how to disable it, I would always do so. I'm assuming that is pretty common, but I don't know for sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps flash the message up at load time saying we're using telemetry, but leave the instructions/details to the docs? Then the true tin-foilers can look into how to disable it, but most might not bother?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this might be the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @charles-turner-1 I think if we flash messages every time a user loads the package, they will just deactivate telemetry and we will use important data.
Any chance we could do that at install time?
Otherwise flagging it in the documentation is good enough for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where should we add in the documentation? The usage FAQs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want to let users know what we're collecting and why. For data users it's a very compelling case: we will rely on the telemetry to make important resource allocation decisions. If they turn off the telemetry we won't know they're using a dataset, which means it risks being removed/de-prioritised.
That sort of involved argument can only be made in the docs I think.
I wonder if there is a possibility that we could also give users some useful information from the telemetry so they would see some immediate value in it?
Like could they call a function that reports a summary of what they've accessed in a session? For their own interest?
Or do we need to worry about people gaming the system so the datasets they're interested in are well used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want to let users know what we're collecting and why. For data users it's a very compelling case: we will rely on the telemetry to make important resource allocation decisions. If they turn off the telemetry we won't know they're using a dataset, which means it risks being removed/de-prioritised.
That sort of involved argument can only be made in the docs I think.
I wonder if there is a possibility that we could also give users some useful information from the telemetry so they would see some immediate value in it?
Like could they call a function that reports a summary of what they've accessed in a session? For their own interest?
This is definitely doable & should be relatively straightforward.
Or do we need to worry about people gaming the system so the datasets they're interested in are well used?
My guess is probably not - I imagine that the amount of noisy signal generated from playing around with datasets to obtain what a user would want is going to be pretty large anyway, just from the way notebooks tend to be used. Gaming that would be kind of tough.
With that said, we can easily enough deduplicate queries when we look at telemetry - I imagine if someone is trying to game the telemetry, it would be something like
for _ in range(1e6):
catalog.search(experiment='my_very_important_experiment', variable='my_very_important_variable')
All those calls will leave identical records in the telemetry (besides timestamp & primary key) - so should be straightforward to deduplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should implement a throttle mechanism to prevent recording identical telemetry calls within a short time window (e.g., a few seconds). This will help mitigate excessive server requests and protect against potential overload.
fa47f7e
to
a13e5b8
Compare
I've had a prospective data owner ask about the possibility of getting usage statistics/information for the dataset they're proposing to add to the catalog. Does this PR track the usage down to that level? |
I would need to double check, but we'll definitely know how many times a particular datastore has been searched for/opened. Potentially with some more work we can make things more granular - the telemetry package is designed to be pretty general and I think we should extend it in whatever direction we find helps collect useful data. |
Hooks in telemetry IPython extension upon accessing
intake.cat.access_nri
- and sends searches performed on eitherDfFileCatalog
objects (ie. the metacatalog), oresm_datastore
objects to a telemetry server.Relies on access-ipy-telemetryto provide the implementation of telemetry.