Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

311 telemetry #312

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

311 telemetry #312

wants to merge 15 commits into from

Conversation

charles-turner-1
Copy link
Collaborator

@charles-turner-1 charles-turner-1 commented Dec 17, 2024

Hooks in telemetry IPython extension upon accessing intake.cat.access_nri - and sends searches performed on either DfFileCatalog objects (ie. the metacatalog), or esm_datastore objects to a telemetry server.

Relies on access-ipy-telemetryto provide the implementation of telemetry.

Copy link

codecov bot commented Dec 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.75%. Comparing base (f331cb4) to head (6502a7f).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #312   +/-   ##
=======================================
  Coverage   98.75%   98.75%           
=======================================
  Files          15       15           
  Lines        1205     1205           
=======================================
  Hits         1190     1190           
  Misses         15       15           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@charles-turner-1
Copy link
Collaborator Author

Closing this as the testing infrastructure required is too different to the current setup & the functionality can be generalised - it's better off as it's own package.

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Jan 2, 2025

Tests failing due to access-ipy-telemetry package not being available on PyPI yet - will need to upload to PyPI when I get my hands on the org credentials.

Changing to a git blob for now.

@charles-turner-1 charles-turner-1 marked this pull request as ready for review January 2, 2025 01:26
@charles-turner-1 charles-turner-1 mentioned this pull request Jan 2, 2025
Copy link
Collaborator

@marc-white marc-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the git dependency is breaking my editable install so I can't run the tests:

Obtaining file:///Users/marc/Documents/ACCESS-NRI/access-nri-intake-catalog/access-nri-intake-catalog
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: cftime in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (1.6.4)
Requirement already satisfied: ecgtools>=2023.7.13 in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (2024.7.31)
Requirement already satisfied: intake>=0.7.0 in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (0.7.0)
Requirement already satisfied: intake-dataframe-catalog>=0.2.4 in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (0.2.4)
Requirement already satisfied: intake-esm>=2023.11.10 in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (2024.2.6)
Requirement already satisfied: jsonschema in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (4.23.0)
Requirement already satisfied: pooch in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (1.8.2)
Requirement already satisfied: xarray in /Users/marc/anaconda3/envs/access-nri-intake-test/lib/python3.11/site-packages (from access_nri_intake==1.0.0+14.g9c67e7f) (2024.10.0)
INFO: pip is looking at multiple versions of access-nri-intake to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement access-ipy-telemetry>=0.1.0 (from access-nri-intake) (from versions: none)
ERROR: No matching distribution found for access-ipy-telemetry>=0.1.0

@charles-turner-1
Copy link
Collaborator Author

Weird... will look into it.

Copy link
Collaborator

@marc-white marc-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My attempt to replicate the install is not going well:

pip install https://github.com/ACCESS-NRI/access-ipy-telemetry.git                                                                                                      1 ✘  4s  access-nri-intake-test Py  13:09:59 
Collecting https://github.com/ACCESS-NRI/access-ipy-telemetry.git
  Downloading https://github.com/ACCESS-NRI/access-ipy-telemetry.git
     - 388.3 kB 10.8 MB/s 0:00:00
  ERROR: Cannot unpack file /private/var/folders/93/b5zp1z_j4r35lgsbh92wkyy80000gp/T/pip-unpack-cstb5xru/access-ipy-telemetry.git (downloaded from /private/var/folders/93/b5zp1z_j4r35lgsbh92wkyy80000gp/T/pip-req-build-bes3kd5a, content-type: text/html; charset=utf-8); cannot detect archive format
ERROR: Cannot determine archive format of /private/var/folders/93/b5zp1z_j4r35lgsbh92wkyy80000gp/T/pip-req-build-bes3kd5a

@charles-turner-1
Copy link
Collaborator Author

Let's leave this for now - we'll need to add that package to PyPI at some point anyway, so might as well get that done and come back to it.

Copy link
Collaborator

@marc-white marc-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the code updates themselves look pretty straightforward, so no real comments there (modulo wondering what happens with conda builds).

However, I do have another concern about personal data management & tracking. I noticed on the access_py_telemetry docs that the telemetry is on by default on the conda environments. In that case, I think there need to be plenty of loud, obvious messages that come up saying that that's the case, and including the instructions for disabling it. I've also added a comment where the telemetry is (re-)enabled by this package.

@@ -22,6 +22,7 @@ dependencies = [
"jsonschema",
"pooch",
"xarray",
"access-py-telemetry>=0.1.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find this on conda.... will that break the conda package build?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I don't know the answer. I've put the package onto PyPI and set the CI environments to find it with pip. I'll find out what I need to do to add it to conda too for completeness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this end up getting covered?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, I haven't got a chance yet - I'll take a look at it now.

cat_version = data._captured_init_kwargs.get("metadata", {}).get(
"version", "latest"
) # Get the catalog version number and set it to "latest" if it can't be found
configure_telemetry(["--enable", "--silent"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a message be emitted here to warn the user that the telemetry has been turned on, and how to turn it back off if they so wish?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's a wider question here about how we implement telemetry. I'd defer to @aidanheerdegen's judgement here on:

  1. Whether we alert users about telemetry - I don't see any reason not to let users know, but I wonder about whether letting users know about telemetry at import time is the right place to do so. We have added some telemetry notices elsewhere, maybe adding it to the docs here would be appropriate?.
  2. Whether we allow them to turn it off. Allow is a bit of a misnomer here - you can always turn off telemetry if it's configurable through a python function with the right permissions (as it is here), but if we inform users & tell them how to disable it, we run the risk of losing a lot of information.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B I am a bit of tinfoil hat and if I got a notice regarding telemetry and how to disable it, I would always do so. I'm assuming that is pretty common, but I don't know for sure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps flash the message up at load time saying we're using telemetry, but leave the instructions/details to the docs? Then the true tin-foilers can look into how to disable it, but most might not bother?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this might be the best option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @charles-turner-1 I think if we flash messages every time a user loads the package, they will just deactivate telemetry and we will use important data.

Any chance we could do that at install time?

Otherwise flagging it in the documentation is good enough for me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should we add in the documentation? The usage FAQs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to let users know what we're collecting and why. For data users it's a very compelling case: we will rely on the telemetry to make important resource allocation decisions. If they turn off the telemetry we won't know they're using a dataset, which means it risks being removed/de-prioritised.

That sort of involved argument can only be made in the docs I think.

I wonder if there is a possibility that we could also give users some useful information from the telemetry so they would see some immediate value in it?

Like could they call a function that reports a summary of what they've accessed in a session? For their own interest?

Or do we need to worry about people gaming the system so the datasets they're interested in are well used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to let users know what we're collecting and why. For data users it's a very compelling case: we will rely on the telemetry to make important resource allocation decisions. If they turn off the telemetry we won't know they're using a dataset, which means it risks being removed/de-prioritised.

That sort of involved argument can only be made in the docs I think.

I wonder if there is a possibility that we could also give users some useful information from the telemetry so they would see some immediate value in it?

Like could they call a function that reports a summary of what they've accessed in a session? For their own interest?

This is definitely doable & should be relatively straightforward.

Or do we need to worry about people gaming the system so the datasets they're interested in are well used?

My guess is probably not - I imagine that the amount of noisy signal generated from playing around with datasets to obtain what a user would want is going to be pretty large anyway, just from the way notebooks tend to be used. Gaming that would be kind of tough.

With that said, we can easily enough deduplicate queries when we look at telemetry - I imagine if someone is trying to game the telemetry, it would be something like

for _ in range(1e6):
   catalog.search(experiment='my_very_important_experiment', variable='my_very_important_variable')

All those calls will leave identical records in the telemetry (besides timestamp & primary key) - so should be straightforward to deduplicate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should implement a throttle mechanism to prevent recording identical telemetry calls within a short time window (e.g., a few seconds). This will help mitigate excessive server requests and protect against potential overload.

@marc-white
Copy link
Collaborator

I've had a prospective data owner ask about the possibility of getting usage statistics/information for the dataset they're proposing to add to the catalog. Does this PR track the usage down to that level?

@charles-turner-1
Copy link
Collaborator Author

I would need to double check, but we'll definitely know how many times a particular datastore has been searched for/opened. Potentially with some more work we can make things more granular - the telemetry package is designed to be pretty general and I think we should extend it in whatever direction we find helps collect useful data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants