Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove percentile counter caching #111

Merged

Conversation

copperlight
Copy link
Collaborator

A metrics reporting issue was observed where infrequent updates to percentile timers resulted in inconsistent reporting of metrics values. Unusual peaks and valleys were seen on the query side. Capturing measurement reporting from the spectatord process revealed that the base timer values were being reported in a consistent manner, but some of the percentile counters were missing.

The percentile timer meters kept a local cache of the percentile counters, with shared pointer references to corresponding meters in the Registry. When the Registry expires meters, it does so based upon last update time, rather than the number of pointers still held. Thus, infrequently used percentile counters were removed from the Registry, but remained within the local percentile timer cache. The cached values would be updated, but this would not translate to the Registry, and there was no opportunity to recreate them.

The fix is to remove the local percentile counter cache from the percentile timer meter, and rely on the Registry to keep track of all counters. If one is expired, it will then be recreated. This will add a little bit of overhead, but it should be minimal.

The same applies to the percentile distribution summaries.

@copperlight copperlight force-pushed the remove-percentile-caching branch from 336b0bb to 6461d38 Compare December 2, 2024 21:15
@copperlight copperlight force-pushed the remove-percentile-caching branch 5 times, most recently from aa857b6 to 075e1c2 Compare December 2, 2024 23:38
A metrics reporting issue was observed where infrequent updates to percentile
timers resulted in inconsistent reporting of metrics values. Unusual peaks and
valleys were seen on the query side. Capturing measurement reporting from the
spectatord process revealed that the base timer values were being reported in
a consistent manner, but some of the percentile counters were missing.

The percentile timer meters kept a local cache of the percentile counters, with
shared pointer references to corresponding meters in the Registry. When the
Registry expires meters, it does so based upon last update time, rather than
the number of pointers still held. Thus, infrequently used percentile counters
were removed from the Registry, but remained within the local percentile timer
cache. The cached values would be updated, but this would not translate to the
Registry, and there was no opportunity to recreate them.

The fix is to remove the local percentile counter cache from the percentile
timer meter, and rely on the Registry to keep track of all counters. If one is
expired, it will then be recreated. This will add a little bit of overhead, but
it should be minimal.

The same applies to the percentile distribution summaries.
@copperlight copperlight force-pushed the remove-percentile-caching branch from 075e1c2 to 6942162 Compare December 2, 2024 23:41
@copperlight copperlight merged commit 81c0cae into Netflix-Skunkworks:main Dec 3, 2024
1 check passed
@copperlight copperlight deleted the remove-percentile-caching branch December 3, 2024 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants