Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP: Cost attribution #10269

Open
wants to merge 84 commits into
base: main
Choose a base branch
from
Open

MVP: Cost attribution #10269

wants to merge 84 commits into from

Conversation

ying-jeanne
Copy link
Contributor

@ying-jeanne ying-jeanne commented Dec 17, 2024

What this PR does

This is the follow up of #9733,

The PR intent to export extra attributed metrics in distributor and ingester, in order to get sample received, sample discarded and active_series attributed by cost attribution label.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@ying-jeanne ying-jeanne force-pushed the final-cost-attribution branch from 5165a5b to 6f36b5f Compare December 17, 2024 21:07
@ying-jeanne ying-jeanne changed the title Final cost attribution MVP: Cost attribution Dec 17, 2024
@ying-jeanne ying-jeanne requested a review from colega December 17, 2024 21:10
@ying-jeanne ying-jeanne force-pushed the final-cost-attribution branch from 6f36b5f to 077a94a Compare December 17, 2024 22:01
@ying-jeanne ying-jeanne force-pushed the final-cost-attribution branch from 077a94a to f04c28f Compare December 17, 2024 22:08
@ying-jeanne ying-jeanne marked this pull request as ready for review December 17, 2024 22:13
@ying-jeanne ying-jeanne requested review from tacole02 and a team as code owners December 17, 2024 22:13
pkg/costattribution/manager.go Outdated Show resolved Hide resolved
pkg/costattribution/manager.go Outdated Show resolved Hide resolved
pkg/costattribution/manager.go Outdated Show resolved Hide resolved
pkg/costattribution/manager.go Outdated Show resolved Hide resolved
pkg/ingester/activeseries/active_series.go Outdated Show resolved Hide resolved
@@ -502,6 +525,18 @@ func (s *seriesStripe) remove(ref storage.SeriesRef) {
}

s.active--
if s.cat != nil {
if idx == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should assume this isn't nil. Just skipping the removal will break the numbers forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vendor and update in commit 4706bde

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the active series tracker tests with the costattribution.Tracker, otherwise the new code isn't tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 17b64a9

Comment on lines 3276 to 3280
idx, err := db.Head().Index()
if err != nil {
level.Warn(i.logger).Log("msg", "failed to get the index of the TSDB head", "user", userID, "err", err)
idx = nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented previously, we should never proceed without an index.

If you check the implementation of db.Head().Index() it never returns an error. We have three options here:

  1. Skip tenants if they don't have index: this is the least effort one.
  2. Panic if err is not nil, this is ugly
  3. Update mimir-prometheus to add a MustIndex() IndexReader method that does not return an error, and use that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR in mimir-prometheus grafana/mimir-prometheus#811

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vendor and update in commit 4706bde

if t.Cfg.CostAttributionRegistryPath != "" {
reg := prometheus.NewRegistry()
var err error
t.CostAttributionManager, err = costattribution.NewManager(3*time.Minute, time.Minute, t.Cfg.CostAttributionEvictionInterval, util_log.Logger, t.Overrides, reg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these values should not be hardcoded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed unused parameter b27e379

Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the docs! I left a few suggestions.

Comment on lines 164 to 173
func (t *Tracker) IncrementReceivedSamples(req *mimirpb.WriteRequest, now time.Time) {
if t == nil {
return
}

dict := make(map[string]int)
for _, ts := range req.Timeseries {
lvs := t.extractLabelValuesFromLabelAdapater(ts.Labels)
dict[t.hashLabelValues(lvs)] += len(ts.TimeSeries.Samples) + len(ts.TimeSeries.Histograms)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the hottest path on our application, we should optimize it as much as possible.
Why do we need to build a new data structure (which escapes to heap) with holds []mimirpb.LabelAdapter slices that escape to heap, which create a string that escapes to heap just to put it into a dict that I don't think should be a map, even if we need some data structure.

Can we just extract the labelValues byte slices, recycled from a pool, and process each one separately in the loop below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as paired, addressed in this commit a2ffe5a

Comment on lines 127 to 129
out <- prometheus.MustNewConstMetric(t.activeSeriesPerUserAttribution, prometheus.GaugeValue, t.overflowCounter.activeSerie.Load(), t.overflowLabels[:len(t.overflowLabels)-1]...)
out <- prometheus.MustNewConstMetric(t.receivedSamplesAttribution, prometheus.CounterValue, t.overflowCounter.receivedSample.Load(), t.overflowLabels[:len(t.overflowLabels)-1]...)
out <- prometheus.MustNewConstMetric(t.discardedSampleAttribution, prometheus.CounterValue, t.overflowCounter.totalDiscarded.Load(), t.overflowLabels...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing the sub-slicing here to the length of the slice? That sounds like a noop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the discarded sample metric includes an additional "reason" label compared to the other two, it is the only one that contains the full set of overflow labels. The other two metrics will have one fewer label (i.e., len(overflow)-1).

Comment on lines 310 to 312
if _, exists := t.observed[key]; exists {
return
}
Copy link
Contributor

@colega colega Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong. Sounds like we should still increment the numbers, right? Otherwise we didn't count this activeSerie, and when we delete it, we'll go into negative numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in f28d672


// Aggregate active series from all keys into the overflow counter.
for _, o := range t.observed {
if o != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can o be nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in f28d672

Comment on lines 251 to 254
o.lastUpdate.Store(ts)
if activeSeriesIncrement != 0 {
o.activeSerie.Add(activeSeriesIncrement)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't check the overflow here, so we're incremeting something that isn't being used anymore, which means that the overflow number is wrong.

If we want to keep the overflow number correct, we need to handle these race conditions (and I don't think it will be easy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as paired, it is addressed in a2ffe5a and f28d672

previousOverflow = t.isOverflow.Swap(true)
if !previousOverflow {
// Initialize the overflow counter.
t.overflowCounter = &observation{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some kind of concurrency coordination here on setting this property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is gone in f28d672

Copy link
Contributor

github-actions bot commented Jan 13, 2025

💻 Deploy preview available: https://deploy-preview-mimir-10269-zb444pucvq-vp.a.run.app/docs/mimir/latest/
🏗️ Updating deploy preview...

CHANGELOG.md Outdated
@@ -2,6 +2,8 @@

## main / unreleased

* [FEATURE] Ingester/Distributor: Add support for exporting cost attribution metrics (`cortex_ingester_attributed_active_series`, `cortex_received_attributed_samples_total`, and `cortex_discarded_attributed_samples_total`) with labels specified by customers to a custom Prometheus registry. This feature enables more flexible billing data tracking. #10269
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go in Grafana Mimir below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 68410c8

@ying-jeanne ying-jeanne requested a review from colega January 14, 2025 11:25
Comment on lines 32 to 38
mstx sync.RWMutex
sampleTrackersByUserID map[string]*SampleTracker
inactiveTimeout time.Duration
cleanupInterval time.Duration

matx sync.RWMutex
activeTrackersByUserID map[string]*ActiveSeriesTracker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly confused here: what do mstx and matx stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed stmtx and atmtx represent sample tracker mutex and active tracker mutex in commit 68410c8

Comment on lines 34 to 35
inactiveTimeout time.Duration
cleanupInterval time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't need to be under the mutex, right? Can you move them up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to put it near sampleTracker since they are not used by activeSeriesTracker. I can move them to the common area.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved in commit 68410c8

pkg/distributor/distributor.go Outdated Show resolved Hide resolved
Comment on lines +821 to +823
func (o *Overrides) MaxCostAttributionLabelsPerUser(userID string) int {
return o.getOverridesForUser(userID).MaxCostAttributionLabelsPerUser
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a user setting? I can we make a fixed upper bound to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capped to 4 in this commit and test updated 68410c8

@ying-jeanne ying-jeanne requested a review from colega January 15, 2025 11:23
@@ -183,6 +185,11 @@ limits:
ha_cluster_label: ha_cluster
ha_replica_label: ha_replica
ha_max_clusters: 10

cost_attribution_labels: "container"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an array, right? Not a single string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be string seperated by comma, would update the config to make it more clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants