-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MVP: Cost attribution #10269
base: main
Are you sure you want to change the base?
MVP: Cost attribution #10269
Conversation
5165a5b
to
6f36b5f
Compare
6f36b5f
to
077a94a
Compare
077a94a
to
f04c28f
Compare
@@ -502,6 +525,18 @@ func (s *seriesStripe) remove(ref storage.SeriesRef) { | |||
} | |||
|
|||
s.active-- | |||
if s.cat != nil { | |||
if idx == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we should assume this isn't nil. Just skipping the removal will break the numbers forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vendor and update in commit 4706bde
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the active series tracker tests with the costattribution.Tracker, otherwise the new code isn't tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in 17b64a9
pkg/ingester/ingester.go
Outdated
idx, err := db.Head().Index() | ||
if err != nil { | ||
level.Warn(i.logger).Log("msg", "failed to get the index of the TSDB head", "user", userID, "err", err) | ||
idx = nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented previously, we should never proceed without an index.
If you check the implementation of db.Head().Index()
it never returns an error. We have three options here:
- Skip tenants if they don't have index: this is the least effort one.
- Panic if err is not nil, this is ugly
- Update mimir-prometheus to add a
MustIndex() IndexReader
method that does not return an error, and use that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR in mimir-prometheus grafana/mimir-prometheus#811
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vendor and update in commit 4706bde
pkg/mimir/modules.go
Outdated
if t.Cfg.CostAttributionRegistryPath != "" { | ||
reg := prometheus.NewRegistry() | ||
var err error | ||
t.CostAttributionManager, err = costattribution.NewManager(3*time.Minute, time.Minute, t.Cfg.CostAttributionEvictionInterval, util_log.Logger, t.Overrides, reg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these values should not be hardcoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed unused parameter b27e379
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the docs! I left a few suggestions.
pkg/costattribution/tracker.go
Outdated
func (t *Tracker) IncrementReceivedSamples(req *mimirpb.WriteRequest, now time.Time) { | ||
if t == nil { | ||
return | ||
} | ||
|
||
dict := make(map[string]int) | ||
for _, ts := range req.Timeseries { | ||
lvs := t.extractLabelValuesFromLabelAdapater(ts.Labels) | ||
dict[t.hashLabelValues(lvs)] += len(ts.TimeSeries.Samples) + len(ts.TimeSeries.Histograms) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the hottest path on our application, we should optimize it as much as possible.
Why do we need to build a new data structure (which escapes to heap) with holds []mimirpb.LabelAdapter
slices that escape to heap, which create a string that escapes to heap just to put it into a dict
that I don't think should be a map, even if we need some data structure.
Can we just extract the labelValues byte slices, recycled from a pool, and process each one separately in the loop below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as paired, addressed in this commit a2ffe5a
pkg/costattribution/tracker.go
Outdated
out <- prometheus.MustNewConstMetric(t.activeSeriesPerUserAttribution, prometheus.GaugeValue, t.overflowCounter.activeSerie.Load(), t.overflowLabels[:len(t.overflowLabels)-1]...) | ||
out <- prometheus.MustNewConstMetric(t.receivedSamplesAttribution, prometheus.CounterValue, t.overflowCounter.receivedSample.Load(), t.overflowLabels[:len(t.overflowLabels)-1]...) | ||
out <- prometheus.MustNewConstMetric(t.discardedSampleAttribution, prometheus.CounterValue, t.overflowCounter.totalDiscarded.Load(), t.overflowLabels...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing the sub-slicing here to the length of the slice? That sounds like a noop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the discarded sample metric includes an additional "reason" label compared to the other two, it is the only one that contains the full set of overflow labels. The other two metrics will have one fewer label (i.e., len(overflow)-1).
pkg/costattribution/tracker.go
Outdated
if _, exists := t.observed[key]; exists { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong. Sounds like we should still increment the numbers, right? Otherwise we didn't count this activeSerie
, and when we delete it, we'll go into negative numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in f28d672
pkg/costattribution/tracker.go
Outdated
|
||
// Aggregate active series from all keys into the overflow counter. | ||
for _, o := range t.observed { | ||
if o != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can o
be nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed in f28d672
pkg/costattribution/tracker.go
Outdated
o.lastUpdate.Store(ts) | ||
if activeSeriesIncrement != 0 { | ||
o.activeSerie.Add(activeSeriesIncrement) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't check the overflow here, so we're incremeting something that isn't being used anymore, which means that the overflow number is wrong.
If we want to keep the overflow number correct, we need to handle these race conditions (and I don't think it will be easy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/costattribution/tracker.go
Outdated
previousOverflow = t.isOverflow.Swap(true) | ||
if !previousOverflow { | ||
// Initialize the overflow counter. | ||
t.overflowCounter = &observation{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be some kind of concurrency coordination here on setting this property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic is gone in f28d672
💻 Deploy preview available: https://deploy-preview-mimir-10269-zb444pucvq-vp.a.run.app/docs/mimir/latest/ |
CHANGELOG.md
Outdated
@@ -2,6 +2,8 @@ | |||
|
|||
## main / unreleased | |||
|
|||
* [FEATURE] Ingester/Distributor: Add support for exporting cost attribution metrics (`cortex_ingester_attributed_active_series`, `cortex_received_attributed_samples_total`, and `cortex_discarded_attributed_samples_total`) with labels specified by customers to a custom Prometheus registry. This feature enables more flexible billing data tracking. #10269 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go in Grafana Mimir
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in 68410c8
pkg/costattribution/manager.go
Outdated
mstx sync.RWMutex | ||
sampleTrackersByUserID map[string]*SampleTracker | ||
inactiveTimeout time.Duration | ||
cleanupInterval time.Duration | ||
|
||
matx sync.RWMutex | ||
activeTrackersByUserID map[string]*ActiveSeriesTracker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly confused here: what do mstx
and matx
stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed stmtx and atmtx represent sample tracker mutex and active tracker mutex in commit 68410c8
pkg/costattribution/manager.go
Outdated
inactiveTimeout time.Duration | ||
cleanupInterval time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These don't need to be under the mutex, right? Can you move them up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to put it near sampleTracker since they are not used by activeSeriesTracker. I can move them to the common area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved in commit 68410c8
func (o *Overrides) MaxCostAttributionLabelsPerUser(userID string) int { | ||
return o.getOverridesForUser(userID).MaxCostAttributionLabelsPerUser | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a user setting? I can we make a fixed upper bound to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capped to 4 in this commit and test updated 68410c8
Co-authored-by: Oleg Zaytsev <[email protected]>
@@ -183,6 +185,11 @@ limits: | |||
ha_cluster_label: ha_cluster | |||
ha_replica_label: ha_replica | |||
ha_max_clusters: 10 | |||
|
|||
cost_attribution_labels: "container" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an array, right? Not a single string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be string seperated by comma, would update the config to make it more clear.
What this PR does
This is the follow up of #9733,
The PR intent to export extra attributed metrics in distributor and ingester, in order to get sample received, sample discarded and active_series attributed by cost attribution label.
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.