-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write custom errors
package with stack trace functionality
#5239
Conversation
6131e19
to
d27f017
Compare
[WIP] How does the output look, @matej-g ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall! Awesome work! 💫
Would love to see some more examples of this in action.
Just some suggestions.
|
||
// The idea of writing errors package in thanos is highly motivated from the Tast project of Chromium OS Authors. However, instead of | ||
// copying the package, we end up writing our own simplified logic borrowing some ideas from the errors and github.com/pkg/errors. | ||
// A big thanks to all of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need to mention Chromium's license here too. I believe we do it for some Prometheus code that we use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the review. That's what I am also wondering too. To be honest, the current program is not an exact replication (here and there I have got rid of some unnecessary codes, interfaces, used string Builder for efficient formatting, changed the return signature of a few functions) of the Chromium OS Package.
Here's their license https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/platform/tast/LICENSE
What do you think we should do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine if we just add a line pointing to this license here, as we don't copy but rather adapt code from Chromium. But I don't feel too strongly about this either way. 🙂
WDYT @matej-g?
Makefile
Outdated
@@ -360,7 +360,8 @@ go-lint: check-git deps $(GOLANGCI_LINT) $(FAILLINT) | |||
$(call require_clean_work_tree,'detected not clean work tree before running lint, previous job changed something?') | |||
@echo ">> verifying modules being imported" | |||
@# TODO(bwplotka): Add, Printf, DefaultRegisterer, NewGaugeFunc and MustRegister once exception are accepted. Add fmt.{Errorf}=github.com/pkg/errors.{Errorf} once https://github.com/fatih/faillint/issues/10 is addressed. | |||
@$(FAILLINT) -paths "errors=github.com/pkg/errors,\ | |||
@$(FAILLINT) -paths "errors=github.com/thanos-io/thanos/pkg/errors,\ | |||
github.com/pkg/errors=github.com/thanos-io/thanos/pkg/errors,\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should replace all the instances of /pkg/errors in a separate PR instead of unnecessarily bloating this one.
💥➜ thanos git:(pkg/errors) ✗ make lint 2>&1 | wc -l
144
Your views @matej-g?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely separate PR yes.
Do I understand this implementation correctly, that it's mostly a shim for the stdlib errors packages that supports stack traces as part of errors now? |
Hi @metalmatze, yes you are right. It extends the functionality errors package by combining
In simple words, a very minimalistic, simple and readable replacement of |
I wonder if we should not start by adding this error lib in https://github.com/efficientgo/tools/tree/main/core to keep util things as ultra small module? |
Hi, thanks for the suggestion @bwplotka and I am sorry for the late reply. I just took a look at the package and especially the
// merr := merrors.New(err1)
// merr.Add(err2, errOrNil3)
// for _, err := range errs {
// merr.Add(err)
// }
// return merr.Err()
Hey, I am new to the community and I don't have much context about the discussions and decisions : ) Here I just tried to do some research and share my findings if the Thanks : ) |
Hey @bisakhmondal , thanks for the detailed analysis! :) To address some of your concerns,
Hmm, I'd say there is definitely utility in having For example, say in our docs CI, we have a tool running called mdox, which checks each markdown file and ensures that they are correctly formatted and have correct links. You run it once via If we were to just return the very first error encountered, users would have to run it again and again until they have fixed each error, which is definitely not optimal and would be a painful process. There are use cases for this in Thanos too, for example, the Line 49 in 19dcc79
You'd also find such cases scattered throughout the codebase. So there are definitely use cases for this! I think there are also other implementations of this like https://github.com/hashicorp/go-multierror. But a question I have for your point is, why does the existence of
Not sure what you mean here. I think @bwplotka's point was to add this as a separate "errors" package in the tools repo. Not as an extension or modification to the existing
Hmm. Why would it destroy backward compatibility? We aren't changing/removing any of the existing functionality, right? If this package is added to I might be understanding something differently about your points here, so correct me if I'm wrong :). |
Ahh, Thanks for the clarification @saswatamcode. I misinterpreted @bwplotka's comment thinking we are going to use the error package ( Definitely, there are certain use cases where combining multiple errors makes sense. As I said, I misinterpreted that we are going to make this a default throughout the thanos project which actually doesn't make sense.
Yep, I got it now : ) I am totally fine with keeping the error package as a part of thanos project or inside efficientgo/tools. I have no strong opinion here : ) |
Correct. I just mentioned utility. (: |
But we can start small with just implementing it here for now |
Great! Would you mind giving it a quick look and sharing some reviews? I'll write the test suite then : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. I have a few proposals - feel free to challenge non sensible former ideas ppl used to do. Especially with Wrap
vs Wrapf
is something we could improve 🤔
Let's make sure it's clear and ready.
We also need some at least basic tests. Thanks!
Makefile
Outdated
@@ -360,7 +360,8 @@ go-lint: check-git deps $(GOLANGCI_LINT) $(FAILLINT) | |||
$(call require_clean_work_tree,'detected not clean work tree before running lint, previous job changed something?') | |||
@echo ">> verifying modules being imported" | |||
@# TODO(bwplotka): Add, Printf, DefaultRegisterer, NewGaugeFunc and MustRegister once exception are accepted. Add fmt.{Errorf}=github.com/pkg/errors.{Errorf} once https://github.com/fatih/faillint/issues/10 is addressed. | |||
@$(FAILLINT) -paths "errors=github.com/pkg/errors,\ | |||
@$(FAILLINT) -paths "errors=github.com/thanos-io/thanos/pkg/errors,\ | |||
github.com/pkg/errors=github.com/thanos-io/thanos/pkg/errors,\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely separate PR yes.
pkg/errors/errors.go
Outdated
|
||
// Errorf creates a new error with the given message and a stacktrace in details. | ||
// An alternative to fmt.Errorf function. | ||
func Errorf(format string, args ...interface{}) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - again - it's our package, we can innovate.
What about only
errors.Newf
and errors.Wrapf
?
Or even naming it without f
but supporting fmt? No harm here, but simper API for everyone. So:
errors.New(format string, args ...interface{})
and errors.Wrap(err, format string, args ...interface{})
would work too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as it's going to be our custom error package, having simpler APIs would be very useful in the long run.
I was just a bit worried about the performance penalty (though it's linear O(N) (ref) )for extra iteration over the format string. So I ran a quick benchmark and the results are equivalent in terms of memory and time.
func benchWrap(errorMsg *string, b *testing.B) {
for i := 0; i < b.N; i++ {
_ = errors.Wrap(errors.New("random error"), *errorMsg)
}
}
func benchWrapf(errorMsg *string, b *testing.B) {
for i := 0; i < b.N; i++ {
_ = errors.Wrap(errors.New("random error"), *errorMsg)
}
}
// >>> len("something terrible has happened.")
// 32
var (
len32 = "something terrible has happened."
len64 = "something terrible has happened.something terrible has happened."
len128 = "something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened."
len512 = "something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened.something terrible has happened."
)
func BenchmarkWrap_32(b *testing.B) { benchWrap(&len32, b) }
func BenchmarkWrapf_32(b *testing.B) { benchWrapf(&len32, b) }
func BenchmarkWrap_64(b *testing.B) { benchWrap(&len64, b) }
func BenchmarkWrapf_64(b *testing.B) {
benchWrapf(&len64, b)
}
func BenchmarkWrap_512(b *testing.B) {
benchWrap(&len512, b)
}
func BenchmarkWrapf_512(b *testing.B) {
benchWrapf(&len512, b)
}
//🔥➜ errors git:(pkg/errors) ✗ go test -bench=. -benchmem
//goos: linux
//goarch: amd64
//pkg: github.com/thanos-io/thanos/pkg/errors
//cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
//BenchmarkWrap_32-8 909074 1301 ns/op 256 B/op 2 allocs/op
//BenchmarkWrapf_32-8 942794 1297 ns/op 256 B/op 2 allocs/op
//BenchmarkWrap_64-8 936003 1289 ns/op 256 B/op 2 allocs/op
//BenchmarkWrapf_64-8 941866 1288 ns/op 256 B/op 2 allocs/op
//BenchmarkWrap_512-8 964816 1286 ns/op 256 B/op 2 allocs/op
//BenchmarkWrapf_512-8 883635 1285 ns/op 256 B/op 2 allocs/op
As you have suggested, it's no harm to trim the extra f
from this package : )
Updating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun fact:
actually, it's faster and more memory-optimized than github/pkg/errors ^_^
goos: linux
goarch: amd64
pkg: github.com/thanos-io/thanos/pkg/errors
cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
BenchmarkWrap_32-8 776246 1312 ns/op 256 B/op 2 allocs/op
BenchmarkWrap_pkg_errors_32-8 805689 1498 ns/op 640 B/op 7 allocs/op
BenchmarkWrap_64-8 767688 1324 ns/op 256 B/op 2 allocs/op
BenchmarkWrap_pkg_errors_64-8 831584 1523 ns/op 640 B/op 7 allocs/op
BenchmarkWrap_512-8 952338 1342 ns/op 256 B/op 2 allocs/op
BenchmarkWrap_pkg_errors_512-8 813520 1508 ns/op 640 B/op 7 allocs/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
After some thinking, let's keep f
word suffix so we are consistent with some lints/tooling that tries to verify formatters. Just removing non f and being able to produce format
string without variables is fine. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also next time, just in case use
func benchWrapf(errorMsg *string, b *testing.B) {
b.ResetAllocs()
for i := 0; i < b.N; i++ {
_ = errors.Wrap(errors.New("random error"), *errorMsg)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks
// Is is a wrapper of built-in errors.Is. It reports whether any error in err's | ||
// chain matches target. The chain consists of err itself followed by the sequence | ||
// of errors obtained by repeatedly calling Unwrap. | ||
func Is(err, target error) bool { | ||
return errors.Is(err, target) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if alias var Is = errors.Is
would not work? (and same for As
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would work, updating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, sorry, I have reverted back to the old one.
Reason: Though using var Is = errors.Is
would definitely make things work but users will have a slightly poor IDE experience as the ide will treat those functions as variables (as actually, they are in the modified pkg). So the auto-completion works a bit weird way
see the screenshot
I dug in a little bit to find an alternative approach to tackle this, but it seems popular packages do use the wrapping while exposing their client-side APIs from the internal packages.
ref: https://github.com/temporalio/sdk-go/blob/fd0d1eb548eb0621a5395581cfe2c418704b007c/client/client.go#L435-L476
I hope you are okay with it. @bwplotka?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but worth adding comment why we did this this way 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I know why it did not work for you. We can do type Is = errors.Is
I think. Let's merge and iterate (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi sure 👍 thanks for the suggestion, we can tackle it on the next pr. On second thought I doubt if Go will allow putting a new type over a function definition (for function signature it can be done).
cbf85c7
to
6fae087
Compare
68d88e2
to
0f2829f
Compare
func newStackTrace() stacktrace { | ||
const stackDepth = 16 // record maximum 16 frames (if available) | ||
|
||
pc := make([]uintptr, stackDepth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation has the same issue as pkg/errors
i.e. the pc
escapes to heap thus the entirety of the 16 uintptr is kept in memory even if only the stack is only 1 call frame deep. I have explained the issue in detail here; based on a real-world scenario.
Please try go build -gcflags='-m -m' ./pkg/errors/stacktrace.go
and you would find
pkg/errors/stacktrace.go:21:12: make([]uintptr, stackDepth) escapes to heap:
pkg/errors/stacktrace.go:21:12: flow: pc = &{storage for make([]uintptr, stackDepth)}:
pkg/errors/stacktrace.go:21:12: from make([]uintptr, stackDepth) (spill) at pkg/errors/stacktrace.go:21:12
pkg/errors/stacktrace.go:21:12: from pc := make([]uintptr, stackDepth) (assign) at
The solution is to allocate the buffer pc
on stack like the original implementation and then copy whats needed off it to prevent escaping to heap.
pc := make([]uintptr, stackDepth) | |
const maxDepth = 16 | |
var pcs [maxDepth]uintptr. // allocate on stack | |
n := runtime.Callers(3, pcs[:]) | |
st := make(stack, n) | |
copy(st, pcs[:n]) | |
return st |
Another thought is if we really need to hold on to the entire CallFrame until String()
is called? At the expense of some compute, we could calculate Stacktrace string immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice nitpick and great catch @sthaha.
the pc escapes to heap thus the entirety of the 16 uintptr is kept in memory even if only the stack is only 1 call frame deep.
Yes, you are right - there is no point in holding an extra chunk of unused heap memory, And I think you are copying to another temporary slice because even though we are returning stacktrace(pc[:n])
the capacity of PC still remains 16 (allocated memory), right?
But this additional copy adds some extra complexity : ) (though it's very minor and negligible as we are dealing with length of 16) From escape analysis the function complexity gets increased from 88 -> 96
./stacktrace.go:20:6: cannot inline newStackTrace: function too complex: cost 96 exceeds budget 80
.
.
./stacktrace.go:20:6: cannot inline newStackTrace: function too complex: cost 88 exceeds budget 80
Since go1.2 we have 3 index slice capability where the third param can define the capacity of the newly created slice.
So just updating the return statement to return stacktrace(pc[:n:n])
would yield the same effect of the suggestion you proposed.
Thanks a lot. That was really awesome.
Another thought is if we really need to hold on to the entire CallFrame until String() is called? At the expense of some compute, we could calculate Stacktrace string immediately.
I think we shouldn't do it for the following reasons
- Its compute expensive - runtime needs to retrieve caller information
- More memory usage - the
String
method is meant for human consumption so naturally, this adds a lot of extra information, text etc compared to the 16 element slice of type unitptr. During error changing it would be worse. - Not every error needs the stacktrace. For specific format verbs like
%+v
stacktrace gets dumped recursively (only then String gets called).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since go1.2 we have 3 index slice capability where the third param can define the capacity of the newly created slice.
💯 .. Thank you, I wasn't aware :)
I think we shouldn't do it for the following reasons ...
Its compute expensive - runtime needs to retrieve caller information
Have you tried a benchmark?
More memory usage - the String method is meant for human consumption so naturally, this adds a lot of extra information, text etc compared to the 16 element slice of type unitptr. During error changing it would be worse.
What I meant was to only hold on to information that you need than have uintptr
which now points to the callframe and IIUC, holding onto the callframe will now prevent GC from cleaning the callframe info until the error is garbage collected as well.
It may be worth seeing the inuse allocation if we don't hold on to the callframes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't do it for the following reasons ...
Its compute expensive - runtime needs to retrieve caller information
Have you tried a benchmark?
IIUC, in the new approach that you are suggesting we should call the String
immediately at the expense of an extra call which might not be required if the error is not logged with an "%+v"
format verb. So we are adhering to extra computation which might not be of any use. So obviously it will be expensive in time against the current approach as the current one only calls string when it's required.
More memory usage - the String method is meant for human consumption so naturally, this adds a lot of extra information, text etc compared to the 16 element slice of type unitptr. During error chaining it would be worse.
What I meant was to only hold on to information that you need than have
uintptr
which now points to the callframe and IIUC, holding onto the callframe will now prevent GC from cleaning the callframe info until the error is garbage collected as well.
Now coming to the memory footprint, IMHO storing unstructured data (the stack trace output string) is generally a bad idea. If we change the output string format from fmt.Sprintf("> %s\t%s:%d\n", frame.Func.Name(), frame.File, frame.Line)
this to something else memory footprint gets changed. So I think this optimization won't be valid in longer run.
After giving it a thought, I think it's a tradeoff, why?
Memory usage:
- current approach: 16*4 bytes uintptr + unreleased memory from gc (not sure what it stores on that PC address, might be some function related metadata to populate the callFrames. definitely releases all the resources used inside the function)
- proposed approach: large unstructured string + still some unreleased memory (not sure when will the next cycle of go GC will kick off and it's dependent on the config).
I am not sure if we can benchmark this anyhow or not.
the error is garbage collected as well.
If you see, in Go world err has the comparatively shortest lifespan than anything else. In a well written program they gets handled immediately by being returned to the caller function or logged into the sink. So I think we are good here. After all it's an internal package and if we found some untoward runtime behaviour we can always do different optimization.
Thank you : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you see, in Go world err has the comparatively shortest lifespan than anything else.
Not always. Check our fixed issue: #5257
Anyway, I asked @sthaha to remind us about the optimization emperor did - to make sure we are aware. As usually we can iterate over it, we won't do it perfectly over time - the main thing is to get APIs as best as possible - it's hard to change them later.
It looks like the Lift and DCO CIs are pending forever. Does this take this long, especially the code analysis by lift? Is there a way we can optimize it? I'd love to. Hi guys, could you please take another look at this PR when you have some time : ) cc @bwplotka |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an amazing job. LGTM. Before merge my only question is really about doing: Newf
and Wrapf
(A) vs New
and Wrap
(B).
I like the fact that with no-f
(B) version it's cleaner and we show that we don't ever support Wrapper/New without sprintf like formatting. On other hand people got used to things, plus there might be nice tooling which work only if method is f
suffix (don't have example - I saw linters for Printf formatting errors).
I think I am leaning towards (A) actually... Thoughts?
// Copyright (c) The Thanos Authors. | ||
// Licensed under the Apache License 2.0. | ||
|
||
//nolint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, why nolint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tast project
spellchecker thinks the project name should be Taste
instead of the keyword Tast
🙃
Let's try to push (or repush) another commit for DCO to unstuck. You can ignore |
Hi, thanks for the nice feedback and the reviews. Yes, to support approach A, what you have said related to developers' experience and the linters definitely make sense. People are accustomed to having both Okay then, do let me know your final thoughts - we are going with approach A, right? I'll make the necessary changes ASAP : ) |
Yes, I think A is safer. 🤗 |
Sure, updating : ) |
pkg/errors/errors.go
Outdated
// with a stacktrace containing recent call frames. | ||
// | ||
// If cause is nil, this is the same as New. | ||
func Wrap(cause error, msg string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we left our both non f
and f
methods at the end? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I thought we are focussing on end-users experience that's why ended up keeping both.
I have changed my mind, let's simplify things and stick to only the f
versions 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM, just minor nits. We have opportunity to simplify this package, but we ended up having both New, Errorf, Wrap and Wrapf
. Again, there is an opportunity to simplify this, so reviewers do not need to be a pain and saying "Why you used Errrof("non formattable error")
and not New
(and similar for Wrap)
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Signed-off-by: Bisakh Mondal <[email protected]>
Head branch was pushed to by a user without write access
0c0d413
to
5f0c859
Compare
Done. Thanks to you both for the awesome review : ) |
Hi Guys, all the CIs (well, as usual except for the lift) have passed. Please approve and merge this. Then I'll proceed with the rest of the refactoring : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the massive lag! KubeCon toil. LGTM!
Let's try to use it and improve on the way. Ideally at some point move to efficientgo/core - or better - separate repo for others to import it.
Thank you! 🎉 Amazing work! |
Thank you everyone for those awesome reviews :) |
* Remove debug line (#5245) Signed-off-by: Matej Gera <[email protected]> * e2e: fix compact test's flakiness (#5246) Fix the compact test's by running this sub-test sequentially. The further steps depend on this test's results so it's wrong to run it as a sub-test. Signed-off-by: Giedrius Statkevičius <[email protected]> * bump prometheus version to v2.33.5 (#5256) Signed-off-by: Ben Ye <[email protected]> * info: Return store info only when the service is ready (#5255) * return store info only when the service is ready Signed-off-by: Ben Ye <[email protected]> * fix test Signed-off-by: Ben Ye <[email protected]> * Merge release 0.25 to main (#5210) * Cut 0.25.0-rc.0 (#5184) Signed-off-by: Matej Gera <[email protected]> * Cut v0.25.0 (#5209) Signed-off-by: Matej Gera <[email protected]> * Create v0.25.1 built with Go 1.17.8 (#5226) The binaries published with this release are built with Go1.17.8 to avoid [CVE-2022-24921](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-24921). Signed-off-by: Matthias Loibl <[email protected]> * *: Cut 0.25.2 rc.0 (#5247) * fix: add null check to exemplar data (#5202) Signed-off-by: Thomas Mota <[email protected]> * Ruler: Fix WAL directory in stateless mode (#5242) Signed-off-by: Matej Gera <[email protected]> * Update CHANGELOG, VERSION Signed-off-by: Matej Gera <[email protected]> * Updates busybox SHA (#5234) Signed-off-by: GitHub <[email protected]> Co-authored-by: yeya24 <[email protected]> Co-authored-by: Tomás Mota <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: yeya24 <[email protected]> * Cut v0.25.2 Signed-off-by: Matej Gera <[email protected]> Update tutorials Signed-off-by: Matej Gera <[email protected]> Co-authored-by: Matthias Loibl <[email protected]> Co-authored-by: Tomás Mota <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: yeya24 <[email protected]> * Implement GRPC query API (#5250) With the current GRPC APIs, layering Thanos Queriers results in the root querier getting all of the samples and executing the query in memory. As a result, the intermediary Queriers do not do any intensive work and merely transport samples from the Stores to the root Querier. When data is perfectly sharded, users can implement a pattern where the root Querier instructs the intermediary ones to execute the queries from their stores and return back results. The results can then be concatenated by the root querier and returned to the user. In order to support this use case, this commit implements a GRPC API in the Querier which is analogous to the HTTP Query API exposed by Prometheus. Signed-off-by: fpetkovski <[email protected]> * Change error cleanup in `objstore.DownloadDir` to delete files not destination dir (#5229) * Change error cleanup in objstore.DownloadDir to delete files not directories Dst is always a directory. If any file after the first fails to download, the cleanup will fail because the destination already contains at least one file. This commit changes the cleanup logic to clean up successfully downloaded files one by one instead of attempting to clean up the whole dst directory. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add cleanup of root dst directory. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add unit test for cleanup of DownloadDir Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix linter Signed-off-by: Dimitar Dimitrov <[email protected]> * Update index.html (#5264) * Add SumUp logo to adopters (#5267) Signed-off-by: Guilherme Souza <[email protected]> * receive: Added tenant ID error handling of remote write requests. (#5269) Plus better explanation. Signed-off-by: Bartlomiej Plotka <[email protected]> * Add TIXnGO logo to adopters (#5273) Signed-off-by: Pierre Hanselmann <[email protected]> * Fix miekgdns resolver to work with CNAME records too (#5271) * Fix miekgdns resolver to work with CNAME records too Signed-off-by: Marco Pracucci <[email protected]> * Remove unused context Signed-off-by: Marco Pracucci <[email protected]> * Update pkg/discovery/dns/miekgdns/resolver.go Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Lucas Servén Marín <[email protected]> Co-authored-by: Lucas Servén Marín <[email protected]> * UI: Remove old ui (#5145) * remove old ui Signed-off-by: Augustin Husson <[email protected]> * add changelog Signed-off-by: Augustin Husson <[email protected]> * update assets Signed-off-by: Augustin Husson <[email protected]> * Updates busybox SHA (#5283) Signed-off-by: GitHub <[email protected]> Co-authored-by: yeya24 <[email protected]> * build(deps): bump moment from 2.29.1 to 2.29.2 in /pkg/ui/react-app (#5274) Bumps [moment](https://github.com/moment/moment) from 2.29.1 to 2.29.2. - [Release notes](https://github.com/moment/moment/releases) - [Changelog](https://github.com/moment/moment/blob/develop/CHANGELOG.md) - [Commits](https://github.com/moment/moment/compare/2.29.1...2.29.2) --- updated-dependencies: - dependency-name: moment dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * docs: fix URLs preventing generation and unblock CI (#5285) * docs: fix Ian Billett's GitHub handle I noticed that CI was failing [0] for PR https://github.com/thanos-io/thanos/pull/5284 because Ian had changed his GitHub handle from @ianbillett to @bill3tt. This commit fixes this. [0] https://github.com/thanos-io/thanos/runs/6050355497?check_suite_focus=true#step:5:135 Signed-off-by: Lucas Servén Marín <[email protected]> * docs: fix broken links to GitHub docs Currently, documentation generation is failing because mdox can't fetch some GitHub documentation pages since the URLs for the help content has changed. This commit updates the links to use the correct URLs. Signed-off-by: Lucas Servén Marín <[email protected]> * MAINTAINERS.md: regenerate Signed-off-by: Lucas Servén Marín <[email protected]> * UI: Update vulnerable dependencies (#5233) * refactor global window typings Use declaration merging for better window types Signed-off-by: Gabriel Bernal <[email protected]> * bump vulnerable react-scripts version Signed-off-by: Gabriel Bernal <[email protected]> * Add Vestiaire Collective as adopter (#5289) Signed-off-by: claude ebaneck <[email protected]> Co-authored-by: claude ebaneck <[email protected]> * Implement Query API discovery (#5291) A recent commit (#5250) added a GRPC API to Thanos Query which allows executing PromQL over GRPC. This API is currently not discoverable through endpointsets which makes it hard for other Thanos components to use it. This commit extends endpointsets with a GetQueryAPIClients method which returns Query API clients to all components which support this API. Signed-off-by: fpetkovski <[email protected]> * Added support for ppc64le (#5290) * Added support for ppc64le Signed-off-by: Marvin Giessing <[email protected]> * Updated Changelog Signed-off-by: Marvin Giessing <[email protected]> * Updated promu & protoc Signed-off-by: Marvin Giessing <[email protected]> * Updated Makefile comment Signed-off-by: Marvin Giessing <[email protected]> * Added target API tests (+goleak). (#5260) Attempted to repro https://github.com/thanos-io/thanos/issues/5257, but no good luck. Signed-off-by: Bartlomiej Plotka <[email protected]> * Revert "Added target API tests (+goleak). (#5260)" (#5297) This reverts commit 955ea6dcae2529ad5b5b97a6a11150a5906d775a. Signed-off-by: Giedrius Statkevičius <[email protected]> * Use correct filesystem/network path separators when uploading blocks (#5281) Signed-off-by: Arve Knudsen <[email protected]> * query-frontend: Don't cache request with dedup=false (#5300) * query-frontend: Added repro for dedup affecting precision of querying. Signed-off-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Douglas Camata <[email protected]> * QFE does not cache request with dedup=false. Signed-off-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Douglas Camata <[email protected]> * Move info about queries that skip cache logic to docs Signed-off-by: Douglas Camata <[email protected]> * Update CHANGELOG Signed-off-by: Douglas Camata <[email protected]> * Run docs formatter Signed-off-by: Douglas Camata <[email protected]> * Fix e2e tests where caching logic is desired Signed-off-by: Douglas Camata <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> * mixin: Fix typo in ThanosCompactHalted alert (#5306) Signed-off-by: Pedro Araujo <[email protected]> * Avoid starting goroutines for memcached batch requests before gate (#5301) Use the doWithBatch function to avoid starting goroutines to fetch batched results from memcached before they are allowed to run via the concurrency Gate. This avoids starting many goroutines which cannot make any progress due to a concurrency limit. Fixes #4967 Signed-off-by: Nick Pillitteri <[email protected]> * Cut readme for 0.26 (#5311) Co-authored-by: Wiard van Rij <[email protected]> * Reviewed and updated Changelog for 0.26-rc0 (#5313) Signed-off-by: Wiard van Rij <[email protected]> Co-authored-by: Wiard van Rij <[email protected]> * Cut 0.26.0-rc.0 set version correctly (#5317) Signed-off-by: Wiard van Rij <[email protected]> Co-authored-by: Wiard van Rij <[email protected]> * docs: Fix broken link to introduction blog (#5319) Signed-off-by: jmjf <[email protected]> * Ensure memcached batched requests handle context cancelation (#5314) * Ensure memcached batched requests handle context cancellation Ensure that when the context used for Memcached GetMulti is cancelled, getMultiBatched does not hang waiting for results that will never be generated (since the batched requests will not run if the context has been cancelled). Fixes an issue introduced in #5301 Signed-off-by: Nick Pillitteri <[email protected]> * Lint fixes Signed-off-by: Nick Pillitteri <[email protected]> * Code review changes: run batches unconditionally Signed-off-by: Nick Pillitteri <[email protected]> * stalebot: add generic label to avoid stalebot (#5322) Add a generic label which tells stalebot not to close issues marked with it. Signed-off-by: Giedrius Statkevičius <[email protected]> * Use proper replicalabels in GRPC Query API (#5308) The GRPC Query API uses only the replica labels coming from the RPC request and ignores the ones configured when starting the querier. This commit ensures that the API falls back on the preconfigured replica labels when they are not provided in the request. Signed-off-by: Filip Petkovski <[email protected]> * groupcache: reduce log severity (#5323) Sometimes certain operations can fail with some error(-s) being expected e.g. a deletion marker might or might not exist. Thus, these log lines could get triggered even though nothing bad is happening. Since the expected errors are known only at the very end, near the call site, and because `error`s are already logged in other places, and because these Fetch()/Store() functions are working in best-effort scenario, I propose reducing the severity of these log lines to `debug`. Fixes https://github.com/thanos-io/thanos/issues/5265. Signed-off-by: Giedrius Statkevičius <[email protected]> * Update release process (#5325) * update release process Signed-off-by: Wiard van Rij <[email protected]> * Add info about VERSION file Signed-off-by: Wiard van Rij <[email protected]> * query-frontend: improve docs on requestes excluded from cache (#5326) Signed-off-by: Douglas Camata <[email protected]> * cut release 0.26.0 (#5330) Signed-off-by: Wiard van Rij <[email protected]> * Updates busybox SHA (#5336) Signed-off-by: GitHub <[email protected]> Co-authored-by: yeya24 <[email protected]> * receive: fix deadlock on interrupt in routerOnly mode (#5339) * fix receive router deadlock on interrupt Signed-off-by: François Gouteroux <[email protected]> * Update changelog Signed-off-by: François Gouteroux <[email protected]> * docs: Updated information about our community call. (#5309) Signed-off-by: Bartlomiej Plotka <[email protected]> * reloader: Force trigger reload when config rollbacked (#5324) * Add Cache metrics to groupcache (#5352) Add metrics about the hot and main caches[0]. * Number of bytes in each cache. * Number of items in each cache. * Counter of evictions from each cache. [0]: https://pkg.go.dev/github.com/vimeo/galaxycache#CacheStats Signed-off-by: SuperQ <[email protected]> * e2e: Refactored service helpers to be consistent with new API. (#5348) * test: Added Alert compatibilty test. Signed-off-by: Bartlomiej Plotka <[email protected]> * Tmp. Signed-off-by: Bartlomiej Plotka <[email protected]> * Update. Signed-off-by: Bartlomiej Plotka <[email protected]> * update. Signed-off-by: Bartlomiej Plotka <[email protected]> * update. Signed-off-by: Bartlomiej Plotka <[email protected]> * e2e: Refactored service helpers for newest e2e version. Signed-off-by: Bartlomiej Plotka <[email protected]> * Removed alert combatibiltiy test for now. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fixed lint. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fixed lint2. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fixed nginx service. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fixes. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fix. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fix. Signed-off-by: Bartlomiej Plotka <[email protected]> * fix. Signed-off-by: Bartlomiej Plotka <[email protected]> * Refactored ruler. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fixed test. Signed-off-by: Bartlomiej Plotka <[email protected]> * fixes. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fix. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fixed compactor. Signed-off-by: Bartlomiej Plotka <[email protected]> * Fix. Signed-off-by: Bartlomiej Plotka <[email protected]> * What about now? Signed-off-by: Bartlomiej Plotka <[email protected]> * groupcache: fix handling of slashes (#5357) Use https://github.com/julienschmidt/httprouter#catch-all-parameters for the groupcache route otherwise slashes in the cache's key gets interpreted by the router and thus groupcache's function never gets invoked, and all clients get 404. Remove test regarding cache hit because now Thanos Store during test constantly generates cache hits due to 1s delay between block information refreshes. Signed-off-by: Giedrius Statkevičius <[email protected]> * Adds more info about the formatting part. (#5347) * Adds more info about the formatting part. Closes #5282 Signed-off-by: Wiard van Rij <[email protected]> * adds extra newline Signed-off-by: Wiard van Rij <[email protected]> * Update promdoc to solve #5344 (#5345) Signed-off-by: Wiard van Rij <[email protected]> * e2e: Refactored Receive Builder to be consistent with other helpers. (#5358) * e2e: Refactored Receive Builder to be consistent with other helpers. Signed-off-by: Bartlomiej Plotka <[email protected]> * Addressed comments. Signed-off-by: Bartlomiej Plotka <[email protected]> * Updates busybox SHA (#5365) Signed-off-by: GitHub <[email protected]> Co-authored-by: yeya24 <[email protected]> * e2e: Fixed exemplar support in receive helper. (#5372) Signed-off-by: Bartlomiej Plotka <[email protected]> * Enforce memcached concurrency limit with unbatched requests (#5360) * Enforce memcached concurrency limit with unbatched requests This ensures that requests that are _not_ split into batches still count towards the concurrency limit that the client enforces. This fixes an issue introduced in #5301 Signed-off-by: Nick Pillitteri <[email protected]> * Lint fix Signed-off-by: Nick Pillitteri <[email protected]> * docs: fix link (#5379) I think I've found a replacement for the dead link. Signed-off-by: Giedrius Statkevičius <[email protected]> * cache: do not copy data in groupcache (#5378) Add a unsafe codec which uses the given byte slices directly to avoid copying - we are doing ioutil.ReadAll() either way so there is no need to copy anything. Signed-off-by: Giedrius Statkevičius <[email protected]> * fix ruler send empty alerts (#5377) Signed-off-by: Ben Ye <[email protected]> * Add custom `errors` package with stack trace functionality (#5239) * feat: a simple stacktrace utility Signed-off-by: Bisakh Mondal <[email protected]> * feat: custom errors package with new, errorf, wrapping, unwrapping and stacktrace Signed-off-by: Bisakh Mondal <[email protected]> * chore: update existing errors import (small subset) Signed-off-by: Bisakh Mondal <[email protected]> * chore: update comments Signed-off-by: Bisakh Mondal <[email protected]> * add errors into skip-files linter config Signed-off-by: Bisakh Mondal <[email protected]> * intoduce UnwrapTillCause to suffice the limitation of Unwrap Signed-off-by: Bisakh Mondal <[email protected]> * Revert "chore: update existing errors import (small subset)" This reverts commit d27f0177fe6c8a357ba10e4ac8bfee87c8bf985c. Signed-off-by: Bisakh Mondal <[email protected]> * revert makefile && golangcilint file Signed-off-by: Bisakh Mondal <[email protected]> * apply PR feedbacks Signed-off-by: Bisakh Mondal <[email protected]> * stacktrace and errors test Signed-off-by: Bisakh Mondal <[email protected]> * fix typo Signed-off-by: Bisakh Mondal <[email protected]> * update stacktrace testing regex Signed-off-by: Bisakh Mondal <[email protected]> * add lint ignore for standard errors import inside errors pkg Signed-off-by: Bisakh Mondal <[email protected]> * [test files] add copyright headers Signed-off-by: Bisakh Mondal <[email protected]> * add no lint to avoid false misspell detection of keyword Tast Signed-off-by: Bisakh Mondal <[email protected]> * update stacktrace output test line number with regex pattern Signed-off-by: Bisakh Mondal <[email protected]> * return pc slice with reduced capacity Signed-off-by: Bisakh Mondal <[email protected]> * segregate formatted vs non formatted methods Signed-off-by: Bisakh Mondal <[email protected]> * update with only f functions Signed-off-by: Bisakh Mondal <[email protected]> * Group memcached keys based on server when performing batch gets (#5356) * Group memcached keys based on server when performing batch gets Order and group keys during batch get operations based on the memcached server they will be sharded to. This reduces the number of connections that must be made within each batch of get operations. Fixes #5353 Signed-off-by: Nick Pillitteri <[email protected]> * Code review changes Signed-off-by: Nick Pillitteri <[email protected]> * Fix error in testutil method added Signed-off-by: Nick Pillitteri <[email protected]> * Code review: comments for selector interface Signed-off-by: Nick Pillitteri <[email protected]> * QueryFrontend: pre-compile regexp (#5383) * pre compile regexp Signed-off-by: Jin Dong <[email protected]> * rename oppattern to labelvaluespattern Signed-off-by: Jin Dong <[email protected]> * [FEAT] adding thanos consul blogpost (#5387) Signed-off-by: Nicolas Takashi <[email protected]> * Fix empty $externalLabels when templating labels in rule. (#5394) Signed-off-by: Rostislav Benes <[email protected]> Co-authored-by: Rostislav Benes <[email protected]> * support series relabeling on Thanos receiver (#5391) * support series relabeling on Thanos receiver Signed-off-by: Ben Ye <[email protected]> * add changelog Signed-off-by: Ben Ye <[email protected]> * fix lint Signed-off-by: Ben Ye <[email protected]> * update lint Signed-off-by: Ben Ye <[email protected]> * fix e2e test Signed-off-by: Ben Ye <[email protected]> * fix relabel config pass Signed-off-by: Ben Ye <[email protected]> * cleanup white space Signed-off-by: Ben Ye <[email protected]> * address review comments Signed-off-by: Ben Ye <[email protected]> * address comments Signed-off-by: Ben Ye <[email protected]> * update comment Signed-off-by: Ben Ye <[email protected]> * Expose GatherFileStats. (#5400) Signed-off-by: Peter Štibraný <[email protected]> * Rule: Error out earlier when building alertmanager config (#5405) * Error out earlier when building alertmanager config Signed-off-by: Jéssica Lins <[email protected]> * Add test case for empty host Signed-off-by: Jéssica Lins <[email protected]> * [5130] [.*:] Upgrade Minio used for local development and e2e tests (#5392) * add updated bingo .gitignore Signed-off-by: B0go <[email protected]> * update bingo minio version to commit 91130e884b5df59d66a45a0aad4f48db88f5ca63 Signed-off-by: B0go <[email protected]> * trigger CI Signed-off-by: B0go <[email protected]> * Submit a proposal for vertical query sharding (#5350) Signed-off-by: fpetkovski <[email protected]> * query: Close() after using query (#5410) * query: Close() after using query This should reduce memory usage because Close() returns points back to a sync.Pool. Signed-off-by: Giedrius Statkevičius <[email protected]> * CHANGELOG: add item Signed-off-by: Giedrius Statkevičius <[email protected]> * query: call Close() in gRPC API too Signed-off-by: Giedrius Statkevičius <[email protected]> * avoided potential panic due to divide by 0 (#5412) Signed-off-by: Aditi Ahuja <[email protected]> * sidecar/compact/store/receiver - Add the prefix option to buckets (#5337) * Create prefixed bucket Signed-off-by: jademcosta <[email protected]> * started PrefixedBucket tests Signed-off-by: Maria Eduarda Duarte <[email protected]> * finish objstore tests Signed-off-by: Maria Eduarda Duarte <[email protected]> * Simplify string removal logic Signed-off-by: jademcosta <[email protected]> * Test more prefix cases on PrefixedBucket Signed-off-by: jademcosta <[email protected]> * Only use a prefixedbucket if we have a valid prefix Signed-off-by: jademcosta <[email protected]> * Add single unit test for prefixedBucket prefix Signed-off-by: jademcosta <[email protected]> * test other prefixes on UsesPrefixTest Signed-off-by: Maria Eduarda Duarte <[email protected]> * add remaining methods to UsesPrefixTest Signed-off-by: Maria Eduarda Duarte <[email protected]> * add prefix to docs examples Signed-off-by: Maria Eduarda Duarte <[email protected]> * Simplify Iter method Signed-off-by: jademcosta <[email protected]> * add prefix explanation to S3 docs Signed-off-by: Maria Eduarda Duarte <[email protected]> * Conclusion of prefix sentence on docs Signed-off-by: jademcosta <[email protected]> * Use DirDelim instead of magic string Signed-off-by: jademcosta <[email protected]> * Add log when using prefixed bucket Signed-off-by: jademcosta <[email protected]> * Remove "@" from test string to make them simpler Signed-off-by: jademcosta <[email protected]> * fix BucketConfig Config type - back to interface Signed-off-by: Maria Eduarda Duarte <[email protected]> * add changelog Signed-off-by: Maria Eduarda Duarte <[email protected]> * add missing checks in UsesPrefixTest Signed-off-by: Maria Eduarda Duarte <[email protected]> * fix linter and test errors Signed-off-by: Maria Eduarda Duarte <[email protected]> * Add license to new files Signed-off-by: jademcosta <[email protected]> * Remove autogenerated docs Signed-off-by: jademcosta <[email protected]> * Remove duplicated transformation of string->[]byte Signed-off-by: jademcosta <[email protected]> * Add prefixed bucket on all e2e tests for S3 The idea is that if it works, we can add for all other providers. Signed-off-by: jademcosta <[email protected]> * Add e2e tests using prefixed bucket to all providers Signed-off-by: jademcosta <[email protected]> * refactor: move validPrefix to prefixed_bucket logic Signed-off-by: Maria Eduarda Duarte <[email protected]> * Enhance the documentation about prefix. Signed-off-by: jademcosta <[email protected]> * Fix format Signed-off-by: jademcosta <[email protected]> * Add prefix entry on bucket config example Signed-off-by: jademcosta <[email protected]> * Removing redundancies on prefix checks and tests We already check if the prefix if not empty when creating the bucket. Signed-off-by: jademcosta <[email protected]> * Remove redundant YAML unmarshal Signed-off-by: jademcosta <[email protected]> * Remove unused parameter Signed-off-by: jademcosta <[email protected]> * Remove docs that should be auto-geneated Signed-off-by: jademcosta <[email protected]> * refactor: move prefix to config root level Signed-off-by: Maria Eduarda Duarte <[email protected]> * add auto generated docs Signed-off-by: Maria Eduarda Duarte <[email protected]> * fix changelog Signed-off-by: Maria Eduarda Duarte <[email protected]> Co-authored-by: Maria Eduarda Duarte <[email protected]> * Ruler: Change default evaluation interval to 1m (#5417) * Change default eval interval to 1m Signed-off-by: Matej Gera <[email protected]> * Update CHANGELOG Signed-off-by: Matej Gera <[email protected]> * Updates busybox SHA (#5423) Signed-off-by: GitHub <[email protected]> Co-authored-by: yeya24 <[email protected]> * receive: Added Ketamo Consistent hashing (#5408) * Add support for consistent hashing in receivers This commit adds support for distributing series in Receivers using consistent hashing based on the libketama algorithm. Signed-off-by: Filip Petkovski <[email protected]> * Use require package for test assertions Signed-off-by: Filip Petkovski <[email protected]> * Rename algorithm from consistent to ketama Signed-off-by: Filip Petkovski <[email protected]> * S3: Add config option to enforce the minio DNS lookup (#5409) * Add config option to enforce the minio DNS lookup Signed-off-by: Jakob Hahn <[email protected]> * Useenums instead of boolean for bucket_lookup_type Signed-off-by: Jakob Hahn <[email protected]> * Expose tsdb status in receiver (#5402) * Expose tsdb status in receiver This commit implements the api/v1/status/tsdb API in the Receiver. Signed-off-by: Filip Petkovski <[email protected]> * Add docs and todo Signed-off-by: Filip Petkovski <[email protected]> * Fix tests Signed-off-by: Filip Petkovski <[email protected]> * Receive: option to extract tenant from client certificate (#5153) * added option to extract tenant from client certificate Signed-off-by: Magnus Kaiser <[email protected]> * added suggestions from PR Signed-off-by: Magnus Kaiser <[email protected]> * removed else cases Signed-off-by: Magnus Kaiser <[email protected]> * corrected location of certificate field check Signed-off-by: Magnus Kaiser <[email protected]> * fixed issue with err definition Signed-off-by: Magnus Kaiser <[email protected]> * updated docs Signed-off-by: Magnus Kaiser <[email protected]> * corrected comment Signed-off-by: Magnus Kaiser <[email protected]> Co-authored-by: Magnus Kaiser <[email protected]> * Improve ketama hashring replication (#5427) With the Ketama hashring, replication is currently handled by choosing subsequent nodes in the list of endpoints. This can lead to existing nodes getting more series when the hashring is scaled. This commit changes replication to choose subsequent nodes from the hashring which should not create new series in old nodes when the hashring is scaled. Signed-off-by: Filip Petkovski <[email protected]> * Cut readme for 0.27 (#5429) Signed-off-by: Wiard van Rij <[email protected]> * Added alert compliance test for Thanos (#5315) * test: Added Alert compatibilty test. Signed-off-by: Bartlomiej Plotka <[email protected]> * Tmp. Signed-off-by: Bartlomiej Plotka <[email protected]> * Update. Signed-off-by: Bartlomiej Plotka <[email protected]> * update. Signed-off-by: Bartlomiej Plotka <[email protected]> * update. Signed-off-by: Bartlomiej Plotka <[email protected]> * e2e: Refactored service helpers for newest e2e version. Signed-off-by: Bartlomiej Plotka <[email protected]> * Removed alert combatibiltiy test for now. Signed-off-by: Bartlomiej Plotka <[email protected]> * e2e: Added test for compatibility. Signed-off-by: Bartlomiej Plotka <[email protected]> * Added Querier /alerts API. Signed-off-by: Bartlomiej Plotka <[email protected]> * e2e:Added replica labels. Signed-off-by: Bartlomiej Plotka <[email protected]> * Option to remove replica-label. Signed-off-by: Bartlomiej Plotka <[email protected]> * skip. Signed-off-by: Bartlomiej Plotka <[email protected]> * Use stateful ruler and default resend delay Signed-off-by: Matej Gera <[email protected]> * Update docs Signed-off-by: Matej Gera <[email protected]> Co-authored-by: Matej Gera <[email protected]> * 0.27-rc0 Update readme and version (#5430) * Update readme and version Signed-off-by: Wiard van Rij <[email protected]> * Fix newlines Signed-off-by: Wiard van Rij <[email protected]> * Fixes typo Signed-off-by: Wiard van Rij <[email protected]> * fixes noise Signed-off-by: Wiard van Rij <[email protected]> * Alert Compliance: Fix wrong ruler configuration (#5433) * [receive] Export metrics about remote write requests per tenant (#5424) * Add write metrics to Thanos Receive Signed-off-by: Douglas Camata <[email protected]> * Let the middleware count inflight HTTP requests Signed-off-by: Douglas Camata <[email protected]> * Update Receive write metrics type & definition Signed-off-by: Douglas Camata <[email protected]> * Put option back in its place to avoid big diff Signed-off-by: Douglas Camata <[email protected]> * Fetch tenant from headers instead of context It might not be in the context in some cases. Signed-off-by: Douglas Camata <[email protected]> * Delete unnecessary tenant parser middleware Signed-off-by: Douglas Camata <[email protected]> * Refactor & reuse code for HTTP instrumentation Signed-off-by: Douglas Camata <[email protected]> * Add missing copyright to some files Signed-off-by: Douglas Camata <[email protected]> * Add changelog entry for Receive & new HTTP metrics Signed-off-by: Douglas Camata <[email protected]> * Remove TODO added by accident Signed-off-by: Douglas Camata <[email protected]> * Make error handling code shorter Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Douglas Camata <[email protected]> * Make switch statement simpler Signed-off-by: Douglas Camata <[email protected]> * Remove method label from timeseries' metrics Signed-off-by: Douglas Camata <[email protected]> * Count samples of all series instead of each Signed-off-by: Douglas Camata <[email protected]> * Remove in-flight requests metric Will add this in a follow-up PR to keep this small. Signed-off-by: Douglas Camata <[email protected]> * Change timeseries/samples metrics to histograms The buckets were picked based on the fact that Prometheus' default remote write configuration (see https://prometheus.io/docs/practices/remote_write/#memory-usage) set a max of 500 samples sent per second. Signed-off-by: Douglas Camata <[email protected]> * Fix Prometheus registry for histograms Signed-off-by: Douglas Camata <[email protected]> * Fix comment in NewHandler functions There are now four metrics instead of five. Signed-off-by: Douglas Camata <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> * remove unused block-sync-concurrency flag (#5426) * remove unused block-sync-concurrency flag Signed-off-by: Ben Ye <[email protected]> * add changelog Signed-off-by: Ben Ye <[email protected]> * update Signed-off-by: Ben Ye <[email protected]> * fix e2e test Signed-off-by: Ben Ye <[email protected]> * fix tests Signed-off-by: Ben Ye <[email protected]> * fix docs typo in metric thanos_compact_halted (#5448) Signed-off-by: Nikita Matveenko <[email protected]> * Implement tenant expiration (#5420) * Implement tenant expiration This commit adds dynamic TSDB pruning for tenants which have not received new samples within a certain period of time. Signed-off-by: Filip Petkovski <[email protected]> * Add link to receiver tenant-lifecycle-management Signed-off-by: Filip Petkovski <[email protected]> * Docs: Remove Katacoda links (#5454) * Remove Katacoda links Signed-off-by: Matej Gera <[email protected]> * Remove one more reference Signed-off-by: Matej Gera <[email protected]> * Fixed lint on Go 1.18.3+ (#5459) Signed-off-by: bwplotka <[email protected]> * Add HTTP metrics for in-flight requests (#5440) * Add HTTP metrics for in-flight requests Signed-off-by: Douglas Camata <[email protected]> * Fix changelog entry after PR creation Signed-off-by: Douglas Camata <[email protected]> * Fix link in old CHANGELOG entry Signed-off-by: Douglas Camata <[email protected]> * Fix style in the CHANGELOG All the entries should end up with a period. Signed-off-by: Douglas Camata <[email protected]> * Improve help for in-flight htttp requests metric Signed-off-by: Douglas Camata <[email protected]> * Move changelog entry pending PR Signed-off-by: Douglas Camata <[email protected]> * Add a method label to the in-flight HTTP requests Signed-off-by: Douglas Camata <[email protected]> * docs: Fix heading level of "Excluded from caching" (#5455) * Refactor DefaultTransport() from objstore to package exthttp (#5447) * Refactoring the DefaultTransport func in package exthttp Signed-off-by: Srushti Sapkale <[email protected]> * Refactoring the DefaultTransport func from s3 in package exthttp Signed-off-by: Srushti Sapkale <[email protected]> * Updated helpers.go corrected argument for DefaultTransport() in helpers.go Signed-off-by: Srushti (sroo-sh-tee) <[email protected]> * Changed the argument type in getContainerURL Signed-off-by: Srushti Sapkale <[email protected]> * Update pkg/exthttp/transport.go Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Srushti (sroo-sh-tee) <[email protected]> * Update pkg/exthttp/transport.go Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Srushti (sroo-sh-tee) <[email protected]> * Removed the use of NewTransport() in cos.go Signed-off-by: Srushti Sapkale <[email protected]> * Moved TLSConfig struct and functions that need it from objstore to exthttp Signed-off-by: Srushti Sapkale <[email protected]> * Changed s3.go Signed-off-by: Srushti Sapkale <[email protected]> * Kept s3.go and helpers.go unchanged to not break the cortex deps Signed-off-by: Srushti Sapkale <[email protected]> * Consistency changed made while pair++ programming. Signed-off-by: bwplotka <[email protected]> * Created a new tlsconfig in exthttp and minor changes in cos.go Signed-off-by: Srushti Sapkale <[email protected]> * Commented in s3.go Signed-off-by: Srushti Sapkale <[email protected]> * Minor changes in transport.go Signed-off-by: Srushti Sapkale <[email protected]> * Changed transport.go Signed-off-by: Srushti Sapkale <[email protected]> * Changed transport.go and tlsconfig.go Signed-off-by: Srushti Sapkale <[email protected]> * Removed changes from prometheus.mod and prometheus.sum Signed-off-by: Srushti Sapkale <[email protected]> * Minor updation in cos.go Signed-off-by: Srushti Sapkale <[email protected]> Co-authored-by: bwplotka <[email protected]> * receive: Fix race condition when pruning tenants (#5460) Pruning Receiver tenants has a race condition caused by concurrently removing items from the tenants map. This commit addresses the issue by using a mutex to guard the tenants map. Signed-off-by: fpetkovski <[email protected]> * Adding SCMP as an adopter (#5466) Signed-off-by: Chris Ng <[email protected]> * Updated busybox version. (#5471) Signed-off-by: bwplotka <[email protected]> * Refactor endpoint ref clients Signed-off-by: Matej Gera <[email protected]> * Fix E2E test env name clash Signed-off-by: Matej Gera <[email protected]> * Build with Go 1.18 (#5258) * Build with Go 1.18 Signed-off-by: Sylvain Rabot <[email protected]> * Try something Signed-off-by: Sylvain Rabot <[email protected]> * Upgrade minio Signed-off-by: Sylvain Rabot <[email protected]> * Replace json-iterator/reflect2 in bingo Signed-off-by: Sylvain Rabot <[email protected]> * Ignore 405 errors for prometheus buildVersion API requests (#5477) Older versions of prometheus (such as 2.7 which is shipped by Debian buster) return a 405 error for non-existent API endpoints instead of the 404 returned by more recent versions. Signed-off-by: Nicolas Dandrimont <[email protected]> * *: Cut 0.27.0 (#5473) * Cut 0.27.0 Signed-off-by: Matej Gera <[email protected]> * Updated busybox version. (#5471) Signed-off-by: bwplotka <[email protected]> Signed-off-by: Matej Gera <[email protected]> * Docs: Remove Katacoda links (#5454) * Remove Katacoda links Signed-off-by: Matej Gera <[email protected]> * Remove one more reference Signed-off-by: Matej Gera <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Matej Gera <[email protected]> * Update compact.md (#5465) * During 1h downsampling skip XOR chunks that may erroneously be present in 5m resolution blocks (#5453) * Add fpetkovski to triage list Signed-off-by: Filip Petkovski <[email protected]> * Use Azure BlobURL.Download instead of in-memory buffer (#5451) Modify the azure.Bucket get methods to use BlobURL.Download for fetching blobs and blob ranges. This avoids the need to allocate a buffer for storing the entire expected size of the object in memory. Instead, use a ReaderCloser view of the body returned by the download method. See grafana/mimir#2229 Signed-off-by: Nick Pillitteri <[email protected]> * Update storage.md (#5486) * [receive] Add per-tenant charts to Receive's example dashboard (#5472) * Start to add tenant charts to Receive Signed-off-by: Douglas Camata <[email protected]> * Properly filter HTTP status codes Signed-off-by: Douglas Camata <[email protected]> * Fix tenant error rate chart Signed-off-by: Douglas Camata <[email protected]> * Refactor to improve readability and consistency Signed-off-by: Douglas Camata <[email protected]> * Refactor one more usage of code and tenant labels Signed-off-by: Douglas Camata <[email protected]> * Filter tenant metrics to the Receive handler Signed-off-by: Douglas Camata <[email protected]> * Format math expression properly Signed-off-by: Douglas Camata <[email protected]> * Update CHANGELOG Signed-off-by: Douglas Camata <[email protected]> * Add samples charts to series & samples row Signed-off-by: Douglas Camata <[email protected]> * Bump Go version in all the GH Actions (#5487) * Bump go version in go mod This is a follow up to #5258, which made the project be built with Go 1.18. Signed-off-by: Douglas Camata <[email protected]> * Update Go version in all GH Actions Signed-off-by: Douglas Camata <[email protected]> * Run go mod tidy Signed-off-by: Douglas Camata <[email protected]> * Added changelog entry Signed-off-by: Douglas Camata <[email protected]> * Put back Go 1.17 in go.mod Because we don't use any Go 1.18 feature yet, so it's not needed Signed-off-by: Douglas Camata <[email protected]> * Update go.sum after changing go.mod to go 1.17 Signed-off-by: Douglas Camata <[email protected]> * Remove non-user-impacting entry for changelog Signed-off-by: Douglas Camata <[email protected]> * objstore: Download and Upload block files in parallel (#5475) * Parallel Chunks Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * test Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Changelog Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * making ApplyDownloadOptions private Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * upload concurrency Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Upload Test Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * update change log Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Change comments Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Address comments Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Remove duplicate entries on changelog Signed-off-by: Alan Protasio <[email protected]> Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Addressing Comments Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * update golang.org/x/sync Signed-off-by: alanprot <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Adding Commentts Signed-off-by: Alan Protasio <[email protected]> * Use default HTTP config for E2E S3 tests (#5483) Signed-off-by: Matej Gera <[email protected]> * chore: Included githubactions in the dependabot config (#5364) This should help with keeping the GitHub actions updated on new releases. This will also help with keeping it secure. Dependabot helps in keeping the supply chain secure https://docs.github.com/en/code-security/dependabot GitHub actions up to date https://docs.github.com/en/code-security/dependabot/working-with-dependabot/keeping-your-actions-up-to-date-with-dependabot https://github.com/ossf/scorecard/blob/main/docs/checks.md#dependency-update-tool Signed-off-by: naveensrinivasan <[email protected]> * bump codemirror and promql editor to the last version (#5491) Signed-off-by: Augustin Husson <[email protected]> * receiver: Expose stats for all tenants (#5470) * receiver: Expose stats for all tenants Thanos Receiver supports the Prometheus tsdb status API and can expose TSDB stats for a single tenant. This commit extends that functionality and allows users to request TSDB stats for all tenants using the all_tenants=true query parameter. Signed-off-by: Filip Petkovski <[email protected]> * Add back chunk count Signed-off-by: Filip Petkovski <[email protected]> * Simplify TSDBStats interface Signed-off-by: Filip Petkovski <[email protected]> * Return empty result for no stats Signed-off-by: Filip Petkovski <[email protected]> * CHANGELOG.md: regenerate (#5495) * receive: Fix stats nil pointer panic (#5494) When fetching TSDB stats from receivers, certain TSDBs might not be initialized yet. This can lead to a nil pointer access when the status endpoint is accessed before all TSDBs are initialized. This commit adds an explicit check for each tenant's TSDB when exporting TSDB stats. Signed-off-by: Filip Petkovski <[email protected]> * Update query.md (#5496) Fix typo of parameter --store.sd-files Signed-off-by: Firxiao <[email protected]> * Parallel download blocks - Follow up of #5475 (#5493) * Download blocks in parallel Signed-off-by: Alan Protasio <[email protected]> * remove the go func Signed-off-by: Alan Protasio <[email protected]> * Doc Signed-off-by: Alan Protasio <[email protected]> * CHANGELOG Signed-off-by: Alan Protasio <[email protected]> * doc Signed-off-by: alanprot <[email protected]> * AddressComments Signed-off-by: alanprot <[email protected]> * fix typo Signed-off-by: Alan Protasio <[email protected]> * Upgrade mdox with cache and some http settings to reduce CI failures (#5500) * Pin mdox to latest master commit It suppors now a cache for link validation and some HTTP configuration that can be used to help avoid intermittent CI failures. Signed-off-by: Douglas Camata <[email protected]> * Add mdox cache and HTTP configuration The cache has a default TTL (5 days) A timeout of 1m and 10 connections per host at transport level should help us reduce the intermittent failures if we have to invalidate the cache. Signed-off-by: Douglas Camata <[email protected]> * Add Github Action cache for the mdox cache Using the hash of the md files as cache key. Signed-off-by: Douglas Camata <[email protected]> * Upgrade cache actions to v3 and add restore key Signed-off-by: Douglas Camata <[email protected]> * Empty commit to test CI build cache Signed-off-by: GitHub <[email protected]> * Use 2.5 days as jitter for mdox cache Signed-off-by: Douglas Camata <[email protected]> * Fix bad editor auto-formating again Signed-off-by: Douglas Camata <[email protected]> * Updated minio-go to latest; removed fork. (#5474) * Updated minio-go fork to latest. NOTE: Optimization is propopsed to upstream to avoid fork in future. Relates to https://github.com/thanos-io/thanos/issues/5101 and https://github.com/thanos-io/thanos/issues/5130 Signed-off-by: bwplotka <[email protected]> # Conflicts: # go.mod # go.sum * Removed fork. Signed-off-by: bwplotka <[email protected]> * Added comment. Signed-off-by: bwplotka <[email protected]> * Receiver: Handle storage exemplar multi-error (#5502) * Handle exemplar store errors as conflict Signed-off-by: Matej Gera <[email protected]> * Adjust tests Signed-off-by: Matej Gera <[email protected]> * Update CHANGELOG Signed-off-by: Matej Gera <[email protected]> * Fixing Race condition Introduced by #5493 (#5503) * Update busybox image versions (#5506) Signed-off-by: Kemal Akkoyun <[email protected]> * Updates busybox SHA (#5507) Signed-off-by: GitHub <[email protected]> Co-authored-by: yeya24 <[email protected]> * chore: Update Prometheus dependency (#5484) * chore: Update Prometheus dependency Update Prometheus from v2.33.5 to v2.36.2. Signed-off-by: SuperQ <[email protected]> * Update query tests for cortex changes. Signed-off-by: SuperQ <[email protected]> * Use the default rules.RuleGroupPostProcessFunc. Signed-off-by: SuperQ <[email protected]> * Update QueryStats use. Signed-off-by: SuperQ <[email protected]> * Update Cortex. Signed-off-by: SuperQ <[email protected]> * Update queryfrontend for Cortex changes. Signed-off-by: SuperQ <[email protected]> * Bump pprof. Signed-off-by: SuperQ <[email protected]> * Add changelog entry. Signed-off-by: SuperQ <[email protected]> * Adapt to changed query stats API Signed-off-by: Kemal Akkoyun <[email protected]> * Sync dependencies Signed-off-by: Kemal Akkoyun <[email protected]> * Reflect changed metric names Signed-off-by: Kemal Akkoyun <[email protected]> Co-authored-by: Kemal Akkoyun <[email protected]> Co-authored-by: Kemal Akkoyun <[email protected]> * chore: Vendor Cortex dependency as an internal package (#5504) * Vendor Cortex dependency as an internal package Signed-off-by: Kemal Akkoyun <[email protected]> * Add gitattributes Signed-off-by: Kemal Akkoyun <[email protected]> * Skip checks for vendored directory Signed-off-by: Kemal Akkoyun <[email protected]> * Add copyright headers for Cortex Signed-off-by: Kemal Akkoyun <[email protected]> * *: Move objstore out of repo (#5510) * *: Move objstore out of repo Signed-off-by: Kemal Akkoyun <[email protected]> * Fix doc checks Signed-off-by: Kemal Akkoyun <[email protected]> * chore: Update Prometheus to v2.37.0 (#5511) * chore: Update Prometheus to v2.37.0 Update Prometheus to the latest release. Note that Prometheus upstream now tags v0.x.y to map to the 2.x.y releases. Signed-off-by: SuperQ <[email protected]> * Cleanup direct/indirect go.mod requirements. Signed-off-by: SuperQ <[email protected]> * chore: Update Go modules (#5516) * Update weaveworks/common to remove node_exporter indirect dep. * Update simonpasquier/klog-gokit/v2. * Update google.golang.org/grpc lock to v1.45.0. * Cleanup replacements that are now handled by indirect requirements. * Fixup grpc.WithInsecure() use. Signed-off-by: SuperQ <[email protected]> * chore: Update Go modules (#5518) * Reuse upstream TSDB status structs (#5526) This commit replaces the copied TSDB status structs with direct references from prometheus/prometheus. Signed-off-by: Filip Petkovski <[email protected]> * Fix proposal on website (#5530) Signed-off-by: Saswata Mukherjee <[email protected]> * Update all bingo dependencies (#5525) This commit updates all bingo dependencies to their latest versions. It pins golang.org/x/sys to v0.0.0-20220715151400-c0bba94af5f8 for the github.com/google/go-jsonnet dependency in order to prevent failures when running make docs on Mac OS. Signed-off-by: Filip Petkovski <[email protected]> * delete_katacoda (#5529) Signed-off-by: Akshit42-hue <[email protected]> * Remove empty RuleGroups in api/v1/rules when using matchers (#5537) * Remove empty RuleGroups in api/v1/rules Signed-off-by: Saswata Mukherjee <[email protected]> * Implement suggestion Signed-off-by: Saswata Mukherjee <[email protected]> * Rename variables Signed-off-by: Saswata Mukherjee <[email protected]> * fix(api): When querying api query on endpoint alerts return a json struct with alerts in lowercase. (#5534) To be same result as prometheus api Signed-off-by: Guillaume audic <[email protected]> * Receiver: Add benchmark for receive writer (#5533) * Add benchmark for receive writer Signed-off-by: Matej Gera <[email protected]> * Incorporate feedback - Clearer parameter naming; use a separate temp dir for bench Signed-off-by: Matej Gera <[email protected]> * Submit a proposal for Active Series Limiting for Hashring Topology (#5415) * Add proposal for Active Series Limiting for Hashring Topology Signed-off-by: Saswata Mukherjee <[email protected]> * Resize images Signed-off-by: Saswata Mukherjee <[email protected]> * Add Observatorium as an alternative Signed-off-by: Saswata Mukherjee <[email protected]> * Implement suggestions; add TODO Signed-off-by: Saswata Mukherjee <[email protected]> * Update proposal Signed-off-by: Saswata Mukherjee <[email protected]> * Implement suggestions: add sections numbers Signed-off-by: Saswata Mukherjee <[email protected]> * Refactor EndpointSet (#5538) * Refactor EndpointSet This commit refactors the EndpointSet struct in order to make it easier to understand and work with. Signed-off-by: Filip Petkovski <[email protected]> * Handle context cancellation in endpoint mock Signed-off-by: Filip Petkovski <[email protected]> * Make additions and removals of refs atomic. Signed-off-by: Filip Petkovski <[email protected]> * Fix changed-docs grep regex (#5556) Signed-off-by: Saswata Mukherjee <[email protected]> * Added Vertical Query Sharding to Query-Frontend (#5342) * Update faillint to v1.10.0 Signed-off-by: Filip Petkovski <[email protected]> * Implement query sharding This commit implements query sharding for grouping PromQL expressions. Sharding is initiated by analyzing the PromQL and extracting grouping labels. Extracted labels are propagated down to Stores which partition the response by hashmoding all series on those labels. If a query is shardable, the partitioning and merging process will be initiated by the Query Frontend. The Query Frontend will make N distinct queries across a set of Queriers and merge the results back before presenting them to the user. Signed-off-by: Filip Petkovski <[email protected]> * First code review pass Signed-off-by: Filip Petkovski <[email protected]> * Use sync pool to reuse sharding buffers Signed-off-by: Filip Petkovski <[email protected]> * Add test for binary expression with constant Signed-off-by: Filip Petkovski <[email protected]> * Include external labels in series sharding Signed-off-by: Filip Petkovski <[email protected]> * Rule: Fix e2e test flake (#5558) * Rule: Fix e2e test flake Signed-off-by: Saswata Mukherjee <[email protected]> * Fix lint Signed-off-by: Saswata Mukherjee <[email protected]> * Check errors Signed-off-by: Saswata Mukherjee <[email protected]> * Change to github.com/thanos-io/thanos/pkg/errors Signed-off-by: Saswata Mukherjee <[email protected]> * Implement suggestion Signed-off-by: Saswata Mukherjee <[email protected]> * Fix multi-tenant exemplar matchers (#5554) * Fix multi-tenant exemplar matchers The exemplar proxy synthesizes a query based on PromQL expression matchers and individual store's label sets. When a store has multiple label sets with same label names but different values (e.g. multitenant Receivers), each exemplar matcher will be repeated once for each label set. Because of this, a receiver hosting 200 tenants can get the same exemplar matcher 200 times. This leads to the underlying stores slowing down and timing out when asked for exemplars. This commit modifies the exemplar proxy to deduplicate matchers and only send a matcher once to an underlying store. Signed-off-by: Filip Petkovski <[email protected]> * Address CR comments Signed-off-by: Filip Petkovski <[email protected]> * Receive: add per request limits for remote write (#5527) * Add per request limits for remote write Signed-off-by: Douglas Camata <[email protected]> * Remove useless TODO item Signed-off-by: Douglas Camata <[email protected]> * Refactor write request limits test Signed-off-by: Douglas Camata <[email protected]> * Add write concurrency limit to Receive Signed-off-by: Douglas Camata <[email protected]> * Change write limits config option name Signed-off-by: Douglas Camata <[email protected]> * Document remote write concurrenty limit Signed-off-by: Douglas Camata <[email protected]> * Add changelog entry Signed-off-by: Douglas Camata <[email protected]> * Format docs Signed-off-by: Douglas Camata <[email protected]> * Extract request limiting logic from handler Signed-off-by: Douglas Camata <[email protected]> * Add copyright header Signed-off-by: Douglas Camata <[email protected]> * Add a TODO for per-tenant limits Signed-off-by: Douglas Camata <[email protected]> * Add default value and hide the request limit flags Signed-off-by: Douglas Camata <[email protected]> * Improve TODO comment in request limits Signed-off-by: Douglas Camata <[email protected]> * Update Receive docs after flags wre made hidden Signed-off-by: Douglas Camata <[email protected]> * Add note about WIP in Receive request limits doc Signed-off-by: Douglas Camata <[email protected]> * Fix typo in Receive docs Co-authored-by: Filip Petkovski <[email protected]> Signed-off-by: Douglas Camata <[email protected]> * Fix help text for concurrent request limit Signed-off-by: Douglas Camata <[email protected]> * Use byte unit helpers for improved readability Signed-off-by: Douglas Camata <[email protected]> * Removed check for nil writeGate The constructor sets the writeGate to a noopGate. Signed-off-by: Douglas Camata <[email protected]> * Better organize linebreaks Signed-off-by: Douglas Camata <[email protected]> * Fix help text for limits hit metric Signed-off-by: Douglas Camata <[email protected]> * Apply some english feedback Signed-off-by: Douglas Camata <[email protected]> * Improve limits & gates documentationb Signed-off-by: Douglas Camata <[email protected]> * Fix import clause Signed-off-by: Douglas Camata <[email protected]> * Use a 3 node hashring for write limits test This should ensure the request fanout logic cannot somehow interfere with the request limit logic. Signed-off-by: Douglas Camata <[email protected]> * Fix comment Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Douglas Camata <[email protected]> * Announce sharding in ruler and store proxy (#5560) The ruler and store proxy currently support series sharding through the components that they use. However, this capability is not announced to the querier. This commit modifies their Info calls to indicate to the querier that it doesn't need to shard the response it receives from rulers and other store proxies. Signed-off-by: Filip Petkovski <[email protected]> * Fix flaky e2e tests (#5563) * Tools: Fix e2e test flake Signed-off-by: Saswata Mukherjee <[email protected]> * Metadata: Fix flaky e2e test Signed-off-by: Saswata Mukherjee <[email protected]> * Compact: Fix flaky e2e test Signed-off-by: Saswata Mukherjee <[email protected]> * Bumping actions/cache to v3 for e2e tests Signed-off-by: Saswata Mukherjee <[email protected]> * Add missing e2e.WaitMissingMetrics Signed-off-by: Saswata Mukherjee <[email protected]> * Meta-monitoring based active series limiting (#5520) * Add initial PoC for meta-monitoring Receive active series limits Signed-off-by: Saswata Mukherjee <[email protected]> * Add e2e tests, rebase Signed-off-by: Saswata Mukherjee <[email protected]> * Add multitenant test + remake diagrams Signed-off-by: Saswata Mukherjee <[email protected]> * Implement suggestions; Make naming consistent; Rm/Add metrics Signed-off-by: Saswata Mukherjee <[email protected]> * Reuse meta-monitoring client Signed-off-by: Saswata Mukherjee <[email protected]> * Fix panic Signed-off-by: Saswata Mukherjee <[email protected]> * Cache meta-monitoring query result Signed-off-by: Saswata Mukherjee <[email protected]> * Fix lint Signed-off-by: Saswata Mukherjee <[email protected]> * Fail fast when limiting Signed-off-by: Saswata Mukherjee <[email protected]> * Implement suggestions: docs + mutex + struct Signed-off-by: Saswata Mukherjee <[email protected]> * Add interface and no-op Signed-off-by: Saswata Mukherjee <[email protected]> * Add changelog entry Signed-off-by: Saswata Mukherjee <[email protected]> * Add seriesLimitSupported to handler Signed-off-by: Saswata Mukherjee <[email protected]> * Remove tools fork Signed-off-by: Saswata Mukherjee <[email protected]> * Change docs header Signed-off-by: Saswata Mukherjee <[email protected]> * Remove usage of ioutil (#5564) Signed-off-by: Saswata Mukherjee <[email protected]> * docs/contribution.md: Update required Go version (#5557) * delete_katacoda Signed-off-by: Akshit42-hue <[email protected]> * updated go version Signed-off-by: Akshit42-hue <[email protected]> * update golang version Signed-off-by: Akshit42-hue <[email protected]> * updated Signed-off-by: Akshit42-hue <[email protected]> * Retrigger CI Signed-off-by: Akshit42-hue <[email protected]> * Retrigger CI Signed-off-by: Akshit42-hue <[email protected]> * fix an expression param in a link to an alert in the rules page (#5562) Signed-off-by: Rostislav Benes <[email protected]> Co-authored-by: Rostislav Benes <[email protected]> * Receiver: Validate labels in write requests (#5508) * Add label set validation method Signed-off-by: Matej Gera <matejgera@g…
To test
closes #5176
Changes
Verification
TODO: