Release v0.37 #109

jnyi · 2024-12-02T18:15:20Z

merge db_main branch to release branch which has been running for a few weeks, a few highlights to call out:

cuckoo_filters: reduce db cpu usage
pgw counter resets: fixed [ES-1292925] Fix metrics with reusable counter resets #107
tenant race condition: fixed [ES-1314123] race condition for memorize tsdb client #108
fix query non existing blocks: Querier ignore 'The specified key does not exist' error #97

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

Verification

* fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]>

If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]>

We have seen deadlocks with endpoint discovery caused by the metric collector hanging and not releasing the store labels lock. This causes the endpoint update to hang, which also makes all endpoint readers hang on acquiring a read lock for the resolved endpoints slice. This commit makes sure the Collect method on the metrics collector has a built in timeout to guard against cases where an upstream call never reads from the collection channel. Signed-off-by: Filip Petkovski <[email protected]>

…ne (thanos-io#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]>

In LabelNames and LabelValues gRPC calls were not pruned properly. While results are not wrong, this leads to inefficient fan-out for setups with many endpoints. We took the opportunity to unify the store filtering and generally also the larger layout of the gRPC methods, including logging and tracing. Signed-off-by: Michael Hoffmann <[email protected]>

Signed-off-by: Pedro Tanaka <[email protected]>

* Appending warn to changelog about breaking change Signed-off-by: Pedro Tanaka <[email protected]> * Including warning emoji Signed-off-by: Pedro Tanaka <[email protected]> --------- Signed-off-by: Pedro Tanaka <[email protected]>

…7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]>

* *: Using native histograms for grpc middleware metrics Since we updated the middleware library, we can now use native histograms to keep track of latencies in grpc calls. This is a semi-breaking change if people enabled native histogram collection on their Prometheus monitoring Thanos instances. Signed-off-by: Pedro Tanaka <[email protected]> adding change log Signed-off-by: Pedro Tanaka <[email protected]> * removing empty space; Signed-off-by: Pedro Tanaka <[email protected]> * Put full disclaimer in changelog Signed-off-by: Pedro Tanaka <[email protected]> --------- Signed-off-by: Pedro Tanaka <[email protected]>

* compact: recover from panics (thanos-io#7318) For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> * Sidecar: wait for prometheus on startup (thanos-io#7323) Signed-off-by: Michael Hoffmann <[email protected]> * Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948) * fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]> * Receive: fix stats (thanos-io#7373) If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]> * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]> * Cut patch release v0.35.1 Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Michael Hoffmann <[email protected]> Signed-off-by: Thibault Mange <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]> Co-authored-by: Michael Hoffmann <[email protected]> Co-authored-by: Thibault Mange <[email protected]>

Previously we defered starting the gRPC server by blocking the whole startup until we could ping prometheus. This breaks usecases that rely on the config reloader to start prometheus. We fix it by using a channel to defer starting the grpc server and loading external labels in an actor concurrently. Signed-off-by: Michael Hoffmann <[email protected]>

* Uupdate Prometheus Signed-off-by: alanprot <[email protected]> * Updating prometheus to 4e664035e84e Signed-off-by: alanprot <[email protected]> * Temporarily pinning prometheus common Signed-off-by: alanprot <[email protected]> * fixing lint Signed-off-by: alanprot <[email protected]> * Using jsoniter to encode promql responses Signed-off-by: alanprot <[email protected]> * Removing e2e test case with unvalid hifen on a matcher -> prometheus now support this use case Signed-off-by: alanprot <[email protected]> * Updating prometheus to v0.52.2-0.20240606174736-edd558884b24 Signed-off-by: alanprot <[email protected]> * pinning grpc to v1.63.2 Signed-off-by: alanprot <[email protected]> --------- Signed-off-by: alanprot <[email protected]> Co-authored-by: EC2 Default User <[email protected]>

Signed-off-by: Michael Hoffmann <[email protected]>

Allow suppressing environment variables expansion errors when unset, and thus keep the reloader from crashing. Instead leave them as is. Signed-off-by: Pranshu Srivastava <[email protected]>

* Update adopters.yml Signed-off-by: Rishabh Soni <[email protected]> * Add files via upload Signed-off-by: Rishabh Soni <[email protected]> --------- Signed-off-by: Rishabh Soni <[email protected]>

Signed-off-by: Vasiliy Rumyantsev <[email protected]>

Signed-off-by: Pedro Tanaka <[email protected]>

Recently ran into an issue with Istio in particular, where leaving the trailing dot on the SRV record returned by `dnssrvnoa` lookups led to an inability to connect to the endpoint. Removing the trailing dot fixes this behaviour. Now, technically, this is a valid URL, and it shouldn't be a problem. One could definitely argue that Istio should be responsible here for ensuring that the traffic is delivered. The problem seems rooted in how Istio attempts to do wildcard matching or URLs it receives - including the dot leads it to lookup an empty DNS field, which is invalid. The approach I take here is actually copied from how Prometheus does it. Therefore I hope we can sneak this through with the argument that 'this is how Prometheus does it', regardless of whether or not this is philosophically correct... Signed-off-by: verejoel <[email protected]>

Bumps [go.opentelemetry.io/contrib/propagators/autoprop](https://github.com/open-telemetry/opentelemetry-go-contrib) from 0.38.0 to 0.53.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-go-contrib/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go-contrib@zpages/v0.38.0...zpages/v0.53.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/contrib/propagators/autoprop dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [go.opentelemetry.io/contrib/samplers/jaegerremote](https://github.com/open-telemetry/opentelemetry-go-contrib) from 0.7.0 to 0.22.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-go-contrib/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go-contrib@v0.7.0...v0.22.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/contrib/samplers/jaegerremote dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…hanos-io#7492) * compact: Update filtered blocks list before second downsample pass If the second downsampling pass is given the same filteredMetas list as the first pass, it will create duplicates of blocks created in the first pass. It will also not be able to do further downsampling e.g 5m->1h using blocks created in the first pass, as it will not be aware of them. The metadata was already being synced before the second pass, but not updated into the filteredMetas list. Signed-off-by: Thomas Hartland <[email protected]> * Update changelog Signed-off-by: Thomas Hartland <[email protected]> * e2e/compact: Fix number of blocks cleaned assertion The value was increased in 2ed48f7 to fix the test, with the reasoning that the hardcoded value must have been taken from a run of the CI that didn't reach the max value due to CI worker lag. More likely the real reason is that commit 68bef3f the day before had caused blocks to be duplicated during downsampling. The duplicate block is immediately marked for deletion, causing an extra +1 in the number of blocks cleaned. Subtracting one from the value again now that the block duplication issue is fixed. Signed-off-by: Thomas Hartland <[email protected]> * e2e/compact: Revert change to downsample count assertion Combined with the previous commit this effectively reverts all of 2ed48f7, in which two assertions were changed to (unknowingly) account for a bug which had just been introduced in the downsampling code, causing duplicate blocks. This assertion change I am less sure on the reasoning for, but after running through the e2e tests several times locally, it is consistent that the only downsampling happens in the "compact-working" step, and so all other steps would report 0 for their total downsamples metric. Signed-off-by: Thomas Hartland <[email protected]> --------- Signed-off-by: Thomas Hartland <[email protected]>

Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>

…s.go (thanos-io#7552) Signed-off-by: Nishant Bansal <[email protected]>

Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.24.0 to 0.25.0. - [Commits](golang/crypto@v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

thanos-io#7528) Bumps [go.opentelemetry.io/otel/bridge/opentracing](https://github.com/open-telemetry/opentelemetry-go) from 1.21.0 to 1.28.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go@v1.21.0...v1.28.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/otel/bridge/opentracing dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

This commits adds the option of filtering rules by rule name, rule group, or file. This brings the rule API closer in-line with the current Prometheus api. Signed-off-by: Jacob Baungard Hansen <[email protected]>

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.26.0 to 0.27.0. - [Commits](golang/net@v0.26.0...v0.27.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…hanos-io#7525) Bumps [go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc](https://github.com/open-telemetry/opentelemetry-go) from 1.27.0 to 1.28.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go@v1.27.0...v1.28.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Signed-off-by: Yi Jin <[email protected]>

Signed-off-by: Yuchen Wang <[email protected]>

* support hedged requests in store Signed-off-by: milinddethe15 <[email protected]> * hedged roundtripper with tdigest for dynamic delay Signed-off-by: milinddethe15 <[email protected]> * refactor struct and fix lint Signed-off-by: milinddethe15 <[email protected]> * Improve hedging implementation Signed-off-by: milinddethe15 <[email protected]> * Improved hedging implementation Signed-off-by: milinddethe15 <[email protected]> * Update store doc Signed-off-by: milinddethe15 <[email protected]> * fix white space Signed-off-by: milinddethe15 <[email protected]> * add enabled field Signed-off-by: milinddethe15 <[email protected]> --------- Signed-off-by: milinddethe15 <[email protected]>

I always get this in logs: ``` err: receive capnp conn: close tcp ...: use of closed network connection ``` This is also visible in the e2e test. After Done() returns, the connection is closed either way so no need to close it again. Signed-off-by: Giedrius Statkevičius <[email protected]>

* Fix a storage GW bug that loses TSDB infos when joining them * E2E test demonstrating a bug in the MinT calculation in distributed Engine Signed-off-by: Michael Hoffmann <[email protected]>

Signed-off-by: Saswata Mukherjee <[email protected]>

…o#7915) * always close block series client at the end Signed-off-by: Ben Ye <[email protected]> * add back close for loser tree Signed-off-by: Ben Ye <[email protected]> --------- Signed-off-by: Ben Ye <[email protected]>

* Update objstore and promql-engine to latest Signed-off-by: Saswata Mukherjee <[email protected]> * Fixes after upgrade Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]>

Signed-off-by: Saswata Mukherjee <[email protected]>

Signed-off-by: Yi Jin <[email protected]>

thibaultmg and others added 30 commits October 16, 2024 14:50

Receive: fix stats (thanos-io#7373)

9d27e07

If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]>

Adding changelog

806e741

Signed-off-by: Pedro Tanaka <[email protected]>

fixing details

afc5020

Signed-off-by: Pedro Tanaka <[email protected]>

CHANGELOG: Mark 0.36 as in progress (thanos-io#7486)

52ee266

Signed-off-by: Michael Hoffmann <[email protected]>

reloader: allow suppressing envvar errors (thanos-io#7429)

77859b7

Allow suppressing environment variables expansion errors when unset, and thus keep the reloader from crashing. Instead leave them as is. Signed-off-by: Pranshu Srivastava <[email protected]>

chore: Add nirmata to adopters (thanos-io#7506)

3dba461

* Update adopters.yml Signed-off-by: Rishabh Soni <[email protected]> * Add files via upload Signed-off-by: Rishabh Soni <[email protected]> --------- Signed-off-by: Rishabh Soni <[email protected]>

removed mention of unused pkg (thanos-io#7515)

acd49f3

Signed-off-by: Vasiliy Rumyantsev <[email protected]>

QFE: disable double compression middleware (thanos-io#7511)

6722e67

Signed-off-by: Pedro Tanaka <[email protected]>

Build with Go 1.22 (thanos-io#7559)

0740758

Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>

Fix issue thanos-io#7550: Bug fix and complete test coverage for tool…

4b93a1f

…s.go (thanos-io#7552) Signed-off-by: Nishant Bansal <[email protected]>

Update prometheus and promql-engine dependencies (thanos-io#7558)

cde14c9

Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>

api/rules: Add filtering on rule name/group/file (thanos-io#7560)

2fe95c6

This commits adds the option of filtering rules by rule name, rule group, or file. This brings the rule API closer in-line with the current Prometheus api. Signed-off-by: Jacob Baungard Hansen <[email protected]>

yuchen-db and others added 26 commits November 13, 2024 16:03

only do top metric in db (#101)

f1b1919

silence conflict samples

d58a1ac

Signed-off-by: Yi Jin <[email protected]>

silence conflict samples (#102)

17f9d7c

Merge branch 'db_main' into yuchen-db/scaledown-with-operator

5bee491

Signed-off-by: Yuchen Wang <[email protected]>

return tenants in http header

662332a

Add receiver downscale endpoint (#88)

6cf9daa

store, query: remote engine bug (thanos-io#7904)

caa972f

* Fix a storage GW bug that loses TSDB infos when joining them * E2E test demonstrating a bug in the MinT calculation in distributed Engine Signed-off-by: Michael Hoffmann <[email protected]>

Skip TestDistributedEngineWithDisjointTSDBs (thanos-io#7911)

2a975d3

Signed-off-by: Saswata Mukherjee <[email protected]>

measure pre-aggregated metrics write latency

87b8554

fix metric label name

8b4e854

Changelog: Mark v0.37 release in progress (thanos-io#7920)

8c49344

Signed-off-by: Saswata Mukherjee <[email protected]>

docs: Add link to ignore (thanos-io#7926)

6a2be98

Signed-off-by: Saswata Mukherjee <[email protected]>

docs: Fix formatting again (thanos-io#7928)

fd06432

Signed-off-by: Saswata Mukherjee <[email protected]>

measure pre-aggregated metrics write latency (#104)

457616e

merge oss main on 2024-11-19

7a8a541

Signed-off-by: Yi Jin <[email protected]>

[0.37-nov 19] (#106)

b0aff72

[ES-1292925] fix reusable counter resets

0168fd1

Signed-off-by: Yi Jin <[email protected]>

fix bugs and add new test cases

77347ce

Signed-off-by: Yi Jin <[email protected]>

rename to quorum

ab31d53

Signed-off-by: Yi Jin <[email protected]>

[ES-1292925] Fix metrics with reusable counter resets (#107)

67f5336

[ES-1314123] race condition for memorize tsdb client

ea908b1

Signed-off-by: Yi Jin <[email protected]>

[ES-1314123] race condition for memorize tsdb client (#108)

1c69c7e

jnyi requested review from hczhu-db, yuchen-db and yulong-db December 2, 2024 18:15

jnyi merged commit 149364c into release Dec 2, 2024
184 of 185 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.37 #109

Release v0.37 #109

jnyi commented Dec 2, 2024

Release v0.37 #109

Release v0.37 #109

Conversation

jnyi commented Dec 2, 2024

Changes

Verification