Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.37 #109

Merged
merged 706 commits into from
Dec 2, 2024
Merged

Release v0.37 #109

merged 706 commits into from
Dec 2, 2024

Conversation

jnyi
Copy link
Collaborator

@jnyi jnyi commented Dec 2, 2024

merge db_main branch to release branch which has been running for a few weeks, a few highlights to call out:

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

thibaultmg and others added 30 commits October 16, 2024 14:50
* fix serverAsClient goroutines leak

Signed-off-by: Thibault Mange <[email protected]>

* fix lint

Signed-off-by: Thibault Mange <[email protected]>

* update changelog

Signed-off-by: Thibault Mange <[email protected]>

* delete invalid comment

Signed-off-by: Thibault Mange <[email protected]>

* remove temp dev test

Signed-off-by: Thibault Mange <[email protected]>

* remove timer channel drain

Signed-off-by: Thibault Mange <[email protected]>

---------

Signed-off-by: Thibault Mange <[email protected]>
If we account stats for remote write and local writes we will count them
twice since the remote write will be counted locally again by the remote
receiver instance.

Signed-off-by: Michael Hoffmann <[email protected]>
We have seen deadlocks with endpoint discovery caused by the metric
collector hanging and not releasing the store labels lock. This causes
the endpoint update to hang, which also makes all endpoint readers hang on
acquiring a read lock for the resolved endpoints slice.

This commit makes sure the Collect method on the metrics collector has
a built in timeout to guard against cases where an upstream call never
reads from the collection channel.

Signed-off-by: Filip Petkovski <[email protected]>
…ne (thanos-io#7382)

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline

Signed-off-by: Saswata Mukherjee <[email protected]>

* small fix

Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Saswata Mukherjee <[email protected]>
In LabelNames and LabelValues gRPC calls were not pruned properly. While
results are not wrong, this leads to inefficient fan-out for setups with
many endpoints.
We took the opportunity to unify the store filtering and generally also
the larger layout of the gRPC methods, including logging and tracing.

Signed-off-by: Michael Hoffmann <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
* Appending warn to changelog about breaking change

Signed-off-by: Pedro Tanaka <[email protected]>

* Including warning emoji

Signed-off-by: Pedro Tanaka <[email protected]>

---------

Signed-off-by: Pedro Tanaka <[email protected]>
…7392)

If we have a new querier it will create query hints even without the
pushdown feature being present anymore. Old sidecars will then trigger
query pushdown which leads to broken max,min,max_over_time and
min_over_time.

Signed-off-by: Michael Hoffmann <[email protected]>
* *: Using native histograms for grpc middleware metrics

Since we updated the middleware library, we can now use native histograms to keep track of latencies in grpc calls.
This is a semi-breaking change if people enabled native histogram collection on their Prometheus monitoring Thanos instances.

Signed-off-by: Pedro Tanaka <[email protected]>

adding change log

Signed-off-by: Pedro Tanaka <[email protected]>

* removing empty space;

Signed-off-by: Pedro Tanaka <[email protected]>

* Put full disclaimer in changelog

Signed-off-by: Pedro Tanaka <[email protected]>

---------

Signed-off-by: Pedro Tanaka <[email protected]>
* compact: recover from panics (thanos-io#7318)

For thanos-io#6775, it would be useful
to know the exact block IDs to aid debugging.

Signed-off-by: Giedrius Statkevičius <[email protected]>

* Sidecar: wait for prometheus on startup (thanos-io#7323)

Signed-off-by: Michael Hoffmann <[email protected]>

* Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948)

* fix serverAsClient goroutines leak

Signed-off-by: Thibault Mange <[email protected]>

* fix lint

Signed-off-by: Thibault Mange <[email protected]>

* update changelog

Signed-off-by: Thibault Mange <[email protected]>

* delete invalid comment

Signed-off-by: Thibault Mange <[email protected]>

* remove temp dev test

Signed-off-by: Thibault Mange <[email protected]>

* remove timer channel drain

Signed-off-by: Thibault Mange <[email protected]>

---------

Signed-off-by: Thibault Mange <[email protected]>

* Receive: fix stats (thanos-io#7373)

If we account stats for remote write and local writes we will count them
twice since the remote write will be counted locally again by the remote
receiver instance.

Signed-off-by: Michael Hoffmann <[email protected]>

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382)

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline

Signed-off-by: Saswata Mukherjee <[email protected]>

* small fix

Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Saswata Mukherjee <[email protected]>

* Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392)

If we have a new querier it will create query hints even without the
pushdown feature being present anymore. Old sidecars will then trigger
query pushdown which leads to broken max,min,max_over_time and
min_over_time.

Signed-off-by: Michael Hoffmann <[email protected]>

* Cut patch release v0.35.1

Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Michael Hoffmann <[email protected]>
Signed-off-by: Thibault Mange <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Co-authored-by: Giedrius Statkevičius <[email protected]>
Co-authored-by: Michael Hoffmann <[email protected]>
Co-authored-by: Thibault Mange <[email protected]>
Previously we defered starting the gRPC server by blocking the whole
startup until we could ping prometheus. This breaks usecases that rely
on the config reloader to start prometheus.
We fix it by using a channel to defer starting the grpc server
and loading external labels in an actor concurrently.

Signed-off-by: Michael Hoffmann <[email protected]>
* Uupdate Prometheus

Signed-off-by: alanprot <[email protected]>

* Updating prometheus to 4e664035e84e

Signed-off-by: alanprot <[email protected]>

* Temporarily pinning prometheus common

Signed-off-by: alanprot <[email protected]>

* fixing lint

Signed-off-by: alanprot <[email protected]>

* Using jsoniter to encode promql responses

Signed-off-by: alanprot <[email protected]>

* Removing e2e test case with unvalid hifen on a matcher -> prometheus now support this use case

Signed-off-by: alanprot <[email protected]>

* Updating prometheus to v0.52.2-0.20240606174736-edd558884b24

Signed-off-by: alanprot <[email protected]>

* pinning grpc to v1.63.2

Signed-off-by: alanprot <[email protected]>

---------

Signed-off-by: alanprot <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Allow suppressing environment variables expansion errors when unset, and
thus keep the reloader from crashing. Instead leave them as is.

Signed-off-by: Pranshu Srivastava <[email protected]>
* Update adopters.yml

Signed-off-by: Rishabh Soni <[email protected]>

* Add files via upload

Signed-off-by: Rishabh Soni <[email protected]>

---------

Signed-off-by: Rishabh Soni <[email protected]>
Recently ran into an issue with Istio in particular, where leaving the
trailing dot on the SRV record returned by `dnssrvnoa` lookups led to an
inability to connect to the endpoint. Removing the trailing dot fixes
this behaviour.

Now, technically, this is a valid URL, and it shouldn't be a problem.
One could definitely argue that Istio should be responsible here for
ensuring that the traffic is delivered. The problem seems rooted in how
Istio attempts to do wildcard matching or URLs it receives - including
the dot leads it to lookup an empty DNS field, which is invalid.

The approach I take here is actually copied from how Prometheus does it.
Therefore I hope we can sneak this through with the argument that 'this
is how Prometheus does it', regardless of whether or not this is
philosophically correct...

Signed-off-by: verejoel <[email protected]>
Bumps [go.opentelemetry.io/contrib/propagators/autoprop](https://github.com/open-telemetry/opentelemetry-go-contrib) from 0.38.0 to 0.53.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go-contrib/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go-contrib@zpages/v0.38.0...zpages/v0.53.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/contrib/propagators/autoprop
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [go.opentelemetry.io/contrib/samplers/jaegerremote](https://github.com/open-telemetry/opentelemetry-go-contrib) from 0.7.0 to 0.22.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go-contrib/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go-contrib@v0.7.0...v0.22.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/contrib/samplers/jaegerremote
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…hanos-io#7492)

* compact: Update filtered blocks list before second downsample pass

If the second downsampling pass is given the same filteredMetas
list as the first pass, it will create duplicates of blocks
created in the first pass.

It will also not be able to do further downsampling e.g 5m->1h
using blocks created in the first pass, as it will not be aware
of them.

The metadata was already being synced before the second pass,
but not updated into the filteredMetas list.

Signed-off-by: Thomas Hartland <[email protected]>

* Update changelog

Signed-off-by: Thomas Hartland <[email protected]>

* e2e/compact: Fix number of blocks cleaned assertion

The value was increased in 2ed48f7 to fix the test,
with the reasoning that the hardcoded value must
have been taken from a run of the CI that didn't
reach the max value due to CI worker lag.

More likely the real reason is that commit 68bef3f
the day before had caused blocks to be duplicated
during downsampling.

The duplicate block is immediately marked for deletion,
causing an extra +1 in the number of blocks cleaned.

Subtracting one from the value again now that the
block duplication issue is fixed.

Signed-off-by: Thomas Hartland <[email protected]>

* e2e/compact: Revert change to downsample count assertion

Combined with the previous commit this effectively reverts
all of 2ed48f7, in which two assertions were changed to
(unknowingly) account for a bug which had just been
introduced in the downsampling code, causing duplicate blocks.

This assertion change I am less sure on the reasoning for,
but after running through the e2e tests several times locally,
it is consistent that the only downsampling happens in the
"compact-working" step, and so all other steps would report 0
for their total downsamples metric.

Signed-off-by: Thomas Hartland <[email protected]>

---------

Signed-off-by: Thomas Hartland <[email protected]>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.24.0 to 0.25.0.
- [Commits](golang/crypto@v0.24.0...v0.25.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
thanos-io#7528)

Bumps [go.opentelemetry.io/otel/bridge/opentracing](https://github.com/open-telemetry/opentelemetry-go) from 1.21.0 to 1.28.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go@v1.21.0...v1.28.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/bridge/opentracing
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This commits adds the option of filtering rules by rule name, rule
group, or file. This brings the rule API closer in-line with the current
Prometheus api.

Signed-off-by: Jacob Baungard Hansen <[email protected]>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.26.0 to 0.27.0.
- [Commits](golang/net@v0.26.0...v0.27.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…hanos-io#7525)

Bumps [go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc](https://github.com/open-telemetry/opentelemetry-go) from 1.27.0 to 1.28.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go@v1.27.0...v1.28.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
yuchen-db and others added 26 commits November 13, 2024 16:03
* support hedged requests in store

Signed-off-by: milinddethe15 <[email protected]>

* hedged roundtripper with tdigest for dynamic delay

Signed-off-by: milinddethe15 <[email protected]>

* refactor struct and fix lint

Signed-off-by: milinddethe15 <[email protected]>

* Improve hedging implementation

Signed-off-by: milinddethe15 <[email protected]>

* Improved hedging implementation

Signed-off-by: milinddethe15 <[email protected]>

* Update store doc

Signed-off-by: milinddethe15 <[email protected]>

* fix white space

Signed-off-by: milinddethe15 <[email protected]>

* add enabled field

Signed-off-by: milinddethe15 <[email protected]>

---------

Signed-off-by: milinddethe15 <[email protected]>
I always get this in logs:
```
err: receive capnp conn: close tcp ...: use of closed network connection
```

This is also visible in the e2e test.

After Done() returns, the connection is closed either way so no need to
close it again.

Signed-off-by: Giedrius Statkevičius <[email protected]>
* Fix a storage GW bug that loses TSDB infos when joining them
* E2E test demonstrating a bug in the MinT calculation in distributed
  Engine

Signed-off-by: Michael Hoffmann <[email protected]>
…o#7915)

* always close block series client at the end

Signed-off-by: Ben Ye <[email protected]>

* add back close for loser tree

Signed-off-by: Ben Ye <[email protected]>

---------

Signed-off-by: Ben Ye <[email protected]>
* Update objstore and promql-engine to latest

Signed-off-by: Saswata Mukherjee <[email protected]>

* Fixes after upgrade

Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: Yi Jin <[email protected]>
@jnyi jnyi merged commit 149364c into release Dec 2, 2024
184 of 185 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.