Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add -ingest-storage.kafka.target-consumer-lag-at-startup support #8579

Merged
merged 6 commits into from
Jul 3, 2024

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Jul 2, 2024

What this PR does

As soon as an ingester switches to Running, the reported end-to-end latency spikes high. We can see the correlation here:

344916329-ff5993f2-544c-4c99-a03b-06dd99a3b5a1

Why? Because the ingesters are configured with -ingest-storage.kafka.max-consumer-lag-at-startup=15s (default), which means an ingester is guaranteed to have an end-to-end latency of up to 15s when switched to Running but there's no guarantee that the latency will be significantly lower than that. Significantly reducing -ingest-storage.kafka.max-consumer-lag-at-startup setting (e.g. to a couple of seconds) is tricky because if the ingester can't catch up to that low latency then it will never startup.

In this PR I'm proposing to add a new config option called -ingest-storage.kafka.target-consumer-lag-at-startup and making it a best-effort. How it works:

  1. At startup, the ingester waits until -ingest-storage.kafka.max-consumer-lag-at-startup (as in main)
  2. Then we give the ingester a "grace period" to reach -ingest-storage.kafka.target-consumer-lag-at-startup (defaults 2s). The grace period is equal to -ingest-storage.kafka.max-consumer-lag-at-startup; the rationale here is that we expect at least a 1x replay speed from Kafka.

This means that -ingest-storage.kafka.target-consumer-lag-at-startup is a best-effort, and -ingest-storage.kafka.max-consumer-lag-at-startup is guaranteed (as in main).

I've tested it in a Mimir dev cluster and looks fixing the issue:

344992679-3e533d72-f4da-4683-b6dc-ee85ff0860ef

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@pracucci pracucci force-pushed the add-target-consumer-lag-at-startup branch 2 times, most recently from a86cb02 to c3e0fc3 Compare July 2, 2024 11:31
}
}

return boff.Err()
// TODO should be moved to dskit's backoff
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I will open a PR to add context.Cause() support to dskit's backoff but I prefer to not block on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pracucci pracucci marked this pull request as ready for review July 2, 2024 12:47
@pracucci pracucci requested review from tacole02 and a team as code owners July 2, 2024 12:47
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no major concerns, LGTM

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
pkg/storage/ingest/reader.go Outdated Show resolved Hide resolved
pkg/storage/ingest/reader.go Show resolved Hide resolved
boff.Wait()
continue
}

// Send a direct request to the Kafka backend to fetch the last produced offset.
// We intentionally don't use WaitNextFetchLastProducedOffset() to not introduce further
// latency.
lastProducedOffsetRequestedAt := time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was a subtle bug previously 👍

@pracucci pracucci force-pushed the add-target-consumer-lag-at-startup branch from 3919855 to d2bd338 Compare July 2, 2024 13:40
@tacole02 tacole02 added the type/docs Improvements or additions to documentation label Jul 2, 2024
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the docs! I've left a few comments.

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
@pracucci pracucci force-pushed the add-target-consumer-lag-at-startup branch from 0cf787f to 6de7564 Compare July 3, 2024 06:33
pracucci and others added 2 commits July 3, 2024 08:33
Signed-off-by: Marco Pracucci <[email protected]>
@pracucci
Copy link
Collaborator Author

pracucci commented Jul 3, 2024

Thanks for updating the docs! I've left a few comments.

Thanks @tacole02 for the review!

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, thank you.

// written to the partition, we give the reader maxLag time to replay the backlog + ingest the new data
// written in the meanwhile.
func() (time.Duration, error) {
timedCtx, cancel := context.WithTimeoutCause(ctx, maxLag, errWaitTargetLagDeadlineExceeded)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about context.WithTimeoutCause, nice.

@pracucci
Copy link
Collaborator Author

pracucci commented Jul 3, 2024

Going to merge this. Will address the TODO once dskit PR is merged.

@pracucci pracucci merged commit 74c72c8 into main Jul 3, 2024
29 checks passed
@pracucci pracucci deleted the add-target-consumer-lag-at-startup branch July 3, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants