Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart when manager and agent cannnot connect to kafka #1249

Merged
merged 1 commit into from
Dec 3, 2024

Conversation

ldpliu
Copy link
Contributor

@ldpliu ldpliu commented Nov 29, 2024

Summary

Related issue(s)

https://issues.redhat.com/browse/ACM-14562
Fixes #

Tests

  • Unit/function tests have been added and incorporated into make unit-tests.
  • Integration tests have been added and incorporated into make integration-test.
  • E2E tests have been added and incorporated into make e2e-test-all.
  • List other manual tests you have done.
    • tested delete kafka cluster, and stop operator. manager and agent will restart.
    • set transport-failure-threshold=9999 in operator deployment, manager and agent will not be restart.

@ldpliu
Copy link
Contributor Author

ldpliu commented Nov 29, 2024

/cc @clyang82 @yanmxa

@openshift-ci openshift-ci bot requested review from clyang82 and yanmxa November 29, 2024 01:05
Copy link
Contributor

@clyang82 clyang82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, let us add a back door in case we do not handle all error cases. FYI @ldpliu

@ldpliu
Copy link
Contributor Author

ldpliu commented Dec 2, 2024

As we discussed offline, let us add a back door in case we do not handle all error cases. FYI @ldpliu
Done

@ldpliu ldpliu force-pushed the panic-kaka branch 3 times, most recently from c6023b8 to ccd1009 Compare December 3, 2024 04:36

errorCount++
if errorCount >= transportFailureThreshold {
log.Panicf("Thransport producer error > 10 in 5 minutes, error: %v", ev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Thransport/transport

@@ -212,6 +215,16 @@ func handleProducerEvents(log *zap.SugaredLogger, eventChan chan kafka.Event) {
log.Warnw("Transport producer client error, ignore it for most cases", "error", ev)
} else {
log.Warnw("Thransport producer client error", "error", ev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Thransport/transport

@@ -124,6 +124,8 @@ func parseFlags() *configs.AgentConfig {
"The agent running namespace, also used as leader election namespace")
pflag.IntVar(&agentConfig.SpecWorkPoolSize, "consumer-worker-pool-size", 10,
"The goroutine number to propagate the bundles on managed cluster.")
pflag.IntVar(&agentConfig.TransportConfig.FailureThreshold, "transport-failure-threshold", 10,
"Restart pod when tansport error count > transport-failure-threshold in 5 mins.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restart the pod if the transport error count exceeds the transport-failure-threshold within 5 minutes

} else {
log.Warnw("Thransport producer client error", "error", ev)
log.Warnw("thransport producer client error", "error", ev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/thransport/transport


errorCount++
if errorCount >= transportFailureThreshold {
log.Panicf("thransport producer error > 10 in 5 minutes, error: %v", ev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/thransport/transport

Copy link
Contributor

@clyang82 clyang82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci bot added the lgtm label Dec 3, 2024
Copy link

openshift-ci bot commented Dec 3, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clyang82, ldpliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Dec 3, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 1b84835 into stolostron:main Dec 3, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants