Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable OOB tests #2187

Open
wants to merge 9 commits into
base: improvement/ZENKO-4414
Choose a base branch
from
Open

Conversation

Kerkesni
Copy link
Contributor

@Kerkesni Kerkesni commented Jan 24, 2025

  • Enable OOB tests
  • Increase the timeout of OOB tests as replication of operations between the repds can be slow when the CI vm is saturated (an operation is only visible in the raft log once it got replicated to the majority of the repds)
  • Wait for the ingestion consumer group to be in a stable state before starting the tests. Zenko is reconciled just before launching the tests which causes the ingestion pods to restart and the consumer group to rebalance. The consumer group stays in the rebalance state for 30-ish seconds which is enough time for the tests to start. Not waiting for the state to be stable can cause the first messages pushed to the topic to be skipped (consumers start consuming late because of the rebalance) as the consumers are configured to start consuming from the latest offset if no offset was found (which is the case at startup).
  • When we wait for the consumer group state to be stable we also check if we have the proper number of consumers connected to the consumer group. When the ingestion pods restart from the reconcile, the previous consumers are kept in the consumer group until their session times out. Not waiting for this can cause messages to be skipped as well, as the new consumer(s) only have some partitions assigned to them and not all, the rest are assigned to the "dead" consumer until future rebalance. At startup no offset is generally stored yet for the partitions assigned to the "dead" consumer, so when the rebalance occurs and the new consumer get assigned to them the new offset becomes the latest and previous messages are skipped.

Issue: ZENKO-4286

@bert-e
Copy link
Contributor

bert-e commented Jan 24, 2025

Hello kerkesni,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Jan 24, 2025

Incorrect fix version

The Fix Version/s in issue ZENKO-4286 contains:

  • 2.11.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 2.10.12

Please check the Fix Version/s of ZENKO-4286, or the target
branch of this pull request.

@Kerkesni Kerkesni changed the base branch from development/2.10 to improvement/ZENKO-4414 January 24, 2025 16:03
@Kerkesni Kerkesni force-pushed the improvement/ZENKO-4286 branch from 9179cff to ef5fc35 Compare January 24, 2025 16:06
@Kerkesni Kerkesni force-pushed the improvement/ZENKO-4414 branch from 0efddbf to 9055c5f Compare January 24, 2025 16:06
@Kerkesni Kerkesni force-pushed the improvement/ZENKO-4286 branch 2 times, most recently from a4e8c1d to f46740b Compare January 27, 2025 17:30
@Kerkesni Kerkesni marked this pull request as ready for review January 27, 2025 17:31
@Kerkesni Kerkesni force-pushed the improvement/ZENKO-4286 branch 3 times, most recently from e4c754b to 6be55fc Compare January 28, 2025 11:44
@Kerkesni Kerkesni force-pushed the improvement/ZENKO-4414 branch 3 times, most recently from 8e060ea to 5a84228 Compare January 28, 2025 11:55
A raft log entry is considered committed only after
it has been replicated to the majority of the repds.

Increasing OOB test timeout to allow this replication
to happen.

Issue: ZENKO-4286
Zenko gets reconciled just before launching the tests which makes
the first two ingestion tests falky as the consumer group is not
stable (rebalance state) when the tests are started. This
causes the consumers to connect late to the topic (after the producer
pushed the first message), which causes the first messages to be skipped
(consumers are configured to start consuming from the latest offset when
no offset is stored in Kafka)

Issue: ZENKO-4286
"data" is the actual body of the response which
does not contain the error code.

Issue: ZENKO-4286
When a pod restarts the previous consumers are kept in the consumer group
until the session times out, this can lead to the new consumer skipping
some messages on the non assigned partitions as there is no offset stored
for those initially (consumers are configured to start consuming from the
latest offset).

Issue: ZENKO-4286
@Kerkesni Kerkesni force-pushed the improvement/ZENKO-4286 branch from 6be55fc to c47e76d Compare January 28, 2025 12:02


if [ $ENABLE_RING_TESTS = true ]; then
# wait for ingestion processor to start consuming from Kafka
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should (eventually) get baked into backbeat?
--> so we should probably add a ticket (and we can put it in the EPIC "cleanup kafka management" https://scality.atlassian.net/browse/ARTESCA-9180)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants