Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failpoint raftAfterSave=sleep(1s) is flaking in robustness test #18240

Closed
serathius opened this issue Jun 27, 2024 · 10 comments
Closed

Failpoint raftAfterSave=sleep(1s) is flaking in robustness test #18240

serathius opened this issue Jun 27, 2024 · 10 comments
Assignees
Labels
area/robustness-testing area/testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/flake

Comments

@serathius
Copy link
Member

Which Github Action / Prow Jobs are flaking?

https://testgrid.k8s.io/sig-etcd-robustness#ci-etcd-robustness-amd64

Which tests are flaking?

TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize3

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-main-amd64/1806029033377370112

Reason for failure (if possible)

    logger.go:146: 2024-06-26T19:45:53.423Z	INFO	goFailpoint deactivate failed	{"failpoint": "raftAfterSave=sleep(1s)", "error": "Delete \"http://127.0.0.1:12381/raftAfterSave\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
    logger.go:146: 2024-06-26T19:45:53.423Z	ERROR	Failed to trigger failpoint	{"failpoint": "raftAfterSave=sleep(1s)", "error": "goFailpoint raftAfterSave=sleep(1s) deactivate failed, err: Delete \"http://127.0.0.1:12381/raftAfterSave\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
    main_test.go:122: failed triggering failpoint, err: goFailpoint raftAfterSave=sleep(1s) deactivate failed, err: Delete "http://127.0.0.1:12381/raftAfterSave": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Anything else we need to know?

No response

@jmhbnz jmhbnz added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/robustness-testing labels Jun 28, 2024
@MadhavJivrajani
Copy link
Contributor

MadhavJivrajani commented Jul 3, 2024

validate.go:34: Broken validation assumptions: non empty database at start or first write didn't succeed, required by model implementation

@serathius I also see this in the logs ^

From

func checkValidationAssumptions(reports []report.ClientReport, persistedRequests []model.EtcdRequest) error {

@MadhavJivrajani
Copy link
Contributor

For the failpoint flake, we have the option of increasing the client timeout for that particular failpoint, but I'm still not entirely sure why its timing out to begin with.

@siyuanfoundation
Copy link
Contributor

/cc @henrybear327

@serathius
Copy link
Member Author

/assign @henrybear327

@k8s-ci-robot
Copy link

@serathius: GitHub didn't allow me to assign the following users: henrybear327.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @henrybear327

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@henrybear327
Copy link
Contributor

/assign @henrybear327

@serathius
Copy link
Member Author

@henrybear327 when you will be able to send the PR? Just the timeout increase should be 1 line change.

henrybear327 added a commit to henrybear327/etcd that referenced this issue Aug 1, 2024
henrybear327 added a commit to henrybear327/etcd that referenced this issue Aug 1, 2024
Reference:
- etcd-io#18240

Signed-off-by: Chun-Hung Tseng <[email protected]>
@henrybear327
Copy link
Contributor

@henrybear327 when you will be able to send the PR? Just the timeout increase should be 1 line change.

Done just now!

I will still continue my investigation on the gofail library to see if any issues are coming from the gofail v0.2.0 changes that I made.

@henrybear327
Copy link
Contributor

I think we can close this issue, right @serathius? IIRC #18397 fixed the issue.

@serathius
Copy link
Member Author

Right, thanks @henrybear327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness-testing area/testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/flake
Development

No branches or pull requests

6 participants