Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

github: Use Canonical runners for system tests #469

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

roosterfish
Copy link
Contributor

@roosterfish roosterfish commented Nov 8, 2024

This PR switches the system tests from the runner group GitHubMicrocloud to our own self hosted runners.

@roosterfish roosterfish force-pushed the self_hosted_runners branch 2 times, most recently from 8af0edc to 4f38ed2 Compare November 8, 2024 11:07
@roosterfish roosterfish marked this pull request as ready for review November 8, 2024 13:58
@roosterfish
Copy link
Contributor Author

@masnax I did some tests regarding this error.

Unfortunately the timeout set in the MicroCeph GetConfig client func is set to only 5s.
Based on your suggestion in the meeting earlier, can it be that right after forming the MicroCloud, the proxy is waiting for something in the cluster to settle before it can forward the request to MicroCeph's local unix socket?

I have bootstrapped a single node MicroCloud and fired requests to /1.0/services/microceph in parallel.
Right around when the MicroCloud is bootstrapped I saw a delay in the response which could proof that something is going on there.

@masnax
Copy link
Contributor

masnax commented Nov 8, 2024

I have bootstrapped a single node MicroCloud and fired requests to /1.0/services/microceph in parallel. Right around when the MicroCloud is bootstrapped I saw a delay in the response which could proof that something is going on there.

Well this current failure is happening long before MicroCloud is bootstrapped, as it happens right after system discovery and before asking any setup questions. In the bootstrap case, the only delay would be related to refreshing the truststore and waiting for the lock, but even that wouldn't happen on a single-node request as it all goes through the unix socket which skips truststore verification.

When bootstrapping, the listeners also restart, so that could be the delay you're seeing locally. But again that wouldn't affect the test failure since it's not during bootstrap.

can it be that right after forming the MicroCloud, the proxy is waiting for something in the cluster to settle before it can forward the request to MicroCeph's local unix socket?

this is the whole local proxy block in MicroCloud so it's definitely not waiting for anything here.

Since it's a network request, there is the additional overhead of authHandlerMTLS pulling the truststore.

Ensure MicroCeph is fully started after bootstrapping to prevent running into timeouts
if the test suite is too fast.

Signed-off-by: Julian Pelizäus <[email protected]>
@roosterfish
Copy link
Contributor Author

Well this current failure is happening long before MicroCloud is bootstrapped, as it happens right after system discovery and before asking any setup questions

Mh it looks we can fix it by waiting for microceph cluster bootstrap to settle and only continue if it's done.
I have added another commit that adds a wrapper function we can use throughout the test suite to wait until microceph status reports the single node cluster services are present. I found waiting for this condition looks to be enough.

@masnax
Copy link
Contributor

masnax commented Nov 12, 2024

Well this current failure is happening long before MicroCloud is bootstrapped, as it happens right after system discovery and before asking any setup questions

Mh it looks we can fix it by waiting for microceph cluster bootstrap to settle and only continue if it's done. I have added another commit that adds a wrapper function we can use throughout the test suite to wait until microceph status reports the single node cluster services are present. I found waiting for this condition looks to be enough.

Is this something that can be checked over the API? Perhaps microceph cluster bootsrap shouldn't return until all its services are finished, or have a ready API that we can check against before sending requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants