Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hfab console connectivity failure during switch reinstall #321

Open
pau-hedgehog opened this issue Jan 16, 2025 · 8 comments
Open

hfab console connectivity failure during switch reinstall #321

pau-hedgehog opened this issue Jan 16, 2025 · 8 comments
Assignees
Labels
bug Something isn't working ci ci-hw Run hardware CI job flaky

Comments

@pau-hedgehog
Copy link
Contributor

pau-hedgehog commented Jan 16, 2025

https://github.com/githedgehog/fabricator/actions/runs/12813204547/job/35726707779

I observed some failures when running hhfab vlab serial from:

20:29:03 DBG ds3000-02: 20:29:03 ERR serial: failed to run command: exit status 255
20:29:03 DBG ds4000-01: 20:29:03 ERR serial: failed to run command: exit status 255
20:29:03 DBG sse-c4632-01: 20:29:03 ERR serial: failed to run command: exit status 255
20:29:03 DBG ds3000-01: 20:29:03 ERR serial: failed to run command: exit status 255
20:29:03 DBG ds4000-02: 20:29:03 ERR serial: failed to run command: exit status 255

But due to #317 hhfab doesn't catch this error and the CI continues:

20:30:50 INF All switches placed into NOS Install Mode took=2m20.407841459s

Then there are additional unhandled errors (This would be a separate issue, IMO) which delay the CI as-fast-as-possible end:

20:45:20 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"
20:45:35 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"
20:45:50 ERR Unhandled Error logger=UnhandledError err="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:108: Failed to watch *v1beta1.VLANNamespace: context deadline exceeded"
20:45:50 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"
20:46:05 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"
20:46:20 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"

The CI should fail fast to prevent wasting valuable CI-HW cycles.

As an extra safeguard we should set a timeout (eg, 1h), if possible, @Frostman :

Image

@pau-hedgehog pau-hedgehog self-assigned this Jan 16, 2025
@pau-hedgehog pau-hedgehog changed the title CI-HW f CI-HW failure during switch reinstall Jan 16, 2025
@pau-hedgehog pau-hedgehog added ci flaky ci-hw Run hardware CI job labels Jan 16, 2025
@pau-hedgehog
Copy link
Contributor Author

@pau-hedgehog
Copy link
Contributor Author

Another one: https://github.com/githedgehog/fabricator/actions/runs/12834731651/job/35792652175

So the piece of code that is failing is on the Remote Serial VLAB Helper:

return fmt.Errorf("failed to run command: %w", err)

I've reproduced this in another env issuing repeated remote serial connections:

ubuntu@env-3:~/hhfab$ ./hhfab vlab serial --name as4630-01 -v
19:27:18 INF Hedgehog Fabricator version=v0.32.1-34-gfbf494c4-dirty-be1939
19:27:18 INF Wiring hydrated successfully mode=if-not-present
19:27:18 INF VLAB config loaded file=vlab/config.yaml
19:27:18 INF Remote serial (hardware) name=as4630-01 remote=192.168.88.10:9004
19:27:18 DBG Running cmd="ssh -o GlobalKnownHostsFile=/dev/null -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 9004 192.168.88.10"

Type the hot key to suspend the connection: <CTRL>Z

as4630-01 login: 
as4630-01 login: 
--:- AS4630-01 cli-> 19:29:43 ERR serial: failed to run command: exit status 255
ubuntu@env-3:~/hhfab$ ./hhfab vlab serial --name as4630-01 -v
19:29:51 INF Hedgehog Fabricator version=v0.32.1-34-gfbf494c4-dirty-be1939
19:29:51 INF Wiring hydrated successfully mode=if-not-present
19:29:51 INF VLAB config loaded file=vlab/config.yaml
19:29:51 INF Remote serial (hardware) name=as4630-01 remote=192.168.88.10:9004
19:29:51 DBG Running cmd="ssh -o GlobalKnownHostsFile=/dev/null -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 9004 192.168.88.10"

This connection is in use. User(s) currently connected: ubuntu@1130.
                                                                    You need privilege to make a simultaneous session.
The connection was unsuccessful.
19:29:51 ERR serial: failed to run command: exit status 255

@pau-hedgehog
Copy link
Contributor Author

@pau-hedgehog pau-hedgehog changed the title CI-HW failure during switch reinstall CI-HW console connectivity failure during switch reinstall Jan 20, 2025
@pau-hedgehog pau-hedgehog changed the title CI-HW console connectivity failure during switch reinstall hfab console connectivity failure during switch reinstall Jan 21, 2025
@pau-hedgehog
Copy link
Contributor Author

I haven't seen any hit o this one during last week. Closing

@pau-hedgehog
Copy link
Contributor Author

@pau-hedgehog pau-hedgehog reopened this Jan 29, 2025
@pau-hedgehog
Copy link
Contributor Author

@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Jan 31, 2025

@sonoble, can we do something about the console server session idle timeout or concurrency? It is causing some switch reinstall to fail in our CI due to:

This connection is in use. User(s) currently connected: ubuntu@1130.
You need privilege to make a simultaneous session.
The connection was unsuccessful.

@sonoble
Copy link

sonoble commented Jan 31, 2025

Concurrency no, timeout can be adjusted here

Image

@Frostman Frostman added the bug Something isn't working label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci ci-hw Run hardware CI job flaky
Projects
None yet
Development

No branches or pull requests

3 participants