-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[zosv3light] dhcp-zos losing carrier on deployment #2531
Comments
zos light doesn't support ygg I believe this deployment shouldn't happen at first place. maybe this needs to be handled in terraform. can u plz try with mycelium only? and check if it is the same or not |
I tried to follow your steps:
@coesensbert What is your terraform version? |
entrypoint is commented because I want a full vm, it was already hard enough to form a main file that works to get a full vm .. |
@coesensbert I mean our terraform provider version |
|
Okay then I cannot reproduce the problem! It worked fine as you see |
another example, bellow terraform deployed on mainnet node 7400 (just took a random one in LiriaFarm) Last hour (click live for live logs): https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225D5BymmTtcsm3CxxfQxFHjMvWGp53ZLQi34S82pXeM2cKVS8%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D I'll leave the deployment thus this node in a faulty state for dev to investigate terraform:
|
at first node 7400 send some logs to loki periodically, as the carrier comes and goes. After a while the node goes offline. Have to reset it, then it comes back and pattern repeats. If after a reset we remove the workload, the carrier issue resolves |
any progress? it basically stops us from using 40 nodes we rent |
Describe the bug
zosv3light node 7403 running on a hetzner dedicated server was running fine until I deployed a workload via terraform. A full vm with mycelium and ygg enabled.
saw this on 2 other nodes as well, same exact behavior. Once deployed the vm works over mycelium for a few minutes and then becomes unreachable, however the issue does not seem to be related to mycelium. In loki one can find that dhcp-zos lost it's carrier, and therefore removed it's default routes etc ..
Therefore it becomes impossible to remove the workload via terraform since the node is unreachable. I have to reboot the zos node and then remove the deployment. If after a reboot one waits a few minutes, the same pattern repeats until the deployment is removed. Deploying the same terraform on other nodes does not create this issue.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
normal node / deployment operation
Screenshots
Loki logs from my last deployment: https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%221738754325781%22,%22to%22:%221738754987799%22%7D%7D
node metrics:
https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=3997&var-node=5C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=2025-02-05T10:57:50.069Z&to=2025-02-05T11:34:24.989Z&timezone=browser
terraform:
The text was updated successfully, but these errors were encountered: