Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zosv3light] dhcp-zos losing carrier on deployment #2531

Open
coesensbert opened this issue Feb 5, 2025 · 9 comments
Open

[zosv3light] dhcp-zos losing carrier on deployment #2531

coesensbert opened this issue Feb 5, 2025 · 9 comments

Comments

@coesensbert
Copy link

Describe the bug

zosv3light node 7403 running on a hetzner dedicated server was running fine until I deployed a workload via terraform. A full vm with mycelium and ygg enabled.
saw this on 2 other nodes as well, same exact behavior. Once deployed the vm works over mycelium for a few minutes and then becomes unreachable, however the issue does not seem to be related to mycelium. In loki one can find that dhcp-zos lost it's carrier, and therefore removed it's default routes etc ..
Therefore it becomes impossible to remove the workload via terraform since the node is unreachable. I have to reboot the zos node and then remove the deployment. If after a reboot one waits a few minutes, the same pattern repeats until the deployment is removed. Deploying the same terraform on other nodes does not create this issue.

To Reproduce

Steps to reproduce the behavior:

1. deploy below terraform on mainnet node 7403
2. watch the loki logs: https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-5m%22,%22to%22:%22now%22%7D%7D
3. watch the metrics stop: https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=3997&var-node=5C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=now-1h&to=now&timezone=browser
4. try to reach your deployment and test if it stays online

Expected behavior

normal node / deployment operation

Screenshots

Loki logs from my last deployment: https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%221738754325781%22,%22to%22:%221738754987799%22%7D%7D

node metrics:
https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=3997&var-node=5C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=2025-02-05T10:57:50.069Z&to=2025-02-05T11:34:24.989Z&timezone=browser

terraform:

terraform {
  required_providers {
    grid = {
      source = "threefoldtech/grid"
    }
  }
}

provider "grid" {
}

resource "random_bytes" "mycelium_ip_seed" {
  length = 6
}

resource "random_bytes" "mycelium_key" {
  length = 32
}

resource "grid_network" "net1" {
    nodes = [7403]
    ip_range = "10.212.0.0/16"
    name = "myceiperf2"
    description = "myceiperf2"
    add_wg_access = true
    mycelium_keys = {
      format("%s", 7403) = random_bytes.mycelium_key.hex
    }
}
resource "grid_deployment" "d1" {
  node = 7403
  network_name = grid_network.net1.name
  disks {
    name = "root"
    size = 25
  }
    vms {
    name = "myceiperf2"
    flist = "https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist"
    cpu = 4
    planetary = true
    publicip = false
    publicip6 = false
    memory = 8192
#    entrypoint = "/sbin/zinit init"
    mounts {
        name = "root"
        mount_point = "/data"
    }
    env_vars = {
      SSH_KEY ="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDYNeJXJV2FNEwuQz6e0jkKeqRbKWwftBKq+sjSTqa2x"
    }
    mycelium_ip_seed = random_bytes.mycelium_ip_seed.hex
  }
}
output "wg_config" {
    value = grid_network.net1.access_wg_config
}
output "node1_vm1_ip" {
    value = grid_deployment.d1.vms[0].ip
}
output "public_ip" {
    value = grid_deployment.d1.vms[0].computedip
}
output "public_ip6" {
    value = grid_deployment.d1.vms[0].computedip6
}
output "planetary_ip" {
    value = grid_deployment.d1.vms[0].planetary_ip
}
output "vm1_mycelium_ip" {
  value = grid_deployment.d1.vms[0].mycelium_ip
}
@ashraffouda
Copy link
Collaborator

zos light doesn't support ygg I believe this deployment shouldn't happen at first place. maybe this needs to be handled in terraform. can u plz try with mycelium only? and check if it is the same or not

@rawdaGastan
Copy link
Contributor

rawdaGastan commented Feb 6, 2025

I tried to follow your steps:

  1. Ygg IP, Public IPs and wireguard conf are not supported in light deployment in Terraform so the result would be something like that:
Outputs:

node1_vm1_ip = "10.212.2.2"
planetary_ip = ""
public_ip = ""
public_ip6 = ""
vm1_mycelium_ip = "599:f44c:8e22:e74:ff0f:5f7c:2ec2:2a53"
wg_config = ""
  1. You are commenting the entrypoint line not sure why but the deployment works anyway
#    entrypoint = "/sbin/zinit init"
  1. I tried the file you provided here (just changed the ssh key) and waited for > 25 mins and the vm is still reachable
  2. I could remove the deployment after > 25 mins

@coesensbert What is your terraform version?

@coesensbert
Copy link
Author

➜  ~ terraform --version                         
Terraform v1.9.8
on linux_amd64

entrypoint is commented because I want a full vm, it was already hard enough to form a main file that works to get a full vm ..

@rawdaGastan
Copy link
Contributor

@coesensbert I mean our terraform provider version

@coesensbert
Copy link
Author

➜  terraform init -upgrade
Initializing the backend...
Initializing provider plugins...
- Finding latest version of threefoldtech/grid...
- Finding latest version of hashicorp/random...
- Using previously-installed threefoldtech/grid v1.11.3
- Using previously-installed hashicorp/random v3.6.3

@rawdaGastan
Copy link
Contributor

Okay then I cannot reproduce the problem! It worked fine as you see

@coesensbert
Copy link
Author

coesensbert commented Feb 7, 2025

another example, bellow terraform deployed on mainnet node 7400 (just took a random one in LiriaFarm)

logs of when it happened: https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225D5BymmTtcsm3CxxfQxFHjMvWGp53ZLQi34S82pXeM2cKVS8%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%221738941545739%22,%22to%22:%221738945043221%22%7D%7D

Last hour (click live for live logs): https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225D5BymmTtcsm3CxxfQxFHjMvWGp53ZLQi34S82pXeM2cKVS8%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

metrics: https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=3997&var-node=5D5BymmTtcsm3CxxfQxFHjMvWGp53ZLQi34S82pXeM2cKVS8&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=now-24h&to=now&timezone=browser

I'll leave the deployment thus this node in a faulty state for dev to investigate

terraform:

terraform {
  required_providers {
    grid = {
      source = "threefoldtech/grid"
    }
  }
}

provider "grid" {
}

resource "random_bytes" "mycelium_ip_seed" {
  length = 6
}

resource "random_bytes" "mycelium_key" {
  length = 32
}

resource "grid_network" "net1" {
    nodes = [7400]
    ip_range = "10.211.0.0/16"
    name = "myceiperf5"
    description = "myceiperf5"
#    add_wg_access = true
    mycelium_keys = {
      format("%s", 7400) = random_bytes.mycelium_key.hex
    }
}
resource "grid_deployment" "d1" {
  node = 7400
  network_name = grid_network.net1.name
  disks {
    name = "root"
    size = 25
  }
    vms {
    name = "myceiperf5"
    flist = "https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist"
    cpu = 4
    memory = 8192
    mounts {
        name = "root"
        mount_point = "/data"
    }
    env_vars = {
      SSH_KEY ="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDYNeJXJV2FNEwuQz6e0jkKeqRbKWwftBKq+sjSTqa2x"
    }
    mycelium_ip_seed = random_bytes.mycelium_ip_seed.hex
  }
}
output "node1_vm1_ip" {
    value = grid_deployment.d1.vms[0].ip
}
output "vm1_mycelium_ip" {
  value = grid_deployment.d1.vms[0].mycelium_ip
}

@coesensbert
Copy link
Author

at first node 7400 send some logs to loki periodically, as the carrier comes and goes. After a while the node goes offline. Have to reset it, then it comes back and pattern repeats. If after a reset we remove the workload, the carrier issue resolves

@coesensbert
Copy link
Author

any progress? it basically stops us from using 40 nodes we rent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants