[zosv3light] dhcp-zos losing carrier on deployment #2531

coesensbert · 2025-02-05T12:08:41Z

Describe the bug

zosv3light node 7403 running on a hetzner dedicated server was running fine until I deployed a workload via terraform. A full vm with mycelium and ygg enabled.
saw this on 2 other nodes as well, same exact behavior. Once deployed the vm works over mycelium for a few minutes and then becomes unreachable, however the issue does not seem to be related to mycelium. In loki one can find that dhcp-zos lost it's carrier, and therefore removed it's default routes etc ..
Therefore it becomes impossible to remove the workload via terraform since the node is unreachable. I have to reboot the zos node and then remove the deployment. If after a reboot one waits a few minutes, the same pattern repeats until the deployment is removed. Deploying the same terraform on other nodes does not create this issue.

To Reproduce

Steps to reproduce the behavior:

1. deploy below terraform on mainnet node 7403
2. watch the loki logs: https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-5m%22,%22to%22:%22now%22%7D%7D
3. watch the metrics stop: https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=3997&var-node=5C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=now-1h&to=now&timezone=browser
4. try to reach your deployment and test if it stays online

Expected behavior

normal node / deployment operation

Screenshots

Loki logs from my last deployment: https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%221738754325781%22,%22to%22:%221738754987799%22%7D%7D

node metrics:
https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=3997&var-node=5C8DMBKpg88NM1XRS91ET9BYo26bo1YoVPyMQMeXRhijyFVm&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=2025-02-05T10:57:50.069Z&to=2025-02-05T11:34:24.989Z&timezone=browser

terraform:

terraform {
  required_providers {
    grid = {
      source = "threefoldtech/grid"
    }
  }
}

provider "grid" {
}

resource "random_bytes" "mycelium_ip_seed" {
  length = 6
}

resource "random_bytes" "mycelium_key" {
  length = 32
}

resource "grid_network" "net1" {
    nodes = [7403]
    ip_range = "10.212.0.0/16"
    name = "myceiperf2"
    description = "myceiperf2"
    add_wg_access = true
    mycelium_keys = {
      format("%s", 7403) = random_bytes.mycelium_key.hex
    }
}
resource "grid_deployment" "d1" {
  node = 7403
  network_name = grid_network.net1.name
  disks {
    name = "root"
    size = 25
  }
    vms {
    name = "myceiperf2"
    flist = "https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist"
    cpu = 4
    planetary = true
    publicip = false
    publicip6 = false
    memory = 8192
#    entrypoint = "/sbin/zinit init"
    mounts {
        name = "root"
        mount_point = "/data"
    }
    env_vars = {
      SSH_KEY ="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDYNeJXJV2FNEwuQz6e0jkKeqRbKWwftBKq+sjSTqa2x"
    }
    mycelium_ip_seed = random_bytes.mycelium_ip_seed.hex
  }
}
output "wg_config" {
    value = grid_network.net1.access_wg_config
}
output "node1_vm1_ip" {
    value = grid_deployment.d1.vms[0].ip
}
output "public_ip" {
    value = grid_deployment.d1.vms[0].computedip
}
output "public_ip6" {
    value = grid_deployment.d1.vms[0].computedip6
}
output "planetary_ip" {
    value = grid_deployment.d1.vms[0].planetary_ip
}
output "vm1_mycelium_ip" {
  value = grid_deployment.d1.vms[0].mycelium_ip
}

The text was updated successfully, but these errors were encountered:

ashraffouda · 2025-02-06T09:25:35Z

zos light doesn't support ygg I believe this deployment shouldn't happen at first place. maybe this needs to be handled in terraform. can u plz try with mycelium only? and check if it is the same or not

rawdaGastan · 2025-02-06T10:45:39Z

I tried to follow your steps:

Ygg IP, Public IPs and wireguard conf are not supported in light deployment in Terraform so the result would be something like that:

Outputs:

node1_vm1_ip = "10.212.2.2"
planetary_ip = ""
public_ip = ""
public_ip6 = ""
vm1_mycelium_ip = "599:f44c:8e22:e74:ff0f:5f7c:2ec2:2a53"
wg_config = ""

You are commenting the entrypoint line not sure why but the deployment works anyway

#    entrypoint = "/sbin/zinit init"

I tried the file you provided here (just changed the ssh key) and waited for > 25 mins and the vm is still reachable
I could remove the deployment after > 25 mins

@coesensbert What is your terraform version?

coesensbert · 2025-02-06T13:09:58Z

➜  ~ terraform --version                         
Terraform v1.9.8
on linux_amd64

entrypoint is commented because I want a full vm, it was already hard enough to form a main file that works to get a full vm ..

rawdaGastan · 2025-02-06T13:12:18Z

@coesensbert I mean our terraform provider version

coesensbert · 2025-02-06T13:14:37Z

➜  terraform init -upgrade
Initializing the backend...
Initializing provider plugins...
- Finding latest version of threefoldtech/grid...
- Finding latest version of hashicorp/random...
- Using previously-installed threefoldtech/grid v1.11.3
- Using previously-installed hashicorp/random v3.6.3

rawdaGastan · 2025-02-06T13:23:22Z

Okay then I cannot reproduce the problem! It worked fine as you see

coesensbert · 2025-02-07T16:22:13Z

another example, bellow terraform deployed on mainnet node 7400 (just took a random one in LiriaFarm)

logs of when it happened: https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225D5BymmTtcsm3CxxfQxFHjMvWGp53ZLQi34S82pXeM2cKVS8%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%221738941545739%22,%22to%22:%221738945043221%22%7D%7D

Last hour (click live for live logs): https://mon.grid.tf/explore?orgId=1&left=%7B%22datasource%22:%22Loki-ZOS%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bfarm%3D%5C%223997%5C%22,network%3D%5C%22production%5C%22,node%3D%5C%225D5BymmTtcsm3CxxfQxFHjMvWGp53ZLQi34S82pXeM2cKVS8%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

metrics: https://metrics.grid.tf/d/rYdddlPWkfqwf/zos-host-metrics?orgId=2&refresh=30s&var-network=production&var-farm=3997&var-node=5D5BymmTtcsm3CxxfQxFHjMvWGp53ZLQi34S82pXeM2cKVS8&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=now-24h&to=now&timezone=browser

I'll leave the deployment thus this node in a faulty state for dev to investigate

terraform:

terraform {
  required_providers {
    grid = {
      source = "threefoldtech/grid"
    }
  }
}

provider "grid" {
}

resource "random_bytes" "mycelium_ip_seed" {
  length = 6
}

resource "random_bytes" "mycelium_key" {
  length = 32
}

resource "grid_network" "net1" {
    nodes = [7400]
    ip_range = "10.211.0.0/16"
    name = "myceiperf5"
    description = "myceiperf5"
#    add_wg_access = true
    mycelium_keys = {
      format("%s", 7400) = random_bytes.mycelium_key.hex
    }
}
resource "grid_deployment" "d1" {
  node = 7400
  network_name = grid_network.net1.name
  disks {
    name = "root"
    size = 25
  }
    vms {
    name = "myceiperf5"
    flist = "https://hub.grid.tf/tf-official-vms/ubuntu-24.04-latest.flist"
    cpu = 4
    memory = 8192
    mounts {
        name = "root"
        mount_point = "/data"
    }
    env_vars = {
      SSH_KEY ="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDYNeJXJV2FNEwuQz6e0jkKeqRbKWwftBKq+sjSTqa2x"
    }
    mycelium_ip_seed = random_bytes.mycelium_ip_seed.hex
  }
}
output "node1_vm1_ip" {
    value = grid_deployment.d1.vms[0].ip
}
output "vm1_mycelium_ip" {
  value = grid_deployment.d1.vms[0].mycelium_ip
}

coesensbert · 2025-02-10T12:56:37Z

at first node 7400 send some logs to loki periodically, as the carrier comes and goes. After a while the node goes offline. Have to reset it, then it comes back and pattern repeats. If after a reset we remove the workload, the carrier issue resolves

coesensbert · 2025-02-18T16:14:59Z

any progress? it basically stops us from using 40 nodes we rent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zosv3light] dhcp-zos losing carrier on deployment #2531

[zosv3light] dhcp-zos losing carrier on deployment #2531

coesensbert commented Feb 5, 2025

ashraffouda commented Feb 6, 2025

rawdaGastan commented Feb 6, 2025 •

edited

Loading

coesensbert commented Feb 6, 2025

rawdaGastan commented Feb 6, 2025

coesensbert commented Feb 6, 2025

rawdaGastan commented Feb 6, 2025

coesensbert commented Feb 7, 2025 •

edited

Loading

coesensbert commented Feb 10, 2025

coesensbert commented Feb 18, 2025

[zosv3light] dhcp-zos losing carrier on deployment #2531

[zosv3light] dhcp-zos losing carrier on deployment #2531

Comments

coesensbert commented Feb 5, 2025

Describe the bug

To Reproduce

Expected behavior

Screenshots

ashraffouda commented Feb 6, 2025

rawdaGastan commented Feb 6, 2025 • edited Loading

coesensbert commented Feb 6, 2025

rawdaGastan commented Feb 6, 2025

coesensbert commented Feb 6, 2025

rawdaGastan commented Feb 6, 2025

coesensbert commented Feb 7, 2025 • edited Loading

coesensbert commented Feb 10, 2025

coesensbert commented Feb 18, 2025

rawdaGastan commented Feb 6, 2025 •

edited

Loading

coesensbert commented Feb 7, 2025 •

edited

Loading