Skip to content

Repository containing my personal Talos Kubernetes configurations

Notifications You must be signed in to change notification settings

nazarewk-iac/talos-configs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

talos-configs

Repository containing my personal Talos Kubernetes setup.

Usage overview

nix develop --command fish

# (re-)generate configurations
talos-gen

# generate an ISO
talos-installer-disk-write /dev/disk/by-id/usb-Samsung_Portable_SSD_T5_1234567D585A-0:0 pwet iso

# in BIOS enter Secure Boot "Setup Mode" (becomes visible when doing custom secure boot)
# plug in the USB
# select from Talos boot menu: `Enroll Secure Boot keys: auto`
# confirm presence of the new keys in BIOS (Secure Boot active and/or custom keys present)

# first node
talos-node-apply --insecure pwet

# boostrap only the first nose
talosctl-node pwet bootstrap

talosctl-node pwet kubeconfig --force
kubectl get node
kubectl get pod -A

# rest of nodes
talos-node-apply --insecure turo yost
kubectl get node
kubectl get pod -A

Overview

It is running dual-stack (both LAN and WAN) on a mix of 10/2.5/1 GbE (mostly managed) switches.

Primary nodes

Runs on 3x self-built Mini ITX NAS boxes, each consisting of:

  • SzBox N100 mobo aka CWWK N100:
    • 4x 2.5GbE ethernet
    • 2x NVMe connectors
    • 6x SATA connectors
  • Jonsbo N2 case:
    • 5x 3.5" removable SATA
    • 2x 2.5" SATA mounted inside
  • 32GB RAM
  • drives:
    • 250GB NVMe as Talos system drive
    • 1TB NVMe as the primary cache/local storage
    • 1TB SSD for Ceph storage
    • 4TB HDD for Ceph storage

Raspberry Pi4 nodes

Not running yet, but keeping the section

Runs on 3x Raspberry Pi 4 4GB, each holding:

  • UEFI-boot SD card having ONLY boot configuration (see Boot sequence issues)
  • (any) SD card holding RPi4 UEFI-boot and nothing else
  • 1x small ~256 GB SSDs holding encrypted Talos system partitions: STATE and EPHEMERAL

RPi4 scope

Scope

  • integrate with Nix-based development environment
  • securely store sensitive data/configuration using pass:
    • read/write/sync using talos-pass
  • networking:
    • dual-stack (IPv4 + IPv6)
    • use Cilium CNI
    • make (cluster) IPv6s are accessible from LAN
    • run Netbird client
    • expose Kubernetes controlplane:
      • to LAN
      • over Netbird
      • over WAN if Netbird takes too much time
    • expose Ceph Block Pool:
      • to LAN
      • over Netbird
      • over WAN if Netbird takes too much time
    • expose Ceph Object Storage:
      • to LAN
      • over Netbird
      • over WAN if Netbird takes too much time
  • figure out update/upgrade/reconfiguration procedures:
    • reconfigure nodes using talos-node-apply
    • upgrade (Talos) nodes using talos-node-upgrade
    • use Image Factory to customize Talos images using talos-image
    • add ZFS system extension
    • pin Kubernetes version to upgrade separately from Talos
  • set up ZFS on LUKS on the 1TB drive for local storage
  • run arbitrary Nix tooling within the cluster
    • see k8s-nix-disks or nix-system/nix-disks daemonset configuration
    • put container gcroots (maybe profiles?) into subdirectories on host
    • write some controller/operator to inject Nix configs into Pod automatically?
  • resolve *.pic.kdn.im DNS names
  • check whether 4x2.5GbE ethernets could be bonded into 10GbE
  • set up Rook/Ceph
    • FAILED: set up CephCluster on RPi4s
    • set up CephCluster on CWWK N100s
    • set up Block Pool
    • set up Object Storage
    • set up Filesystem (failed, see rook-cluster)
    • separate Ceph configurations for:
      • SSDs: replicated frequent/lower latency access
      • HDDs: infrequent/large files access
  • run Nextcloud?
  • offline-synced backup solution?

RPi4 scope

  • set up Talos on multi-disk Raspberry Pi 4:
    • RPi4 UEFI-boot SD card having ONLY boot configuration (see Boot sequence issues)
      • reconfigure rpi4 BOOT_SEQUENCE
      • investigate whether config.txt should be copied to UEFI from Talos partition?
    • prepare & boot Talos installer USB disk
    • bootstraping cluster

First time setup

Based on following materials:

Preparing CWWK N100

CWWK N100 setup checklist:

  • boot the Talos Installer SecureBoot ISO & select Enroll Secure Boot keys: auto from boot menu
  • reboot into Talos Installer ISO, note down the IP or get it from router fd31:e17c:f07f:1:aab8:e0ff:fe04:130d
  • set up hostname.yaml
  • set up install-disk.yaml:
    • talosctl -n fd31:e17c:f07f:1:aab8:e0ff:fe04:130d disks --insecure
  • set up networking.yaml:
    • generate DUID with uuidget | tr -d '-', store it here and in config.json
    • talosctl -n fd31:e17c:f07f:1:aab8:e0ff:fe04:130d get addresses --insecure
  • fill in all network interfaces:
    • talosctl -n fd31:e17c:f07f:1:aab8:e0ff:fe04:130d get link --insecure
  • enable: true in config.json
  • run talos-gen
  • apply node config:
    • talos-node-apply --insecure turo

Preparing Raspberry Pi 4

On Nix+Sway RPI Imager (eg: configure EEPROM) can be run with _sway-root-gui --enable and sudo nix run 'nixpkgs#rpi-imager'.

  1. load SD card with #rpi4-uefi (can be done through #rpi-imager)
  2. load USB drive/disk with metal-rpi_generic-arm64.raw.xz #talos release using #rpi-imager or dd
  3. load and boot another SD card with SD > USB boot EEPROM using #rpi-imager
  4. boot it to configure for SD card boot
  5. enter UEFI setup:
  6. (optionally?) disconnect all of USB drives
  7. boot #rpi4-uefi SD card (it should stay as the primary boot option forever)
  • TODO: possibly copy-over the config.txt from Talos partition?
  1. wait for rasbperry logo on black background
  • press ESC immediately (before the loader expires)
  1. you are now at UEFI setup (looks like a BIOS setup)
  2. (optionally?) connect all USB drives
  3. make sure the boot USB drive is connected
  4. set up #rpi4-uefi (go back with F10 > Y > ESC to save settings whenever possible)
  • Device Manager
    • Rasbperry Pi Configuration
      • Display Configuration
      • Advanced Configuration
        • disable Limit RAM to 3 GB
  • Boot Maintenance Manager
    • Boot Discovery Policy
      • set to Minimal
    • (optionally) change Auto Boot Time-out from 5 to 1 for faster boot
    • Boot Options
      • Change Boot Order
        • make sure the correct disk is first (it's the same as #talos config machine.install.diskSelector.wwid in case of SK Hynix drives over NVMe to USB adapters)
        • optionally delete all the other boot options
  • Reset
  1. Reboot the RPi4 into Talos
  • wait for everything to boot (no pool.ntp.org errors after ~2-3 minutes)
  • figure out:
    • ip address
      • try nc -w 2 -z <ip> 50000
    • mac address
  • set up DNS name on drek (router)
  • add entry to config.json
  1. it should be possible to run talos-node-apply on the controlplane node

Rasbperry Pi 4 boot sequence issues

Raspberry Pi 4 boots the first USB disk discovered by the board and does not retry any other disk. Everything worked fine for me until I plugged in the second disk

I have worked around the issue by:

  1. loading up a dedicated SD card with https://github.com/pftf/RPi4
  2. making sure SD card boots first by modifying BOOT_ORDER

RPi4 UEFI has those characteristics:

  • attempts to boot EVERY disk at least once before failing
  • for some reason it very often tries 4 netboots (HTTP/iPXE + IPv4/IPv6) for a few minutes delay before finally getting to the local USB boot

Preparing a new Kubernetes node

  • prepare a Talos node (see above)
  • locate a *-local device and put it into /k8s/05-1-nix-disks/nix/disks.nix:
    • talosctl disks - quickly identify the disk
    • k8s-nix-disks yost lsblk -OJ | jq '.blockdevices' | gron - locate lsblk parameters
    • uuidgen to create new LUKS UUID
  • set up /src/node*.yaml.d/local-storage.yaml with zpool.storage.kdn.im/pic-local label
    • talos-gen & talos-node-apply
  • add the disks meant for Ceph to /k8s/15-1-network-storage/templates:
    • k8s-nix-disks pwet ls -la /dev/disk/by-id
    • k8s-nix-disks pwet lsblk -b --output SIZE -n -d /dev/disk/by-id/wwn-0x500a0751e8764b59

Day two operations

Updating/upgrading/debugging.

  • talosctl-node rant dmesg
  • talosctl-node rant health
  • talosctl etcd members

Disk management

non-Talos disk management

kubectl apply -k k8s/nix-system
k8s-nix-disks rant

Talos disk identification

In talosctl disks (actual inputs for cluster configs) USB adapters identify as the same device without meaningful difference between those.

After (probably) udev service is up you can list all disks and symlinks by talosctl list /dev/disk/by-id --long.

USB3.0 to powered SATA adapter

Best identified with wwid: *DD56419883014*

NODE        DEV            MODEL         SERIAL       TYPE   UUID   WWID                                              MODALIAS      NAME    SIZE     BUS_PATH                                                                                       SUBSYSTEM          READ_ONLY   SYSTEM_DISK
rant.lan.   /dev/sda       USB3.0        -            HDD    -      t10.ANKEJE  USB3.0          DD56419883014\0\0\0   scsi:t-0x00   -       1.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:0/           /sys/class/block
hurl.lan.   /dev/sdb       USB3.0        -            HDD    -      t10.ANKEJE  USB3.0          DD56419883014\0\0\0   scsi:t-0x00   -       1.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1.1/3-1.1:1.0/host1/target1:0:0/1:0:0:0/   /sys/class/block
jhal.lan.   /dev/sdb       USB3.0        -            HDD    -      t10.ANKEJE  USB3.0          DD56419883014\0\0\0   scsi:t-0x00   -       1.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1.3/3-1.3:1.0/host1/target1:0:0/1:0:0:0/   /sys/class/block

USB3.0 to M.2 SSD adapter

Best identified with wwid: *DD564198838A3*:

NODE        DEV            MODEL         SERIAL       TYPE   UUID   WWID                                              MODALIAS      NAME    SIZE     BUS_PATH                                                                                       SUBSYSTEM          READ_ONLY   SYSTEM_DISK
rant.lan.   /dev/sdb       Super Speed   -            HDD    -      t10.USB3.0  Super Speed     DD564198838A3\0\0\0   scsi:t-0x00   -       256 GB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-2/3-2:1.0/host1/target1:0:0/1:0:0:0/           /sys/class/block               *
hurl.lan.   /dev/sda       Super Speed   -            HDD    -      t10.USB3.0  Super Speed     DD564198838A3\0\0\0   scsi:t-0x00   -       256 GB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-2/3-2:1.0/host0/target0:0:0/0:0:0:0/           /sys/class/block               *
jhal.lan.   /dev/sda       Super Speed   -            HDD    -      t10.USB3.0  Super Speed     DD564198838A3\0\0\0   scsi:t-0x00   -       256 GB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-2/3-2:1.0/host0/target0:0:0/0:0:0:0/           /sys/class/block               *

All connected disks

talosctl disks
[TALOSCTL] /home/kdn/dev/github.com/nazarewk-iac/talos-configs/talos-1.7.5/talosctl-linux-amd64 --cluster=pic disks
NODE        DEV            MODEL             SERIAL       TYPE   UUID   WWID                   MODALIAS      NAME    SIZE     BUS_PATH                                                                               SUBSYSTEM          READ_ONLY   SYSTEM_DISK
rant.lan.   /dev/mmcblk1   -                 0x28fbade5   SD     -      -                      -             SA32G   31 GB    /system/container/ACPI0004:01/BRCME88C:00/mmc_host/mmc1/mmc1:1234/                     /sys/class/block
rant.lan.   /dev/sda       500SSD1           -            HDD    -      -                      scsi:t-0x00   -       1.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:0/   /sys/class/block
rant.lan.   /dev/sdb       001-2MA101        -            HDD    -      -                      scsi:t-0x00   -       4.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:1/   /sys/class/block
rant.lan.   /dev/sdc       Portable SSD T5   -            SSD    -      naa.5000000000000001   scsi:t-0x00   -       250 GB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-2/3-2:1.0/host1/target1:0:0/1:0:0:0/   /sys/class/block               *
rant.lan.   /dev/sdd       500SSD1           -            HDD    -      -                      scsi:t-0x00   -       2.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:2/   /sys/class/block
jhal.lan.   /dev/mmcblk1   -                 0x28fbace4   SD     -      -                      -             SA32G   31 GB    /system/container/ACPI0004:01/BRCME88C:00/mmc_host/mmc1/mmc1:1234/                     /sys/class/block
jhal.lan.   /dev/sda       500SSD1           -            HDD    -      -                      scsi:t-0x00   -       1.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:0/   /sys/class/block
jhal.lan.   /dev/sdb       001-2MA101        -            HDD    -      -                      scsi:t-0x00   -       4.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:1/   /sys/class/block
jhal.lan.   /dev/sdc       Super Speed       -            HDD    -      -                      scsi:t-0x00   -       256 GB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-2/3-2:1.0/host1/target1:0:0/1:0:0:0/   /sys/class/block               *
jhal.lan.   /dev/sdd       500SSD1           -            HDD    -      -                      scsi:t-0x00   -       2.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:2/   /sys/class/block
hurl.lan.   /dev/mmcblk1   -                 0x28fbad8b   SD     -      -                      -             SA32G   31 GB    /system/container/ACPI0004:01/BRCME88C:00/mmc_host/mmc1/mmc1:1234/                     /sys/class/block
hurl.lan.   /dev/sda       500SSD1           -            HDD    -      -                      scsi:t-0x00   -       1.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:0/   /sys/class/block
hurl.lan.   /dev/sdb       Super Speed       -            HDD    -      -                      scsi:t-0x00   -       256 GB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-2/3-2:1.0/host1/target1:0:0/1:0:0:0/   /sys/class/block               *
hurl.lan.   /dev/sdc       00AAKS-00A7B0     -            HDD    -      -                      scsi:t-0x00   -       500 GB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:1/   /sys/class/block
hurl.lan.   /dev/sdd       001-1ER164        -            HDD    -      -                      scsi:t-0x00   -       2.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:2/   /sys/class/block
hurl.lan.   /dev/sde       P210 2048GB       -            HDD    -      -                      scsi:t-0x00   -       2.0 TB   /system/container/ACPI0004:02/PNP0D10:00/usb3/3-1/3-1:1.0/host0/target0:0:0/0:0:0:3/   /sys/class/block

updating machine config

talos-node-apply --dry-run hurl
talos-node-apply hurl

checking machine config drift

talos-node-apply --dry-run '*'
talos-node-apply --check

upgrading install image (extensions etc.)

this command will update to latest configured

talos-node-upgrade hurl

might need to run talos-node-apply hurl after reboot to load the ZFS kernel module before the boot finishes

debug system messages with talosctl-node hurl dmesg --follow

Discovered issues

  • most cheap SATA -> USB disk bays require physically pressing a button to turn on after losing power, so far:

    • ORICO-6648US3-C-V1
    • ORICO-6558US3-C
    • StarTech SDOCK4U313
    • Fantec QB-35US3R
  • ORICO-6648US3-C-V1 seems to garble drive's metadata:

    • all 3 Crucial BX500 1TB plugged into different RPi4 appear EXACTLY the same in lsblk -OJ

improvement ideas

About

Repository containing my personal Talos Kubernetes configurations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published