Skip to content

Conversation

khuedoan
Copy link
Owner

@khuedoan khuedoan commented Oct 4, 2025

Migrate to a pure Nix/NixOS setup, including the PXE boot process. This is still in a very early draft stage.

Obviously this will need a full cluster rebuild. Ideally, the end result should be smaller with fewer lines of code than the current one - excluding the lock/sum files, but this is not strictly required.

  • Custom NixOS installer for netboot
  • PXE boot server to boot the above NixOS installer
  • Install NixOS with nixos-anywhere
  • Callback mechanism for the installer
  • Wake-on-LAN
  • Static IP management for bare-metal machines (or possibly mDNS)
  • k3s clustering
  • LB: kube-vip and see if Cilium L2/MetalLB is still required (or the other way around)
  • New secrets management scheme for bare metal (root SSH key, cluster join token), likely with sops-nix
  • Rewrite some scripts with a more robust implementation in Go or Rust (likely Go, since I haven’t been able to get a PXE server implementation working in Rust yet - mostly a skill issue - and Go already has the excellent pixiecore library)
  • Disk management with raw partition for Ceph using the remaining space
  • Remove unused legacy components
  • Update documentation and decision records
  • Cilium may be removed in favor of Istio ambient mesh (I know it’s an apples-to-oranges, but after running Istio ambient mesh for a while in my other clusters, it seems stable and achieves the goals of observability, security, and traffic control)
  • Bare metal deploy orchestration (e.g., master first, then worker, including draining and cordoning nodes). Check if nixos-rebuild is sufficient or if deploy-rs is needed.
  • Pure Nix GitOps bootstrap, ideally without Kubernetes API access, avoid leaking the admin KUBECONFIG
  • Kernel tuning for k3s
  • Test cluster in QEMU using NixOS's testing framework
  • Maybe split the NixOS PXE boot implementation into a separate repository to allow reuse and make upgrades easier for forks, making this repo less convoluted

Draft on decisions:

  • NixOS: atomic configuration changes, truly declarative, very steep learning curve but worth it
  • Callback installer: easier to observe, event-based instead of polling, provides more control over the installation process. Still not 100% on this yet and need more experimentation
  • Custom Go CLI for PXE server instead of Docker containers: I want the installer to be a single binary with native OS processes instead of separate Docker containers, which require more network troubleshooting and cleanup. Now, instead of multiple containers, we just have multiple Go routines to serve DHCP, TFTP, HTTP, and the callback API, which cleanly exit once all hosts are installed.
  • Embed pixiecore into the installer instead of running it as a separate process: While we can shell out to the pixiecore CLI, we need an API for dynamic configuration anyway, so it makes sense to use the pixiecore library to manage with function calls instead of API calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant