Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Dual-Mode NixOS Workstation AI Node — Unified Planning and Mode State Machine


Implementation Checklist Plan

This is structured to get you from doc → bootable system with minimal thrash.


Phase 0 — Ground Truth (before touching Nix)

Hardware + constraints

  • Confirm GPU topology (which is GPU0 vs GPU1)
  • Confirm display wiring (which GPU drives monitor)
  • Confirm audio interface + latency requirements
  • Validate NVIDIA driver compatibility with NixOS + Wayland/Hyprland

Decisions to lock

  • Use runtime switching (no specialisations yet)

  • Studio = studio-local (conditional policy overlay on desktop, not a first-class target in v1)

  • Source of truth = /run/mode-controller/desired

  • mode request is synchronous: return success only after convergence

  • Choose vLLM unit model for v1:

    • v1 fast path: single compute-only vllm.service
    • target architecture: vllm@desktop.service and vllm@compute.service
  • k3s policy for v1:

    • keep k3s.service running across modes
    • change slice budgets and allowed workload intensity by mode
    • defer full k3s mode switching unless operational evidence justifies it
  • desktop-mode AI policy for v1:

    • keep vLLM off in desktop for the first convergence milestone
    • only add bounded desktop-mode AI after desktopcompute switching is reliable
  • studio-local overlay representation for v1:

    • studio-local-policy.service
    • audio-priority.service
  • capability-placement.json source for v1:

    • generated from Nix configuration
    • no runtime override unless a real need emerges
  • defer mode force in v1

  • GUI teardown policy for compute transitions:

    • require graphical session absence
    • require explicit GPU-release verification
    • only add display-manager/greeter stop logic if testing proves it necessary
  • desktop.target should not directly own greeter/login in v1

  • studio-local-policy.service should be:

    • a reliable marker for observation
    • a light policy-application unit
    • not a giant all-in-one Studio controller
  • observe-current implementation for v1:

    • shell first
    • stable plain-text + JSON output contract
    • replace with typed helper later only if complexity justifies it
  • package mode tools in pkgs/ and install them through the module

  • controller trigger model for v1:

    • parameterized oneshot only
    • no timer/path-triggered reconcile until manual transitions are proven
  • boot policy for v1:

    • normalize to desktop on boot
    • defer persistent desired-state replay across reboot
  • Define hard vs soft guards before automation


Phase 0.5 — Control Contract (before full workload integration)

Runtime state contract

  • Define /run/mode-controller/

    • desired
    • current
    • lock
    • last-transition.json
    • last-guards.json
    • capability-placement.json

CLI contract

  • Implement or stub:

    • mode request
    • mode status
    • mode reconcile
    • mode current
    • mode desired
    • mode dry-run
    • mode explain
  • defer mode force until guard policy is battle-tested

Observation contract

  • Classifier can return:

    • desktop
    • studio-local
    • compute
    • transitioning
    • failed-transition

Guard contract

  • Add check_target_reachable
  • Standardize exit codes
  • Standardize structured output
  • Mark guards as hard vs soft

Phase 1 — Base NixOS System

Core system

  • Create flake repo (if not already)
  • Install NixOS (minimal)
  • Enable flakes + nix-command
  • Add SSH + basic hardening

GPU + CUDA

  • Install NVIDIA drivers (matching kernel)
  • Validate nvidia-smi
  • Validate CUDA runtime

Desktop

  • Install Hyprland
  • Configure login/session (greetd or similar)
  • Validate Wayland stability with NVIDIA

Audio

  • Install PipeWire + WirePlumber
  • Validate low-latency config
  • Test REAPER baseline

Phase 2 — systemd Mode Skeleton

Targets / policy markers

  • Define first-class targets:

    • desktop.target
    • compute.target
  • Define studio-local as a policy overlay on desktop

  • Add explicit policy marker/service for studio-local

  • Decide whether studio-local is represented by:

    • audio-priority.service
    • studio-local-policy.service layered over desktop
    • another lightweight marker unit

Relationships

  • Add Conflicts= between:

    • compute ↔ graphical targets
  • Add Wants= / After= dependencies

Slices

  • Define:

    • interactive.slice
    • ai.slice
    • platform.slice
  • Assign services to slices


Phase 3 — Mode Controller (Core)

Core controller

  • mode-controller@.service
  • observe-current
  • reconcile
  • lock handling
  • state-file updates
  • dry-run path

Failure model

  • Record failed-transition
  • Record prior mode
  • Record guard/action failures
  • Verify abort-to-safe-state behavior

Phase 4 — Workload Layer

AI / vLLM

  • Package or install vLLM

  • Create profile-specific config/env for:

    • desktop profile
    • compute profile
  • Implement either:

    • v1 fast path: single vllm.service
    • target path: vllm@desktop.service + vllm@compute.service
  • keep vLLM disabled in desktop for the first bootable transition milestone

  • Validate single-GPU mode

  • Validate dual-GPU mode

  • Keep controller actions profile-aware so later split is mechanical

Platform / k3s

  • Install k3s

  • Configure control node

  • Validate cluster health

  • Deploy minimal workload

  • Keep k3s.service stable across desktop and compute in v1

  • Express mode differences via:

    • platform.slice budgets
    • workload policy / allowed intensity
    • optional node labels / taints later

Phase 5 — State Observation

Implement classifier

  • observe-current script

Detect:

  • graphical session (loginctl / process)
  • PipeWire / audio activity
  • vLLM service state
  • GPU usage (optional: nvidia-smi)

Output

  • plain mode
  • optional JSON (debug)
  • classify transitioning
  • classify failed-transition

Phase 6 — Guards

Implement guards (scripts)

  • check_target_reachable
  • check_audio_idle
  • check_gpu_display_released
  • check_cpu_load_safe
  • check_user_jobs_safe
  • check_memory_headroom
  • check_vllm_drainable
  • check_studio_capability_local

Standardize

  • exit codes
  • JSON output
  • logging
  • hard vs soft guard policy

Phase 7 — Transition Execution

Implement transition flows

  • Desktop → StudioLocal
  • StudioLocal → Desktop
  • Desktop → Compute
  • Compute → Desktop
  • StudioLocal → Compute

Verify explicitly

  • graphical session absence before compute promotion
  • GPU release after GUI shutdown
  • vLLM profile switching
  • audio protection works
  • transitions are idempotent
  • failed guard returns to prior safe state
  • failed action records failed-transition

Phase 8 — Idle + Automation

Idle detection

  • implement idle signal (input + audio + load)
  • threshold tuning

Policy

  • idle → mode request compute
  • guard failures → no transition

Safety

  • never auto-promote from studio-local

Phase 9 — Observability

Logging

  • structured logs for:

    • transitions
    • guards
    • failures

Status

  • mode status shows:

    • desired
    • current
    • last transition
    • blocking guards
    • capability placement

Phase 10 — Hardening

Failure handling

  • retry logic (bounded)
  • failed-transition state handling

Resource tuning

  • CPU quotas per slice
  • memory limits
  • I/O priority
  • tune platform.slice conservatively for desktop / studio-local, relaxed for compute

Security

  • restrict mode controller to root
  • audit transitions
  • isolate AI services

Phase 11 — Optional Evolution

If runtime switching is insufficient

  • introduce specialisation.compute
  • keep same mode interface
  • optionally promote studio-local overlay into a stronger first-class target only if operational evidence justifies the added complexity
  • consider stronger k3s mode-switching only if slice-governed steady-state behavior is inadequate

If Studio moves to Mac mini

  • set capability-placement.json
  • disable studio-local
  • keep controller intact

Critical Path (short version)

If you want the fastest path to something real:

  1. Base NixOS + GPU + Hyprland
  2. vLLM working (single GPU)
  3. Define targets (desktop, compute)
  4. Simple mode CLI + desired file
  5. Hardcoded transitions (no guards yet)
  6. Add guards + observation
  7. Add idle automation
  8. Add studio-local last

Where this can go wrong (worth calling out)

  • GPU release is the hardest boundary → don’t assume, always verify

  • Audio is fragile → treat StudioLocal invariants as strict

  • systemd isolate can surprise you → test with minimal configs first

  • too much cleverness early → get a dumb working version first, then refine