Dual-Mode NixOS Workstation AI Node — Unified Planning and Mode State Machine
Implementation Checklist Plan
This is structured to get you from doc → bootable system with minimal thrash.
Phase 0 — Ground Truth (before touching Nix)
Hardware + constraints
- Confirm GPU topology (which is GPU0 vs GPU1)
- Confirm display wiring (which GPU drives monitor)
- Confirm audio interface + latency requirements
- Validate NVIDIA driver compatibility with NixOS + Wayland/Hyprland
Decisions to lock
-
Use runtime switching (no specialisations yet)
-
Studio =
studio-local(conditional policy overlay ondesktop, not a first-class target in v1) -
Source of truth =
/run/mode-controller/desired -
mode requestis synchronous: return success only after convergence -
Choose vLLM unit model for v1:
- v1 fast path: single compute-only
vllm.service - target architecture:
vllm@desktop.serviceandvllm@compute.service
- v1 fast path: single compute-only
-
k3s policy for v1:
- keep
k3s.servicerunning across modes - change slice budgets and allowed workload intensity by mode
- defer full k3s mode switching unless operational evidence justifies it
- keep
-
desktop-mode AI policy for v1:
- keep vLLM off in
desktopfor the first convergence milestone - only add bounded desktop-mode AI after
desktop↔computeswitching is reliable
- keep vLLM off in
-
studio-localoverlay representation for v1:studio-local-policy.serviceaudio-priority.service
-
capability-placement.jsonsource for v1:- generated from Nix configuration
- no runtime override unless a real need emerges
-
defer
mode forcein v1 -
GUI teardown policy for
computetransitions:- require graphical session absence
- require explicit GPU-release verification
- only add display-manager/greeter stop logic if testing proves it necessary
-
desktop.targetshould not directly own greeter/login in v1 -
studio-local-policy.serviceshould be:- a reliable marker for observation
- a light policy-application unit
- not a giant all-in-one Studio controller
-
observe-currentimplementation for v1:- shell first
- stable plain-text + JSON output contract
- replace with typed helper later only if complexity justifies it
-
package
modetools inpkgs/and install them through the module -
controller trigger model for v1:
- parameterized oneshot only
- no timer/path-triggered reconcile until manual transitions are proven
-
boot policy for v1:
- normalize to
desktopon boot - defer persistent desired-state replay across reboot
- normalize to
-
Define hard vs soft guards before automation
Phase 0.5 — Control Contract (before full workload integration)
Runtime state contract
-
Define
/run/mode-controller/- desired
- current
- lock
- last-transition.json
- last-guards.json
- capability-placement.json
CLI contract
-
Implement or stub:
mode requestmode statusmode reconcilemode currentmode desiredmode dry-runmode explain
-
defer
mode forceuntil guard policy is battle-tested
Observation contract
-
Classifier can return:
desktopstudio-localcomputetransitioningfailed-transition
Guard contract
- Add
check_target_reachable - Standardize exit codes
- Standardize structured output
- Mark guards as hard vs soft
Phase 1 — Base NixOS System
Core system
- Create flake repo (if not already)
- Install NixOS (minimal)
- Enable flakes + nix-command
- Add SSH + basic hardening
GPU + CUDA
- Install NVIDIA drivers (matching kernel)
- Validate
nvidia-smi - Validate CUDA runtime
Desktop
- Install Hyprland
- Configure login/session (greetd or similar)
- Validate Wayland stability with NVIDIA
Audio
- Install PipeWire + WirePlumber
- Validate low-latency config
- Test REAPER baseline
Phase 2 — systemd Mode Skeleton
Targets / policy markers
-
Define first-class targets:
desktop.targetcompute.target
-
Define
studio-localas a policy overlay ondesktop -
Add explicit policy marker/service for
studio-local -
Decide whether
studio-localis represented by:audio-priority.servicestudio-local-policy.servicelayered overdesktop- another lightweight marker unit
Relationships
-
Add
Conflicts=between:- compute ↔ graphical targets
-
Add
Wants=/After=dependencies
Slices
-
Define:
interactive.sliceai.sliceplatform.slice
-
Assign services to slices
Phase 3 — Mode Controller (Core)
Core controller
-
mode-controller@.service -
observe-current -
reconcile - lock handling
- state-file updates
- dry-run path
Failure model
- Record
failed-transition - Record prior mode
- Record guard/action failures
- Verify abort-to-safe-state behavior
Phase 4 — Workload Layer
AI / vLLM
-
Package or install vLLM
-
Create profile-specific config/env for:
- desktop profile
- compute profile
-
Implement either:
- v1 fast path: single
vllm.service - target path:
vllm@desktop.service+vllm@compute.service
- v1 fast path: single
-
keep vLLM disabled in
desktopfor the first bootable transition milestone -
Validate single-GPU mode
-
Validate dual-GPU mode
-
Keep controller actions profile-aware so later split is mechanical
Platform / k3s
-
Install k3s
-
Configure control node
-
Validate cluster health
-
Deploy minimal workload
-
Keep
k3s.servicestable acrossdesktopandcomputein v1 -
Express mode differences via:
platform.slicebudgets- workload policy / allowed intensity
- optional node labels / taints later
Phase 5 — State Observation
Implement classifier
-
observe-currentscript
Detect:
- graphical session (loginctl / process)
- PipeWire / audio activity
- vLLM service state
- GPU usage (optional:
nvidia-smi)
Output
- plain mode
- optional JSON (debug)
- classify
transitioning - classify
failed-transition
Phase 6 — Guards
Implement guards (scripts)
-
check_target_reachable -
check_audio_idle -
check_gpu_display_released -
check_cpu_load_safe -
check_user_jobs_safe -
check_memory_headroom -
check_vllm_drainable -
check_studio_capability_local
Standardize
- exit codes
- JSON output
- logging
- hard vs soft guard policy
Phase 7 — Transition Execution
Implement transition flows
- Desktop → StudioLocal
- StudioLocal → Desktop
- Desktop → Compute
- Compute → Desktop
- StudioLocal → Compute
Verify explicitly
- graphical session absence before compute promotion
- GPU release after GUI shutdown
- vLLM profile switching
- audio protection works
- transitions are idempotent
- failed guard returns to prior safe state
- failed action records
failed-transition
Phase 8 — Idle + Automation
Idle detection
- implement idle signal (input + audio + load)
- threshold tuning
Policy
- idle →
mode request compute - guard failures → no transition
Safety
- never auto-promote from
studio-local
Phase 9 — Observability
Logging
-
structured logs for:
- transitions
- guards
- failures
Status
-
mode statusshows:- desired
- current
- last transition
- blocking guards
- capability placement
Phase 10 — Hardening
Failure handling
- retry logic (bounded)
- failed-transition state handling
Resource tuning
- CPU quotas per slice
- memory limits
- I/O priority
- tune
platform.sliceconservatively fordesktop/studio-local, relaxed forcompute
Security
- restrict mode controller to root
- audit transitions
- isolate AI services
Phase 11 — Optional Evolution
If runtime switching is insufficient
- introduce
specialisation.compute - keep same
modeinterface - optionally promote
studio-localoverlay into a stronger first-class target only if operational evidence justifies the added complexity - consider stronger k3s mode-switching only if slice-governed steady-state behavior is inadequate
If Studio moves to Mac mini
- set
capability-placement.json - disable
studio-local - keep controller intact
Critical Path (short version)
If you want the fastest path to something real:
- Base NixOS + GPU + Hyprland
- vLLM working (single GPU)
- Define targets (
desktop,compute) - Simple
modeCLI + desired file - Hardcoded transitions (no guards yet)
- Add guards + observation
- Add idle automation
- Add
studio-locallast
Where this can go wrong (worth calling out)
-
GPU release is the hardest boundary → don’t assume, always verify
-
Audio is fragile → treat StudioLocal invariants as strict
-
systemd isolate can surprise you → test with minimal configs first
-
too much cleverness early → get a dumb working version first, then refine