Dual-Mode NixOS Workstation AI Node — Unified Planning and Mode State Machine

Implementation Checklist Plan

This is structured to get you from doc → bootable system with minimal thrash.

Phase 0 — Ground Truth (before touching Nix)

Hardware + constraints

Confirm GPU topology (which is GPU0 vs GPU1)
Confirm display wiring (which GPU drives monitor)
Confirm audio interface + latency requirements
Validate NVIDIA driver compatibility with NixOS + Wayland/Hyprland

Decisions to lock

Use runtime switching (no specialisations yet)
Studio = studio-local (conditional policy overlay on desktop, not a first-class target in v1)
Source of truth = /run/mode-controller/desired
mode request is synchronous: return success only after convergence
Choose vLLM unit model for v1:
- v1 fast path: single compute-only vllm.service
- target architecture: vllm@desktop.service and vllm@compute.service
k3s policy for v1:
- keep k3s.service running across modes
- change slice budgets and allowed workload intensity by mode
- defer full k3s mode switching unless operational evidence justifies it
desktop-mode AI policy for v1:
- keep vLLM off in desktop for the first convergence milestone
- only add bounded desktop-mode AI after desktop ↔ compute switching is reliable
studio-local overlay representation for v1:
- studio-local-policy.service
- audio-priority.service
capability-placement.json source for v1:
- generated from Nix configuration
- no runtime override unless a real need emerges
defer mode force in v1
GUI teardown policy for compute transitions:
- require graphical session absence
- require explicit GPU-release verification
- only add display-manager/greeter stop logic if testing proves it necessary
desktop.target should not directly own greeter/login in v1
studio-local-policy.service should be:
- a reliable marker for observation
- a light policy-application unit
- not a giant all-in-one Studio controller
observe-current implementation for v1:
- shell first
- stable plain-text + JSON output contract
- replace with typed helper later only if complexity justifies it
package mode tools in pkgs/ and install them through the module
controller trigger model for v1:
- parameterized oneshot only
- no timer/path-triggered reconcile until manual transitions are proven
boot policy for v1:
- normalize to desktop on boot
- defer persistent desired-state replay across reboot
Define hard vs soft guards before automation

Phase 0.5 — Control Contract (before full workload integration)

Runtime state contract

Define /run/mode-controller/
- desired
- current
- lock
- last-transition.json
- last-guards.json
- capability-placement.json

CLI contract

Implement or stub:
- mode request
- mode status
- mode reconcile
- mode current
- mode desired
- mode dry-run
- mode explain
defer mode force until guard policy is battle-tested

Observation contract

Classifier can return:
- desktop
- studio-local
- compute
- transitioning
- failed-transition

Guard contract

Add check_target_reachable
Standardize exit codes
Standardize structured output
Mark guards as hard vs soft

Phase 1 — Base NixOS System

Core system

Create flake repo (if not already)
Install NixOS (minimal)
Enable flakes + nix-command
Add SSH + basic hardening

GPU + CUDA

Install NVIDIA drivers (matching kernel)
Validate nvidia-smi
Validate CUDA runtime

Desktop

Install Hyprland
Configure login/session (greetd or similar)
Validate Wayland stability with NVIDIA

Audio

Install PipeWire + WirePlumber
Validate low-latency config
Test REAPER baseline

Phase 2 — systemd Mode Skeleton

Targets / policy markers

Define first-class targets:
- desktop.target
- compute.target
Define studio-local as a policy overlay on desktop
Add explicit policy marker/service for studio-local
Decide whether studio-local is represented by:
- audio-priority.service
- studio-local-policy.service layered over desktop
- another lightweight marker unit

Relationships

Add Conflicts= between:
- compute ↔ graphical targets
Add Wants= / After= dependencies

Slices

Define:
- interactive.slice
- ai.slice
- platform.slice
Assign services to slices

Phase 3 — Mode Controller (Core)

Core controller

mode-controller@.service
observe-current
reconcile
lock handling
state-file updates
dry-run path

Failure model

Record failed-transition
Record prior mode
Record guard/action failures
Verify abort-to-safe-state behavior

Phase 4 — Workload Layer

AI / vLLM

Package or install vLLM
Create profile-specific config/env for:
- desktop profile
- compute profile
Implement either:
- v1 fast path: single vllm.service
- target path: vllm@desktop.service + vllm@compute.service
keep vLLM disabled in desktop for the first bootable transition milestone
Validate single-GPU mode
Validate dual-GPU mode
Keep controller actions profile-aware so later split is mechanical

Platform / k3s

Install k3s
Configure control node
Validate cluster health
Deploy minimal workload
Keep k3s.service stable across desktop and compute in v1
Express mode differences via:
- platform.slice budgets
- workload policy / allowed intensity
- optional node labels / taints later

Phase 5 — State Observation

Implement classifier

observe-current script

Detect:

graphical session (loginctl / process)
PipeWire / audio activity
vLLM service state
GPU usage (optional: nvidia-smi)

Output

plain mode
optional JSON (debug)
classify transitioning
classify failed-transition

Phase 6 — Guards

Implement guards (scripts)

check_target_reachable
check_audio_idle
check_gpu_display_released
check_cpu_load_safe
check_user_jobs_safe
check_memory_headroom
check_vllm_drainable
check_studio_capability_local

Standardize

exit codes
JSON output
logging
hard vs soft guard policy

Phase 7 — Transition Execution

Implement transition flows

Desktop → StudioLocal
StudioLocal → Desktop
Desktop → Compute
Compute → Desktop
StudioLocal → Compute

Verify explicitly

graphical session absence before compute promotion
GPU release after GUI shutdown
vLLM profile switching
audio protection works
transitions are idempotent
failed guard returns to prior safe state
failed action records failed-transition

Phase 8 — Idle + Automation

Idle detection

implement idle signal (input + audio + load)
threshold tuning

Policy

idle → mode request compute
guard failures → no transition

Safety

never auto-promote from studio-local

Phase 9 — Observability

Logging

structured logs for:
- transitions
- guards
- failures

Status

mode status shows:
- desired
- current
- last transition
- blocking guards
- capability placement

Phase 10 — Hardening

Failure handling

retry logic (bounded)
failed-transition state handling

Resource tuning

CPU quotas per slice
memory limits
I/O priority
tune platform.slice conservatively for desktop / studio-local, relaxed for compute

Security

restrict mode controller to root
audit transitions
isolate AI services

Phase 11 — Optional Evolution

If runtime switching is insufficient

introduce specialisation.compute
keep same mode interface
optionally promote studio-local overlay into a stronger first-class target only if operational evidence justifies the added complexity
consider stronger k3s mode-switching only if slice-governed steady-state behavior is inadequate

If Studio moves to Mac mini

set capability-placement.json
disable studio-local
keep controller intact

Critical Path (short version)

If you want the fastest path to something real:

Base NixOS + GPU + Hyprland
vLLM working (single GPU)
Define targets (desktop, compute)
Simple mode CLI + desired file
Hardcoded transitions (no guards yet)
Add guards + observation
Add idle automation
Add studio-local last

Where this can go wrong (worth calling out)

GPU release is the hardest boundary → don’t assume, always verify
Audio is fragile → treat StudioLocal invariants as strict
systemd isolate can surprise you → test with minimal configs first
too much cleverness early → get a dumb working version first, then refine

Keyboard shortcuts

Dubnium