System Implementation Plan

Status: living plan

This plan is for implementing Dubnium on the actual workstation host. It expands the short bring-up checklist into a cautious, evidence-driven rollout. The goal is not to turn everything on at once. The goal is to prove one layer at a time: hardware facts, Nix evaluation, boot baseline, observer honesty, overlay mode, compute mode, rollback, then hardening.

Current V1 Assumptions

These assumptions come from the current repo configuration and should be confirmed before the first live switch:

Area	Current assumption
Host flake target	`.#workstation`
Hostname	`dubnium-workstation`
Boot default	`desktop`
Studio placement	`local`
`studio-local` representation	desktop overlay using `studio-local-policy.service` and `audio-priority.service`
vLLM lifecycle	compute-only in v1
vLLM model	`Qwen/Qwen2.5-Coder-14B-Instruct`
Current GPU phase	planned 2 GPUs, currently present `[ 0 ]`
Display GPU	`0`
Compute GPUs	`[ 0 ]` until second GPU is present
k3s	disabled in current host config
Bootloader	systemd-boot with EFI variable access
Runtime state	`/run/mode-controller`

Do not proceed to live transition testing until the hardware facts are confirmed against the actual host.

Phase 0: Safety and Ground Truth

Objective: know enough about the machine to avoid destructive or confusing changes.

0.1 Confirm Installation Path

Decide which path applies:

existing NixOS machine: use nixos-rebuild build then switch
fresh install from live USB: use the fresh install runbook first
non-NixOS current OS: do not use this plan directly until disk/install strategy is decided

Exit criteria:

install path is explicit
target disk and boot mode are known if fresh installing
rollback access path is known

0.2 Confirm Remote/Recovery Access

Before switching system configuration:

ip addr
systemctl status sshd

Confirm:

local keyboard/display access works
SSH is enabled or a local console is available
you know how to select an older NixOS generation at boot
important local data is backed up

Failure mode to avoid:

switching into a broken graphical/session state with no recovery path

0.3 Capture Hardware Facts

Run on the target host:

lspci -nn | grep -E 'VGA|3D|Audio|USB'
nvidia-smi
lsblk -f
findmnt
bootctl status

Record:

actual GPU count
which GPU drives display
GPU PCI IDs
NVIDIA driver visibility through nvidia-smi
boot disk/filesystem layout
EFI/systemd-boot status
audio interface and whether REAPER/local studio is still needed on-host

Exit criteria:

dubnium.hardware.presentGpus matches real visible GPUs
dubnium.hardware.displayGpu matches the display path
dubnium.hardware.computeGpus only references present GPUs
bootloader assumptions match the host

0.4 Decide First Compute Profile

For first live validation, choose the least surprising compute profile:

with one GPU: compute may terminate the desktop and use GPU 0
with two GPUs: compute can target both GPUs, but only after single-GPU behavior is proven
vLLM should stay compute-only

If VRAM is tight, add vLLM guardrails before compute testing:

dubnium.vllm.extraArgs = [
  "--max-model-len" "8192"
  "--gpu-memory-utilization" "0.70"
  "--enforce-eager"
];

Do not add desktop AI in the first rollout.

0.5 Seed Local Model Bundle

Preferred path:

copy the selected materialized model bundle from the Dubnium USB seed into /var/lib/dubnium/models
keep model weights out of Git and out of the Nix store

See docs/runbooks/model-seeding.md for the exact operator flow.

Phase 1: Repo and Host Configuration Review

Objective: make the flake match the real system before any switch.

1.1 Generate Hardware Configuration

On the target NixOS machine:

sudo nixos-generate-config --dir ./hosts/workstation

Review:

root filesystem and boot filesystem entries
EFI mount point
generated hardware imports
NVIDIA-related hardware detection

Do not preserve the placeholder hardware file if it does not match the target.

1.2 Review Host Config

Inspect:

sed -n '1,220p' hosts/workstation/default.nix

Confirm or update:

networking.hostName
bootloader settings
services.openssh.enable
dubnium.capabilityPlacement.studio
dubnium.vllm.enable
dubnium.vllm.model
dubnium.vllm.extraArgs
dubnium.hardware.presentGpus
dubnium.hardware.displayGpu
dubnium.hardware.computeGpus
dubnium.k3s.enable

Recommended first-system stance:

keep boot.defaultMode = "desktop"
keep enableDesktopProfile = false
keep k3s.enable = false until mode control is proven
keep computeGpus = [ 0 ] if only one GPU is currently installed

1.3 Confirm Module Assertions

The module already asserts:

display GPU must be present
desktop AI GPUs must be present
compute GPUs must be present
vLLM package and model must be set when vLLM is enabled

These assertions are useful. If they fail, fix the host facts rather than bypassing them.

Exit criteria:

host config expresses real hardware, not planned hardware
planned hardware is represented only in plannedGpuCount
actual services enabled match the first rollout scope

Phase 2: Build Without Switching

Objective: prove Nix evaluation and build before mutating the live system.

Run:

sudo nixos-rebuild build --flake .#workstation

If it fails, classify the failure:

hardware config mismatch
unfree/NVIDIA package issue
vLLM package evaluation issue
missing module import
syntax or option error

Do not run switch until build succeeds.

Useful follow-up checks:

nix flake check
nix build .#packages.x86_64-linux.mode-tools

Exit criteria:

flake builds successfully
mode-tools package builds
no host option assertion is failing

Phase 3: First Switch to Desktop Baseline

Objective: switch only into the safe desktop-default posture.

Run:

sudo nixos-rebuild switch --flake .#workstation

Immediately check:

hostname
mode status
mode current
mode desired
sudo ls -la /run/mode-controller
systemctl status desktop.target
systemctl status compute.target
systemctl status studio-local-policy.service
systemctl status audio-priority.service
systemctl status vllm.service

Expected:

host boots or remains usable
desired mode is desktop
current mode is desktop, or a clearly explained non-desktop state
vllm.service is inactive in desktop
studio-local-policy.service is inactive
audio-priority.service is inactive
/run/mode-controller exists

If mode current reports compute or studio-local unexpectedly, stop and fix observation before testing transitions.

Exit criteria:

desktop baseline is usable
mode CLI works
observer output matches visible reality

Phase 4: Control-Plane Inspection Before Transitions

Objective: prove the controller can explain the system before it mutates the system.

Run:

mode status
mode current --refresh
mode current --json
mode explain desktop
mode explain studio-local
mode explain compute
sudo cat /run/mode-controller/capability-placement.json
sudo cat /run/mode-controller/hardware-topology.json

Check that the JSON/evidence shape is useful enough to diagnose:

graphical session active or not
studio policy active or not
compute target active or not
vLLM active or not
last transition status

If mode current --json is too thin, harden observer output before running compute transitions. The observer is the foundation of safe switching.

Exit criteria:

status output distinguishes desired and current
current state is derived from facts
hardware and placement files match host configuration

Phase 5: Test `desktop -> studio-local -> desktop`

Objective: prove the low-risk overlay path before terminating the GUI for compute.

Run:

sudo mode request studio-local
mode status
systemctl status studio-local-policy.service
systemctl status audio-priority.service
systemctl show interactive.slice -p CPUWeight -p IOWeight
systemctl show ai.slice -p CPUWeight -p IOWeight
systemctl show platform.slice -p CPUWeight -p IOWeight

Expected:

observed mode becomes studio-local
studio-local-policy.service is active
audio-priority.service is active
interactive slice weights are raised
AI/platform slice weights are lowered
vLLM remains inactive

Return to desktop:

sudo mode request desktop
mode status
systemctl status studio-local-policy.service
systemctl status audio-priority.service
systemctl show interactive.slice -p CPUWeight -p IOWeight
systemctl show ai.slice -p CPUWeight -p IOWeight
systemctl show platform.slice -p CPUWeight -p IOWeight

Expected:

observed mode becomes desktop
overlay services are inactive
slice weights return to baseline

Exit criteria:

overlay activation and cleanup are repeatable
observer accurately distinguishes desktop and studio-local
failure records are useful if a command fails

Phase 6: Precompute Guard Validation

Objective: test compute guards without trusting the full transition yet.

Before running a real compute transition:

mode status
systemctl status vllm.service
loginctl list-sessions

Manually confirm:

no active REAPER project
no live audio session you care about
no long-running foreground job
model store path has enough space
vLLM model choice fits current GPU memory plan

Run or inspect guards if exposed through the CLI. If not yet exposed, use the existing transition path cautiously and rely on last-guards.json.

Compute should be blocked when:

audio is active
graphical session is not terminable
memory headroom is insufficient
target is not reachable
required persistence paths are missing

Exit criteria:

you know which guards are hard blocks
guard failures are visible in last-guards.json
no guard silently assumes success

Phase 7: First `desktop -> compute` Transition

Objective: prove one real promotion into compute, accepting that the first attempt may reveal NVIDIA/session behavior.

Preconditions:

desktop baseline has already been verified
studio overlay path has already been verified
no critical local work is running
local console or SSH recovery is available

Run:

sudo mode request compute

Then inspect:

mode status
systemctl status compute.target
systemctl status vllm.service
loginctl list-sessions
nvidia-smi
sudo cat /run/mode-controller/last-transition.json
sudo cat /run/mode-controller/last-guards.json
journalctl -u 'mode-controller@*' -b
journalctl -u vllm.service -b

Expected success:

observed mode is compute
graphical session is absent or non-authoritative
compute.target is active
vllm.service is active if enabled
GPU process evidence matches compute expectations
transition record says success

Acceptable first degraded outcomes:

vLLM starts but only on reduced GPU profile
residual display allocation remains below a documented threshold
non-critical desktop unit remains active without resource conflict

Hard failures:

observer cannot classify final state
audio or GUI conflict remains
GPU release is indeterminate
vLLM fails repeatedly and prevents compute contract
rollback cannot restore desktop

If the transition fails, do not keep retrying blindly. Diagnose the first failed predicate.

Phase 8: First `compute -> desktop` Return

Objective: prove rollback/restoration before treating compute as usable.

Run:

sudo mode request desktop

Then inspect:

mode status
systemctl status desktop.target
systemctl status vllm.service
loginctl list-sessions
nvidia-smi

Expected:

observed mode is desktop
vllm.service is inactive
graphical session path is usable
audio returns to ordinary desktop behavior
no compute-only state remains authoritative

If desktop is only partially restored, classify the result as degraded and fix the observer/controller before more compute testing.

Exit criteria:

one complete desktop -> compute -> desktop loop works or fails with a clear documented reason
rollback is evidence-backed

Phase 9: Repeatability and Soak

Objective: distinguish a one-time success from a reliable operating model.

Repeat:

sudo mode request studio-local
sudo mode request desktop
sudo mode request compute
sudo mode request desktop

For each run, record:

final mode status
transition duration
guard output
whether GPU release was clean
whether desktop restoration was clean
whether vLLM startup was reliable

Minimum repeatability bar before broader usage:

3 clean studio overlay round trips
3 clean compute round trips
no false-success observer classifications
no unexplained stale locks
no manual cleanup needed between runs

Phase 10: Hardening Backlog

Only after the first transition loop is proven, prioritize hardening in this order:

Richer observe-current --json evidence and conflicts.
Persistent audit log at /var/lib/mode-controller/events.jsonl.
Explicit GPU release predicate and thresholds.
Degraded state classification for desktop and compute.
Guard CLI surface such as mode guards <target>.
vLLM runtime guardrails and model store persistence.
k3s enablement and platform.slice policy.
Optional impermanence and /persist mapping.
Bounded desktop AI after second GPU and stable transitions.
Specialisation evaluation only if runtime switching fails repeatedly.

Stop Conditions

Stop implementation and return to planning if any of these occur:

the observer reports false success
desktop cannot be restored through the controller
GPU release is repeatedly indeterminate
target isolation stops recovery-critical services
vLLM causes repeated OOM or driver instability
failures require undocumented manual cleanup

The correct response to any stop condition is not more automation. First improve observation, logs, predicates, and rollback.

Evidence to Keep

For each major milestone, keep the following:

mode status
mode current --json
sudo cat /run/mode-controller/last-transition.json
sudo cat /run/mode-controller/last-guards.json
systemctl status desktop.target compute.target vllm.service
nvidia-smi
journalctl -u 'mode-controller@*' -b

For repeated failures, copy the relevant evidence into an issue, planning note, or future runbook update before changing more code.

Keyboard shortcuts

Dubnium