System Implementation Plan
Status: living plan
This plan is for implementing Dubnium on the actual workstation host. It expands the short bring-up checklist into a cautious, evidence-driven rollout. The goal is not to turn everything on at once. The goal is to prove one layer at a time: hardware facts, Nix evaluation, boot baseline, observer honesty, overlay mode, compute mode, rollback, then hardening.
Current V1 Assumptions
These assumptions come from the current repo configuration and should be confirmed before the first live switch:
| Area | Current assumption |
|---|---|
| Host flake target | .#workstation |
| Hostname | dubnium-workstation |
| Boot default | desktop |
| Studio placement | local |
studio-local representation | desktop overlay using studio-local-policy.service and audio-priority.service |
| vLLM lifecycle | compute-only in v1 |
| vLLM model | Qwen/Qwen2.5-Coder-14B-Instruct |
| Current GPU phase | planned 2 GPUs, currently present [ 0 ] |
| Display GPU | 0 |
| Compute GPUs | [ 0 ] until second GPU is present |
| k3s | disabled in current host config |
| Bootloader | systemd-boot with EFI variable access |
| Runtime state | /run/mode-controller |
Do not proceed to live transition testing until the hardware facts are confirmed against the actual host.
Phase 0: Safety and Ground Truth
Objective: know enough about the machine to avoid destructive or confusing changes.
0.1 Confirm Installation Path
Decide which path applies:
- existing NixOS machine: use
nixos-rebuild buildthenswitch - fresh install from live USB: use the fresh install runbook first
- non-NixOS current OS: do not use this plan directly until disk/install strategy is decided
Exit criteria:
- install path is explicit
- target disk and boot mode are known if fresh installing
- rollback access path is known
0.2 Confirm Remote/Recovery Access
Before switching system configuration:
ip addr
systemctl status sshd
Confirm:
- local keyboard/display access works
- SSH is enabled or a local console is available
- you know how to select an older NixOS generation at boot
- important local data is backed up
Failure mode to avoid:
- switching into a broken graphical/session state with no recovery path
0.3 Capture Hardware Facts
Run on the target host:
lspci -nn | grep -E 'VGA|3D|Audio|USB'
nvidia-smi
lsblk -f
findmnt
bootctl status
Record:
- actual GPU count
- which GPU drives display
- GPU PCI IDs
- NVIDIA driver visibility through
nvidia-smi - boot disk/filesystem layout
- EFI/systemd-boot status
- audio interface and whether REAPER/local studio is still needed on-host
Exit criteria:
dubnium.hardware.presentGpusmatches real visible GPUsdubnium.hardware.displayGpumatches the display pathdubnium.hardware.computeGpusonly references present GPUs- bootloader assumptions match the host
0.4 Decide First Compute Profile
For first live validation, choose the least surprising compute profile:
- with one GPU: compute may terminate the desktop and use GPU
0 - with two GPUs: compute can target both GPUs, but only after single-GPU behavior is proven
- vLLM should stay compute-only
If VRAM is tight, add vLLM guardrails before compute testing:
dubnium.vllm.extraArgs = [
"--max-model-len" "8192"
"--gpu-memory-utilization" "0.70"
"--enforce-eager"
];
Do not add desktop AI in the first rollout.
0.5 Seed Local Model Bundle
Preferred path:
- copy the selected materialized model bundle from the Dubnium USB seed into
/var/lib/dubnium/models - keep model weights out of Git and out of the Nix store
See docs/runbooks/model-seeding.md for the exact operator flow.
Phase 1: Repo and Host Configuration Review
Objective: make the flake match the real system before any switch.
1.1 Generate Hardware Configuration
On the target NixOS machine:
sudo nixos-generate-config --dir ./hosts/workstation
Review:
- root filesystem and boot filesystem entries
- EFI mount point
- generated hardware imports
- NVIDIA-related hardware detection
Do not preserve the placeholder hardware file if it does not match the target.
1.2 Review Host Config
Inspect:
sed -n '1,220p' hosts/workstation/default.nix
Confirm or update:
networking.hostName- bootloader settings
services.openssh.enabledubnium.capabilityPlacement.studiodubnium.vllm.enabledubnium.vllm.modeldubnium.vllm.extraArgsdubnium.hardware.presentGpusdubnium.hardware.displayGpudubnium.hardware.computeGpusdubnium.k3s.enable
Recommended first-system stance:
- keep
boot.defaultMode = "desktop" - keep
enableDesktopProfile = false - keep
k3s.enable = falseuntil mode control is proven - keep
computeGpus = [ 0 ]if only one GPU is currently installed
1.3 Confirm Module Assertions
The module already asserts:
- display GPU must be present
- desktop AI GPUs must be present
- compute GPUs must be present
- vLLM package and model must be set when vLLM is enabled
These assertions are useful. If they fail, fix the host facts rather than bypassing them.
Exit criteria:
- host config expresses real hardware, not planned hardware
- planned hardware is represented only in
plannedGpuCount - actual services enabled match the first rollout scope
Phase 2: Build Without Switching
Objective: prove Nix evaluation and build before mutating the live system.
Run:
sudo nixos-rebuild build --flake .#workstation
If it fails, classify the failure:
- hardware config mismatch
- unfree/NVIDIA package issue
- vLLM package evaluation issue
- missing module import
- syntax or option error
Do not run switch until build succeeds.
Useful follow-up checks:
nix flake check
nix build .#packages.x86_64-linux.mode-tools
Exit criteria:
- flake builds successfully
mode-toolspackage builds- no host option assertion is failing
Phase 3: First Switch to Desktop Baseline
Objective: switch only into the safe desktop-default posture.
Run:
sudo nixos-rebuild switch --flake .#workstation
Immediately check:
hostname
mode status
mode current
mode desired
sudo ls -la /run/mode-controller
systemctl status desktop.target
systemctl status compute.target
systemctl status studio-local-policy.service
systemctl status audio-priority.service
systemctl status vllm.service
Expected:
- host boots or remains usable
- desired mode is
desktop - current mode is
desktop, or a clearly explained non-desktop state vllm.serviceis inactive in desktopstudio-local-policy.serviceis inactiveaudio-priority.serviceis inactive/run/mode-controllerexists
If mode current reports compute or studio-local unexpectedly, stop and fix
observation before testing transitions.
Exit criteria:
- desktop baseline is usable
- mode CLI works
- observer output matches visible reality
Phase 4: Control-Plane Inspection Before Transitions
Objective: prove the controller can explain the system before it mutates the system.
Run:
mode status
mode current --refresh
mode current --json
mode explain desktop
mode explain studio-local
mode explain compute
sudo cat /run/mode-controller/capability-placement.json
sudo cat /run/mode-controller/hardware-topology.json
Check that the JSON/evidence shape is useful enough to diagnose:
- graphical session active or not
- studio policy active or not
- compute target active or not
- vLLM active or not
- last transition status
If mode current --json is too thin, harden observer output before running
compute transitions. The observer is the foundation of safe switching.
Exit criteria:
- status output distinguishes desired and current
- current state is derived from facts
- hardware and placement files match host configuration
Phase 5: Test desktop -> studio-local -> desktop
Objective: prove the low-risk overlay path before terminating the GUI for compute.
Run:
sudo mode request studio-local
mode status
systemctl status studio-local-policy.service
systemctl status audio-priority.service
systemctl show interactive.slice -p CPUWeight -p IOWeight
systemctl show ai.slice -p CPUWeight -p IOWeight
systemctl show platform.slice -p CPUWeight -p IOWeight
Expected:
- observed mode becomes
studio-local studio-local-policy.serviceis activeaudio-priority.serviceis active- interactive slice weights are raised
- AI/platform slice weights are lowered
- vLLM remains inactive
Return to desktop:
sudo mode request desktop
mode status
systemctl status studio-local-policy.service
systemctl status audio-priority.service
systemctl show interactive.slice -p CPUWeight -p IOWeight
systemctl show ai.slice -p CPUWeight -p IOWeight
systemctl show platform.slice -p CPUWeight -p IOWeight
Expected:
- observed mode becomes
desktop - overlay services are inactive
- slice weights return to baseline
Exit criteria:
- overlay activation and cleanup are repeatable
- observer accurately distinguishes desktop and studio-local
- failure records are useful if a command fails
Phase 6: Precompute Guard Validation
Objective: test compute guards without trusting the full transition yet.
Before running a real compute transition:
mode status
systemctl status vllm.service
loginctl list-sessions
Manually confirm:
- no active REAPER project
- no live audio session you care about
- no long-running foreground job
- model store path has enough space
- vLLM model choice fits current GPU memory plan
Run or inspect guards if exposed through the CLI. If not yet exposed, use the
existing transition path cautiously and rely on last-guards.json.
Compute should be blocked when:
- audio is active
- graphical session is not terminable
- memory headroom is insufficient
- target is not reachable
- required persistence paths are missing
Exit criteria:
- you know which guards are hard blocks
- guard failures are visible in
last-guards.json - no guard silently assumes success
Phase 7: First desktop -> compute Transition
Objective: prove one real promotion into compute, accepting that the first attempt may reveal NVIDIA/session behavior.
Preconditions:
- desktop baseline has already been verified
- studio overlay path has already been verified
- no critical local work is running
- local console or SSH recovery is available
Run:
sudo mode request compute
Then inspect:
mode status
systemctl status compute.target
systemctl status vllm.service
loginctl list-sessions
nvidia-smi
sudo cat /run/mode-controller/last-transition.json
sudo cat /run/mode-controller/last-guards.json
journalctl -u 'mode-controller@*' -b
journalctl -u vllm.service -b
Expected success:
- observed mode is
compute - graphical session is absent or non-authoritative
compute.targetis activevllm.serviceis active if enabled- GPU process evidence matches compute expectations
- transition record says success
Acceptable first degraded outcomes:
- vLLM starts but only on reduced GPU profile
- residual display allocation remains below a documented threshold
- non-critical desktop unit remains active without resource conflict
Hard failures:
- observer cannot classify final state
- audio or GUI conflict remains
- GPU release is indeterminate
- vLLM fails repeatedly and prevents compute contract
- rollback cannot restore desktop
If the transition fails, do not keep retrying blindly. Diagnose the first failed predicate.
Phase 8: First compute -> desktop Return
Objective: prove rollback/restoration before treating compute as usable.
Run:
sudo mode request desktop
Then inspect:
mode status
systemctl status desktop.target
systemctl status vllm.service
loginctl list-sessions
nvidia-smi
Expected:
- observed mode is
desktop vllm.serviceis inactive- graphical session path is usable
- audio returns to ordinary desktop behavior
- no compute-only state remains authoritative
If desktop is only partially restored, classify the result as degraded and fix the observer/controller before more compute testing.
Exit criteria:
- one complete
desktop -> compute -> desktoploop works or fails with a clear documented reason - rollback is evidence-backed
Phase 9: Repeatability and Soak
Objective: distinguish a one-time success from a reliable operating model.
Repeat:
sudo mode request studio-local
sudo mode request desktop
sudo mode request compute
sudo mode request desktop
For each run, record:
- final
mode status - transition duration
- guard output
- whether GPU release was clean
- whether desktop restoration was clean
- whether vLLM startup was reliable
Minimum repeatability bar before broader usage:
- 3 clean studio overlay round trips
- 3 clean compute round trips
- no false-success observer classifications
- no unexplained stale locks
- no manual cleanup needed between runs
Phase 10: Hardening Backlog
Only after the first transition loop is proven, prioritize hardening in this order:
- Richer
observe-current --jsonevidence and conflicts. - Persistent audit log at
/var/lib/mode-controller/events.jsonl. - Explicit GPU release predicate and thresholds.
- Degraded state classification for desktop and compute.
- Guard CLI surface such as
mode guards <target>. - vLLM runtime guardrails and model store persistence.
- k3s enablement and
platform.slicepolicy. - Optional impermanence and
/persistmapping. - Bounded desktop AI after second GPU and stable transitions.
- Specialisation evaluation only if runtime switching fails repeatedly.
Stop Conditions
Stop implementation and return to planning if any of these occur:
- the observer reports false success
- desktop cannot be restored through the controller
- GPU release is repeatedly indeterminate
- target isolation stops recovery-critical services
- vLLM causes repeated OOM or driver instability
- failures require undocumented manual cleanup
The correct response to any stop condition is not more automation. First improve observation, logs, predicates, and rollback.
Evidence to Keep
For each major milestone, keep the following:
mode status
mode current --json
sudo cat /run/mode-controller/last-transition.json
sudo cat /run/mode-controller/last-guards.json
systemctl status desktop.target compute.target vllm.service
nvidia-smi
journalctl -u 'mode-controller@*' -b
For repeated failures, copy the relevant evidence into an issue, planning note, or future runbook update before changing more code.