Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

System Implementation Plan

Status: living plan

This plan is for implementing Dubnium on the actual workstation host. It expands the short bring-up checklist into a cautious, evidence-driven rollout. The goal is not to turn everything on at once. The goal is to prove one layer at a time: hardware facts, Nix evaluation, boot baseline, observer honesty, overlay mode, compute mode, rollback, then hardening.

Current V1 Assumptions

These assumptions come from the current repo configuration and should be confirmed before the first live switch:

AreaCurrent assumption
Host flake target.#workstation
Hostnamedubnium-workstation
Boot defaultdesktop
Studio placementlocal
studio-local representationdesktop overlay using studio-local-policy.service and audio-priority.service
vLLM lifecyclecompute-only in v1
vLLM modelQwen/Qwen2.5-Coder-14B-Instruct
Current GPU phaseplanned 2 GPUs, currently present [ 0 ]
Display GPU0
Compute GPUs[ 0 ] until second GPU is present
k3sdisabled in current host config
Bootloadersystemd-boot with EFI variable access
Runtime state/run/mode-controller

Do not proceed to live transition testing until the hardware facts are confirmed against the actual host.

Phase 0: Safety and Ground Truth

Objective: know enough about the machine to avoid destructive or confusing changes.

0.1 Confirm Installation Path

Decide which path applies:

  • existing NixOS machine: use nixos-rebuild build then switch
  • fresh install from live USB: use the fresh install runbook first
  • non-NixOS current OS: do not use this plan directly until disk/install strategy is decided

Exit criteria:

  • install path is explicit
  • target disk and boot mode are known if fresh installing
  • rollback access path is known

0.2 Confirm Remote/Recovery Access

Before switching system configuration:

ip addr
systemctl status sshd

Confirm:

  • local keyboard/display access works
  • SSH is enabled or a local console is available
  • you know how to select an older NixOS generation at boot
  • important local data is backed up

Failure mode to avoid:

  • switching into a broken graphical/session state with no recovery path

0.3 Capture Hardware Facts

Run on the target host:

lspci -nn | grep -E 'VGA|3D|Audio|USB'
nvidia-smi
lsblk -f
findmnt
bootctl status

Record:

  • actual GPU count
  • which GPU drives display
  • GPU PCI IDs
  • NVIDIA driver visibility through nvidia-smi
  • boot disk/filesystem layout
  • EFI/systemd-boot status
  • audio interface and whether REAPER/local studio is still needed on-host

Exit criteria:

  • dubnium.hardware.presentGpus matches real visible GPUs
  • dubnium.hardware.displayGpu matches the display path
  • dubnium.hardware.computeGpus only references present GPUs
  • bootloader assumptions match the host

0.4 Decide First Compute Profile

For first live validation, choose the least surprising compute profile:

  • with one GPU: compute may terminate the desktop and use GPU 0
  • with two GPUs: compute can target both GPUs, but only after single-GPU behavior is proven
  • vLLM should stay compute-only

If VRAM is tight, add vLLM guardrails before compute testing:

dubnium.vllm.extraArgs = [
  "--max-model-len" "8192"
  "--gpu-memory-utilization" "0.70"
  "--enforce-eager"
];

Do not add desktop AI in the first rollout.

0.5 Seed Local Model Bundle

Preferred path:

  • copy the selected materialized model bundle from the Dubnium USB seed into /var/lib/dubnium/models
  • keep model weights out of Git and out of the Nix store

See docs/runbooks/model-seeding.md for the exact operator flow.

Phase 1: Repo and Host Configuration Review

Objective: make the flake match the real system before any switch.

1.1 Generate Hardware Configuration

On the target NixOS machine:

sudo nixos-generate-config --dir ./hosts/workstation

Review:

  • root filesystem and boot filesystem entries
  • EFI mount point
  • generated hardware imports
  • NVIDIA-related hardware detection

Do not preserve the placeholder hardware file if it does not match the target.

1.2 Review Host Config

Inspect:

sed -n '1,220p' hosts/workstation/default.nix

Confirm or update:

  • networking.hostName
  • bootloader settings
  • services.openssh.enable
  • dubnium.capabilityPlacement.studio
  • dubnium.vllm.enable
  • dubnium.vllm.model
  • dubnium.vllm.extraArgs
  • dubnium.hardware.presentGpus
  • dubnium.hardware.displayGpu
  • dubnium.hardware.computeGpus
  • dubnium.k3s.enable

Recommended first-system stance:

  • keep boot.defaultMode = "desktop"
  • keep enableDesktopProfile = false
  • keep k3s.enable = false until mode control is proven
  • keep computeGpus = [ 0 ] if only one GPU is currently installed

1.3 Confirm Module Assertions

The module already asserts:

  • display GPU must be present
  • desktop AI GPUs must be present
  • compute GPUs must be present
  • vLLM package and model must be set when vLLM is enabled

These assertions are useful. If they fail, fix the host facts rather than bypassing them.

Exit criteria:

  • host config expresses real hardware, not planned hardware
  • planned hardware is represented only in plannedGpuCount
  • actual services enabled match the first rollout scope

Phase 2: Build Without Switching

Objective: prove Nix evaluation and build before mutating the live system.

Run:

sudo nixos-rebuild build --flake .#workstation

If it fails, classify the failure:

  • hardware config mismatch
  • unfree/NVIDIA package issue
  • vLLM package evaluation issue
  • missing module import
  • syntax or option error

Do not run switch until build succeeds.

Useful follow-up checks:

nix flake check
nix build .#packages.x86_64-linux.mode-tools

Exit criteria:

  • flake builds successfully
  • mode-tools package builds
  • no host option assertion is failing

Phase 3: First Switch to Desktop Baseline

Objective: switch only into the safe desktop-default posture.

Run:

sudo nixos-rebuild switch --flake .#workstation

Immediately check:

hostname
mode status
mode current
mode desired
sudo ls -la /run/mode-controller
systemctl status desktop.target
systemctl status compute.target
systemctl status studio-local-policy.service
systemctl status audio-priority.service
systemctl status vllm.service

Expected:

  • host boots or remains usable
  • desired mode is desktop
  • current mode is desktop, or a clearly explained non-desktop state
  • vllm.service is inactive in desktop
  • studio-local-policy.service is inactive
  • audio-priority.service is inactive
  • /run/mode-controller exists

If mode current reports compute or studio-local unexpectedly, stop and fix observation before testing transitions.

Exit criteria:

  • desktop baseline is usable
  • mode CLI works
  • observer output matches visible reality

Phase 4: Control-Plane Inspection Before Transitions

Objective: prove the controller can explain the system before it mutates the system.

Run:

mode status
mode current --refresh
mode current --json
mode explain desktop
mode explain studio-local
mode explain compute
sudo cat /run/mode-controller/capability-placement.json
sudo cat /run/mode-controller/hardware-topology.json

Check that the JSON/evidence shape is useful enough to diagnose:

  • graphical session active or not
  • studio policy active or not
  • compute target active or not
  • vLLM active or not
  • last transition status

If mode current --json is too thin, harden observer output before running compute transitions. The observer is the foundation of safe switching.

Exit criteria:

  • status output distinguishes desired and current
  • current state is derived from facts
  • hardware and placement files match host configuration

Phase 5: Test desktop -> studio-local -> desktop

Objective: prove the low-risk overlay path before terminating the GUI for compute.

Run:

sudo mode request studio-local
mode status
systemctl status studio-local-policy.service
systemctl status audio-priority.service
systemctl show interactive.slice -p CPUWeight -p IOWeight
systemctl show ai.slice -p CPUWeight -p IOWeight
systemctl show platform.slice -p CPUWeight -p IOWeight

Expected:

  • observed mode becomes studio-local
  • studio-local-policy.service is active
  • audio-priority.service is active
  • interactive slice weights are raised
  • AI/platform slice weights are lowered
  • vLLM remains inactive

Return to desktop:

sudo mode request desktop
mode status
systemctl status studio-local-policy.service
systemctl status audio-priority.service
systemctl show interactive.slice -p CPUWeight -p IOWeight
systemctl show ai.slice -p CPUWeight -p IOWeight
systemctl show platform.slice -p CPUWeight -p IOWeight

Expected:

  • observed mode becomes desktop
  • overlay services are inactive
  • slice weights return to baseline

Exit criteria:

  • overlay activation and cleanup are repeatable
  • observer accurately distinguishes desktop and studio-local
  • failure records are useful if a command fails

Phase 6: Precompute Guard Validation

Objective: test compute guards without trusting the full transition yet.

Before running a real compute transition:

mode status
systemctl status vllm.service
loginctl list-sessions

Manually confirm:

  • no active REAPER project
  • no live audio session you care about
  • no long-running foreground job
  • model store path has enough space
  • vLLM model choice fits current GPU memory plan

Run or inspect guards if exposed through the CLI. If not yet exposed, use the existing transition path cautiously and rely on last-guards.json.

Compute should be blocked when:

  • audio is active
  • graphical session is not terminable
  • memory headroom is insufficient
  • target is not reachable
  • required persistence paths are missing

Exit criteria:

  • you know which guards are hard blocks
  • guard failures are visible in last-guards.json
  • no guard silently assumes success

Phase 7: First desktop -> compute Transition

Objective: prove one real promotion into compute, accepting that the first attempt may reveal NVIDIA/session behavior.

Preconditions:

  • desktop baseline has already been verified
  • studio overlay path has already been verified
  • no critical local work is running
  • local console or SSH recovery is available

Run:

sudo mode request compute

Then inspect:

mode status
systemctl status compute.target
systemctl status vllm.service
loginctl list-sessions
nvidia-smi
sudo cat /run/mode-controller/last-transition.json
sudo cat /run/mode-controller/last-guards.json
journalctl -u 'mode-controller@*' -b
journalctl -u vllm.service -b

Expected success:

  • observed mode is compute
  • graphical session is absent or non-authoritative
  • compute.target is active
  • vllm.service is active if enabled
  • GPU process evidence matches compute expectations
  • transition record says success

Acceptable first degraded outcomes:

  • vLLM starts but only on reduced GPU profile
  • residual display allocation remains below a documented threshold
  • non-critical desktop unit remains active without resource conflict

Hard failures:

  • observer cannot classify final state
  • audio or GUI conflict remains
  • GPU release is indeterminate
  • vLLM fails repeatedly and prevents compute contract
  • rollback cannot restore desktop

If the transition fails, do not keep retrying blindly. Diagnose the first failed predicate.

Phase 8: First compute -> desktop Return

Objective: prove rollback/restoration before treating compute as usable.

Run:

sudo mode request desktop

Then inspect:

mode status
systemctl status desktop.target
systemctl status vllm.service
loginctl list-sessions
nvidia-smi

Expected:

  • observed mode is desktop
  • vllm.service is inactive
  • graphical session path is usable
  • audio returns to ordinary desktop behavior
  • no compute-only state remains authoritative

If desktop is only partially restored, classify the result as degraded and fix the observer/controller before more compute testing.

Exit criteria:

  • one complete desktop -> compute -> desktop loop works or fails with a clear documented reason
  • rollback is evidence-backed

Phase 9: Repeatability and Soak

Objective: distinguish a one-time success from a reliable operating model.

Repeat:

sudo mode request studio-local
sudo mode request desktop
sudo mode request compute
sudo mode request desktop

For each run, record:

  • final mode status
  • transition duration
  • guard output
  • whether GPU release was clean
  • whether desktop restoration was clean
  • whether vLLM startup was reliable

Minimum repeatability bar before broader usage:

  • 3 clean studio overlay round trips
  • 3 clean compute round trips
  • no false-success observer classifications
  • no unexplained stale locks
  • no manual cleanup needed between runs

Phase 10: Hardening Backlog

Only after the first transition loop is proven, prioritize hardening in this order:

  1. Richer observe-current --json evidence and conflicts.
  2. Persistent audit log at /var/lib/mode-controller/events.jsonl.
  3. Explicit GPU release predicate and thresholds.
  4. Degraded state classification for desktop and compute.
  5. Guard CLI surface such as mode guards <target>.
  6. vLLM runtime guardrails and model store persistence.
  7. k3s enablement and platform.slice policy.
  8. Optional impermanence and /persist mapping.
  9. Bounded desktop AI after second GPU and stable transitions.
  10. Specialisation evaluation only if runtime switching fails repeatedly.

Stop Conditions

Stop implementation and return to planning if any of these occur:

  • the observer reports false success
  • desktop cannot be restored through the controller
  • GPU release is repeatedly indeterminate
  • target isolation stops recovery-critical services
  • vLLM causes repeated OOM or driver instability
  • failures require undocumented manual cleanup

The correct response to any stop condition is not more automation. First improve observation, logs, predicates, and rollback.

Evidence to Keep

For each major milestone, keep the following:

mode status
mode current --json
sudo cat /run/mode-controller/last-transition.json
sudo cat /run/mode-controller/last-guards.json
systemctl status desktop.target compute.target vllm.service
nvidia-smi
journalctl -u 'mode-controller@*' -b

For repeated failures, copy the relevant evidence into an issue, planning note, or future runbook update before changing more code.