Rolling Implementation Design
Status: living draft
This file captures the current implementation design for Dubnium as a rolling reference. It should be updated as hardware facts, control-plane contracts, and mode-transition behavior are validated on the real host.
Documentation framework:
- architecture docs live under
docs/architecture/ - accepted decisions live under
docs/decisions/ - operator procedures live under
docs/runbooks/ - this file remains the rolling synthesis, gap register, and implementation backlog
Architecture Summary
Dubnium is a NixOS host that must behave as one physical machine with multiple operational contracts:
desktop: normal Hyprland workstation/dev mode. GUI and ordinary audio are active. The display GPU is protected. AI is off or tightly bounded in v1.studio-local: conditional low-latency audio profile. It is a policy overlay ondesktop, not the center of the architecture. If studio/audio moves to a Mac mini, the host-local state machine should still make sense.compute: headless throughput mode. GUI is absent or non-authoritative. vLLM and platform workloads may use more of the machine, including both GPUs when present.
The key design rule is that desired state and current state are different things:
- Desired state is operator or automation intent, written under
/run/mode-controller. - Current state is observation-derived from runtime facts, not copied from desired state.
- A reconciler moves the system toward desired state through guarded transitions.
- systemd targets, services, and slices are the enforcement layer.
- Transitions must be bounded, logged, idempotent, and able to report blocked, degraded, or failed outcomes explicitly.
The normative source is the Dubnium control-plane specification. Desired state
is authoritative intent, current state is observer output, no transition runs
without a lock, and success requires post-action re-observation. The local docs
and current repo scaffold already align with the main direction: runtime
switching first, no specialisations yet, desktop.target and compute.target
as first-class targets, studio-local as a desktop overlay, vLLM compute-only
in v1, and k3s stable across modes.
Gaps / Risks
The goal is to keep this section operational. Items should either be resolved for v1, converted into implementation work, or left as explicit open questions with an owner before the first live build.
Contradictions to Resolve
Resolved for v1:
| Topic | Decision | Follow-up |
|---|---|---|
studio-local.target vs overlay | Do not create a first-class studio-local.target in v1. Use studio-local-policy.service and audio-priority.service as a desktop overlay. | Update older checklist wording when touching that file. |
| Root-on-RAM / impermanence | Defer Root-on-RAM, /persist, Home Manager, sops-nix, and impermanence until the base bootable control loop works. | Keep persistent path design compatible with adding /persist later. |
modectl vs mode | Keep the local command name mode. | Treat modectl in upstream notes as an older name unless a rename is explicitly requested. |
| Desktop AI vs compute-only vLLM | Keep vLLM compute-only in v1. | Revisit bounded desktop AI only after reliable desktop <-> compute transitions. |
| Maintenance mode | Do not implement maintenance mode in the first milestone. | Reserve state names and avoid enum designs that make maintenance hard to add later. |
Open compatibility item:
- Desired/current state format remains plain text in the current scaffold. This
is acceptable for the first bootable milestone only if transition records
carry structured metadata. The next hardening pass should move toward
desired.jsonandcurrent.json, or explicitly document why the plain-text files remain the stable interface.
Missing Decisions
Resolved for v1:
| Decision | V1 stance |
|---|---|
| Authority model | Require privileged transition execution. The initial operator path is sudo mode request <mode> or root-owned mode-controller@.service. Unprivileged users must not be able to forge desired/current state or transition success. |
| Reboot policy | Boot normalizes to desktop. Do not replay last desired mode across reboot in v1. |
| vLLM service shape | Use one vllm.service, compute-only. Keep the controller and options shaped so vllm@compute.service can replace it later. |
| k3s lifecycle | Keep k3s.service stable across modes in v1. Express mode pressure through platform.slice budgets before adding start/stop behavior. |
Still open before live compute testing:
| Open item | Concrete next step |
|---|---|
| GPU release predicate | Define a target-host predicate using loginctl, compositor absence, nvidia-smi process evidence, and an acceptable residual VRAM threshold. Record both pass and indeterminate outcomes. |
| Degraded thresholds | Define degraded-compute as safe but incomplete compute operation, such as vLLM active on a reduced GPU profile or residual non-critical display allocation below the configured threshold. Define failed-transition for unsafe, conflicting, or unclassified post-action states. |
| Persistent audit location | Choose /var/lib/mode-controller/events.jsonl now, with an option to move it under /persist/var/lib/mode-controller/events.jsonl when impermanence lands. |
| k3s compute policy | Decide whether v1 only changes platform.slice weights or also applies k3s labels/taints for workload intensity. Do not do both until there is a real workload that needs it. |
Risky Assumptions
| Risk | Failure mode | Mitigation |
|---|---|---|
| NVIDIA/Wayland GPU release is sticky | Compute promotion terminates the GUI but leaves display GPU allocations or ambiguous CUDA/display ownership. | Treat GPU release as an observation predicate, not an assumption. Add bounded timeout, residual threshold, and escalation criteria for specialization/reboot-mediated compute. |
systemctl isolate compute.target stops too much | Important baseline services disappear because target dependencies are incomplete. | Keep compute.target minimal and explicitly list required base services. Test with systemctl list-dependencies compute.target before live switching. |
| Shell observer misclassifies mixed states | Status reports compute while GUI, audio, or conflicting services are still active. | Prefer unknown, transitioning, degraded-*, or failed-transition over false success. Add JSON evidence output and snapshot tests. |
| Rollback does not restore a usable desktop | desktop.target starts but graphical session/audio/display remain broken. | Make rollback success require post-rollback observation, not just successful systemctl commands. Record degraded desktop if partially restored. |
/run loses state on reboot | Recent desired/current files disappear and audit history is lost. | Keep live lock/current/desired in /run; write transition history to /var/lib/mode-controller/events.jsonl before introducing impermanence. |
Gap Closure Backlog
These are the smallest useful implementation/doc tasks to close the current gaps without broadening scope:
- Update older checklist references so
studio-localis consistently described as a desktop overlay, not a v1 target. - Add a short
docs/control-plane-decisions.mdor extend this file with a dated decision log for authority model, reboot policy, vLLM shape, and audit location. - Define the exact
observe-current --jsonschema before adding more transition logic. - Define the GPU release predicate in docs, then implement it in
check_gpu_display_released. - Add persistent audit output to
/var/lib/mode-controller/events.jsonl. - Add observer classifications for
degraded-compute,degraded-desktop, andfailed-transitionbefore relying on rollback. - Keep k3s mode behavior limited to
platform.sliceweights until a concrete platform workload proves that labels, taints, or service restarts are needed.
Proposed Repo Structure
Use the existing scaffold and keep it simple:
.
├── flake.nix
├── hosts/
│ └── workstation/
│ ├── default.nix
│ └── hardware-configuration.nix
├── modules/
│ ├── dubnium/
│ │ ├── default.nix
│ │ ├── options.nix
│ │ ├── state.nix
│ │ ├── targets.nix
│ │ ├── slices.nix
│ │ ├── services.nix
│ │ ├── controller.nix
│ │ └── guards.nix
│ └── workloads/
│ ├── hyprland.nix
│ ├── audio.nix
│ ├── nvidia.nix
│ ├── vllm.nix
│ └── k3s.nix
├── pkgs/
│ └── mode-tools.nix
├── scripts/
│ ├── mode
│ ├── reconcile
│ ├── observe-current
│ ├── lib.sh
│ └── guards/
│ ├── check_audio_idle
│ ├── check_gpu_display_released
│ ├── check_graphical_session_terminable
│ ├── check_vllm_drainable
│ ├── check_compute_capability_local
│ ├── check_studio_capability_local
│ ├── check_memory_headroom
│ └── check_persistence_paths_ready
└── docs/
Flake Design
nixosConfigurations.workstationimportshosts/workstation/default.nix.nixosModules.defaultexposes the Dubnium module.packages.x86_64-linux.mode-toolspackages the CLI, observer, reconciler, and guards.- Add
home-manager,sops-nix, andimpermanencelater only when the base transition loop is proven.
Module Layout
options.nix: all host policy knobs: default mode, GPU topology, vLLM model/profile, studio placement, slice weights.state.nix: creates/run/mode-controller, writes generated topology and placement files, initializes boot default.targets.nix: definesdesktop.targetandcompute.target; no v1studio-local.target.slices.nix: definesinteractive.slice,ai.slice,platform.slice.services.nix: marker/policy services likestudio-local-policy.service,audio-priority.service,mode-observe.service.controller.nix:mode-controller@.service, boot normalization unit, permissions.guards.nix: installs guard scripts and documents exit-code contract.workloads/*.nix: workload-specific units, not mode policy.
systemd Targets and Dependencies
desktop.target
Wants=graphical.target
After=graphical.target
compute.target
Conflicts=graphical.target desktop.target
Wants=vllm.service
After=multi-user.target network-online.target
For studio-local, use:
studio-local-policy.service
Type=oneshot
RemainAfterExit=true
Slice=interactive.slice
audio-priority.service
Type=oneshot
RemainAfterExit=true
ExecStart=systemctl set-property --runtime ...
ExecStop=reset slice weights
Slice Structure
interactive.slice: Hyprland/session-adjacent services, audio priority policy, desktop-critical work.ai.slice: vLLM and future AI workloads.platform.slice: k3s and platform/background services.- Optional later:
maintenance.sliceif maintenance mode becomes real.
Service Layout
vllm.service: compute-only in v1,Slice=ai.slice,WantedBy=compute.target, persistent model/cache path outside the Nix store.k3s.service: stable across modes in v1,Slice=platform.slice; mode differences are resource budgets/policy, not start/stop.- Hyprland/display stack: owned by normal graphical/session machinery;
desktop.targetshould depend on it but not become a giant desktop controller. - Audio/PipeWire: normal desktop user services; studio-local only applies priority policy and blocks compute promotion when active audio is detected.
Control Plane Shape
Mode CLI
mode status
mode request <desktop|studio-local|compute>
mode reconcile [--target <mode>]
mode current [--refresh] [--json]
mode desired
mode dry-run <mode>
mode explain [<mode>]
Recommended additions after the first scaffold:
mode guards <target>
mode history
mode last-transition
mode doctor
mode request should be synchronous in v1: return success only after
post-transition observation satisfies the target. Otherwise it should return
non-zero and show the failed or blocking reason.
Observer / Classifier
The observer should be conservative and evidence-first. It should inspect:
- active graphical sessions via
loginctl - compositor/display-manager state
compute.targetandvllm.servicestudio-local-policy.service- PipeWire/JACK/REAPER indicators
- NVIDIA process/VRAM evidence where available
- controller lock/transition marker
- last failed transition marker
Output should support plain mode for scripts and JSON for status/debug:
{
"observed_state": "desktop",
"confidence": "high",
"degraded": false,
"signals": {
"graphical_session_active": true,
"compute_target_active": false,
"vllm_active": false,
"studio_policy_active": false
},
"conflicts": [],
"timestamp": "..."
}
Classification rule: if signals conflict, report transitioning,
degraded-*, or failed-transition; do not pretend the desired target was
reached.
Guard Layout
- Guards are standalone scripts or subcommands.
- Exit codes:
0: pass10-19: policy block20+: execution/check error
- Each guard emits structured JSON or stable key/value output.
- Guards should check one thing each.
Initial guard set:
check_audio_idle: REAPER/PipeWire/JACK activity blocks compute.check_graphical_session_terminable: pre-action check before killing GUI.check_gpu_display_released: post-action validation after GUI teardown.check_vllm_drainable: compute -> desktop.check_compute_capability_local: placement check.check_studio_capability_local: blocks studio-local if externalized.check_memory_headroom: avoids launching compute under obvious pressure.check_persistence_paths_ready: model store/runtime paths exist and are writable.
First Milestone
The smallest bootable milestone should be narrower than “all modes implemented.”
Goal: boot the flake-managed workstation into desktop, expose the control
plane, and prove an observable/auditable desktop baseline before deep workload
switching.
-
Generate real hardware config into
hosts/workstation/hardware-configuration.nix. -
Confirm host options:
dubnium.boot.defaultMode = "desktop"dubnium.hardware.presentGpusdubnium.hardware.displayGpudubnium.hardware.computeGpus- vLLM disabled or compute-only
- studio placement set to
localonly if local audio is still intended
-
Build without switching:
sudo nixos-rebuild build --flake .#workstation -
Switch only after evaluation succeeds:
sudo nixos-rebuild switch --flake .#workstation -
Verify boot/control-plane files:
mode status mode current mode desired sudo ls -la /run/mode-controller -
Verify systemd skeleton:
systemctl status desktop.target systemctl status compute.target systemctl status studio-local-policy.service systemctl status audio-priority.service systemctl status vllm.service -
Prove observer honesty:
- In desktop,
mode currentshould saydesktop. vllm.serviceshould be inactive.studio-local-policy.serviceshould be inactive unless requested.- If evidence conflicts, status should show conflict/degraded/failed rather than silently reporting success.
- In desktop,
-
Test the safe overlay first:
sudo mode request studio-local mode status sudo mode request desktop mode status -
Only then test
desktop -> computewith vLLM either disabled, stubbed, or known-good:sudo mode request compute mode status sudo mode request desktop mode status -
Milestone success criteria:
- The machine boots from the flake.
mode status/current/desiredwork.- Desired/current separation is visible.
- The controller lock prevents concurrent transitions.
- Guard failures are reported distinctly from execution errors.
desktop -> studio-local -> desktopworks as an overlay.desktop -> compute -> desktopeither works or fails with a clear guard/action/post-observation reason.- No failed transition is reported as a successful target mode.
The next milestone after that should be a real desktop <-> compute control
loop with vLLM active, structured audit records, rollback to desktop, and
explicit degraded-compute thresholds.