Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Rolling Implementation Design

Status: living draft

This file captures the current implementation design for Dubnium as a rolling reference. It should be updated as hardware facts, control-plane contracts, and mode-transition behavior are validated on the real host.

Documentation framework:

  • architecture docs live under docs/architecture/
  • accepted decisions live under docs/decisions/
  • operator procedures live under docs/runbooks/
  • this file remains the rolling synthesis, gap register, and implementation backlog

Architecture Summary

Dubnium is a NixOS host that must behave as one physical machine with multiple operational contracts:

  • desktop: normal Hyprland workstation/dev mode. GUI and ordinary audio are active. The display GPU is protected. AI is off or tightly bounded in v1.
  • studio-local: conditional low-latency audio profile. It is a policy overlay on desktop, not the center of the architecture. If studio/audio moves to a Mac mini, the host-local state machine should still make sense.
  • compute: headless throughput mode. GUI is absent or non-authoritative. vLLM and platform workloads may use more of the machine, including both GPUs when present.

The key design rule is that desired state and current state are different things:

  • Desired state is operator or automation intent, written under /run/mode-controller.
  • Current state is observation-derived from runtime facts, not copied from desired state.
  • A reconciler moves the system toward desired state through guarded transitions.
  • systemd targets, services, and slices are the enforcement layer.
  • Transitions must be bounded, logged, idempotent, and able to report blocked, degraded, or failed outcomes explicitly.

The normative source is the Dubnium control-plane specification. Desired state is authoritative intent, current state is observer output, no transition runs without a lock, and success requires post-action re-observation. The local docs and current repo scaffold already align with the main direction: runtime switching first, no specialisations yet, desktop.target and compute.target as first-class targets, studio-local as a desktop overlay, vLLM compute-only in v1, and k3s stable across modes.

Gaps / Risks

The goal is to keep this section operational. Items should either be resolved for v1, converted into implementation work, or left as explicit open questions with an owner before the first live build.

Contradictions to Resolve

Resolved for v1:

TopicDecisionFollow-up
studio-local.target vs overlayDo not create a first-class studio-local.target in v1. Use studio-local-policy.service and audio-priority.service as a desktop overlay.Update older checklist wording when touching that file.
Root-on-RAM / impermanenceDefer Root-on-RAM, /persist, Home Manager, sops-nix, and impermanence until the base bootable control loop works.Keep persistent path design compatible with adding /persist later.
modectl vs modeKeep the local command name mode.Treat modectl in upstream notes as an older name unless a rename is explicitly requested.
Desktop AI vs compute-only vLLMKeep vLLM compute-only in v1.Revisit bounded desktop AI only after reliable desktop <-> compute transitions.
Maintenance modeDo not implement maintenance mode in the first milestone.Reserve state names and avoid enum designs that make maintenance hard to add later.

Open compatibility item:

  • Desired/current state format remains plain text in the current scaffold. This is acceptable for the first bootable milestone only if transition records carry structured metadata. The next hardening pass should move toward desired.json and current.json, or explicitly document why the plain-text files remain the stable interface.

Missing Decisions

Resolved for v1:

DecisionV1 stance
Authority modelRequire privileged transition execution. The initial operator path is sudo mode request <mode> or root-owned mode-controller@.service. Unprivileged users must not be able to forge desired/current state or transition success.
Reboot policyBoot normalizes to desktop. Do not replay last desired mode across reboot in v1.
vLLM service shapeUse one vllm.service, compute-only. Keep the controller and options shaped so vllm@compute.service can replace it later.
k3s lifecycleKeep k3s.service stable across modes in v1. Express mode pressure through platform.slice budgets before adding start/stop behavior.

Still open before live compute testing:

Open itemConcrete next step
GPU release predicateDefine a target-host predicate using loginctl, compositor absence, nvidia-smi process evidence, and an acceptable residual VRAM threshold. Record both pass and indeterminate outcomes.
Degraded thresholdsDefine degraded-compute as safe but incomplete compute operation, such as vLLM active on a reduced GPU profile or residual non-critical display allocation below the configured threshold. Define failed-transition for unsafe, conflicting, or unclassified post-action states.
Persistent audit locationChoose /var/lib/mode-controller/events.jsonl now, with an option to move it under /persist/var/lib/mode-controller/events.jsonl when impermanence lands.
k3s compute policyDecide whether v1 only changes platform.slice weights or also applies k3s labels/taints for workload intensity. Do not do both until there is a real workload that needs it.

Risky Assumptions

RiskFailure modeMitigation
NVIDIA/Wayland GPU release is stickyCompute promotion terminates the GUI but leaves display GPU allocations or ambiguous CUDA/display ownership.Treat GPU release as an observation predicate, not an assumption. Add bounded timeout, residual threshold, and escalation criteria for specialization/reboot-mediated compute.
systemctl isolate compute.target stops too muchImportant baseline services disappear because target dependencies are incomplete.Keep compute.target minimal and explicitly list required base services. Test with systemctl list-dependencies compute.target before live switching.
Shell observer misclassifies mixed statesStatus reports compute while GUI, audio, or conflicting services are still active.Prefer unknown, transitioning, degraded-*, or failed-transition over false success. Add JSON evidence output and snapshot tests.
Rollback does not restore a usable desktopdesktop.target starts but graphical session/audio/display remain broken.Make rollback success require post-rollback observation, not just successful systemctl commands. Record degraded desktop if partially restored.
/run loses state on rebootRecent desired/current files disappear and audit history is lost.Keep live lock/current/desired in /run; write transition history to /var/lib/mode-controller/events.jsonl before introducing impermanence.

Gap Closure Backlog

These are the smallest useful implementation/doc tasks to close the current gaps without broadening scope:

  1. Update older checklist references so studio-local is consistently described as a desktop overlay, not a v1 target.
  2. Add a short docs/control-plane-decisions.md or extend this file with a dated decision log for authority model, reboot policy, vLLM shape, and audit location.
  3. Define the exact observe-current --json schema before adding more transition logic.
  4. Define the GPU release predicate in docs, then implement it in check_gpu_display_released.
  5. Add persistent audit output to /var/lib/mode-controller/events.jsonl.
  6. Add observer classifications for degraded-compute, degraded-desktop, and failed-transition before relying on rollback.
  7. Keep k3s mode behavior limited to platform.slice weights until a concrete platform workload proves that labels, taints, or service restarts are needed.

Proposed Repo Structure

Use the existing scaffold and keep it simple:

.
├── flake.nix
├── hosts/
│   └── workstation/
│       ├── default.nix
│       └── hardware-configuration.nix
├── modules/
│   ├── dubnium/
│   │   ├── default.nix
│   │   ├── options.nix
│   │   ├── state.nix
│   │   ├── targets.nix
│   │   ├── slices.nix
│   │   ├── services.nix
│   │   ├── controller.nix
│   │   └── guards.nix
│   └── workloads/
│       ├── hyprland.nix
│       ├── audio.nix
│       ├── nvidia.nix
│       ├── vllm.nix
│       └── k3s.nix
├── pkgs/
│   └── mode-tools.nix
├── scripts/
│   ├── mode
│   ├── reconcile
│   ├── observe-current
│   ├── lib.sh
│   └── guards/
│       ├── check_audio_idle
│       ├── check_gpu_display_released
│       ├── check_graphical_session_terminable
│       ├── check_vllm_drainable
│       ├── check_compute_capability_local
│       ├── check_studio_capability_local
│       ├── check_memory_headroom
│       └── check_persistence_paths_ready
└── docs/

Flake Design

  • nixosConfigurations.workstation imports hosts/workstation/default.nix.
  • nixosModules.default exposes the Dubnium module.
  • packages.x86_64-linux.mode-tools packages the CLI, observer, reconciler, and guards.
  • Add home-manager, sops-nix, and impermanence later only when the base transition loop is proven.

Module Layout

  • options.nix: all host policy knobs: default mode, GPU topology, vLLM model/profile, studio placement, slice weights.
  • state.nix: creates /run/mode-controller, writes generated topology and placement files, initializes boot default.
  • targets.nix: defines desktop.target and compute.target; no v1 studio-local.target.
  • slices.nix: defines interactive.slice, ai.slice, platform.slice.
  • services.nix: marker/policy services like studio-local-policy.service, audio-priority.service, mode-observe.service.
  • controller.nix: mode-controller@.service, boot normalization unit, permissions.
  • guards.nix: installs guard scripts and documents exit-code contract.
  • workloads/*.nix: workload-specific units, not mode policy.

systemd Targets and Dependencies

desktop.target
  Wants=graphical.target
  After=graphical.target

compute.target
  Conflicts=graphical.target desktop.target
  Wants=vllm.service
  After=multi-user.target network-online.target

For studio-local, use:

studio-local-policy.service
  Type=oneshot
  RemainAfterExit=true
  Slice=interactive.slice

audio-priority.service
  Type=oneshot
  RemainAfterExit=true
  ExecStart=systemctl set-property --runtime ...
  ExecStop=reset slice weights

Slice Structure

  • interactive.slice: Hyprland/session-adjacent services, audio priority policy, desktop-critical work.
  • ai.slice: vLLM and future AI workloads.
  • platform.slice: k3s and platform/background services.
  • Optional later: maintenance.slice if maintenance mode becomes real.

Service Layout

  • vllm.service: compute-only in v1, Slice=ai.slice, WantedBy=compute.target, persistent model/cache path outside the Nix store.
  • k3s.service: stable across modes in v1, Slice=platform.slice; mode differences are resource budgets/policy, not start/stop.
  • Hyprland/display stack: owned by normal graphical/session machinery; desktop.target should depend on it but not become a giant desktop controller.
  • Audio/PipeWire: normal desktop user services; studio-local only applies priority policy and blocks compute promotion when active audio is detected.

Control Plane Shape

Mode CLI

mode status
mode request <desktop|studio-local|compute>
mode reconcile [--target <mode>]
mode current [--refresh] [--json]
mode desired
mode dry-run <mode>
mode explain [<mode>]

Recommended additions after the first scaffold:

mode guards <target>
mode history
mode last-transition
mode doctor

mode request should be synchronous in v1: return success only after post-transition observation satisfies the target. Otherwise it should return non-zero and show the failed or blocking reason.

Observer / Classifier

The observer should be conservative and evidence-first. It should inspect:

  • active graphical sessions via loginctl
  • compositor/display-manager state
  • compute.target and vllm.service
  • studio-local-policy.service
  • PipeWire/JACK/REAPER indicators
  • NVIDIA process/VRAM evidence where available
  • controller lock/transition marker
  • last failed transition marker

Output should support plain mode for scripts and JSON for status/debug:

{
  "observed_state": "desktop",
  "confidence": "high",
  "degraded": false,
  "signals": {
    "graphical_session_active": true,
    "compute_target_active": false,
    "vllm_active": false,
    "studio_policy_active": false
  },
  "conflicts": [],
  "timestamp": "..."
}

Classification rule: if signals conflict, report transitioning, degraded-*, or failed-transition; do not pretend the desired target was reached.

Guard Layout

  • Guards are standalone scripts or subcommands.
  • Exit codes:
    • 0: pass
    • 10-19: policy block
    • 20+: execution/check error
  • Each guard emits structured JSON or stable key/value output.
  • Guards should check one thing each.

Initial guard set:

  • check_audio_idle: REAPER/PipeWire/JACK activity blocks compute.
  • check_graphical_session_terminable: pre-action check before killing GUI.
  • check_gpu_display_released: post-action validation after GUI teardown.
  • check_vllm_drainable: compute -> desktop.
  • check_compute_capability_local: placement check.
  • check_studio_capability_local: blocks studio-local if externalized.
  • check_memory_headroom: avoids launching compute under obvious pressure.
  • check_persistence_paths_ready: model store/runtime paths exist and are writable.

First Milestone

The smallest bootable milestone should be narrower than “all modes implemented.”

Goal: boot the flake-managed workstation into desktop, expose the control plane, and prove an observable/auditable desktop baseline before deep workload switching.

  1. Generate real hardware config into hosts/workstation/hardware-configuration.nix.

  2. Confirm host options:

    • dubnium.boot.defaultMode = "desktop"
    • dubnium.hardware.presentGpus
    • dubnium.hardware.displayGpu
    • dubnium.hardware.computeGpus
    • vLLM disabled or compute-only
    • studio placement set to local only if local audio is still intended
  3. Build without switching:

    sudo nixos-rebuild build --flake .#workstation
    
  4. Switch only after evaluation succeeds:

    sudo nixos-rebuild switch --flake .#workstation
    
  5. Verify boot/control-plane files:

    mode status
    mode current
    mode desired
    sudo ls -la /run/mode-controller
    
  6. Verify systemd skeleton:

    systemctl status desktop.target
    systemctl status compute.target
    systemctl status studio-local-policy.service
    systemctl status audio-priority.service
    systemctl status vllm.service
    
  7. Prove observer honesty:

    • In desktop, mode current should say desktop.
    • vllm.service should be inactive.
    • studio-local-policy.service should be inactive unless requested.
    • If evidence conflicts, status should show conflict/degraded/failed rather than silently reporting success.
  8. Test the safe overlay first:

    sudo mode request studio-local
    mode status
    sudo mode request desktop
    mode status
    
  9. Only then test desktop -> compute with vLLM either disabled, stubbed, or known-good:

    sudo mode request compute
    mode status
    sudo mode request desktop
    mode status
    
  10. Milestone success criteria:

    • The machine boots from the flake.
    • mode status/current/desired work.
    • Desired/current separation is visible.
    • The controller lock prevents concurrent transitions.
    • Guard failures are reported distinctly from execution errors.
    • desktop -> studio-local -> desktop works as an overlay.
    • desktop -> compute -> desktop either works or fails with a clear guard/action/post-observation reason.
    • No failed transition is reported as a successful target mode.

The next milestone after that should be a real desktop <-> compute control loop with vLLM active, structured audit records, rollback to desktop, and explicit degraded-compute thresholds.