Runbook: vLLM Runtime

Status: living

Use this when Dubnium’s NixOS configuration manages vllm.service, but the vLLM Python/CUDA runtime is installed outside the Nix store.

NixOS owns:

vllm.service
/var/lib/vllm
/var/lib/dubnium/models
CUDA_VISIBLE_DEVICES
ai.dubnium
Tailscale-only firewall exposure

The external runtime owns:

/var/lib/vllm/venv
Python, PyTorch, vLLM, and CUDA wheel packages inside that venv

This keeps rebuilds fast and avoids compiling PyTorch, CUDA, CuPy, MAGMA, OpenCV CUDA, or vLLM during nixos-rebuild.

Scope

This runbook covers the current hybrid-Nix phase. NixOS is authoritative for the service contract, host alias, firewall exposure, users, directories, environment, and health checks. The Python/CUDA package runtime is mutable operator-managed state under /var/lib/vllm/venv.

A pure-Nix vLLM runtime is a separate later phase. That phase should be treated as build-infrastructure work: it likely needs a dedicated CUDA builder, an Attic/Cachix/nix-serve cache, or an upstream Nixpkgs packaging path that avoids rebuilding the full CUDA/PyTorch/vLLM stack on every workstation.

Preconditions

the host has been switched to a Dubnium generation with dubnium.vllm.runtime = "external"
uv is available in the operator shell
NVIDIA GPU access works on the host
model weights are already seeded under /var/lib/dubnium/models

Check GPU visibility first:

nvidia-smi

1. Create The Runtime Directory

sudo install -d -m 0755 -o root -g root /var/lib/vllm
sudo install -d -m 0755 -o root -g root /var/lib/dubnium/models

The NixOS module also declares these directories. These commands are safe to run before or after nixos-rebuild switch.

2. Install vLLM Into The Managed venv

Create a fresh venv:

sudo uv venv --python /run/current-system/sw/bin/python3.12 --python-preference only-system /var/lib/vllm/venv

Install vLLM with CUDA/PyTorch wheels selected by uv:

sudo env UV_TORCH_BACKEND=auto uv pip install --python /var/lib/vllm/venv/bin/python vllm

This is intentionally the only default install command. Do not install audio, JAX, TPU, or broad framework extras during workstation bring-up. In particular, avoid commands that reinstall torchvision, torchaudio, or jax unless a specific workload requires them and the host has enough memory to resolve, download, install, and import that dependency set. The default Dubnium vLLM path is text inference against a local model bundle.

The upstream vLLM GPU install docs recommend uv pip install vllm --torch-backend=auto so uv can select the PyTorch backend from the installed CUDA driver. If that flag is not supported by the installed uv, use the environment variable form above or update uv.

If the installed uv supports newer PyTorch backends, use a specific CUDA backend that matches the host driver. For CUDA 13.0:

sudo uv pip install --python /var/lib/vllm/venv/bin/python --torch-backend=cu130 vllm

Some packaged uv versions may not list cu130 yet. On those versions, keep the default install command above, or upgrade uv to a version that supports the host CUDA backend. Do not use a broad PyTorch-family reinstall as a workstation bring-up workaround; it can pull optional packages such as torchaudio and exceed available memory.

If PyTorch CUDA selection is wrong after the default install, recreate the venv and rerun the vLLM install with a supported UV_TORCH_BACKEND or --torch-backend value rather than layering more framework packages into the same environment.

Host config adds the venv’s PyTorch and NVIDIA wheel library directories to LD_LIBRARY_PATH. That is required because the external venv is outside the Nix store and vLLM’s CUDA extension must be able to find libtorch, libcudart, and the CUDA wheel libraries at runtime.

The service also sets CC to Nix’s C compiler wrapper. Triton may compile a small runtime helper during vLLM startup even when vLLM itself is installed in the external venv.

Keep dubnium.vllm.runtime = "package" available for the future pure-Nix phase, but do not use it for this external-runtime path.

3. Verify The Runtime

Check the executable:

/var/lib/vllm/venv/bin/vllm --version

Check CUDA through PyTorch:

/var/lib/vllm/venv/bin/python -c "import torch; print(torch.cuda.is_available())"

Expected:

True

If this prints False, fix the venv/PyTorch/CUDA wheel selection before debugging Dubnium’s systemd service.

4. Verify The Local Model Bundle

Dubnium keeps model weights out of Git and out of the Nix store. The vLLM service should point at a local model bundle.

MODEL_DIR=/var/lib/dubnium/models/qwen2.5-coder-14b-instruct

If the model bundle was seeded from removable media, verify that the local bundle exists:

test -f "$MODEL_DIR/config.json"
test -f "$MODEL_DIR/model.safetensors.index.json" || test -f "$MODEL_DIR/model.safetensors"

If SHA256SUMS exists, verify it:

cd "$MODEL_DIR"
sudo sha256sum -c SHA256SUMS

If vLLM tries to download model files on first start, the configured model path or local bundle is wrong.

5. Start The Service

Start compute mode or restart the service directly:

sudo systemctl start compute.target
sudo systemctl restart vllm.service

Inspect service state:

systemctl status vllm --no-pager
journalctl -u vllm -n 100 --no-pager
systemctl show vllm.service -p ExecStart --value
systemctl show vllm.service -p Environment --value

If /var/lib/vllm/venv/bin/vllm does not exist or is not executable, vllm.service should fail before startup with an executable check error. That means the NixOS service contract is present but the external runtime has not been installed yet.

6. Verify The API

From the Dubnium host:

getent hosts ai.dubnium
curl http://ai.dubnium:8000/v1/models

From another tailnet machine:

curl http://<dubnium-tailnet-name>:8000/v1/models

ai.dubnium is host-local unless the tailnet DNS or client hosts file also maps that name to the Dubnium node’s Tailscale IP.

References

vLLM GPU installation docs: https://docs.vllm.ai/en/latest/getting_started/installation/gpu/
Model seeding policy: ADR-0008
Tailscale exposure: Tailscale

Keyboard shortcuts

Dubnium