ADR-0010: Keep Persistent Memory Separate From vLLM Runtime

Status: accepted

Context

Dubnium is evolving from a local vLLM compute node toward longer-lived conversational and agentic workflows. Those workflows need durable recall, replayability, externally observable metadata, lifecycle hooks, and scoped retrieval.

vLLM is already the inference runtime for Dubnium’s compute mode. It is built to serve tokens with batching, prefix caching, streaming, model lifecycle control, and GPU-aware scheduling. It is not the right owner for durable user memory, agent task state, retention policy, or governance metadata.

The target hardware is constrained. Dual 12GB RTX 3060 GPUs leave limited room for oversized context windows, high concurrency, and unnecessary KV-cache pressure. Treating persistent memory as “keep all context in the model” would make latency, reliability, and recovery worse.

Persistent memory also changes the security posture. Model output, retrieved documents, tool results, artifacts, and prior conversation summaries are all untrusted inputs when they cross a new session boundary. Without structured metadata and lifecycle events, a future governance layer cannot inspect, constrain, attest, or replay memory behavior.

Decision

Keep vLLM as the inference runtime only.

Build persistent memory as a separate subsystem owned by orchestration, retrieval, storage, summarization, and compaction layers. Orchestrators assemble prompts from working context, retrieved memories, task state, and artifact references before calling vLLM.

Keep the future governance layer external to the memory/runtime architecture. Dubnium memory/runtime should expose structured records, metadata, and lifecycle hooks for governance to inspect later, but vLLM, vector stores, artifact stores, and MemGPT-style runtimes should not depend directly on that future substrate.

Do not persist transformer KV state as the durable memory mechanism. KV cache state can remain an inference optimization inside vLLM, but durable memory must be replayable from stored events, summaries, artifacts, metadata, and retrieval records.

Use separate memory classes:

working context for current session continuity
episodic memory for meaningful historical interactions
semantic memory for normalized stable facts and conventions
task state for active workflows, checkpoints, and execution graphs
artifacts for external files, logs, generated outputs, and large payloads
metadata for provenance, trust hints, retention hints, sensitivity hints, and scope

The first implementation milestone should use a conservative local stack:

Postgres for structured memory, sessions, tasks, artifacts, and provenance
pgvector for local vector search
Redis for transient working context and queues where useful
a small embedding model such as bge-small or nomic-embed
rolling summaries instead of transcript replay
scoped retrieval before prompt assembly

Treat MemGPT-style self-editing memory as a later orchestration upgrade path, not the first storage substrate. The current maintained framework from that lineage is Letta; evaluate it after Dubnium has stable local memory storage, retrieval filters, redaction, provenance, and replay checks. If adopted, it should sit above the persistent memory subsystem and vLLM runtime instead of replacing Dubnium’s metadata, lifecycle hooks, or runtime-secret boundaries.

Boundaries

The inference layer owns token generation, batching, streaming, prefix caching, model startup, GPU assignment, and service health.

The memory subsystem owns storage, retrieval, summarization, embedding, compaction, artifact references, provenance records, and replay inputs.

The orchestration layer owns prompt assembly, scoped retrieval requests, tool coordination, and task workflow progression.

The future governance layer is adjacent. It may later evaluate policy, provenance, trust, retention, audit, and replay concerns by inspecting the structured records emitted by this layer, but it is not embedded in the vLLM runtime, memory database, vector store, artifact store, or MemGPT-style runtime.

Security Model

Assume all inputs are untrusted, including model output and retrieved memories.

Trust boundaries include:

user and agent prompts entering the orchestrator
model output entering summarization or memory extraction
tool output entering task state or memory storage
external documents entering retrieval indexes
retrieved memory entering prompt assembly
retrieval metadata controlling visibility and retention

Durable memory objects must carry enough metadata to support later policy and audit decisions:

source identity
provenance
validation status or validation hints
trust score
sensitivity classification
retention hint or TTL
namespace or project scope
agent boundary
replay lineage

The first milestone must emit enough structure to support mitigation of:

memory poisoning through confidence and validation metadata
persistent prompt injection through instruction classification metadata
cross-agent leakage through scoped namespaces and retrieval events
sensitive data retention through redaction markers and TTL metadata

Do not store credentials, raw secret payloads, or private tokens as memories. Secret values remain governed by the runtime-secret policy in ADR-0009.

Consequences

vLLM workers can stay mostly stateless and focused on low-latency inference.
Memory behavior can be tested, replayed, audited, and evolved without changing the inference service contract.
Prompt size stays bounded by retrieval and compression rather than by naive transcript replay.
Future governance remains possible because memory, retrieval, artifact, and runtime events are structured and externally observable.
Governance does not become an embedded runtime dependency.
More infrastructure is required before memory-backed agents are production ready.
Retrieval quality, memory drift, stale facts, and hallucinated recall become explicit validation targets.
Binary artifacts remain externalized and are referenced through metadata or on-demand multimodal inference rather than injected into prompts by default.

Escalation Criteria

Reconsider this policy if:

vLLM gains a production-grade durable memory interface with replayable external metadata
local hardware changes enough that long-context replay is cheaper than external memory retrieval
a dedicated Anthesis-aligned memory service becomes the primary Dubnium memory provider
Letta or another MemGPT-style agent framework can integrate with Dubnium’s storage, metadata, and replay contracts without becoming the source of truth
compliance requirements demand a concrete external governance authority, attestation system, or retention architecture

References

Persistent Context Memory Architecture

Keyboard shortcuts

Dubnium