Runbook: vLLM Persistent Memory Prototype

Status: planning

Use this when designing or validating a Dubnium memory subsystem around the local vLLM runtime.

vLLM owns inference. The memory subsystem owns persistence, retrieval, summarization, compaction, artifact references, and replay inputs. Do not make durable memory depend on serialized transformer KV state.

Scope

This runbook covers the first prototype milestone:

durable conversation and event storage
rolling summaries
embeddings for retrieval
scoped retrieval
externally observable metadata on every stored memory
bounded prompt assembly for vLLM

It does not cover multi-agent federation, distributed workflow engines, cryptographic memory attestation, or a pure-Nix packaging path for all services. It also does not adopt Letta or another MemGPT-style agent framework in the first milestone; those belong after the local storage, retrieval, and governance contracts are proven.

Future governance remains external to this runbook. The prototype records metadata and lifecycle events so a later governance substrate can inspect, constrain, attest, or replay behavior, but the prototype does not implement the governance authority itself.

Target Shape

flowchart TD
    U[User or Agent] --> O[Orchestrator]
    O --> W[Working Context]
    O --> R[Retriever]
    O --> T[Task State]
    R --> V[(pgvector)]
    R --> M[(Postgres Memory Tables)]
    O --> L[vLLM]
    L --> S[Summarizer]
    S --> E[Embedding Worker]
    E --> V
    S --> M

Prototype Components

Use conservative local services first:

Concern	Prototype choice
Inference	existing `vllm.service`
Structured store	Postgres
Vector search	pgvector
Working context	Redis or Postgres
Queueing	Redis Streams initially
Object storage	local filesystem first, MinIO later
Embeddings	bge-small or nomic-embed

Keep large artifacts outside prompt assembly. Store references to files, logs, and generated outputs, then retrieve and compress only the relevant excerpts.

Data Classes

Working context is transient session state: recent messages, current objective, active plan, unresolved references, and recent tool outputs.

Episodic memory records meaningful historical interactions, such as debugging sessions, deployment history, design discussions, and operational incidents.

Semantic memory records normalized facts, preferences, project conventions, infrastructure topology, and architecture decisions. Do not treat raw transcripts as semantic memory.

Task state records active workflow state: queued work, checkpoints, execution graphs, pending validations, and unresolved actions.

Metadata records where a memory came from, how trusted it appears, how sensitive it appears, how long it should live, and which scopes may retrieve it. A later governance layer can evaluate that metadata, but the Phase 1 memory service only records and exposes it.

Minimum Schema Direction

The first schema should keep memory objects and embeddings separate so memory metadata can evolve without rewriting vector payloads.

Suggested tables:

sessions
memories
memory_embeddings
tasks
artifacts
provenance

Each memory row should include:

{
  "id": "uuid",
  "session_id": "uuid",
  "memory_type": "episodic",
  "summary": "Condensed interaction summary",
  "scope": "project:dubnium",
  "importance": 0.82,
  "confidence": 0.76,
  "sensitivity": "internal",
  "validation_status": "unverified",
  "ttl": null,
  "source": "conversation",
  "created_at": "ISO8601",
  "provenance": {
    "origin": "agent",
    "model": "qwen",
    "extractor_version": "1"
  }
}

Retrieval Contract

The retriever should take a scoped request from the orchestrator and return scoped context candidates, not final prompts.

Required filters:

project or session scope
agent namespace
TTL expiration
recency

Recommended ranking inputs:

vector similarity
keyword match
recency
importance
source authority
validation status

The context builder should compress results before prompt assembly and preserve citations, artifact references, retrieval event ids, or memory ids so a response can be audited later.

Storage Path

Capture a conversation, tool event, task event, or artifact reference.
Classify the event and reject data that should not become durable memory.
Redact secrets and sensitive payloads.
Summarize the event into a typed memory candidate.
Attach provenance, sensitivity, scope, confidence, and retention metadata.
Embed the memory summary.
Store structured memory and vector data.
Schedule expiration or revalidation when retention metadata requires it.

Retrieval Path

Receive a query and current task scope from the orchestrator.
Embed the query.
Search the vector index and any structured filters.
Apply scope, TTL, and sensitivity filters before re-ranking.
Re-rank by relevance, recency, importance, and source hints.
Compress selected context.
Return context candidates with ids, scope, and provenance.
Assemble the final vLLM prompt outside the retriever.

Validation Checks

Before treating the prototype as useful, test:

latency impact on vLLM request path
recall quality for prior sessions
false recall and hallucinated-memory rate
memory poisoning resistance
prompt-injection persistence resistance
cross-project and cross-agent isolation
secret redaction before storage
TTL expiration and revalidation behavior
replay from stored events and memory ids

Acceptance Criteria

The first milestone is complete when:

vLLM can answer with retrieved context without changing vllm.service
memory storage survives service restart
retrieval can be scoped to one project
expired or sensitive memories are excluded from prompt assembly
summaries can be traced back to source events or artifacts
a replay can reconstruct which memories were available to a response

Artifact Handling

Artifacts are not memory. Store raw binaries outside prompts and retrieve derived context by default:

captions
OCR text
extracted metadata
embeddings
content hashes
artifact references

Use on-demand multimodal inference only when a task needs the binary itself. The retrieval result should carry an artifact reference rather than copying the artifact into ordinary text prompt memory.

Incremental Upgrade: MemGPT / Letta

After the Phase 1 substrate is stable, evaluate MemGPT-style self-editing memory as an orchestration-layer upgrade. Use current Letta documentation when testing concrete framework integration; reserve “MemGPT” for the research pattern unless a legacy component explicitly uses that name.

The evaluation should answer:

whether Letta can use Dubnium’s Postgres/pgvector-backed memory stores without bypassing scope, sensitivity, TTL, validation, or provenance filters
whether agent-managed memory edits can be audited and replayed
whether archival and recall memory operations can preserve Dubnium memory ids and source lineage
whether the framework can call local vLLM without requiring model-hosted memory persistence
whether rejected, expired, or sensitive memories stay out of generated prompts

Do not adopt the framework if it requires storing ungoverned transcripts, credentials, or tool outputs in durable memory.

Keyboard shortcuts

Dubnium