Runbook: vLLM Persistent Memory Prototype
Status: planning
Use this when designing or validating a Dubnium memory subsystem around the local vLLM runtime.
vLLM owns inference. The memory subsystem owns persistence, retrieval, summarization, compaction, artifact references, and replay inputs. Do not make durable memory depend on serialized transformer KV state.
Scope
This runbook covers the first prototype milestone:
- durable conversation and event storage
- rolling summaries
- embeddings for retrieval
- scoped retrieval
- externally observable metadata on every stored memory
- bounded prompt assembly for vLLM
It does not cover multi-agent federation, distributed workflow engines, cryptographic memory attestation, or a pure-Nix packaging path for all services. It also does not adopt Letta or another MemGPT-style agent framework in the first milestone; those belong after the local storage, retrieval, and governance contracts are proven.
Future governance remains external to this runbook. The prototype records metadata and lifecycle events so a later governance substrate can inspect, constrain, attest, or replay behavior, but the prototype does not implement the governance authority itself.
Target Shape
flowchart TD
U[User or Agent] --> O[Orchestrator]
O --> W[Working Context]
O --> R[Retriever]
O --> T[Task State]
R --> V[(pgvector)]
R --> M[(Postgres Memory Tables)]
O --> L[vLLM]
L --> S[Summarizer]
S --> E[Embedding Worker]
E --> V
S --> M
Prototype Components
Use conservative local services first:
| Concern | Prototype choice |
|---|---|
| Inference | existing vllm.service |
| Structured store | Postgres |
| Vector search | pgvector |
| Working context | Redis or Postgres |
| Queueing | Redis Streams initially |
| Object storage | local filesystem first, MinIO later |
| Embeddings | bge-small or nomic-embed |
Keep large artifacts outside prompt assembly. Store references to files, logs, and generated outputs, then retrieve and compress only the relevant excerpts.
Data Classes
Working context is transient session state: recent messages, current objective, active plan, unresolved references, and recent tool outputs.
Episodic memory records meaningful historical interactions, such as debugging sessions, deployment history, design discussions, and operational incidents.
Semantic memory records normalized facts, preferences, project conventions, infrastructure topology, and architecture decisions. Do not treat raw transcripts as semantic memory.
Task state records active workflow state: queued work, checkpoints, execution graphs, pending validations, and unresolved actions.
Metadata records where a memory came from, how trusted it appears, how sensitive it appears, how long it should live, and which scopes may retrieve it. A later governance layer can evaluate that metadata, but the Phase 1 memory service only records and exposes it.
Minimum Schema Direction
The first schema should keep memory objects and embeddings separate so memory metadata can evolve without rewriting vector payloads.
Suggested tables:
sessionsmemoriesmemory_embeddingstasksartifactsprovenance
Each memory row should include:
{
"id": "uuid",
"session_id": "uuid",
"memory_type": "episodic",
"summary": "Condensed interaction summary",
"scope": "project:dubnium",
"importance": 0.82,
"confidence": 0.76,
"sensitivity": "internal",
"validation_status": "unverified",
"ttl": null,
"source": "conversation",
"created_at": "ISO8601",
"provenance": {
"origin": "agent",
"model": "qwen",
"extractor_version": "1"
}
}
Retrieval Contract
The retriever should take a scoped request from the orchestrator and return scoped context candidates, not final prompts.
Required filters:
- project or session scope
- agent namespace
- TTL expiration
- recency
Recommended ranking inputs:
- vector similarity
- keyword match
- recency
- importance
- source authority
- validation status
The context builder should compress results before prompt assembly and preserve citations, artifact references, retrieval event ids, or memory ids so a response can be audited later.
Storage Path
- Capture a conversation, tool event, task event, or artifact reference.
- Classify the event and reject data that should not become durable memory.
- Redact secrets and sensitive payloads.
- Summarize the event into a typed memory candidate.
- Attach provenance, sensitivity, scope, confidence, and retention metadata.
- Embed the memory summary.
- Store structured memory and vector data.
- Schedule expiration or revalidation when retention metadata requires it.
Retrieval Path
- Receive a query and current task scope from the orchestrator.
- Embed the query.
- Search the vector index and any structured filters.
- Apply scope, TTL, and sensitivity filters before re-ranking.
- Re-rank by relevance, recency, importance, and source hints.
- Compress selected context.
- Return context candidates with ids, scope, and provenance.
- Assemble the final vLLM prompt outside the retriever.
Validation Checks
Before treating the prototype as useful, test:
- latency impact on vLLM request path
- recall quality for prior sessions
- false recall and hallucinated-memory rate
- memory poisoning resistance
- prompt-injection persistence resistance
- cross-project and cross-agent isolation
- secret redaction before storage
- TTL expiration and revalidation behavior
- replay from stored events and memory ids
Acceptance Criteria
The first milestone is complete when:
- vLLM can answer with retrieved context without changing
vllm.service - memory storage survives service restart
- retrieval can be scoped to one project
- expired or sensitive memories are excluded from prompt assembly
- summaries can be traced back to source events or artifacts
- a replay can reconstruct which memories were available to a response
Artifact Handling
Artifacts are not memory. Store raw binaries outside prompts and retrieve derived context by default:
- captions
- OCR text
- extracted metadata
- embeddings
- content hashes
- artifact references
Use on-demand multimodal inference only when a task needs the binary itself. The retrieval result should carry an artifact reference rather than copying the artifact into ordinary text prompt memory.
Incremental Upgrade: MemGPT / Letta
After the Phase 1 substrate is stable, evaluate MemGPT-style self-editing memory as an orchestration-layer upgrade. Use current Letta documentation when testing concrete framework integration; reserve “MemGPT” for the research pattern unless a legacy component explicitly uses that name.
The evaluation should answer:
- whether Letta can use Dubnium’s Postgres/pgvector-backed memory stores without bypassing scope, sensitivity, TTL, validation, or provenance filters
- whether agent-managed memory edits can be audited and replayed
- whether archival and recall memory operations can preserve Dubnium memory ids and source lineage
- whether the framework can call local vLLM without requiring model-hosted memory persistence
- whether rejected, expired, or sensitive memories stay out of generated prompts
Do not adopt the framework if it requires storing ungoverned transcripts, credentials, or tool outputs in durable memory.