Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbook: vLLM Persistent Memory Prototype

Status: planning

Use this when designing or validating a Dubnium memory subsystem around the local vLLM runtime.

vLLM owns inference. The memory subsystem owns persistence, retrieval, summarization, compaction, artifact references, and replay inputs. Do not make durable memory depend on serialized transformer KV state.

Scope

This runbook covers the first prototype milestone:

  • durable conversation and event storage
  • rolling summaries
  • embeddings for retrieval
  • scoped retrieval
  • externally observable metadata on every stored memory
  • bounded prompt assembly for vLLM

It does not cover multi-agent federation, distributed workflow engines, cryptographic memory attestation, or a pure-Nix packaging path for all services. It also does not adopt Letta or another MemGPT-style agent framework in the first milestone; those belong after the local storage, retrieval, and governance contracts are proven.

Future governance remains external to this runbook. The prototype records metadata and lifecycle events so a later governance substrate can inspect, constrain, attest, or replay behavior, but the prototype does not implement the governance authority itself.

Target Shape

flowchart TD
    U[User or Agent] --> O[Orchestrator]
    O --> W[Working Context]
    O --> R[Retriever]
    O --> T[Task State]
    R --> V[(pgvector)]
    R --> M[(Postgres Memory Tables)]
    O --> L[vLLM]
    L --> S[Summarizer]
    S --> E[Embedding Worker]
    E --> V
    S --> M

Prototype Components

Use conservative local services first:

ConcernPrototype choice
Inferenceexisting vllm.service
Structured storePostgres
Vector searchpgvector
Working contextRedis or Postgres
QueueingRedis Streams initially
Object storagelocal filesystem first, MinIO later
Embeddingsbge-small or nomic-embed

Keep large artifacts outside prompt assembly. Store references to files, logs, and generated outputs, then retrieve and compress only the relevant excerpts.

Data Classes

Working context is transient session state: recent messages, current objective, active plan, unresolved references, and recent tool outputs.

Episodic memory records meaningful historical interactions, such as debugging sessions, deployment history, design discussions, and operational incidents.

Semantic memory records normalized facts, preferences, project conventions, infrastructure topology, and architecture decisions. Do not treat raw transcripts as semantic memory.

Task state records active workflow state: queued work, checkpoints, execution graphs, pending validations, and unresolved actions.

Metadata records where a memory came from, how trusted it appears, how sensitive it appears, how long it should live, and which scopes may retrieve it. A later governance layer can evaluate that metadata, but the Phase 1 memory service only records and exposes it.

Minimum Schema Direction

The first schema should keep memory objects and embeddings separate so memory metadata can evolve without rewriting vector payloads.

Suggested tables:

  • sessions
  • memories
  • memory_embeddings
  • tasks
  • artifacts
  • provenance

Each memory row should include:

{
  "id": "uuid",
  "session_id": "uuid",
  "memory_type": "episodic",
  "summary": "Condensed interaction summary",
  "scope": "project:dubnium",
  "importance": 0.82,
  "confidence": 0.76,
  "sensitivity": "internal",
  "validation_status": "unverified",
  "ttl": null,
  "source": "conversation",
  "created_at": "ISO8601",
  "provenance": {
    "origin": "agent",
    "model": "qwen",
    "extractor_version": "1"
  }
}

Retrieval Contract

The retriever should take a scoped request from the orchestrator and return scoped context candidates, not final prompts.

Required filters:

  • project or session scope
  • agent namespace
  • TTL expiration
  • recency

Recommended ranking inputs:

  • vector similarity
  • keyword match
  • recency
  • importance
  • source authority
  • validation status

The context builder should compress results before prompt assembly and preserve citations, artifact references, retrieval event ids, or memory ids so a response can be audited later.

Storage Path

  1. Capture a conversation, tool event, task event, or artifact reference.
  2. Classify the event and reject data that should not become durable memory.
  3. Redact secrets and sensitive payloads.
  4. Summarize the event into a typed memory candidate.
  5. Attach provenance, sensitivity, scope, confidence, and retention metadata.
  6. Embed the memory summary.
  7. Store structured memory and vector data.
  8. Schedule expiration or revalidation when retention metadata requires it.

Retrieval Path

  1. Receive a query and current task scope from the orchestrator.
  2. Embed the query.
  3. Search the vector index and any structured filters.
  4. Apply scope, TTL, and sensitivity filters before re-ranking.
  5. Re-rank by relevance, recency, importance, and source hints.
  6. Compress selected context.
  7. Return context candidates with ids, scope, and provenance.
  8. Assemble the final vLLM prompt outside the retriever.

Validation Checks

Before treating the prototype as useful, test:

  • latency impact on vLLM request path
  • recall quality for prior sessions
  • false recall and hallucinated-memory rate
  • memory poisoning resistance
  • prompt-injection persistence resistance
  • cross-project and cross-agent isolation
  • secret redaction before storage
  • TTL expiration and revalidation behavior
  • replay from stored events and memory ids

Acceptance Criteria

The first milestone is complete when:

  • vLLM can answer with retrieved context without changing vllm.service
  • memory storage survives service restart
  • retrieval can be scoped to one project
  • expired or sensitive memories are excluded from prompt assembly
  • summaries can be traced back to source events or artifacts
  • a replay can reconstruct which memories were available to a response

Artifact Handling

Artifacts are not memory. Store raw binaries outside prompts and retrieve derived context by default:

  • captions
  • OCR text
  • extracted metadata
  • embeddings
  • content hashes
  • artifact references

Use on-demand multimodal inference only when a task needs the binary itself. The retrieval result should carry an artifact reference rather than copying the artifact into ordinary text prompt memory.

Incremental Upgrade: MemGPT / Letta

After the Phase 1 substrate is stable, evaluate MemGPT-style self-editing memory as an orchestration-layer upgrade. Use current Letta documentation when testing concrete framework integration; reserve “MemGPT” for the research pattern unless a legacy component explicitly uses that name.

The evaluation should answer:

  • whether Letta can use Dubnium’s Postgres/pgvector-backed memory stores without bypassing scope, sensitivity, TTL, validation, or provenance filters
  • whether agent-managed memory edits can be audited and replayed
  • whether archival and recall memory operations can preserve Dubnium memory ids and source lineage
  • whether the framework can call local vLLM without requiring model-hosted memory persistence
  • whether rejected, expired, or sensitive memories stay out of generated prompts

Do not adopt the framework if it requires storing ungoverned transcripts, credentials, or tool outputs in durable memory.

References