Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Persistent Context Memory Architecture

Status: planning

This document describes the long-term persistent context memory architecture for Dubnium’s local vLLM runtime.

Goals

The architecture should:

  • support long-lived conversational and agentic workflows
  • preserve low-latency vLLM inference characteristics
  • separate inference runtime concerns from memory persistence
  • expose enough structure for replay, audit, and policy enforcement
  • operate efficiently on constrained local GPU hardware
  • leave room for Anthesis-style governed agent systems

Future Governance Boundary

A future governance layer remains external to this memory/runtime architecture.

The memory/runtime layer stores, retrieves, summarizes, compacts, and serves context. It records structured metadata and lifecycle events so another layer can inspect, constrain, attest, or replay behavior later.

The future governance layer evaluates policy, provenance, trust, retention, audit, and replay concerns. This document does not define that governance authority.

Dubnium memory/runtime layer
    = stores, retrieves, summarizes, compacts, and serves context

Future governance layer
    = evaluates policy, provenance, trust, retention, audit, and replay concerns

Design implication: memory records, artifacts, retrieval events, and runtime transitions must be structured and externally observable, but vLLM, vector stores, artifact stores, and MemGPT-style runtimes must not depend directly on a future governance substrate.

Core Principle

vLLM is the inference runtime.

Persistent memory is a separate subsystem.

Do not persist transformer KV state as durable memory. KV state can remain an inference optimization inside vLLM. Durable memory must be reconstructable from stored events, summaries, artifacts, metadata, and retrieval records.

flowchart TD
    U[User or Agent] --> O[Orchestrator]
    O --> W[Working Context Buffer]
    O --> R[Retriever]
    O --> T[Task State Store]
    R --> V[(Vector Store)]
    R --> M[(Structured Memory Store)]
    O --> L[vLLM]
    L --> S[Summarizer]
    S --> E[Embedding Pipeline]
    E --> V
    S --> M

Layers

Inference

Responsibilities:

  • token generation
  • batching
  • prefix caching
  • streaming
  • model lifecycle management

Recommended components:

ComponentRecommendation
Inference runtimevLLM
Primary modelsQwen, DeepSeek, Llama-family
Embeddingsbge-small or nomic-embed
QuantizationAWQ or GPTQ initially

Inference nodes should remain stateless where possible. Durable memory logic does not belong inside inference workers.

Working Context

Working context maintains immediate conversational and task continuity.

It contains recent messages, tool outputs, current objectives, active plans, and unresolved references.

Storage options:

OptionUse
Redisfast transient sessions
SQLitesingle-user local setups
Postgresunified durable stack

Recommended strategy:

  • keep the last N conversational turns verbatim
  • keep a rolling summary for older turns
  • keep external references outside the prompt

Episodic Memory

Episodic memory stores meaningful historical interactions, such as debugging sessions, deployment history, design discussions, incidents, and user preferences.

Example shape:

{
  "id": "uuid",
  "timestamp": "ISO8601",
  "session_id": "uuid",
  "memory_type": "episodic",
  "summary": "Condensed interaction summary",
  "importance": 0.82,
  "ttl": null,
  "source": "conversation",
  "provenance": {
    "model": "qwen",
    "extractor_version": "1"
  }
}

Semantic Memory

Semantic memory stores normalized stable facts and reusable knowledge: infrastructure topology, user preferences, architecture decisions, project conventions, and coding standards.

Semantic memory is not raw transcript storage.

Instead of storing “user mentioned NixOS several times”, store:

{
  "fact": "Primary workstation uses NixOS",
  "confidence": 0.94,
  "scope": "personal-preference"
}

Task State

Task state is active execution state, not conversational memory.

Examples:

  • queued work
  • workflow checkpoints
  • active RFC generation
  • agent plans
  • unresolved actions
  • execution graphs

Task state should be strongly structured. Do not embed executable workflow state inside vector stores.

ComponentRecommendation
Structured storePostgres
QueueingRabbitMQ or Redis Streams
Workflow engineTemporal later

Retrieval

Retrieval responsibilities:

  • semantic search
  • scoped retrieval
  • ranking
  • filtering
  • relevance compression
flowchart LR
    Q[Query] --> E[Embed Query]
    E --> S[Vector Search]
    S --> R[Re-ranker]
    R --> C[Context Builder]

Retrieval constraints:

ConstraintExample
Session scopeonly current project
TTLexclude expired memories
Agent boundaryisolate agents
Recency weightingprioritize recent events

The orchestrator constrains retrieval scope and memory assembly. Future governance can inspect the retrieval event stream and stored metadata, but the retriever must remain useful without embedding a governance engine.

Minimal Stack

ConcernTechnology
InferencevLLM
Structured dataPostgres
Vector searchpgvector
Session cacheRedis
Object storagelocal filesystem first, MinIO later
QueueingRedis Streams first, RabbitMQ later

Artifact And Binary Memory

Artifacts and memory are distinct concepts.

ConceptMeaning
Memorysemantic or cognitive abstraction
Artifactraw external object
Evidenceimmutable referenced source
Contexttransient prompt state
Knowledgevalidated normalized facts

Raw binaries should not be first-class prompt memory. Binaries remain externalized, semantic extraction feeds retrieval systems, agents retrieve references and derived context, and multimodal inference runs on demand.

Initial artifact types:

TypeExamples
Imagesscreenshots, whiteboards, diagrams
DocumentsPDFs, Office docs
Audiorecordings, meetings
Videodemos, walkthroughs
Source bundlesarchives, repos
Logsruntime and system logs
Structured dataCSV, JSON, YAML
flowchart TD
    A[Artifact Upload] --> B[Object Storage]
    A --> C[Extraction Pipeline]
    C --> D[OCR]
    C --> E[Captioning]
    C --> F[Metadata Extraction]
    C --> G[Embedding Generation]
    D --> H[Semantic Records]
    E --> H
    F --> H
    G --> H
    H --> I[(Vector Store)]
    H --> J[(Structured Metadata Store)]

Artifact metadata should include content hashes, storage URIs, MIME type, derived captions or OCR, embedding references, provenance, trust hints, and sensitivity hints.

Binary artifacts create operational risk: screenshots can contain credentials, EXIF metadata can leak location, visual data can be sensitive, retrieved artifacts can amplify exposure, and malicious files can poison extraction pipelines. Those controls belong in the external governance/security layer, but the memory layer must expose enough metadata and hooks for them.

Multimodal Retrieval

For normal text prompts, retrieve captions, OCR text, semantic embeddings, metadata, and artifact references rather than injecting raw binaries.

When multimodal reasoning is required:

  1. Semantic retrieval locates relevant artifacts.
  2. Artifact references are resolved.
  3. Binaries are attached to VLM requests.
  4. Multimodal inference runs on demand.

Candidate model classes:

ModelPurpose
Qwen-VLlocal multimodal reasoning
CLIP or SigLIPimage-text embeddings
Whisperaudio transcription
OCR pipelinesdocument extraction

OCI-Compatible Future

Dubnium should stay compatible with OCI-style cognition and artifact distribution.

OCI registries are a strong long-term fit for content addressing, distribution, deduplication, signing, provenance layering, immutable references, artifact versioning, and registry federation.

Candidate future artifact classes:

Artifact classExample
Model artifactsGGUF, safetensors
Embedding indexesvector snapshots
Prompt bundlesgoverned prompts and system policies
Memory bundlesexported episodic memory sets
Workflow definitionsagent workflows
Execution tracesreplayable sessions
Multimodal artifactsimage, document, and audio evidence
Tool contractsMCP capability manifests

Long-term direction:

OCI artifact
    = versioned governed cognition object

This allows Dubnium to evolve toward replayable cognition, portable agent state, attestable workflows, signed memory exports, reproducible multimodal sessions, and distributed cognition registries without coupling cognition storage to one database implementation.

MemGPT-Style Runtime Evolution

MemGPT-style runtimes remain an incremental upgrade path after the persistent memory substrate is stable. Current Letta documentation describes this lineage as agents with in-context core memory, recall memory, archival memory, and self-editing memory tools.

Do not couple Dubnium directly to Letta or MemGPT internals early. Define stable interfaces first:

class MemoryRuntime:
    def retrieve(...): ...
    def summarize(...): ...
    def compact(...): ...
    def promote(...): ...
    def classify(...): ...

Evolution path:

PhaseCapability
1governed retrieval with explicit schemas
2rolling summaries, compaction, and bounded working context
3reflection, summarization loops, memory promotion, relevance scoring
4adaptive retrieval, workflow-aware recall, retrieval planning
5portable cognitive runtime artifacts and OCI-packaged memory overlays

Preserve the distinction between runtime cognition and durable external state. MemGPT-style runtimes should remain replaceable, capability-scoped, inspectable, and externally configurable.

Phases

Phase 1: Minimal Viable Memory

Deliver durable conversation storage, semantic retrieval, basic summarization, Postgres plus pgvector, an embedding pipeline, retrieval API, and rolling conversation summaries.

Phase 2: Structured Memory

Deliver episodic and semantic separation, retrieval filtering, scoped namespaces, metadata tagging, and confidence scoring.

Phase 3: Multi-Agent Coordination

Deliver isolated agent memory, shared collaborative memory, workflow continuity, capability-scoped retrieval, memory federation, execution checkpoints, and task orchestration.

Non-Goals

Avoid initially:

  • serialized GPU KV persistence
  • distributed GPU cache coherence
  • infinite-context simulation
  • recurrent-memory transformer experimentation
  • fully autonomous self-modifying memory

These add substantial complexity and operational instability.

First Milestone

Build a local prototype with:

  • vLLM
  • Qwen coder model
  • Postgres
  • pgvector
  • Redis
  • bge-small embeddings
  • retrieval middleware
  • rolling summaries

Then validate latency, retrieval quality, memory drift, and hallucinated recall before expanding into multi-agent memory systems.