Persistent Context Memory Architecture

Status: planning

This document describes the long-term persistent context memory architecture for Dubnium’s local vLLM runtime.

Goals

The architecture should:

support long-lived conversational and agentic workflows
preserve low-latency vLLM inference characteristics
separate inference runtime concerns from memory persistence
expose enough structure for replay, audit, and policy enforcement
operate efficiently on constrained local GPU hardware
leave room for Anthesis-style governed agent systems

Future Governance Boundary

A future governance layer remains external to this memory/runtime architecture.

The memory/runtime layer stores, retrieves, summarizes, compacts, and serves context. It records structured metadata and lifecycle events so another layer can inspect, constrain, attest, or replay behavior later.

The future governance layer evaluates policy, provenance, trust, retention, audit, and replay concerns. This document does not define that governance authority.

Dubnium memory/runtime layer
    = stores, retrieves, summarizes, compacts, and serves context

Future governance layer
    = evaluates policy, provenance, trust, retention, audit, and replay concerns

Design implication: memory records, artifacts, retrieval events, and runtime transitions must be structured and externally observable, but vLLM, vector stores, artifact stores, and MemGPT-style runtimes must not depend directly on a future governance substrate.

Core Principle

vLLM is the inference runtime.

Persistent memory is a separate subsystem.

Do not persist transformer KV state as durable memory. KV state can remain an inference optimization inside vLLM. Durable memory must be reconstructable from stored events, summaries, artifacts, metadata, and retrieval records.

flowchart TD
    U[User or Agent] --> O[Orchestrator]
    O --> W[Working Context Buffer]
    O --> R[Retriever]
    O --> T[Task State Store]
    R --> V[(Vector Store)]
    R --> M[(Structured Memory Store)]
    O --> L[vLLM]
    L --> S[Summarizer]
    S --> E[Embedding Pipeline]
    E --> V
    S --> M

Layers

Inference

Responsibilities:

token generation
batching
prefix caching
streaming
model lifecycle management

Recommended components:

Component	Recommendation
Inference runtime	vLLM
Primary models	Qwen, DeepSeek, Llama-family
Embeddings	bge-small or nomic-embed
Quantization	AWQ or GPTQ initially

Inference nodes should remain stateless where possible. Durable memory logic does not belong inside inference workers.

Working Context

Working context maintains immediate conversational and task continuity.

It contains recent messages, tool outputs, current objectives, active plans, and unresolved references.

Storage options:

Option	Use
Redis	fast transient sessions
SQLite	single-user local setups
Postgres	unified durable stack

Recommended strategy:

keep the last N conversational turns verbatim
keep a rolling summary for older turns
keep external references outside the prompt

Episodic Memory

Episodic memory stores meaningful historical interactions, such as debugging sessions, deployment history, design discussions, incidents, and user preferences.

Example shape:

{
  "id": "uuid",
  "timestamp": "ISO8601",
  "session_id": "uuid",
  "memory_type": "episodic",
  "summary": "Condensed interaction summary",
  "importance": 0.82,
  "ttl": null,
  "source": "conversation",
  "provenance": {
    "model": "qwen",
    "extractor_version": "1"
  }
}

Semantic Memory

Semantic memory stores normalized stable facts and reusable knowledge: infrastructure topology, user preferences, architecture decisions, project conventions, and coding standards.

Semantic memory is not raw transcript storage.

Instead of storing “user mentioned NixOS several times”, store:

{
  "fact": "Primary workstation uses NixOS",
  "confidence": 0.94,
  "scope": "personal-preference"
}

Task State

Task state is active execution state, not conversational memory.

Examples:

queued work
workflow checkpoints
active RFC generation
agent plans
unresolved actions
execution graphs

Task state should be strongly structured. Do not embed executable workflow state inside vector stores.

Component	Recommendation
Structured store	Postgres
Queueing	RabbitMQ or Redis Streams
Workflow engine	Temporal later

Retrieval

Retrieval responsibilities:

semantic search
scoped retrieval
ranking
filtering
relevance compression

flowchart LR
    Q[Query] --> E[Embed Query]
    E --> S[Vector Search]
    S --> R[Re-ranker]
    R --> C[Context Builder]

Retrieval constraints:

Constraint	Example
Session scope	only current project
TTL	exclude expired memories
Agent boundary	isolate agents
Recency weighting	prioritize recent events

The orchestrator constrains retrieval scope and memory assembly. Future governance can inspect the retrieval event stream and stored metadata, but the retriever must remain useful without embedding a governance engine.

Minimal Stack

Concern	Technology
Inference	vLLM
Structured data	Postgres
Vector search	pgvector
Session cache	Redis
Object storage	local filesystem first, MinIO later
Queueing	Redis Streams first, RabbitMQ later

Artifact And Binary Memory

Artifacts and memory are distinct concepts.

Concept	Meaning
Memory	semantic or cognitive abstraction
Artifact	raw external object
Evidence	immutable referenced source
Context	transient prompt state
Knowledge	validated normalized facts

Raw binaries should not be first-class prompt memory. Binaries remain externalized, semantic extraction feeds retrieval systems, agents retrieve references and derived context, and multimodal inference runs on demand.

Initial artifact types:

Type	Examples
Images	screenshots, whiteboards, diagrams
Documents	PDFs, Office docs
Audio	recordings, meetings
Video	demos, walkthroughs
Source bundles	archives, repos
Logs	runtime and system logs
Structured data	CSV, JSON, YAML

flowchart TD
    A[Artifact Upload] --> B[Object Storage]
    A --> C[Extraction Pipeline]
    C --> D[OCR]
    C --> E[Captioning]
    C --> F[Metadata Extraction]
    C --> G[Embedding Generation]
    D --> H[Semantic Records]
    E --> H
    F --> H
    G --> H
    H --> I[(Vector Store)]
    H --> J[(Structured Metadata Store)]

Artifact metadata should include content hashes, storage URIs, MIME type, derived captions or OCR, embedding references, provenance, trust hints, and sensitivity hints.

Binary artifacts create operational risk: screenshots can contain credentials, EXIF metadata can leak location, visual data can be sensitive, retrieved artifacts can amplify exposure, and malicious files can poison extraction pipelines. Those controls belong in the external governance/security layer, but the memory layer must expose enough metadata and hooks for them.

Multimodal Retrieval

For normal text prompts, retrieve captions, OCR text, semantic embeddings, metadata, and artifact references rather than injecting raw binaries.

When multimodal reasoning is required:

Semantic retrieval locates relevant artifacts.
Artifact references are resolved.
Binaries are attached to VLM requests.
Multimodal inference runs on demand.

Candidate model classes:

Model	Purpose
Qwen-VL	local multimodal reasoning
CLIP or SigLIP	image-text embeddings
Whisper	audio transcription
OCR pipelines	document extraction

OCI-Compatible Future

Dubnium should stay compatible with OCI-style cognition and artifact distribution.

OCI registries are a strong long-term fit for content addressing, distribution, deduplication, signing, provenance layering, immutable references, artifact versioning, and registry federation.

Candidate future artifact classes:

Artifact class	Example
Model artifacts	GGUF, safetensors
Embedding indexes	vector snapshots
Prompt bundles	governed prompts and system policies
Memory bundles	exported episodic memory sets
Workflow definitions	agent workflows
Execution traces	replayable sessions
Multimodal artifacts	image, document, and audio evidence
Tool contracts	MCP capability manifests

Long-term direction:

OCI artifact
    = versioned governed cognition object

This allows Dubnium to evolve toward replayable cognition, portable agent state, attestable workflows, signed memory exports, reproducible multimodal sessions, and distributed cognition registries without coupling cognition storage to one database implementation.

MemGPT-Style Runtime Evolution

MemGPT-style runtimes remain an incremental upgrade path after the persistent memory substrate is stable. Current Letta documentation describes this lineage as agents with in-context core memory, recall memory, archival memory, and self-editing memory tools.

Do not couple Dubnium directly to Letta or MemGPT internals early. Define stable interfaces first:

class MemoryRuntime:
    def retrieve(...): ...
    def summarize(...): ...
    def compact(...): ...
    def promote(...): ...
    def classify(...): ...

Evolution path:

Phase	Capability
1	governed retrieval with explicit schemas
2	rolling summaries, compaction, and bounded working context
3	reflection, summarization loops, memory promotion, relevance scoring
4	adaptive retrieval, workflow-aware recall, retrieval planning
5	portable cognitive runtime artifacts and OCI-packaged memory overlays

Preserve the distinction between runtime cognition and durable external state. MemGPT-style runtimes should remain replaceable, capability-scoped, inspectable, and externally configurable.

serialized GPU KV persistence
distributed GPU cache coherence
infinite-context simulation
recurrent-memory transformer experimentation
fully autonomous self-modifying memory

These add substantial complexity and operational instability.

First Milestone

Build a local prototype with:

vLLM
Qwen coder model
Postgres
pgvector
Redis
bge-small embeddings
retrieval middleware
rolling summaries

Then validate latency, retrieval quality, memory drift, and hallucinated recall before expanding into multi-agent memory systems.

Dubnium