Persistent Context Memory Architecture
Status: planning
This document describes the long-term persistent context memory architecture for Dubnium’s local vLLM runtime.
Goals
The architecture should:
- support long-lived conversational and agentic workflows
- preserve low-latency vLLM inference characteristics
- separate inference runtime concerns from memory persistence
- expose enough structure for replay, audit, and policy enforcement
- operate efficiently on constrained local GPU hardware
- leave room for Anthesis-style governed agent systems
Future Governance Boundary
A future governance layer remains external to this memory/runtime architecture.
The memory/runtime layer stores, retrieves, summarizes, compacts, and serves context. It records structured metadata and lifecycle events so another layer can inspect, constrain, attest, or replay behavior later.
The future governance layer evaluates policy, provenance, trust, retention, audit, and replay concerns. This document does not define that governance authority.
Dubnium memory/runtime layer
= stores, retrieves, summarizes, compacts, and serves context
Future governance layer
= evaluates policy, provenance, trust, retention, audit, and replay concerns
Design implication: memory records, artifacts, retrieval events, and runtime transitions must be structured and externally observable, but vLLM, vector stores, artifact stores, and MemGPT-style runtimes must not depend directly on a future governance substrate.
Core Principle
vLLM is the inference runtime.
Persistent memory is a separate subsystem.
Do not persist transformer KV state as durable memory. KV state can remain an inference optimization inside vLLM. Durable memory must be reconstructable from stored events, summaries, artifacts, metadata, and retrieval records.
flowchart TD
U[User or Agent] --> O[Orchestrator]
O --> W[Working Context Buffer]
O --> R[Retriever]
O --> T[Task State Store]
R --> V[(Vector Store)]
R --> M[(Structured Memory Store)]
O --> L[vLLM]
L --> S[Summarizer]
S --> E[Embedding Pipeline]
E --> V
S --> M
Layers
Inference
Responsibilities:
- token generation
- batching
- prefix caching
- streaming
- model lifecycle management
Recommended components:
| Component | Recommendation |
|---|---|
| Inference runtime | vLLM |
| Primary models | Qwen, DeepSeek, Llama-family |
| Embeddings | bge-small or nomic-embed |
| Quantization | AWQ or GPTQ initially |
Inference nodes should remain stateless where possible. Durable memory logic does not belong inside inference workers.
Working Context
Working context maintains immediate conversational and task continuity.
It contains recent messages, tool outputs, current objectives, active plans, and unresolved references.
Storage options:
| Option | Use |
|---|---|
| Redis | fast transient sessions |
| SQLite | single-user local setups |
| Postgres | unified durable stack |
Recommended strategy:
- keep the last N conversational turns verbatim
- keep a rolling summary for older turns
- keep external references outside the prompt
Episodic Memory
Episodic memory stores meaningful historical interactions, such as debugging sessions, deployment history, design discussions, incidents, and user preferences.
Example shape:
{
"id": "uuid",
"timestamp": "ISO8601",
"session_id": "uuid",
"memory_type": "episodic",
"summary": "Condensed interaction summary",
"importance": 0.82,
"ttl": null,
"source": "conversation",
"provenance": {
"model": "qwen",
"extractor_version": "1"
}
}
Semantic Memory
Semantic memory stores normalized stable facts and reusable knowledge: infrastructure topology, user preferences, architecture decisions, project conventions, and coding standards.
Semantic memory is not raw transcript storage.
Instead of storing “user mentioned NixOS several times”, store:
{
"fact": "Primary workstation uses NixOS",
"confidence": 0.94,
"scope": "personal-preference"
}
Task State
Task state is active execution state, not conversational memory.
Examples:
- queued work
- workflow checkpoints
- active RFC generation
- agent plans
- unresolved actions
- execution graphs
Task state should be strongly structured. Do not embed executable workflow state inside vector stores.
| Component | Recommendation |
|---|---|
| Structured store | Postgres |
| Queueing | RabbitMQ or Redis Streams |
| Workflow engine | Temporal later |
Retrieval
Retrieval responsibilities:
- semantic search
- scoped retrieval
- ranking
- filtering
- relevance compression
flowchart LR
Q[Query] --> E[Embed Query]
E --> S[Vector Search]
S --> R[Re-ranker]
R --> C[Context Builder]
Retrieval constraints:
| Constraint | Example |
|---|---|
| Session scope | only current project |
| TTL | exclude expired memories |
| Agent boundary | isolate agents |
| Recency weighting | prioritize recent events |
The orchestrator constrains retrieval scope and memory assembly. Future governance can inspect the retrieval event stream and stored metadata, but the retriever must remain useful without embedding a governance engine.
Minimal Stack
| Concern | Technology |
|---|---|
| Inference | vLLM |
| Structured data | Postgres |
| Vector search | pgvector |
| Session cache | Redis |
| Object storage | local filesystem first, MinIO later |
| Queueing | Redis Streams first, RabbitMQ later |
Artifact And Binary Memory
Artifacts and memory are distinct concepts.
| Concept | Meaning |
|---|---|
| Memory | semantic or cognitive abstraction |
| Artifact | raw external object |
| Evidence | immutable referenced source |
| Context | transient prompt state |
| Knowledge | validated normalized facts |
Raw binaries should not be first-class prompt memory. Binaries remain externalized, semantic extraction feeds retrieval systems, agents retrieve references and derived context, and multimodal inference runs on demand.
Initial artifact types:
| Type | Examples |
|---|---|
| Images | screenshots, whiteboards, diagrams |
| Documents | PDFs, Office docs |
| Audio | recordings, meetings |
| Video | demos, walkthroughs |
| Source bundles | archives, repos |
| Logs | runtime and system logs |
| Structured data | CSV, JSON, YAML |
flowchart TD
A[Artifact Upload] --> B[Object Storage]
A --> C[Extraction Pipeline]
C --> D[OCR]
C --> E[Captioning]
C --> F[Metadata Extraction]
C --> G[Embedding Generation]
D --> H[Semantic Records]
E --> H
F --> H
G --> H
H --> I[(Vector Store)]
H --> J[(Structured Metadata Store)]
Artifact metadata should include content hashes, storage URIs, MIME type, derived captions or OCR, embedding references, provenance, trust hints, and sensitivity hints.
Binary artifacts create operational risk: screenshots can contain credentials, EXIF metadata can leak location, visual data can be sensitive, retrieved artifacts can amplify exposure, and malicious files can poison extraction pipelines. Those controls belong in the external governance/security layer, but the memory layer must expose enough metadata and hooks for them.
Multimodal Retrieval
For normal text prompts, retrieve captions, OCR text, semantic embeddings, metadata, and artifact references rather than injecting raw binaries.
When multimodal reasoning is required:
- Semantic retrieval locates relevant artifacts.
- Artifact references are resolved.
- Binaries are attached to VLM requests.
- Multimodal inference runs on demand.
Candidate model classes:
| Model | Purpose |
|---|---|
| Qwen-VL | local multimodal reasoning |
| CLIP or SigLIP | image-text embeddings |
| Whisper | audio transcription |
| OCR pipelines | document extraction |
OCI-Compatible Future
Dubnium should stay compatible with OCI-style cognition and artifact distribution.
OCI registries are a strong long-term fit for content addressing, distribution, deduplication, signing, provenance layering, immutable references, artifact versioning, and registry federation.
Candidate future artifact classes:
| Artifact class | Example |
|---|---|
| Model artifacts | GGUF, safetensors |
| Embedding indexes | vector snapshots |
| Prompt bundles | governed prompts and system policies |
| Memory bundles | exported episodic memory sets |
| Workflow definitions | agent workflows |
| Execution traces | replayable sessions |
| Multimodal artifacts | image, document, and audio evidence |
| Tool contracts | MCP capability manifests |
Long-term direction:
OCI artifact
= versioned governed cognition object
This allows Dubnium to evolve toward replayable cognition, portable agent state, attestable workflows, signed memory exports, reproducible multimodal sessions, and distributed cognition registries without coupling cognition storage to one database implementation.
MemGPT-Style Runtime Evolution
MemGPT-style runtimes remain an incremental upgrade path after the persistent memory substrate is stable. Current Letta documentation describes this lineage as agents with in-context core memory, recall memory, archival memory, and self-editing memory tools.
Do not couple Dubnium directly to Letta or MemGPT internals early. Define stable interfaces first:
class MemoryRuntime:
def retrieve(...): ...
def summarize(...): ...
def compact(...): ...
def promote(...): ...
def classify(...): ...
Evolution path:
| Phase | Capability |
|---|---|
| 1 | governed retrieval with explicit schemas |
| 2 | rolling summaries, compaction, and bounded working context |
| 3 | reflection, summarization loops, memory promotion, relevance scoring |
| 4 | adaptive retrieval, workflow-aware recall, retrieval planning |
| 5 | portable cognitive runtime artifacts and OCI-packaged memory overlays |
Preserve the distinction between runtime cognition and durable external state. MemGPT-style runtimes should remain replaceable, capability-scoped, inspectable, and externally configurable.
Phases
Phase 1: Minimal Viable Memory
Deliver durable conversation storage, semantic retrieval, basic summarization, Postgres plus pgvector, an embedding pipeline, retrieval API, and rolling conversation summaries.
Phase 2: Structured Memory
Deliver episodic and semantic separation, retrieval filtering, scoped namespaces, metadata tagging, and confidence scoring.
Phase 3: Multi-Agent Coordination
Deliver isolated agent memory, shared collaborative memory, workflow continuity, capability-scoped retrieval, memory federation, execution checkpoints, and task orchestration.
Non-Goals
Avoid initially:
- serialized GPU KV persistence
- distributed GPU cache coherence
- infinite-context simulation
- recurrent-memory transformer experimentation
- fully autonomous self-modifying memory
These add substantial complexity and operational instability.
First Milestone
Build a local prototype with:
- vLLM
- Qwen coder model
- Postgres
- pgvector
- Redis
- bge-small embeddings
- retrieval middleware
- rolling summaries
Then validate latency, retrieval quality, memory drift, and hallucinated recall before expanding into multi-agent memory systems.