Roadmap — AOSIQ

§ 01 / LegendHow to read this

Six status values describe each item's position in the work pipeline. They do not describe calendar dates. An item marked next ships before an item marked later regardless of when either actually arrives.

Shipped

Live in the released runtime

Active

Currently being built

Committed to upcoming release cycle

Later

Committed but not next

Researching

Exploring whether or how to build

Out of scope

Deliberately not pursued, with reason

§ 02 / PositionWhere we are

AOSIQ is at v0.8.0 — a production-shaped alpha. The governance substrate is complete for both actor types: capability narrowing, tamper-evident audit, approval gates, composite crash recovery, cost ledger, multi-backend LLM abstraction, anti-hallucination evidence stack, prompt-injection defenses, sandboxed code execution, and the DeterministicActor primitive are all shipped. The runtime now governs reasoning agents and scheduled / on-demand deterministic actors under one regime.

With the substrate complete, the recent architectural focus has moved to first verticals — domain agent classes that compose the substrate into solutions an operator recognizes. OperationalAdviceAgent v1 (item 18) shipped as the canonical pattern: a reasoning advisor over a pack of deterministic diagnostic actors, emitting a single structured proposal humans review. Every future vertical follows this same shape; the substrate doesn't change.

The runtime is no longer the bottleneck. The current focus is empirical validation (a historical-incident replay harness for the advisor), the long tail of native tools, and customer-side integration patterns.

Items in § 05 of the threat model cover security-mitigation roadmap items specifically. This page is broader, covering the full product trajectory across both actor types.

§ 03 / FoundationWhat's shipped

The substrate the rest of the platform builds on. All items below are in the released runtime and verified against tests.

Core runtime

Capability authorization

JWT capability tokens with intersection-narrowing delegation. Verification fires before every tool call, every memory operation, every spawn.

Shipped

Tamper-evident audit chain

Per-session SHA-256 hash chain with anchor objects in independently-credentialed object storage. Mid-chain tampering detectable.

Shipped

Mandatory approval gate

Tools registered reversible=False require explicit operator approval bound to (tool, args_hash). Single-use, replay-safe.

Shipped

Composite crash recovery

LangGraph thread state, agent control block, and working memory captured atomically. Worker heartbeat + orphan reaper handle worker crashes.

Shipped

Cost ledger with hard ceilings

Per-call recording with model, tokens, and computed USD. Configurable session ceilings raise exceptions before the API call.

Shipped

Three-mode HTTP authentication

enforced / warn / disabled via env var. Production deployments require enforced; dev runs in warn with a response header tripwire.

Shipped

LLM abstraction & agents

Six LLM backends

Anthropic, AWS Bedrock, OpenAI, Google Gemini (AI Studio), local Ollama, Claude Code CLI shim — selected at construction via factory. Each backend lazy-imports its optional dependency; Gemini ships with the [google_genai] extra.

Shipped

Per-class data-egress envelopes

allowed_backends per agent class. Mismatches raise at construction; an Ollama-only class cannot accidentally route to Anthropic.

Shipped

Seven-layer anti-hallucination stack

Schema-bound tool calls, Pydantic validation, audit-row evidence verification, evidence stamps, loop guards, force-investigate gate, abandonment after refusal.

Shipped

Direct prompt-injection defense

Untrusted-content delimiters around every tool result, plus pattern detection for tool-call syntax, role prefixes, and known injection phrases.

Shipped

Multi-persona swarm orchestration

Parent agents fan out one child per declared persona, each running with persona-overlay system prompts; structured aggregation across children's proposals.

Shipped

Five swarm agent classes

BugHunt, CodeReview, ArchitectureDecision, FeatureDesign, ReleaseReadiness. All read-only; all produce structured proposals via BaseProposal.

Shipped

Knowledge & integration

Knowledge base substrate

pgvector-backed document store with HNSW index, semantic search, and audit-evidence integration. Operator-loaded corpora.

Shipped

MCP server (governance as MCP tools)

Eleven AOSIQ governance operations exposed as MCP tools so any MCP client (Claude Code, Cursor, Cline) can dispatch governed swarms.

Shipped

MCP client bridge for external tools

Agents reach external knowledge bases, internal APIs, and consumer services through MCP. Bridged tools inherit capability, audit, and approval.

Shipped

Knowledge base production layer

Metadata filtering, hybrid vector + keyword search via pgvector and PostgreSQL tsvector, markdown-aware chunking with stable citation anchors (heading_path, heading_anchor, position_in_doc), incremental ingestion via content-hash, corpus introspection, and source-URI-prefix delete. Migrations 017–020. Closes the gap between minimum-viable retrieval and production-grade RAG.

Shipped

Actor model & execution

Sandboxed code execution (run_python)

Container-isolated Python with no network, ephemeral filesystem, hard resource limits, non-root, and a curated package set. Replaces over-broad bash grants for the common case. Reversible by construction. Factored so additional sandboxed languages (e.g. bash_sandboxed, item 55) plug in as language-specific subdirectories — the language-agnostic invoker stays unchanged. Compute-time attribution via a new cost_model field on the handler registry, billed through the same ledger that records LLM token cost. Execution surface for both reasoning agents and deterministic actors that need isolated compute.

Shipped

DeterministicActor primitive

First-class governed entity for non-reasoning automation — scheduled jobs, monitoring scripts, ETL pipelines. Registered Python functions dispatched through a sibling runner to AgentRunner, sharing the same scheduler, capability tokens, audit chain, approval gate, and cost ledger. New compute_ms cost type bills wall-clock execution. Idempotency-up-to-first-irreversible-call is the contract; the function re-runs from the top after a human approves. Completes the runtime's actor model so governance properties extend to all automation, not just LLM-backed agents.

Shipped

OperationalAdviceAgent v1 (first vertical)

First domain vertical built on the runtime's full surface. A reasoning agent diagnoses operational incidents across six named scenarios (job queue backlog, disk space exhaustion, DB connection pool exhaustion, configuration drift, recurring error spike, performance regression) by spawning a pack of deterministic diagnostic actors — log query, threshold evaluation, configuration introspection. The actors gather and structure data; the advisor forms hypotheses and emits a single structured proposal humans review. Read-only by capability; remediation is a separate downstream agent. Out-of-scope is a first-class outcome — the advisor refuses honestly when an incident doesn't match the six scenarios, schema-enforced to carry no proposed actions. Locks in the canonical pattern every subsequent vertical will follow: reasoning advisor over deterministic actor pack, structured proposal as terminal output.

Shipped

Plus 37 additional shipped items not individually listed: capability templates, cookbook examples, dashboard views, threat model document, migrations 001–021, test infrastructure, deployment scaffolding. The full set is verifiable in the codebase.

§ 04 / ActiveWhat's in flight now

Nothing. The OperationalAdviceAgent v1 vertical (item 18) shipped alongside the two architectural sprints it depended on — sandboxed execution (item 17) and the DeterministicActor primitive (item 19). The next priority is the historical-incident replay harness that validates the advisor's recommendations against a corpus of known incidents (queued in § 05 / Next); it didn't block the v1 vertical from landing because the cookbook example and end-to-end tests work against synthetic data.

§ 05 / NextWhat's committed to next

Items scoped, prioritized, and waiting in the queue behind active sprints. Each is independently shippable; the order reflects leverage and dependency rather than calendar.

Native tool catalog

Document fetcher with format-aware extraction

PDF, DOCX, HTML, CSV, XLSX → structured content (title, body, metadata, tables). Closes the "agents can't read documents reliably" gap.

Time and date tool family

Deterministic date arithmetic, timezone reasoning, elapsed-time computation. Eliminates a known class of LLM failure on temporal questions.

Notification dispatcher

Slack, email, PagerDuty, webhook channels. All reversible=False by default — sending a message is irreversible — so the approval gate prevents agent-driven message spam.

Database query tool

Read-only, schema-aware, capability-scoped per connection. Operator-registered connections become db_query@reporting_replica-style scoped tools.

Structured logging tool

Queryable observation log distinct from the immutable forensic audit chain. Operators get console.log-style observability without compromising audit integrity.

Validation tool family

JSON schema, regex, URL, email, IBAN, format validation. Lets agents self-check output against deterministic validators before emitting proposals.

Sandboxed shell execution (bash_sandboxed)

Docker-isolated shell command execution. Same isolation pattern as run_python (item 17): no network, no host filesystem, hard resource limits, non-root. Replaces broad bash grants for agents that need CLI tools (jq, awk, kubectl, etc.) without unrestricted shell access. Second instance of the sandboxed-execution primitive family the runtime is converging on.

Backend & integration

Knowledge base ingest scanner

Pre-embedding scan for prompt-injection patterns, secrets, PII, malicious content. Operator review queue for flagged documents. Closes the KB-poisoning gap from the threat model.

Agents & cookbook

CodeChangeAgent

Second vertical class. Produces structured change descriptions for developer review — affected programs, current code, proposed approach, test plan. Read-only; never writes code.

Field-test feedback loop

Operator worked / partial / failed buttons on every proposal; reinforces or weakens the underlying experiential memory. Weekly aggregation surfaces calibration drift.

50-incident replay harness

Validates new agent classes against curated historical scenarios before promotion. Pass-rate threshold and confidence-calibration measurement are part of the definition of done.

§ 06 / LaterWhat comes after

Items scoped and committed, but not in the next release cycle. Sequenced behind the work above.

Tool catalog continued

Source code reader

Tree-sitter-aware extraction returning symbol graphs rather than raw text. Lets agents navigate codebases by structure, not by token-count.

Later

Diff and patch tools

Structured diffs as first-class proposal artifacts. Diff is the proposal; patch application happens through standard change-control, not the agent.

Later

Allow-listed safe shell

Operator-defined command allow-list as defense-in-depth over capability checks. Bash with grep, find, git log, kubectl get — but not arbitrary commands.

Later

Structured HTTP client

Per-domain rate limits, response-size caps, optional caching, structured response objects. Cuts repeat calls and gives operators visibility into outbound traffic.

Later

Operator infrastructure

Replay harness

aos-replay re-runs a recorded session against a different LLM backend or different prompt and compares outcomes. Critical for evaluating model upgrades.

Later

Capability auditor

Walks registered tools, agent classes, and production tokens; reports unused capabilities and over-broad grants. Helps operators tighten capabilities over time.

Later

Cost report generator

Queryable rollups across sessions by agent class, time window, and operator. For finance and capacity planning, not per-session inspection.

Later

Approval queue exporter

Daily and weekly export of pending and resolved approvals for compliance teams: every destructive action an agent attempted in a window, with outcome.

Later

Production hardening

Performance characterization

Published p50/p99 latency, soak test results, capacity planning under realistic concurrency. Required for first-customer production deployment at scale.

Later

Multi-key authentication with rotation

Per-caller bearer tokens, independent audit attribution, rotation without downtime. Resolves the residual exposure noted in the threat model.

Later

OIDC and mTLS for production

Federated identity for API callers and mutual-TLS as alternative bearer-token mechanism. Targets enterprise deployments where bearer tokens alone are insufficient.

Later

§ 07 / ResearchWhere we're exploring

Open questions where the right answer isn't yet clear. We're prototyping and learning rather than committing. Items here graduate to next or later when scope is confidently bounded — or to out of scope if the right answer is "not us."

Judge-model pattern for reasoning-redirection injection

High-severity proposals routed to a second LLM call with only the proposal and evidence. Closes the largest open category in prompt-injection threat surface, but the architecture has significant ergonomic and cost implications.

Researching

Multi-turn injection detection

Adversaries who place content across many tool calls steering reasoning gradually evade single-result pattern detection. No good general defense exists today; we're tracking the research literature.

Researching

Capability-token-bound output redaction

Detect JWT-shaped or API-key-shaped strings in tool output and redact before LLM exposure. Prevents agents being convinced to leak their own credentials.

Researching

Per-class sandbox resource limits via token claims

Currently sandbox limits are global env vars. Per-class overrides through capability-token claims would let critical agents get more resources than experimental ones, but the audit shape needs design work.

Researching

Domain-specific cookbook entries

Beyond the generic incident-response cookbook, full reference implementations for specific verticals. Which verticals first depends on customer signal.

Researching

Hybrid actors

An actor with a deterministic main execution path and a reasoning hop at one or more specific decision points. Common shape: a deterministic monitor that calls an LLM only to classify ambiguous signals. Cost-efficient and architecturally cleaner than forcing every actor into one camp; needs design work on capability scoping and audit semantics across the hop.

Researching

§ 08 / Out of scopeWhat we won't build

Items deliberately not pursued. Each has a reason. Listing them is a credibility move: a roadmap that claims to do everything is one that has stopped thinking about trade-offs.

Mid-inference preemption

The scheduler preempts between LangGraph node boundaries, not inside an LLM call. True mid-inference preemption requires model-side cooperation that doesn't exist; we won't pretend otherwise.

Out of scope

Fully-compromised infrastructure protection

If an attacker holds both PostgreSQL and audit-anchor credentials, the chain can be rewritten. Mitigation requires external append-only audit, which is the operator's responsibility, not ours.

Out of scope

End-user authentication

AOSIQ authenticates API callers via bearer tokens. User identity, single sign-on, and role-based access at the application layer are the host application's responsibility, not the runtime's.

Out of scope

Durable execution as a primary product

Temporal, DBOS, Restate, and Inngest serve this category. AOSIQ includes durability as one property among many; we don't compete with specialists on durability alone.

Out of scope

Multi-language sandboxes (Node, Ruby, shell)

Python only in v1 sandbox. Other languages are straightforward to add but each is its own threat model and security review. Defer until a real customer workflow demands it.

Out of scope

Custom package install at sandbox call time

Operators define the curated package set in the Dockerfile. Agent-controlled pip install is a supply-chain attack surface we deliberately do not open.

Out of scope

Actor logic itself — models, prompts, business code

AOSIQ governs actors; it does not implement them. Bring your own LangGraph agent definitions, your own deterministic scripts, your own business logic. The runtime provides the substrate for governed action; the application is yours.

Out of scope

§ 09 / CadenceHow this updates

This roadmap is updated when items ship, scope, or move between statuses. There are no calendar dates. The runtime moves at the pace of correctness — when a piece of work is correct enough to ship, it ships.

Items move through statuses in one direction: researching → later → next → active → shipped. Items don't move backward in public unless scope is materially reduced; in that case the change appears in the changelog with reasoning.

This page was last updated May 2026. The full version history of this document — including what changed and when — lives in the project repository.

Recent changes: OperationalAdviceAgent v1 (item 18) shipped — the first domain vertical built on the runtime's full surface, landing alongside the two architectural primitives it depended on (sandboxed execution, item 17, and the DeterministicActor primitive, item 19). The vertical locks in the canonical AOSIQ pattern every subsequent vertical will follow: a reasoning advisor over a pack of deterministic diagnostic actors, emitting a single structured proposal humans review. Six named operational scenarios are covered (job queue backlog, disk space exhaustion, DB connection pool exhaustion, configuration drift, recurring error spike, performance regression); an explicit out-of-scope outcome is schema-enforced so the advisor refuses honestly when an incident falls outside its scope. The Active section is empty for the first time since the foundation phase; the next priority is the historical-incident replay harness (queued in § 05). bash_sandboxed (item 55) remains the next instance of the sandbox primitive family. The Google Gemini backend (previously item 26 / Next) was verified shipped and folded into item 07's six-backend coverage; the registry's known-backends set was repaired so agent classes can pin google_genai for data-residency.

Building something that depends on specific items here?

If your evaluation hinges on a specific roadmap item — performance numbers, a particular tool, a backend addition — that's worth a conversation. Roadmap order can shift in response to real customer signal in a way that reading public documents alone cannot.

Tell us what you need → Read the threat model →