§ 01 / LegendHow to read this
Six status values describe each item's position in the work pipeline. They do not describe calendar dates. An item marked next ships before an item marked later regardless of when either actually arrives.
§ 02 / PositionWhere we are
AOSIQ is at v0.8.0 — a production-shaped alpha. The governance substrate is complete for both actor types: capability narrowing, tamper-evident audit, approval gates, composite crash recovery, cost ledger, multi-backend LLM abstraction, anti-hallucination evidence stack, prompt-injection defenses, sandboxed code execution, and the DeterministicActor primitive are all shipped. The runtime now governs reasoning agents and scheduled / on-demand deterministic actors under one regime.
With the substrate complete, the recent architectural focus has moved to first verticals — domain agent classes that compose the substrate into solutions an operator recognizes. OperationalAdviceAgent v1 (item 18) shipped as the canonical pattern: a reasoning advisor over a pack of deterministic diagnostic actors, emitting a single structured proposal humans review. Every future vertical follows this same shape; the substrate doesn't change.
The runtime is no longer the bottleneck. The current focus is empirical validation (a historical-incident replay harness for the advisor), the long tail of native tools, and customer-side integration patterns.
Items in § 05 of the threat model cover security-mitigation roadmap items specifically. This page is broader, covering the full product trajectory across both actor types.
§ 03 / FoundationWhat's shipped
The substrate the rest of the platform builds on. All items below are in the released runtime and verified against tests.
Core runtime
JWT capability tokens with intersection-narrowing delegation. Verification fires before every tool call, every memory operation, every spawn.
Per-session SHA-256 hash chain with anchor objects in independently-credentialed object storage. Mid-chain tampering detectable.
Tools registered reversible=False require explicit operator approval bound to (tool, args_hash). Single-use, replay-safe.
LangGraph thread state, agent control block, and working memory captured atomically. Worker heartbeat + orphan reaper handle worker crashes.
Per-call recording with model, tokens, and computed USD. Configurable session ceilings raise exceptions before the API call.
enforced / warn / disabled via env var. Production deployments require enforced; dev runs in warn with a response header tripwire.
LLM abstraction & agents
Anthropic, AWS Bedrock, OpenAI, Google Gemini (AI Studio), local Ollama, Claude Code CLI shim — selected at construction via factory. Each backend lazy-imports its optional dependency; Gemini ships with the [google_genai] extra.
allowed_backends per agent class. Mismatches raise at construction; an Ollama-only class cannot accidentally route to Anthropic.
Schema-bound tool calls, Pydantic validation, audit-row evidence verification, evidence stamps, loop guards, force-investigate gate, abandonment after refusal.
Untrusted-content delimiters around every tool result, plus pattern detection for tool-call syntax, role prefixes, and known injection phrases.
Parent agents fan out one child per declared persona, each running with persona-overlay system prompts; structured aggregation across children's proposals.
BugHunt, CodeReview, ArchitectureDecision, FeatureDesign, ReleaseReadiness. All read-only; all produce structured proposals via BaseProposal.
Knowledge & integration
pgvector-backed document store with HNSW index, semantic search, and audit-evidence integration. Operator-loaded corpora.
Eleven AOSIQ governance operations exposed as MCP tools so any MCP client (Claude Code, Cursor, Cline) can dispatch governed swarms.
Agents reach external knowledge bases, internal APIs, and consumer services through MCP. Bridged tools inherit capability, audit, and approval.
Metadata filtering, hybrid vector + keyword search via pgvector and PostgreSQL tsvector, markdown-aware chunking with stable citation anchors (heading_path, heading_anchor, position_in_doc), incremental ingestion via content-hash, corpus introspection, and source-URI-prefix delete. Migrations 017–020. Closes the gap between minimum-viable retrieval and production-grade RAG.
Actor model & execution
run_python)
Container-isolated Python with no network, ephemeral filesystem, hard resource limits, non-root, and a curated package set. Replaces over-broad bash grants for the common case. Reversible by construction. Factored so additional sandboxed languages (e.g. bash_sandboxed, item 55) plug in as language-specific subdirectories — the language-agnostic invoker stays unchanged. Compute-time attribution via a new cost_model field on the handler registry, billed through the same ledger that records LLM token cost. Execution surface for both reasoning agents and deterministic actors that need isolated compute.
First-class governed entity for non-reasoning automation — scheduled jobs, monitoring scripts, ETL pipelines. Registered Python functions dispatched through a sibling runner to AgentRunner, sharing the same scheduler, capability tokens, audit chain, approval gate, and cost ledger. New compute_ms cost type bills wall-clock execution. Idempotency-up-to-first-irreversible-call is the contract; the function re-runs from the top after a human approves. Completes the runtime's actor model so governance properties extend to all automation, not just LLM-backed agents.
First domain vertical built on the runtime's full surface. A reasoning agent diagnoses operational incidents across six named scenarios (job queue backlog, disk space exhaustion, DB connection pool exhaustion, configuration drift, recurring error spike, performance regression) by spawning a pack of deterministic diagnostic actors — log query, threshold evaluation, configuration introspection. The actors gather and structure data; the advisor forms hypotheses and emits a single structured proposal humans review. Read-only by capability; remediation is a separate downstream agent. Out-of-scope is a first-class outcome — the advisor refuses honestly when an incident doesn't match the six scenarios, schema-enforced to carry no proposed actions. Locks in the canonical pattern every subsequent vertical will follow: reasoning advisor over deterministic actor pack, structured proposal as terminal output.
Plus 37 additional shipped items not individually listed: capability templates, cookbook examples, dashboard views, threat model document, migrations 001–021, test infrastructure, deployment scaffolding. The full set is verifiable in the codebase.
§ 04 / ActiveWhat's in flight now
Nothing. The OperationalAdviceAgent v1 vertical (item 18) shipped alongside the two architectural sprints it depended on — sandboxed execution (item 17) and the DeterministicActor primitive (item 19). The next priority is the historical-incident replay harness that validates the advisor's recommendations against a corpus of known incidents (queued in § 05 / Next); it didn't block the v1 vertical from landing because the cookbook example and end-to-end tests work against synthetic data.
§ 05 / NextWhat's committed to next
Items scoped, prioritized, and waiting in the queue behind active sprints. Each is independently shippable; the order reflects leverage and dependency rather than calendar.
Native tool catalog
PDF, DOCX, HTML, CSV, XLSX → structured content (title, body, metadata, tables). Closes the "agents can't read documents reliably" gap.
Deterministic date arithmetic, timezone reasoning, elapsed-time computation. Eliminates a known class of LLM failure on temporal questions.
Slack, email, PagerDuty, webhook channels. All reversible=False by default — sending a message is irreversible — so the approval gate prevents agent-driven message spam.
Read-only, schema-aware, capability-scoped per connection. Operator-registered connections become db_query@reporting_replica-style scoped tools.
Queryable observation log distinct from the immutable forensic audit chain. Operators get console.log-style observability without compromising audit integrity.
JSON schema, regex, URL, email, IBAN, format validation. Lets agents self-check output against deterministic validators before emitting proposals.
bash_sandboxed)
Docker-isolated shell command execution. Same isolation pattern as run_python (item 17): no network, no host filesystem, hard resource limits, non-root. Replaces broad bash grants for agents that need CLI tools (jq, awk, kubectl, etc.) without unrestricted shell access. Second instance of the sandboxed-execution primitive family the runtime is converging on.
Backend & integration
Pre-embedding scan for prompt-injection patterns, secrets, PII, malicious content. Operator review queue for flagged documents. Closes the KB-poisoning gap from the threat model.
Agents & cookbook
Second vertical class. Produces structured change descriptions for developer review — affected programs, current code, proposed approach, test plan. Read-only; never writes code.
Operator worked / partial / failed buttons on every proposal; reinforces or weakens the underlying experiential memory. Weekly aggregation surfaces calibration drift.
Validates new agent classes against curated historical scenarios before promotion. Pass-rate threshold and confidence-calibration measurement are part of the definition of done.
§ 06 / LaterWhat comes after
Items scoped and committed, but not in the next release cycle. Sequenced behind the work above.
Tool catalog continued
Tree-sitter-aware extraction returning symbol graphs rather than raw text. Lets agents navigate codebases by structure, not by token-count.
Structured diffs as first-class proposal artifacts. Diff is the proposal; patch application happens through standard change-control, not the agent.
Operator-defined command allow-list as defense-in-depth over capability checks. Bash with grep, find, git log, kubectl get — but not arbitrary commands.
Per-domain rate limits, response-size caps, optional caching, structured response objects. Cuts repeat calls and gives operators visibility into outbound traffic.
Operator infrastructure
aos-replay re-runs a recorded session against a different LLM backend or different prompt and compares outcomes. Critical for evaluating model upgrades.
Walks registered tools, agent classes, and production tokens; reports unused capabilities and over-broad grants. Helps operators tighten capabilities over time.
Queryable rollups across sessions by agent class, time window, and operator. For finance and capacity planning, not per-session inspection.
Daily and weekly export of pending and resolved approvals for compliance teams: every destructive action an agent attempted in a window, with outcome.
Production hardening
Published p50/p99 latency, soak test results, capacity planning under realistic concurrency. Required for first-customer production deployment at scale.
Per-caller bearer tokens, independent audit attribution, rotation without downtime. Resolves the residual exposure noted in the threat model.
Federated identity for API callers and mutual-TLS as alternative bearer-token mechanism. Targets enterprise deployments where bearer tokens alone are insufficient.
§ 07 / ResearchWhere we're exploring
Open questions where the right answer isn't yet clear. We're prototyping and learning rather than committing. Items here graduate to next or later when scope is confidently bounded — or to out of scope if the right answer is "not us."
High-severity proposals routed to a second LLM call with only the proposal and evidence. Closes the largest open category in prompt-injection threat surface, but the architecture has significant ergonomic and cost implications.
Adversaries who place content across many tool calls steering reasoning gradually evade single-result pattern detection. No good general defense exists today; we're tracking the research literature.
Detect JWT-shaped or API-key-shaped strings in tool output and redact before LLM exposure. Prevents agents being convinced to leak their own credentials.
Currently sandbox limits are global env vars. Per-class overrides through capability-token claims would let critical agents get more resources than experimental ones, but the audit shape needs design work.
Beyond the generic incident-response cookbook, full reference implementations for specific verticals. Which verticals first depends on customer signal.
An actor with a deterministic main execution path and a reasoning hop at one or more specific decision points. Common shape: a deterministic monitor that calls an LLM only to classify ambiguous signals. Cost-efficient and architecturally cleaner than forcing every actor into one camp; needs design work on capability scoping and audit semantics across the hop.
§ 08 / Out of scopeWhat we won't build
Items deliberately not pursued. Each has a reason. Listing them is a credibility move: a roadmap that claims to do everything is one that has stopped thinking about trade-offs.
The scheduler preempts between LangGraph node boundaries, not inside an LLM call. True mid-inference preemption requires model-side cooperation that doesn't exist; we won't pretend otherwise.
If an attacker holds both PostgreSQL and audit-anchor credentials, the chain can be rewritten. Mitigation requires external append-only audit, which is the operator's responsibility, not ours.
AOSIQ authenticates API callers via bearer tokens. User identity, single sign-on, and role-based access at the application layer are the host application's responsibility, not the runtime's.
Temporal, DBOS, Restate, and Inngest serve this category. AOSIQ includes durability as one property among many; we don't compete with specialists on durability alone.
Python only in v1 sandbox. Other languages are straightforward to add but each is its own threat model and security review. Defer until a real customer workflow demands it.
Operators define the curated package set in the Dockerfile. Agent-controlled pip install is a supply-chain attack surface we deliberately do not open.
AOSIQ governs actors; it does not implement them. Bring your own LangGraph agent definitions, your own deterministic scripts, your own business logic. The runtime provides the substrate for governed action; the application is yours.
§ 09 / CadenceHow this updates
This roadmap is updated when items ship, scope, or move between statuses. There are no calendar dates. The runtime moves at the pace of correctness — when a piece of work is correct enough to ship, it ships.
Items move through statuses in one direction: researching → later → next → active → shipped. Items don't move backward in public unless scope is materially reduced; in that case the change appears in the changelog with reasoning.
This page was last updated May 2026. The full version history of this document — including what changed and when — lives in the project repository.
Recent changes: OperationalAdviceAgent v1 (item 18) shipped
— the first domain vertical built on the runtime's full surface,
landing alongside the two architectural primitives it depended on
(sandboxed execution, item 17, and the DeterministicActor primitive,
item 19). The vertical locks in the canonical AOSIQ pattern every
subsequent vertical will follow: a reasoning advisor over a pack of
deterministic diagnostic actors, emitting a single structured
proposal humans review. Six named operational scenarios are
covered (job queue backlog, disk space exhaustion, DB connection
pool exhaustion, configuration drift, recurring error spike,
performance regression); an explicit out-of-scope outcome is
schema-enforced so the advisor refuses honestly when an incident
falls outside its scope. The Active section is empty for the
first time since the foundation phase; the next priority is the
historical-incident replay harness (queued in § 05).
bash_sandboxed (item 55) remains the next instance of
the sandbox primitive family. The Google Gemini backend (previously
item 26 / Next) was verified shipped and folded into item 07's
six-backend coverage; the registry's known-backends set was
repaired so agent classes can pin google_genai for
data-residency.
Building something that depends on specific items here?
If your evaluation hinges on a specific roadmap item — performance numbers, a particular tool, a backend addition — that's worth a conversation. Roadmap order can shift in response to real customer signal in a way that reading public documents alone cannot.