Roadmap — AOSIQ

§ 01 / LegendHow to read this

Six status values describe each item's position in the work pipeline. They do not describe calendar dates. An item marked next ships before an item marked later regardless of when either actually arrives.

Shipped

Live in the released runtime

Active

Currently being built

Committed to upcoming release cycle

Later

Committed but not next

Researching

Exploring whether or how to build

Out of scope

Deliberately not pursued, with reason

§ 02 / PositionWhere we are

AOSIQ is at v0.9.0 — a production-shaped alpha. The governance substrate is complete for both actor types: capability narrowing, tamper-evident audit, approval gates, composite crash recovery, cost ledger, multi-backend LLM abstraction, anti-hallucination evidence stack, prompt-injection defenses, sandboxed code execution, and the DeterministicActor primitive are all shipped. The runtime now governs reasoning agents and scheduled / on-demand deterministic actors under one regime.

With the substrate complete, the recent architectural focus has moved to first verticals — domain agent classes that compose the substrate into solutions an operator recognizes. OperationalAdviceAgent v1 (item 18) shipped as the canonical pattern: a reasoning advisor over a pack of deterministic diagnostic actors, emitting a single structured proposal humans review. Every future vertical follows this same shape; the substrate doesn't change.

The runtime is no longer the bottleneck. Empirical validation is in place — the cross-vertical replay harness publishes multi-backend baselines per oracle version, the air-gapped local-LLM tier is at parity with API backends, and corpus oracles are versioned so revisions never destroy prior baselines. The current focus is the long tail of native tools, customer-side integration patterns, and additional vertical classes onto the canonical advisor-over-actor-pack shape.

Items in § 05 of the threat model cover security-mitigation roadmap items specifically. This page is broader, covering the full product trajectory across both actor types.

§ 03 / FoundationWhat's shipped

The substrate the rest of the platform builds on. All items below are in the released runtime and verified against tests.

Core runtime

Capability authorization

JWT capability tokens with intersection-narrowing delegation. Verification fires before every tool call, every memory operation, every spawn.

Shipped

Tamper-evident audit chain

Per-session SHA-256 hash chain with anchor objects in independently-credentialed object storage. Mid-chain tampering detectable.

Shipped

Mandatory approval gate

Tools registered reversible=False require explicit operator approval bound to (tool, args_hash). Single-use, replay-safe.

Shipped

Composite crash recovery

LangGraph thread state, agent control block, and working memory captured atomically. Worker heartbeat + orphan reaper handle worker crashes.

Shipped

Cost ledger with hard ceilings

Per-call recording with model, tokens, and computed USD. Configurable session ceilings raise exceptions before the API call.

Shipped

Three-mode HTTP authentication

enforced / warn / disabled via env var. Production deployments require enforced; dev runs in warn with a response header tripwire.

Shipped

LLM abstraction & agents

Six LLM backends

Anthropic, AWS Bedrock, OpenAI, Google Gemini (AI Studio), local Ollama, Claude Code CLI shim — selected at construction via factory. Each backend lazy-imports its optional dependency; Gemini ships with the [google_genai] extra.

Shipped

Per-class data-egress envelopes

allowed_backends per agent class. Mismatches raise at construction; an Ollama-only class cannot accidentally route to Anthropic.

Shipped

Seven-layer anti-hallucination stack

Schema-bound tool calls, Pydantic validation, audit-row evidence verification, evidence stamps, loop guards, force-investigate gate, abandonment after refusal.

Shipped

Direct prompt-injection defense

Untrusted-content delimiters around every tool result, plus pattern detection for tool-call syntax, role prefixes, and known injection phrases.

Shipped

Multi-persona swarm orchestration

Parent agents fan out one child per declared persona, each running with persona-overlay system prompts; structured aggregation across children's proposals.

Shipped

Five swarm agent classes

BugHunt, CodeReview, ArchitectureDecision, FeatureDesign, ReleaseReadiness. All read-only; all produce structured proposals via BaseProposal.

Shipped

Knowledge & integration

Knowledge base substrate

pgvector-backed document store with HNSW index, semantic search, and audit-evidence integration. Operator-loaded corpora.

Shipped

MCP server (governance as MCP tools)

Eleven AOSIQ governance operations exposed as MCP tools so any MCP client (Claude Code, Cursor, Cline) can dispatch governed swarms.

Shipped

MCP client bridge for external tools

Agents reach external knowledge bases, internal APIs, and consumer services through MCP. Bridged tools inherit capability, audit, and approval.

Shipped

Knowledge base production layer

Metadata filtering, hybrid vector + keyword search via pgvector and PostgreSQL tsvector, markdown-aware chunking with stable citation anchors (heading_path, heading_anchor, position_in_doc), incremental ingestion via content-hash, corpus introspection, and source-URI-prefix delete. Migrations 017–020. Closes the gap between minimum-viable retrieval and production-grade RAG.

Shipped

Actor model & execution

Sandboxed code execution (run_python)

Container-isolated Python with no network, ephemeral filesystem, hard resource limits, non-root, and a curated package set. Replaces over-broad bash grants for the common case. Reversible by construction. Factored so additional sandboxed languages (e.g. bash_sandboxed, item 55) plug in as language-specific subdirectories — the language-agnostic invoker stays unchanged. Compute-time attribution via a new cost_model field on the handler registry, billed through the same ledger that records LLM token cost. Execution surface for both reasoning agents and deterministic actors that need isolated compute.

Shipped

DeterministicActor primitive

First-class governed entity for non-reasoning automation — scheduled jobs, monitoring scripts, ETL pipelines. Registered Python functions dispatched through a sibling runner to AgentRunner, sharing the same scheduler, capability tokens, audit chain, approval gate, and cost ledger. New compute_ms cost type bills wall-clock execution. Idempotency-up-to-first-irreversible-call is the contract; the function re-runs from the top after a human approves. Completes the runtime's actor model so governance properties extend to all automation, not just LLM-backed agents.

Shipped

OperationalAdviceAgent v1 (first vertical)

First domain vertical built on the runtime's full surface. A reasoning agent diagnoses operational incidents across six named scenarios (job queue backlog, disk space exhaustion, DB connection pool exhaustion, configuration drift, recurring error spike, performance regression) by spawning a pack of deterministic diagnostic actors — log query, threshold evaluation, configuration introspection. The actors gather and structure data; the advisor forms hypotheses and emits a single structured proposal humans review. Read-only by capability; remediation is a separate downstream agent. Out-of-scope is a first-class outcome — the advisor refuses honestly when an incident doesn't match the six scenarios, schema-enforced to carry no proposed actions. Locks in the canonical pattern every subsequent vertical will follow: reasoning advisor over deterministic actor pack, structured proposal as terminal output.

Shipped

Knowledge-base ingest scanner

The prompt-injection-pattern scanner that runs on tool results at read time now runs on ingested chunks at write time too. Every chunk passing through DocumentStore.add_documents is scanned before embedding; the result is stamped into chunk metadata regardless of match outcome, so operators can distinguish "scanned and clean" from "pre-scanner row" by querying the JSONB field. Two policies: warn (default — ingest proceeds, the chunk's metadata records the patterns, an audit event fires per flagged chunk, and kb_search later surfaces an inline warning when the chunk is returned) and reject (the whole source document is refused atomically before any embed or DB write). A backfill scanner remediates corpora ingested before this shipped — one CLI invocation brings a pre-scanner corpus up to the new contract. The read-side and write-side share the same pattern set, so future improvements to the detector benefit both directions uniformly. Closes the most visible "deliberately partial mitigation" claim in the threat-model document; KB-mediated injection moves from "operators should vet ingested corpora" to "scanned at ingest with operator-configurable enforcement."

Shipped

RemediationAgent v1 (execution arm)

Completes the operational-advice workflow from "diagnose" to "act." A deterministic dispatcher consumes an operator-approved OperationalAdviceProposal and walks proposed_actions in proposal order, spawning a per-action handler with capability narrowed to one tool. No reasoning at execution time — the proposal IS the plan; an LLM on the execution path would compound the advisor's accuracy gaps. Four v1 handlers cover the common action shapes (restart workers, scale pool, rotate logs, revert config); each is idempotent by construction so re-run-on-resume is safe. The per-action approval gate fires on every irreversible handler tool call through the existing review queue — a wrong proposal becomes a not-executed wrong proposal, not a wrong action. Every audit row carries the originating proposal's invocation_id via a new indexed column, so the operator's "show me everything this proposal triggered" forensic query is one filter, not a JSONB scan. Operator initiates via an explicit "Remediate" button — separate consent from "approved the diagnosis." The dispatch + handler pattern shipped here is the template every future vertical's execution arm will follow.

Shipped

ComplianceAuditAgent (second vertical)

Proves the canonical AOSIQ pattern generalizes beyond operational incidents. A second reasoning advisor diagnoses compliance findings across six named scenarios (control failure, regulatory deviation, audit-trail gap, segregation-of-duties violation, retention violation, evidence-collection gap) over three frameworks (SOC2, HIPAA, ISO27001) by spawning a pack of deterministic diagnostic actors — control-state check, structured rule evaluation, evidence-vault lookup. Same shape as the first vertical: reasoning advisor over deterministic actor pack, structured proposal as terminal output, out-of-scope as a first-class refusal. The infrastructure investment amortizes — scheduler, audit chain, capability tokens, approval gate, telemetry capture, replay harness, reinforcement worker, dashboard, and remediation dispatcher are all shared between the two verticals with zero per-vertical duplication. The cross-vertical RemediationAgent dispatcher reads the proposal's vertical and selects the right handler pack at runtime; four compliance handlers (evidence backfill, policy correction, control remediation, compliance-officer notification) plug into the same approval-gated execution path the operational handlers use. The dashboard surfaces per-vertical pass rates side-by-side; the replay harness defaults to walking both corpora and prints a per-vertical breakdown so "the pattern generalizes" is one report away. Adding the third vertical now reduces to one advisor module + one actor pack + one corpus + entries in two registration tables — no new infrastructure.

Shipped

Generic agent framework (substrate for future verticals)

Turns the "every vertical follows the same shape" claim from rhetorical to structural. The reasoning advisor's LangGraph machinery — node sequence, system-prompt loading, tool dispatch, evidence-marker rewriting, telemetry capture, cost accounting — extracts into a parameterized base class; each shipped vertical (operational, compliance) becomes a thin configuration carrying its scenario list, actor pack, proposal subclass, and two small adapters for the proposal fields that genuinely differ across domains. The operational module shrinks from ~1000 lines to under 400; the compliance module to under 350. Three additional reference families ship alongside — a read-only reviewer that produces findings + recommendation, a guardrail that consults the centralized approval-policy resolver and emits a per-action decision record, a verifier that compares expected against actual evidence and reports succeeded / failed / partially-succeeded / needs-more-info. Each family carries a corresponding role-based capability template (read-only-reviewer, advisor, guardrail, executor-scoped, verifier, reporter) operators compose with a vertical-specific overlay at capability-mint time — separating the role contract from the domain identity. The audit chain gains five new event types linked through a new proposal-lifecycle UUID column so an operator's "show me everything that happened on this proposal" query reads one indexed column instead of scanning JSONB payloads. The tool registry gains explicit risk-level metadata so the approval gate routes high-risk reversible actions to human review without an operator having to label every tool ad-hoc. Reuse cost for the next advisor vertical drops to roughly one configuration class plus a registration helper — comparable in size to writing a new dataclass rather than a new module.

Shipped

CodeReviewAgent (third vertical, first authored on the generic framework)

Proves the framework's "next vertical fits in a configuration class plus a registration helper" claim with new code rather than retrofitted code. The third vertical reviews code artifacts a human is about to ship — a pasted snippet, or a unified diff piped from git diff main...HEAD — and emits a structured proposal with categorized findings + a recommendation. Reviewer-family pattern (not advisor): the agent doesn't dispatch diagnostic actors, it looks at the artifact and classifies. Each finding is tagged with one of seven categories — security issue, likely bug, correctness concern, style violation, performance concern, missing test, or out of scope — and one of four severities chosen against an explicit anti-inflation rubric. The default recommendation derives from finding severities (any critical → block, any other finding → revise, no findings → approve), with an override path for cases like style-only diffs that shouldn't gate the merge. Honest refusal is first-class: when the artifact isn't reviewable (binary content, generated code, vendored dependencies), the reviewer emits a single out-of-scope finding rather than fabricating issues. Customer profile is the dogfood loop — the operator running Claude Code on their workstation pipes the diff through AOS and gets a governance surface for the review. The third vertical's module is roughly 330 lines (the framework's claim was about 150-200 for a thin vertical; the residual is genuine domain content — the seven categories, the rubric, the parse-failure path). The cost of vertical four is now expected to land closer to the framework's projection as more shared machinery shakes out.

Shipped

Field-test feedback loop

Closes the loop between proposals the advisor emits and the calibration signal it learns from. Every advisor invocation writes a structured telemetry row (vertical, scenario, confidence, severity, evidence count, operator outcome) — captured at proposal time, completed when the operator acts. The operator-facing surface is three buttons on every proposal: approved, edited, rejected — bound to the proposal's audit row so the feedback is the same shape regardless of which vertical surfaced the proposal. A reinforcement worker reads the feedback stream and updates experiential memory: rejected proposals weaken the underlying recall pattern, approved proposals strengthen it. Two materialized views aggregate the telemetry — a daily metrics rollup for the operator dashboard and a calibration view that exposes per-scenario confidence-vs-accuracy drift over the trailing window. The dashboard surfaces both views as the advisor-trends page; the per-vertical filter lets an operator compare operational vs compliance calibration side-by-side. A telemetry-to-corpus exporter promotes operator-confirmed incidents into the replay corpus with the operator's response and any scenario correction preserved as YAML, so production feedback extends the regression set over time without manual curation.

Shipped

Replay harness with audit-clean corpus revision

Validates verticals against curated historical scenarios before promotion. Cross-vertical: walks both operational and compliance corpora by default; per-vertical pass rates surface in one report. Pass-rate measurement is structural — five binary axes per incident (scenario, severity, confidence, actions, evidence) with all five required to pass, so "almost right" never inflates the headline number. Multi-backend baselines published per model: the air-gapped local-LLM tier (mistral-small:22b on Ollama, 80% pass rate) is at parity with API backends (Gemini Flash 78%, OpenAI gpt-4o-mini 76%, Claude Haiku 70%) for operator deployments that cannot route sensitive incident data to a third-party API. Corpus oracles are versioned per case: when an audit identifies an expectation as mis-calibrated (too strict or too lenient on a specific case), revisions bump the case's oracle version, the prior baseline file stays in git byte-for-byte, and an offline rescore tool reproduces the historical number against the prior corpus state without paying for another LLM run. The contract is that every baseline number is well-defined under its specific oracle version — no silent drift, no destroyed history, no manual archaeology required to answer "which oracle was this scored against."

Shipped

Action execution framework v1

Closes the lifecycle the framework had left half-open: after an advisor proposes and a guardrail gates, the executor changes state, a verifier confirms, and a reporter summarizes for the operator's runbook. Promotes two reserved family slots — executor and reporter — to shipped, taking the family catalog from four shipped + three reserved to six shipped + one reserved (triage remains, awaiting the customer-triage vertical). The executor family is the first write-side member: a deterministic dispatcher that walks an operator-approved proposal's actions in order, capability-narrows per action against a base template that ships zero tools (every authority comes from the per-action overlay), records an idempotency cursor for crash-recovery, and routes failure modes through dedicated audit events. No LLM at execution time — the proposal is the plan, and an LLM at the dispatch boundary would compound the advisor's accuracy gaps without adding value. The existing RemediationAgent refactored onto the generic dispatcher, shrinking from roughly 390 lines to 240 while preserving its public surface bit-identically: a depth-β change that proves the framework absorbs write-side verticals the same way it absorbed read-side ones. The reporter family closes the chain: it consumes the verifier's structured result and the audit chain for the proposal's lifecycle, then emits a runbook-ready summary plus a list of concrete next steps. Status mirrors the verifier by default — a chipper summary on a failed verification is the failure mode this family exists to prevent — and only the reporter's own parse-fallback path downgrades to needs-more-info, distinguishing reporter trouble from underlying remediation outcomes. End-to-end lineage threads through one indexed proposal-id column on the audit log so a single SQL filter returns every event in the lifecycle without payload scanning. Six smoke cases (three executor + three reporter, pass/fail/edge per family) plus a cross-handler capability-isolation matrix (eight handlers, fifty-six pairs) pin the safety properties; sixty-plus pre-existing remediation tests still pass against the refactored dispatcher.

Shipped

LLM turn audit (full reasoning traceability)

Closes the last forensic gap in the audit chain. Until this sprint the runtime recorded every tool call with its arguments, results, capability token, and timing — but the LLM's reasoning that produced each tool call was discarded after dispatch. Operators reviewing a bad decision could see the action; they could not read why the model chose it. The sprint adds a new audit event type (LLM_TURN) emitted once per reasoning step, capturing the full assistant message text. The text itself lives in object storage under the existing audit-anchor bucket so the primary audit table stays lean even for agents that produce hundreds of reasoning steps; only the canonical-JSON SHA-256 plus the object key land on the audit row, giving operators an integrity proof without bloating the hot path. Every subsequent tool-call audit row sets a causation pointer back to the reasoning turn that produced it. The reasoning → action chain becomes a single SQL JOIN over two partial indexes, indexed both ways: for an agent's full timeline, and for a specific tool call's originating reasoning step. The same data flows through three operator surfaces: a Reasoning Trace section on the agent detail page with collapsible per-turn cards linking each turn to the tool calls it dispatched; a server-sent-event stream that pushes a 200-character preview of each turn as the agent reasons; and a pair of REST endpoints (list-without-content plus per-turn-content fetch) backed by an MCP tool for programmatic post-hoc forensic review. The scope is helper-first: emission is wired into the two reasoning paths that cover every shipped vertical today (the generic agent substrate and the parameterized advisor base that drives both shipped advisor verticals), with a documented one-line adoption pattern for reviewer-family and swarm reasoning paths to opt in as those workflows need it. The MinIO upload at write time falls back to a sentinel rather than failing the audit append, so the chain stays unbroken even when object storage is degraded.

Shipped

Prompt management (versioned prompts + profile resolution)

Closes the three coupled problems that made prompt iteration painful. First, prompts lived in Python source — the same tool-call protocol existed in two files with two different contents, and updating either required a code change plus a deploy. Second, the model-selection knob was the transport (api / cli / ollama) rather than the model itself, so every model on a given transport got identical instructions despite known differences in verbosity tolerance, thinking-mode handling, and formatting preference. Third, pass-rate changes had no record of which prompt version produced them, so cause and effect could not be separated. This sprint extracts every prompt into a versioned Markdown file declared in a single manifest, routes resolution through a singleton that picks the right variant by model profile, and threads the resolved prompt's identity onto every reasoning-step audit row so a forensic SQL filter can attribute pass-rate diffs to a prompt change or a model change with the same query. Three style profiles ship: terse for small local models, verbose for API-grade models, and a no-think profile that prepends the directive structurally for thinking-mode models so prompt authors do not need to know which models require it. The resolver enforces the manifest at server startup, surfacing every missing-file gap in one consolidated error rather than as a confusing first-spawn failure. An AST-based CI guard walks production source and rejects any multi-line string that looks like a prompt — the rule survives future PRs that try to land an inline prompt back into Python source. The mechanical refactor lands without changing what any model actually reads; the deferred follow-up writes terse-variant rewrites for the three shipped advisor families and lands the variant-comparison corpus eval tool alongside, so style changes can be measured against the baseline cleanly.

Sprint spec →

Shipped

KB defense hardening

Closes the residual gap that the earlier ingest-time scanner left open. The regex pre-filter catches injection-shaped syntax synchronously at ingest with zero latency — but the regex layer cannot judge whether a prose paragraph is instructing the agent to leak its capability token, bypass the approval gate, or mirror every proposal to an attacker-controlled URL. This sprint adds a second stage: an LLM classifier fires asynchronously per chunk after commit, stamps a supplementary verdict onto the same chunk metadata, and degrades fail-closed when the classifier backend is unavailable so an outage cannot block ingest. The same sprint adds an entailment judge on every proposal emission — before a proposal is recorded the judge reconstructs the cited evidence and verifies that the proposal's claim actually follows from it. For proposals carrying a high or critical risk level the check is synchronous: a verdict of entails-false (or confidence below an operator-configurable threshold, default 0.75) raises a typed gate error and the proposal is refused at the audit-engine boundary, forcing the agent to re-cite or narrow the claim. Lower-risk proposals get the same verdict written as a fire-and-forget audit event so operators can run forensic queries against the baseline of "claims that drifted past their evidence" without paying the gate's latency. Operators get a per-chunk quarantine surface: any flagged chunk can be hidden from default search results via a bearer-auth-gated endpoint or a per-chunk-CSRF-protected dashboard button, and released chunks return to the flagged state rather than clean so the regex flag stands even after operator review. An adversarial corpus of twenty-four cases (ten syntactic, eight semantic, six benign with one documented false-positive) ships alongside as the regression target — the probe emits a confusion matrix and pins the regex layer's TP/FN/FP/TN. Closes the largest remaining "deliberately partial mitigation" entry in the threat-model document; reasoning-redirection injection moves from "judge-model pattern deferred to a future sprint" to "structural surface in place, judge accuracy continues to evolve with the underlying model."

Sprint spec →

Shipped

Workflow primitive Cornerstone

The substrate for governed multi-phase, multi-day, multi-agent processes. Single-agent sessions are the wrong primitive when the work spans multiple unbounded human decision points (a CAB review window, a regulator's quarterly sign-off, a postmortem stakeholder cycle), requires different agents at different stages, or must produce a single regulator-visible audit chain from instantiation to closure. The workflow primitive is a durable declarative state machine over phases. Phases are typed (agent spawns an agent session, gate waits for resolution, engine runs a deterministic handler inline). Gates are typed (human_approval, multi_resolver_threshold, time_bounded, sub_workflow_complete, condition, auto_pass) — six resolution shapes the reference workflows together force. The engine ticks in the background, serialises per workflow via Postgres advisory locks, and scopes its scan to templates the registry owns so multiple engines coexist (test isolation, blue/green deploys). The audit chain gains a partial-indexed workflow_id column so "everything for this workflow" is one filter, not a JOIN. Two creation paths: Path A instantiates from a YAML template (operator or system event); Path B is the differentiator — an agent with the propose_workflow capability emits a typed WorkflowAdaptation artifact (eight named adaptation kinds: insert/remove/modify phase, add/remove transition, set resolver, override field), the operator reviews the structural diff at /workflows/proposed, and approval activates the adapted workflow. Three structural refusals are enforced regardless of operator approval (no agent identity in resolver_members, no removing backout from irreversible phases, post-adaptation template re-validation). Five built-in templates ship: IBMi fix cycle with CAB approval and backout sub-workflow, compliance audit cycle with parent-spawns-children fanout, compliance remediation cycle, incident postmortem with time-bounded auto-advance, IBMi fix backout. Dashboard surfaces parallel the existing agents view; nine API endpoints cover instantiation, listing, detail, cancellation, Path B approval, in-flight gate approval, and template catalog management. Approximately two thousand eight hundred lines of code under aos/workflow/ plus migration 037 (three new tables and ten new audit event types) plus around two hundred ten tests including end-to-end against all three reference workflow shapes.

Sprint spec →

Shipped

Notification dispatcher

Closes the substrate gap the workflow primitive's incident-postmortem template opened — agents that detect a critical condition can now page humans, not just surface a proposal on a dashboard nobody is watching. Four outbound channels (Slack, email, PagerDuty, generic webhook) ship as native tool handlers; every tool is reversible=False so the approval gate fires before any message leaves the system. An agent cannot spam operators — a human must approve each individual send, and the gate's args-hash binding means replay-by-re-submission is blocked. Credentials never sit in plaintext. Operators declare channels in a YAML file with environment-variable interpolation, the loader resolves the variables at lifespan startup, and an AES-256-GCM ciphertext lands in a new notification_channels table. Agents pass a channel-id slug they read from the operator's documented surface; they never see webhook URLs, SMTP credentials, or PagerDuty routing keys. The HTTP transport retries once on a 5xx or transport error; a 4xx returns an explicit failure rather than raising, so the agent's evidence chain captures the failure context. Channel-type mismatches refuse pre-send (an agent calling send_slack_message against a type=email channel gets a structured refusal rather than a wrong-channel post). A composable with_notifications capability template grants the four tools with memory-ops:read and no child agents — notification is a leaf operation, fan-out composes at the workflow level. Closes the postmortem cycle's last loose end: the workflow primitive's incident-postmortem template now wires its publication phase directly to these tools, fanning out one send per registered channel and recording each outcome on the workflow's payload so operators can read partial-failure detail without grepping the audit chain. Engine actions skip the per-call approval gate because the workflow template was already operator-approved at instantiation; the agent-driven path still routes through the gate.

Sprint spec →

Shipped

Sandboxed shell execution (run_bash)

The shell sibling of run_python. Same Docker isolation contract — no network by default, read-only rootfs, non-root uid, dropped capabilities, resource caps, hard wall-clock — and a curated CLI toolset baked into the image (jq, gawk, yq, curl, kubectl, postgresql-client, bc, coreutils). No apt-get at runtime; the image is the supply-chain surface and operators rebuild it when a new tool is needed. Closes the gap where the unrestricted bash grant was the only path for an agent that legitimately needs jq on a log payload or kubectl against a cluster API. The host-filesystem case (scanning /var/log/*) deliberately stays on the unrestricted bash grant — the sandbox cannot reach the host filesystem by construction, which is the whole point. Network access is per-agent-class opt-in via a new RegistryEntry.sandbox_network_mode field; the only documented non-default value is bridge, and every AGENT_SPAWN audit payload records the elected mode so a regulator-visible filter answers "which spawns elected outbound reachability" in one SQL query. Agents pass a script/stdin_data/env shape; the env dict is filtered against an operator-managed allow-list before the envelope leaves the host, so an LLM that smuggles a sensitive key into env sees it dropped pre-flight. The with_bash_sandbox capability template grants exactly the run_bash tool plus read/write memory ops — composable with any vertical overlay. Second instance of the sandboxed-execution primitive family the runtime is converging on; future sandboxed languages (Node, Ruby, …) plug in as sibling directories under aos/tools/sandbox/ without touching the language-agnostic invoker.

Sprint spec →

Shipped

Time and date tool family

Closes the temporal-reasoning failure mode that bites every shipped vertical. LLMs fail predictably on date arithmetic, timezone conversion, elapsed-time, and day-of-week reasoning — and the failures appear in proposals (a compliance agent miscalculates a retention deadline, a postmortem renders a timestamp in the wrong timezone). Five pure-stdlib tools delegate every computation to Python's datetime, zoneinfo, and calendar: get_current_time (UTC and optional IANA timezone), parse_date (string → ISO 8601 with explicit refusal for ambiguous slash-formatted inputs rather than guessing), date_diff (days/weeks/months/years with correct month-length and leap-year handling), format_date (long/short/relative styles), and convert_timezone (IANA-to-IANA, naive inputs interpreted in from_tz). All five register reversible=True; inputs and outputs are ISO 8601 strings throughout; every response carries a display field for the operator-facing surface. The with_time_tools capability template grants all five in one template (splitting them is registry surface without security benefit). Composes with any vertical overlay — a compliance audit asking "is this evidence in window" reads the computed value rather than the model's guess.

Sprint spec →

Shipped

Validation tool family

Sibling sprint to the time/date tools, killing a different systemic LLM failure mode: format fabrication. Agents asked "is this a valid email" or "does this JSON match the schema" will sometimes assert validity when the answer is no — the downstream cost is proposals with invalid data that operators rubber-stamp or systems reject. Six pure-Python tools delegate the question to deterministic code: validate_json_schema (draft-7 with dotted-path error messages), validate_regex (returns the match plus capture groups on success), validate_url (format only — no reachability — with an optional require_https guard), validate_email (RFC 5322 format with no SMTP or DNS lookup), validate_format (named-format dispatcher covering UUID, ISO 8601 date and datetime, IBAN with the ISO 13616 mod-97 checksum, and E.164 phone numbers), and validate_json (parse-only check with the failing line and column). All six register reversible=True and return a uniform {valid, errors, match_groups} shape; error messages are human-readable strings rather than codes so the agent's reasoning trail carries the failure directly. The with_validation capability template grants all six in one template, composable with any vertical overlay. The explicit boundary on validate_url and validate_email is documented on the operator surface: format checks only, no network calls — mixing them would collapse two distinct concerns into one ambiguous tool.

Sprint spec →

Shipped

Structured logging tool

Adds a third channel for agent-emitted trace, sitting between the immutable forensic audit chain and ephemeral stdout. Operational commentary — "scanning 47 matching files in QSYSOPR", "evidence vault returned 4 controls without timestamp metadata, falling back to created_at", "primary log backend returned 503; switching to secondary" — belongs in neither: it isn't legally evidentiary, but operators want to read it while the session runs. The new log_observation(level, message, context) tool writes to a dedicated agent_observations table with no hash chain and no MinIO anchor, so rows are rotatable without affecting forensic integrity. The level field is part of the args (debug / info / warn / error) rather than a tool-family split, matching Python's stdlib logging convention. The syscall layer injects session_id, agent_id, and agent_class from the calling ACB — agents cannot smuggle writes into another session's log via the args dict. A live-polled /observations dashboard surfaces the table with filters for session, agent, and level; color-coded level badges; and collapsible context JSON. The operator-facing surface explicitly documents the audit-chain-vs-observation distinction so a vertical author picks the right channel without re-reading the sprint spec. Composes with any vertical overlay via the with_observation_log capability template.

Sprint spec →

Shipped

Database query tool

Read-only structured database access with per-connection capability scoping. Operators declare connections in a YAML file with environment-variable interpolated DSNs; agents pass a connection-id slug and never see credentials. Two tools: db_query for SELECT-only execution and db_schema for column-name introspection. Six defense-in-depth layers: the LLM only sees the broad tool names; the syscall capability verify checks the broad grant; the handler then decodes the agent's token and refuses unless db_query@<connection_id> is ALSO present for the specific call (the per-connection grants are added at token-mint time on top of the template, narrowing the agent's blast radius to exactly the connections it needs); the handler strips SQL comments, walks past leading parens, and rejects anything that isn't SELECT or WITH ... SELECT; per-connection row and timeout caps clamp the result size and wall-clock budget; and operators are expected to back each DSN with a SELECT-only database user as a final layer. Postgres connections use the existing psycopg pool family; DB2 is supported via a thin shim around the lazy-imported ibm_db_dbi library. Non-JSON-native column types (uuid, timestamp, decimal, bytes) are coerced to strings so the agent receives a uniform shape regardless of column types. Writes are explicitly v2 — they require approval-gate integration per call and an operator opt-in per grant, modeled separately as a future db_write tool.

Sprint spec →

Shipped

Structured HTTP client

Replaces the prior ungoverned outbound HTTP path — agents could previously call any URL through the in-line http_get with no rate limit, no response size cap, no domain allow-list, no HTTPS enforcement. The new version registers domains in a YAML allow-list with per-domain config (rate limit RPS, max response bytes, optional TTL cache, path-prefix allow-list, HTTPS enforcement). Calling an unregistered host returns ok:false BEFORE the network sees the call. Two tools ship: http_get is reversible=true and cached per the domain's TTL; http_post is reversible=false so the approval gate fires before every call and the reviewing operator sees the exact URL, headers, and body. Six defense-in-depth layers: domain allowlist, scheme check (require_https), path-prefix membership, async token-bucket rate limit, response size cap with explicit truncated flag, optional per-URL TTL LRU cache. Non-2xx responses return ok=true with the actual status code — some APIs treat 404 or 409 as valid signals, and the handler doesn't force a verdict. Binary content types come back as base64 strings so the agent always receives JSON-safe content. Breaking change for the two capability templates that grant the old http_get (with_log_query_diagnostic and with_threshold_diagnostic) — their target domains must be added to config/http_domains.yaml before existing deployments resume working.

Sprint spec →

Shipped

Document fetcher with format-aware extraction

Closes a capability gap with no prior path — agents could not read binary or rich-text documents at all. A PDF evidence attachment returned raw bytes; a DOCX contract was unreadable; an XLSX metrics export was opaque. The new fetch_document(source) tool reads a URL or local path, detects the format (Content-Type header for URLs with extension fallback), dispatches to a per-format extractor, and returns a uniform DocumentResult with title, body text, metadata, and any extracted tables. Five formats ship in v1: PDF via pymupdf (first 200 pages, metadata preserved), DOCX via python-docx (paragraphs as body plus native tables), HTML via BeautifulSoup with main/article preference and nav/footer/script noise stripped, CSV via the stdlib (single table, first row as headers), and XLSX via openpyxl (one table per sheet, sheet name carried through). Each extractor is lazy-imported so the package loads cleanly even when an optional library is missing — a call into an unavailable extractor returns ok:false with the install command rather than crashing at import. Size caps are explicit and operator-visible: 50 MB per document (pre-flight via Content-Length on URLs, streamed-bytes total tracked too so a lying server doesn't sneak through), 100,000 characters of body text, configurable per-table row cap (default 500, ceiling 2000), 200 PDF pages with the actual count preserved on metadata so an auditor can see both numbers. The truncated field fires whenever any cap activates, so operators reviewing a proposal that rests on partial document content can spot the elision from one field. The with_document_reader capability template grants the tool with read+write memory ops (so an agent can stash a long extracted body into working memory without re-fetching) and a larger token budget than the other utility tools because document body content flows directly into the reasoning context. Explicit non-goals: OCR for scanned PDFs (use a vision model), PowerPoint, document diff, MinIO evidence anchoring, encrypted documents.

Sprint spec →

Shipped

Plus 43 additional shipped items not individually listed: capability templates, cookbook examples, dashboard views, threat model document, migrations 001–042, test infrastructure, deployment scaffolding, the LLM-backend resilience layer (retry-on-transient, per-backend timeouts), operator-driven capability-secret rotation with per-token revocation, the observability surface (Prometheus metrics, OpenTelemetry tracing, structured JSON logging, and continuously-verified audit-chain anchor sweep), a hardening pass on the deterministic-actor model (single-entry compute-ms billing, dequeue index for million-row scale, audit-id-backed evidence citation in the advisor, native bind_tools on Anthropic and Gemini backends), the field-test feedback loop (per-call advisor telemetry capture, materialized rollup views, operator approve/reject/edit reinforcement signal), and the cross-vertical infrastructure that lets the second vertical reuse everything (vertical column on the telemetry table, per-vertical handler packs in RemediationAgent, per-vertical filter on the dashboard, cross-vertical replay harness). The full set is verifiable in the codebase.

§ 04 / ActiveWhat's in flight now

Nothing in flight at the moment. The workflow-primitive cornerstone (item 72) shipped on 2026-05-19 — the substrate for governed multi-phase, multi-day, multi-agent processes, with three new tables (workflow, workflow_phase, workflow_gate), six gate kinds, three phase kinds, an eight-kind Path B agent-instantiation protocol, dashboard surfaces parallel to the existing agents view, and end-to-end coverage against three reference workflow shapes (IBMi fix cycle, compliance audit-cycle with parent-spawns- children fanout, incident postmortem with time-bounded auto-advance). Five built-in templates ship under config/workflows. With the cornerstone landed, the next sprint will come off the Next queue below — the IBMi-fix vertical and the compliance-as-workflow refactor that this primitive was blocking are now unblocked. The notification dispatcher (item 22) shipped the following day to close the substrate gap the postmortem template's publish_postmortem engine action depended on — Slack, email, PagerDuty, and generic webhook tools, all routed through the approval gate.

§ 05 / NextWhat's committed to next

Items scoped, prioritized, and waiting in the queue behind active sprints. Each is independently shippable; the order reflects leverage and dependency rather than calendar.

Native tool catalog

Agents & cookbook

CodeChangeAgent

Second vertical class. Produces structured change descriptions for developer review — affected programs, current code, proposed approach, test plan. Read-only; never writes code.

Sprint spec →

§ 06 / LaterWhat comes after

Items scoped and committed, but not in the next release cycle. Sequenced behind the work above.

Tool catalog continued

Source code reader

Tree-sitter-aware extraction returning symbol graphs rather than raw text. Three reversible-by-contract tools — one returns the symbol graph for a file with bodies opt-in, one searches symbols by name pattern and kind across a registered codebase, one returns the body of a specific symbol. First call per codebase builds an in-memory index in a few seconds; later calls hit the cache and return in well under a tenth of a second. The cache invalidates on file change via mtime hashing. Four languages parse out of the box: Python, JavaScript, Go, and SQL. An agent can navigate a 50-thousand-line codebase in a handful of tool calls instead of burning the budget on raw file reads. Codebases are operator-registered; the registry is deploy-time-immutable. Pairs naturally with the non-self-modifying runtime principle — agents can read project source freely under capability grant, but the write_file path-prefix denylist still refuses any attempt to mutate those same trees.

Sprint spec →

Shipped

Diff and patch tools

Structured diffs as first-class proposal artifacts. Three read-side tools — one computes a unified diff between two text strings, one parses a unified diff into structured metadata, one validates that a diff would apply cleanly against a target's current content. The agent reads the file's symbol graph through the source-code reader, mentally applies its proposed edit, computes the diff between original and modified, validates the diff against the current file content, and carries the diff in its proposal body for human review. No apply_patch — writing a patched file is an irreversible operation that belongs in an executor vertical behind the approval gate, and the write_file path-prefix denylist still refuses writes into protected trees regardless. The diff is the proposal; patch application happens through standard change-control, not the agent. Validation surfaces the first mismatched line by number so an operator reading the proposal can locate exactly what shifted in the target since the diff was generated.

Sprint spec →

Shipped

Allow-listed safe shell

Operator-defined command allow-list as defense-in-depth over capability checks. Bash with grep, find, git log, kubectl get — but not arbitrary commands.

Sprint spec →

Later

Operator infrastructure

Replay harness

aos-replay re-runs a recorded session against a different LLM backend or different prompt and compares outcomes. Critical for evaluating model upgrades.

Sprint spec →

Later

Capability auditor

Walks registered tools, agent classes, and production tokens; reports unused capabilities and over-broad grants. Helps operators tighten capabilities over time.

Sprint spec →

Later

Cost report generator

Queryable rollups across sessions by agent class, time window, and operator. For finance and capacity planning, not per-session inspection.

Sprint spec →

Later

Approval queue exporter

Daily and weekly export of pending and resolved approvals for compliance teams: every destructive action an agent attempted in a window, with outcome.

Sprint spec →

Later

LLM backend strategy

Backend Router

Strategy-based LLM selection per agent class: economy (cheapest model above a configurable quality floor), quality (highest pass rate), local_first (air-gapped by default, escalate only on explicit opt-in). Cascade mode runs the economy model first and escalates to a higher-quality model when response confidence falls below a threshold — achieving near-quality-tier results at economy-tier cost for the majority of calls. Scores seeded from corpus baselines; operator updates them as model quality evolves. Per-agent-class strategy declaration in RegistryEntry.

Sprint spec →

Later

Sandbox egress allowlist

When bash_sandboxed runs with network_mode="bridge", all outbound destinations are currently permitted. The egress allowlist restricts outbound connections to a named set declared in the agent's RegistryEntry — so kubectl can reach the cluster API but not arbitrary internet hosts. Enforced via iptables rules inserted before container start. Allowlist visible in AGENT_SPAWN audit payload.

Sprint spec →

Later

Production hardening

Performance characterization

Published p50/p99 latency, soak test results, capacity planning under realistic concurrency. Required for first-customer production deployment at scale.

Sprint spec →

Later

Multi-key authentication with rotation

Per-caller bearer tokens, independent audit attribution, rotation without downtime. Resolves the residual exposure noted in the threat model.

Sprint spec →

Later

OIDC and mTLS for production

Federated identity for API callers and mutual-TLS as alternative bearer-token mechanism. Targets enterprise deployments where bearer tokens alone are insufficient.

Sprint spec →

Later

§ 07 / ResearchWhere we're exploring

Open questions where the right answer isn't yet clear. We're prototyping and learning rather than committing. Items here graduate to next or later when scope is confidently bounded — or to out of scope if the right answer is "not us."

Judge-model pattern for reasoning-redirection injection

High-severity proposals routed to a second LLM call with only the proposal and evidence. Closes the largest open category in prompt-injection threat surface, but the architecture has significant ergonomic and cost implications.

Sprint spec →

Researching

Multi-turn injection detection

Adversaries who place content across many tool calls steering reasoning gradually evade single-result pattern detection. No good general defense exists today; we're tracking the research literature.

Sprint spec →

Researching

Capability-token-bound output redaction

Detect JWT-shaped or API-key-shaped strings in tool output and redact before LLM exposure. Prevents agents being convinced to leak their own credentials.

Sprint spec →

Researching

Per-class sandbox resource limits via token claims

Currently sandbox limits are global env vars. Per-class overrides through capability-token claims would let critical agents get more resources than experimental ones, but the audit shape needs design work.

Sprint spec →

Researching

Domain-specific cookbook entries

Beyond the generic incident-response cookbook, full reference implementations for specific verticals. Which verticals first depends on customer signal.

Sprint spec →

Researching

Hybrid actors

An actor with a deterministic main execution path and a reasoning hop at one or more specific decision points. Common shape: a deterministic monitor that calls an LLM only to classify ambiguous signals. Cost-efficient and architecturally cleaner than forcing every actor into one camp; needs design work on capability scoping and audit semantics across the hop.

Sprint spec →

Researching

Local model quality floor

The default model (qwen2.5:7b) passes 52% of corpus cases in a fully air-gapped deployment — below the 70% threshold for unassisted production use. Research question: how much of the gap closes through prompt engineering (terse variants) and cascade configuration versus requiring a larger model? Categorize the failure modes in the remaining 48%, calibrate the corpus oracle against human judgment, and publish a minimum-VRAM recommendation for each quality tier (70%, 80%). Output: a recommendation, not a feature.

Sprint spec →

Researching

Production feedback loop into corpus

The field-test feedback loop (item 29) captures operator approve/reject/edit signals. Those signals are not currently used to extend the replay corpus automatically. Without auto-promotion, the corpus distribution drifts from production incident distribution and pass rates become optimistic over time. Research question: what is the minimum-viable corpus-candidate schema, how do we prevent contamination (model output ≠ ground truth), and what sampling rate keeps the corpus manageable? Output: a workflow design for the implementation sprint.

Sprint spec →

Researching

Operational observability and alerting

The runtime is currently silent between dashboard opens. No signal fires when the review queue has been pending 30 minutes, the cost budget is 80% exhausted, or the anchor sweep has fallen behind. Research question: what are the SLOs worth defining, what are calibrated thresholds grounded in baseline data (not guesses), and does alerting require the notification dispatcher (item 22) first? Output: 4–6 SLOs with thresholds, runbooks for each, and a dependency decision.

Sprint spec →

Researching

Agent task persistence across restarts

LangGraph checkpoints are written after each node; the resume path is not implemented. When the server restarts, in-flight agents are orphaned and the work is lost. The checkpoint table also grows forever. Research question: what is the correct resume semantic (auto for read-only, operator-consent for agents with irreversible tools remaining), how does a new AGENT_RESUME event type maintain audit chain integrity across the restart gap, and what is the checkpoint pruning policy? Output: design spec and implementation plan.

Sprint spec →

Researching

Swarm resilience and failure playbook

Multi-persona swarms pass all authored tests but have no documented failure taxonomy, no tested recovery path, and no operator runbook. Three failure modes to characterize: persona-level crash (one of five personas errors mid-investigation), contested memory state (two personas write conflicting conclusions), and aggregation failure (parent receives partial results). Research question: what is the correct policy for each failure mode, what does contested memory resolution look like as an operator workflow, and what is the safe swarm size upper bound? Output: failure taxonomy, runbooks, and implementation spec for recovery paths.

Sprint spec →

Researching

Multi-tenancy architecture

All agents currently share one database, one audit log, one cost ledger, and one capability JWT root. No isolation boundary exists between callers or teams. Research question: what is the minimum tenant concept (caller-based partitioning, row-level security, or separate deployments), what does the JWT capability model need for a tenant_id claim, and what is the smallest change that opens the multi-tenancy path without a big-bang schema migration? Output: isolation architecture recommendation and first-sprint implementation scope.

Sprint spec →

Researching

§ 08 / Out of scopeWhat we won't build

Items deliberately not pursued. Each has a reason. Listing them is a credibility move: a roadmap that claims to do everything is one that has stopped thinking about trade-offs.

Mid-inference preemption

The scheduler preempts between LangGraph node boundaries, not inside an LLM call. True mid-inference preemption requires model-side cooperation that doesn't exist; we won't pretend otherwise.

Out of scope

Fully-compromised infrastructure protection

If an attacker holds both PostgreSQL and audit-anchor credentials, the chain can be rewritten. Mitigation requires external append-only audit, which is the operator's responsibility, not ours.

Out of scope

End-user authentication

AOSIQ authenticates API callers via bearer tokens. User identity, single sign-on, and role-based access at the application layer are the host application's responsibility, not the runtime's.

Out of scope

Durable execution as a primary product

Temporal, DBOS, Restate, and Inngest serve this category. AOSIQ includes durability as one property among many; we don't compete with specialists on durability alone.

Out of scope

Multi-language sandboxes (Node, Ruby)

Python and sandboxed shell (bash_sandboxed, item 55) cover the immediate execution surface. Node and Ruby are straightforward to add with the same invoker pattern but each requires its own threat model and security review. Deferred until a real customer workflow demands it.

Out of scope

Custom package install at sandbox call time

Operators define the curated package set in the Dockerfile. Agent-controlled pip install is a supply-chain attack surface we deliberately do not open.

Out of scope

Actor logic itself — models, prompts, business code

AOSIQ governs actors; it does not implement them. Bring your own LangGraph agent definitions, your own deterministic scripts, your own business logic. The runtime provides the substrate for governed action; the application is yours.

Out of scope

§ 09 / CadenceHow this updates

This roadmap is updated when items ship, scope, or move between statuses. There are no calendar dates. The runtime moves at the pace of correctness — when a piece of work is correct enough to ship, it ships.

Items move through statuses in one direction: researching → later → next → active → shipped. Items don't move backward in public unless scope is materially reduced; in that case the change appears in the changelog with reasoning.

This page was last updated May 2026. The full version history of this document — including what changed and when — lives in the project repository.

Most recent change: role-scoped tool registry — a second enforcement layer at every tool dispatch. The capability discipline that ships in the runtime today is delegation-bound: a parent agent mints a child token whose tool grants are the intersection of the parent's grants and what the child asked for. That mechanism is intact and load-bearing. The gap this change closes is that capability narrowing was entirely delegation-bound — the registry itself did not know that "a researcher agent shouldn't have a destructive admin tool." A misconfigured parent (or one that's been jailbroken into requesting a broad grant set) could hand the child whatever the parent itself held. The new layer adds an intrinsic role-allowlist ceiling: for every agent_class an operator declares which tools the role may ever invoke, in a single deploy-time-immutable YAML. At every tool call both checks must pass — the token must grant the tool AND the role must list it in allowed_tools — so even a prompt-injection-driven over-grant cannot exceed the role's intrinsic boundary. The role definitions support single-parent inheritance so a senior-investigator role can extend the base investigator role with one extra capability rather than re-listing the entire set. A wildcard fallback role preserves backward compatibility: any agent_class not explicitly declared falls through to the wildcard and only the existing token check applies, logged once per process at INFO so operators see which classes are still on the fallback path. A distinct audit event (CAPABILITY_ROLE_DENIED) fires when the second layer refuses — separate from the existing CAPABILITY_DENY for token-gate refusals — so a forensic query answers "which gate fired" in one filter instead of greppping a free-form reason string. The new exception subclasses the existing one, so every existing capability-denial handler still catches role denials operationally; only audit emission and the dashboard need to know the difference. Migration 041 widens the audit-event CHECK constraint with the one new reserved type. No new runtime dependencies; the registry loader is ~300 lines of pure-Python YAML parsing and inheritance resolution. A new HTTP route surfaces the second-layer denials per agent so a red-team operator can answer "which calls did the role gate refuse" without writing SQL. Prior change: MAST reliability guards in the advisor finalize path. The 2025 multi-agent failure taxonomy (Cemri et al., 1,600+ annotated traces, κ = 0.88) identifies four failure modes that account for over half of observed multi-agent failures. Three of them are now caught before a proposal becomes a committed result. Each guard attaches a structured flag to the proposal and emits one audit row so operators can track failure-mode frequency over time without a separate dashboard. The first guard measures token overlap between the proposal's reasoning text and the text of its proposed actions — a proposal that diagnoses "disk exhaustion" but proposes restarting an unrelated service leaves a measurable mismatch. The second compares the proposal's vocabulary against the original task description; an investigation asked to analyse a job-queue backlog that emits a disk-quota proposal trips the scope-drift guard. The third is opt-in and lives on the verifier family slot: structural completeness checks on the proposal's required fields and confidence range, with an operator-promotable hard-stop mode that returns an honest emergency refusal instead of the malformed proposal when the verifier rejects. All three guards are pure-Python — no LLM-as-judge, no per-call cost, microseconds per finalize — so they run on every advisor session without budget concerns. Thresholds are operator-tunable (one env var per guard) and the strict hard-stop mode is gated by a third env var, off by default because warn-severity flags carry signal an operator may legitimately accept. Migration 040 widens the audit-event CHECK constraint with two new reserved types (AGENT_RELIABILITY_FLAG, AGENT_RELIABILITY_VERIFIED). No new runtime dependencies; the tokenizer ships as a small intersection-of-NLTK-spaCy stop-word set in roughly 60 lines of pure-stdlib Python. Prior change: startup drift detector for installed Ollama models. Closes the natural follow-up to the airgap-ledger alignment: the runtime's per-model capability table is hand-maintained, but operators pull new Ollama tags independently. Without a probe, the operator finds out a tag is unregistered the first time they try to use it — either via a silent gate refusal (the default for unknown tags is to refuse tool-requiring agents) or via an empirical hang several minutes into a run. At server startup the runtime now diffs the installed model list against the ledger and emits one structured log line — three shapes: skipped when Ollama isn't reachable (the operator hasn't started it, or the configured URL is wrong), happy when every installed tag is registered, and a warning naming each unregistered tag and pointing at the registration protocol when the lists diverge. The probe is intentionally asymmetric: it warns only on installed-but-unregistered, never on registered-but-not- installed. The inverse warning would fire on essentially every deployment (no operator installs the full ledger) and train operators to ignore the probe entirely. The whole path is diagnostic-only — every failure mode (connection refused, timeout, malformed response, HTTP error) returns a structured outcome rather than raising, and an outer paranoid try/except swallows anything that escapes anyway, so the probe cannot block startup even in pathological cases. No new environment variables, no new audit events, no new runtime dependencies. The whole addition is roughly forty lines of probe code plus a small log-shape helper, with fifteen tests covering the eleven failure modes of the underlying HTTP call plus the three log shapes plus the ledger-key filter rule. Prior change: Ollama tool-capability ledger aligned to airgap measurements. A small-PR-shaped follow-up to an airgap sample sweep across seven local models that surfaced a divergence between what the runtime's per-model tool-calling table claimed and what models actually do under the production reasoning loop. Five entries flipped to refused: the Qwen2.5-Coder family (the prior comment claimed "better at structured tool emission for code tasks" — measurement shows the model converges to a syntactically valid proposal that classifies every operational incident as out-of-scope), Qwen2.5:32b and the Phi4-mini family and Gemma4-MoE-26B (all three emit prose with no parseable tool call and hang the reasoning loop on the operational-advisor corpus), and DeepSeek-R1:32b (Ollama's API refuses the tools field outright on reasoning-model tags). Each refused entry now carries a comment citing the specific airgap measurement rather than a vendor's marketing claim. The gate behaviour is the same as before — the runtime refused models with no ledger entry by default — but the error message moves from "no tool-capability entry" (reads like an oversight) to "registered as supports_tool_calling=False" with the empirical citation (reads like a deliberate decision an operator can investigate). Pin tests guard against a future well-meaning edit silently re-enabling a known-bad model without re-measuring. Prior change: AOS-native replay backend shipped. The replay harness gains a second execution path that runs corpus incidents through the production AOS spawn / poll / fetch API rather than the in-process runner — so the scheduler, audit chain, capability gate, approval policy, and cost ledger all participate in the measurement the same way they would in production. A per-incident wall-clock timeout (default 180 seconds) lets the harness record an incident as errored rather than hanging the run when a model fails to emit any parseable tool call — the operator-visible result is a clean completion with one row marked errored, instead of a seven-minute silent stall. A per-vertical corpus-to-spawn mapper translates corpus rows into the agent_class + task shape the AOS spawn API expects, and the report now separates pass rate (model gave a correct answer) from errored count (runtime could not get an answer at all) so a regression in one signal does not mask the other. The same report carries LLM_TURN economics — per-incident token and cost summary — so the cost story is a property of the harness output, not a post-hoc query. Prior change: sensible Ollama context default shipped. The bundled docker-compose now sets OLLAMA_NUM_CTX=8192 by default, closing the root cause that drove three of the airgap sweep's hung-incident rows. Several model files declare a 131072-token context window, and Ollama pre-allocates the full KV cache up front on first load — spilling roughly 15 GB of VRAM for a 2.5 GB model file and dramatically slowing inference. A direct probe with the 8K override returns a one-token response from phi4-mini in five seconds where the default times out at three minutes. The recommendation now lands as a default rather than a runbook footnote. Prior change: diff and patch tools shipped (item 32). Three read-side tools that complete the proposal-side tooling pair with the source-code reader. One computes a unified diff between two text strings using the standard library; one parses a unified diff into structured metadata — files changed, per-file hunks, line counts, new-file and deleted-file flags; one validates that a proposed diff would apply cleanly against a target's current content and names the first mismatched line by number when it wouldn't. The agent reads source through the reader, computes the diff between the current file and its mental edit, validates the diff against the target's actual current content, and surfaces the diff in its proposal body for human review. Deliberately no apply patch — writing a patched file is an irreversible operation that belongs in an executor vertical behind the approval gate. The diff is the proposal; the operator applies it through standard change control. The read-write split now holds end-to-end at the tool layer: agents read source freely, produce structured diffs as proposals, and the runtime self-protection denylist still refuses writes into protected trees regardless of grant. Prior change: source-code reader shipped (item 31). Three read-side tools that let agents navigate large codebases by symbol graph rather than raw file bodies. The first call per codebase builds an in-memory index in a few seconds — a sorted hash of file paths and modification times keys the cache, so subsequent calls return in well under a tenth of a second and the cache invalidates on the next file change automatically. Four languages parse out of the box: Python, JavaScript, Go, and SQL. An agent looking for the spawn function in a fifty-thousand-line codebase searches by name to locate the file, reads the file's symbol graph to see what else lives there, and pulls the body of the one target symbol — a handful of tool calls instead of burning the budget on sequential reads. All three tools register read-only; the recently shipped write_file path-prefix denylist refuses writes into the same source trees the reader navigates, so an agent granted both reader and writer still cannot mutate the source it just read. The capability template grants the three tools narrowly; the codebase registry is operator-managed and deploy-time-immutable — same precedent as the shell allow-list and notification channels. Prior change: write_file path-prefix denylist shipped. Closes the explicit follow-up the non-self-modifying runtime principle named in its consequences section. Two protected surfaces (the runtime config directory and the runtime source tree) were previously held only by absence of grant — no agent class had write permission on those paths, but a misconfigured custom template could have granted it. The denylist refuses such writes at the handler level before any disk activity and before an operator would be asked to approve, so the runtime's structural self-protection now survives operator grant errors rather than depending on them being absent. Symlink resolution runs before the prefix check so an attacker-controlled symlink dance cannot bypass the boundary. Operators can extend the list at deploy time; canonical entries cannot be disabled at runtime — disabling is an architecture-decision amendment, not a configuration toggle. Twenty-two tests pin every protected prefix plus the symlink-escape path. Prior change: reviewer-family scheduler alignment shipped. Closes the half of composition-layer-v1 that was honestly deferred last commit. The reviewer family was intentionally non-LangGraph-driven by design — operators called the registration helper directly and invoked the reviewer's `review` method themselves rather than going through the scheduler. After this sprint a family-level LangGraph wrapper turns any reviewer into a graph the scheduler can dispatch, the third vertical (code review) reaches the scheduler through the same `register_*` shape as the advisors with the same audit / capability / approval / cost / reasoning-trace forensics, and a uniform proposal-retrieval endpoint reads `agent_processes.result` for every vertical through one stable envelope. The dogfood loop ships as a thin CLI that pipes stdin to the running server, polls until terminal, and renders findings — no in-process imports, exit code maps to recommendation. Future reviewer verticals (SQL, IaC, configuration) drop in as thin subclasses on the family-level graph wrapper. Prior change: composition-layer-v1 shipped (ADR-006 Phase 1). Wiring-debt hardening sprint that closes a class of drift the agent-class deep review surfaced: AOS shipped code for three verticals (operational advice, compliance audit, code review) but only one was actually wired into the running server's startup. After this sprint, the compliance vertical and both remediation handler packs (four operational, four compliance) register at lifespan startup — the documented surface and the running surface come back into alignment. A new env-var-gated startup assertion turns the wiring contract into a boot-time check so future drift surfaces before any operator hits a 4xx. The synopsis vocabulary half of the sprint adopts the 14-layer positioning model in the architecture doc and on this site's architecture section, with an honest per-layer scorecard — the layers that score low (Intent, Planning) are positioning vocabulary, not yet shipped surfaces, and the doc says so plainly. The third vertical's scheduler-driven registration alignment turned out to be a bigger architectural change than the sprint scoped (the reviewer family is intentionally non-LangGraph-driven by design), and was honestly deferred to its own follow-up sprint in the queue rather than padded into this one. Prior change: prompt management (item 62) shipped. Pulls every prompt out of Python source into versioned Markdown files declared in a single manifest. Resolution routes by model profile rather than transport, so a small local model and a frontier API model on the same backend see different prompts when that helps. The resolved prompt's identity threads onto every reasoning-step audit row alongside backend and model, so a forensic SQL filter on payload->>'prompt_version' isolates whether a corpus pass-rate change tracks a prompt change or a model change with one query. An AST-based CI guard walks production source and rejects multi-line strings on the known prompt target names; a documented per-line escape exists for the rare correct-by-construction case. The mechanical refactor landed without changing what any model actually reads; the deferred follow-up writes terse-variant rewrites for the three shipped advisor families and lands the variant- comparison corpus eval tool alongside, so style changes can be measured against the baseline cleanly. Prior change: LLM turn audit (item 61) shipped. Closes the last forensic gap in the audit chain — until that sprint the runtime recorded every tool call but discarded the LLM's reasoning that produced it. The new LLM_TURN event captures each reasoning step's assistant message; the hash plus an object-storage key land on the audit row while the full text lives under the same bucket the tamper-evidence anchors use. Every subsequent tool-call row sets a causation pointer back to the reasoning turn that produced it, so the reasoning → action chain becomes a single indexed SQL JOIN. The same data flows through three operator surfaces: a Reasoning Trace section on the agent detail page, server-sent-event previews as the agent reasons, and a pair of REST endpoints plus an MCP tool for programmatic post-hoc review. Emission is wired in the two reasoning paths that cover every shipped vertical today; a documented one-line adoption pattern brings reviewer-family and swarm reasoning paths into the same chain on demand. Prior change: action execution framework (item 60) shipped. Two reserved family slots — executor and reporter — promoted to shipped, taking the family catalog from four shipped + three reserved to six shipped + one reserved. The executor family is the first write-side member: a deterministic dispatcher that walks an operator-approved proposal's actions in order, capability-narrows per action against a base template that ships zero tools, records an idempotency cursor for crash-recovery, and routes failure modes through dedicated audit events. The existing RemediationAgent refactored onto the dispatcher, shrinking by roughly forty percent while preserving its public surface bit-identically. The reporter family closes the lifecycle: it consumes the verifier's structured result and the audit chain for the proposal, then emits a runbook-ready summary plus concrete next steps — status mirrors the verifier by default, and only the reporter's own parse-fallback path downgrades to needs-more-info, distinguishing reporter trouble from underlying remediation outcomes. End-to-end lineage threads through one indexed proposal-id column so a single SQL filter returns every event in the lifecycle. Prior change: roadmap expanded to reflect the full sprint queue and product vision. Three implementation-ready sprints remain queued (items 55, 63, 64): sandboxed shell execution, KB defense hardening, and backend router. Six study sprints sit in the research section (items 65–71) covering the known architectural gaps: local model quality floor, production corpus feedback loop, operational observability, agent task persistence, swarm resilience, and multi-tenancy. Study sprints are research-and-design briefs, not committed features — each graduates to next or later when scope is confidently bounded.

Prior change: CodeReviewAgent (item 59) shipped as the third vertical. First vertical authored from scratch on the generic agent framework — proving the framework's "next vertical reduces to a configuration class" claim with new code rather than retrofitted code. CodeReviewAgent reviews either a pasted code snippet or a unified git diff and emits a structured proposal with categorized findings (one of: security issue, likely bug, correctness concern, style violation, performance concern, missing test, out of scope) at one of four severities, plus a recommendation derived from the severities by default. Honest refusal stays first-class: when the artifact isn't reviewable (binary content, generated code, vendored dependencies), the reviewer emits a single out-of-scope finding rather than fabricating issues. Customer profile is the dogfood loop — the operator running Claude Code on their workstation pipes the diff through AOS for governance before merging. The third vertical landed in roughly 330 lines of vertical-specific code (configuration + LLM parse logic + domain content) on top of the framework's substrate, with an eight-case smoke corpus exercising the pass/fail/edge matrix across both artifact shapes at 100% under the scripted backend.

Prior change: the generic agent framework (item 58) shipped. The LangGraph machinery behind both reasoning verticals extracted into a parameterized base class; each shipped vertical reduced to a configuration class plus a registration helper. The operational and compliance modules shrunk to under 400 lines each, from roughly 1000. Three additional reference families shipped alongside (reviewer, guardrail, verifier) — each with a corresponding role-based capability template operators compose with a vertical-specific overlay at capability-mint time. The audit chain gained five new event types linked through a proposal-lifecycle UUID column so the "show me everything that happened on this proposal" query reads one indexed column. The tool registry gained explicit risk-level metadata so the approval gate routes high-risk reversible actions to human review without per-deployment manual labelling.

Earlier change: the empirical-validation arc closed. The 50-incident replay harness (item 30) and the field-test feedback loop (item 29) both moved from next to shipped. The harness now publishes cross-vertical, multi-backend baselines per oracle version: the air-gapped local-LLM tier (mistral-small:22b on Ollama) reaches 80% pass rate — at parity with the API tier (Gemini Flash 78%, OpenAI gpt-4o-mini 76%, Claude Haiku 70%) — for operator deployments that cannot route incident data to a third-party API. Corpus oracles are versioned per case, and an offline rescore tool reproduces any historical baseline against its prior oracle state, so revisions never destroy prior numbers. The feedback loop is now end-to-end: structured per-call advisor telemetry, three-button operator outcome capture, a reinforcement worker over experiential memory, two materialized rollup views, a dashboard advisor-trends page with a per-vertical filter, and a telemetry-to-corpus exporter that promotes operator-confirmed incidents into the regression set automatically. The advisor's pass rate is now a function of (backend, oracle version) rather than a single number, and the methodology to audit, revise, and rescore is itself an artifact in the repo.

Prior changes: ComplianceAuditAgent (item 57) shipped as the second vertical class, proving the canonical pattern (reasoning advisor over deterministic actor pack, structured proposal, capability + approval gate) generalizes beyond operational incidents. Six new scenarios across SOC2, HIPAA, and ISO27001 frameworks; cross-vertical machinery (per-vertical telemetry column, per-vertical handler packs in RemediationAgent, per-vertical dashboard filter, harness that walks both corpora) means a third vertical now reduces to one advisor + one actor pack + one corpus + two registration- table entries. Earlier in the same arc, RemediationAgent v1 (item 56) shipped as the execution arm of the OperationalAdvice vertical: a deterministic dispatcher walks an operator-approved proposal's proposed_actions in order, spawning per-action handlers whose capability is narrowed to one tool, with the existing approval gate firing on every irreversible handler call. OperationalAdviceAgent v1 (item 18) shipped before either as the first domain vertical, landing alongside the two architectural primitives it depended on (sandboxed execution, item 17, and the DeterministicActor primitive, item 19). Three foundational hardening sprints followed the first vertical and preceded the second: the LLM-backend resilience layer (retry-with- backoff on transient cloud errors, per-backend timeouts, scheduler-level rate-limit recovery rather than silent zombification); operator-driven credential hygiene (per- session HMAC-secret rotation with a configurable grace window plus per-token revocation via a JTI denylist); and the observability surface (Prometheus metrics, OpenTelemetry tracing, structured JSON logging with correlation IDs, and a periodic anchor-sweep job that continuously verifies the audit chain). Item 27 (Knowledge-base ingest scanner) shipped immediately after, closing the most visible "deliberately partial mitigation" claim in the threat-model document. bash_sandboxed (item 55) remains the next instance of the sandbox primitive family.

Building something that depends on specific items here?

If your evaluation hinges on a specific roadmap item — performance numbers, a particular tool, a backend addition — that's worth a conversation. Roadmap order can shift in response to real customer signal in a way that reading public documents alone cannot.

Tell us what you need → Read the threat model →