§ 01 / LegendHow to read this
Six status values describe each item's position in the work pipeline. They do not describe calendar dates. An item marked next ships before an item marked later regardless of when either actually arrives.
§ 02 / PositionWhere we are
AOSIQ is at v0.9.0 — a production-shaped alpha. The governance substrate is complete for both actor types: capability narrowing, tamper-evident audit, approval gates, composite crash recovery, cost ledger, multi-backend LLM abstraction, anti-hallucination evidence stack, prompt-injection defenses, sandboxed code execution, and the DeterministicActor primitive are all shipped. The runtime now governs reasoning agents and scheduled / on-demand deterministic actors under one regime.
With the substrate complete, the recent architectural focus has moved to first verticals — domain agent classes that compose the substrate into solutions an operator recognizes. OperationalAdviceAgent v1 (item 18) shipped as the canonical pattern: a reasoning advisor over a pack of deterministic diagnostic actors, emitting a single structured proposal humans review. Every future vertical follows this same shape; the substrate doesn't change.
The runtime is no longer the bottleneck. Empirical validation is in place — the cross-vertical replay harness publishes multi-backend baselines per oracle version, the air-gapped local-LLM tier is at parity with API backends, and corpus oracles are versioned so revisions never destroy prior baselines. The current focus is the long tail of native tools, customer-side integration patterns, and additional vertical classes onto the canonical advisor-over-actor-pack shape.
Items in § 05 of the threat model cover security-mitigation roadmap items specifically. This page is broader, covering the full product trajectory across both actor types.
§ 03 / FoundationWhat's shipped
The substrate the rest of the platform builds on. All items below are in the released runtime and verified against tests.
Core runtime
JWT capability tokens with intersection-narrowing delegation. Verification fires before every tool call, every memory operation, every spawn.
Per-session SHA-256 hash chain with anchor objects in independently-credentialed object storage. Mid-chain tampering detectable.
Tools registered reversible=False require explicit operator approval bound to (tool, args_hash). Single-use, replay-safe.
LangGraph thread state, agent control block, and working memory captured atomically. Worker heartbeat + orphan reaper handle worker crashes.
Per-call recording with model, tokens, and computed USD. Configurable session ceilings raise exceptions before the API call.
enforced / warn / disabled via env var. Production deployments require enforced; dev runs in warn with a response header tripwire.
LLM abstraction & agents
Anthropic, AWS Bedrock, OpenAI, Google Gemini (AI Studio), local Ollama, Claude Code CLI shim — selected at construction via factory. Each backend lazy-imports its optional dependency; Gemini ships with the [google_genai] extra.
allowed_backends per agent class. Mismatches raise at construction; an Ollama-only class cannot accidentally route to Anthropic.
Schema-bound tool calls, Pydantic validation, audit-row evidence verification, evidence stamps, loop guards, force-investigate gate, abandonment after refusal.
Untrusted-content delimiters around every tool result, plus pattern detection for tool-call syntax, role prefixes, and known injection phrases.
Parent agents fan out one child per declared persona, each running with persona-overlay system prompts; structured aggregation across children's proposals.
BugHunt, CodeReview, ArchitectureDecision, FeatureDesign, ReleaseReadiness. All read-only; all produce structured proposals via BaseProposal.
Knowledge & integration
pgvector-backed document store with HNSW index, semantic search, and audit-evidence integration. Operator-loaded corpora.
Eleven AOSIQ governance operations exposed as MCP tools so any MCP client (Claude Code, Cursor, Cline) can dispatch governed swarms.
Agents reach external knowledge bases, internal APIs, and consumer services through MCP. Bridged tools inherit capability, audit, and approval.
Metadata filtering, hybrid vector + keyword search via pgvector and PostgreSQL tsvector, markdown-aware chunking with stable citation anchors (heading_path, heading_anchor, position_in_doc), incremental ingestion via content-hash, corpus introspection, and source-URI-prefix delete. Migrations 017–020. Closes the gap between minimum-viable retrieval and production-grade RAG.
Actor model & execution
run_python)
Container-isolated Python with no network, ephemeral filesystem, hard resource limits, non-root, and a curated package set. Replaces over-broad bash grants for the common case. Reversible by construction. Factored so additional sandboxed languages (e.g. bash_sandboxed, item 55) plug in as language-specific subdirectories — the language-agnostic invoker stays unchanged. Compute-time attribution via a new cost_model field on the handler registry, billed through the same ledger that records LLM token cost. Execution surface for both reasoning agents and deterministic actors that need isolated compute.
First-class governed entity for non-reasoning automation — scheduled jobs, monitoring scripts, ETL pipelines. Registered Python functions dispatched through a sibling runner to AgentRunner, sharing the same scheduler, capability tokens, audit chain, approval gate, and cost ledger. New compute_ms cost type bills wall-clock execution. Idempotency-up-to-first-irreversible-call is the contract; the function re-runs from the top after a human approves. Completes the runtime's actor model so governance properties extend to all automation, not just LLM-backed agents.
First domain vertical built on the runtime's full surface. A reasoning agent diagnoses operational incidents across six named scenarios (job queue backlog, disk space exhaustion, DB connection pool exhaustion, configuration drift, recurring error spike, performance regression) by spawning a pack of deterministic diagnostic actors — log query, threshold evaluation, configuration introspection. The actors gather and structure data; the advisor forms hypotheses and emits a single structured proposal humans review. Read-only by capability; remediation is a separate downstream agent. Out-of-scope is a first-class outcome — the advisor refuses honestly when an incident doesn't match the six scenarios, schema-enforced to carry no proposed actions. Locks in the canonical pattern every subsequent vertical will follow: reasoning advisor over deterministic actor pack, structured proposal as terminal output.
The prompt-injection-pattern scanner that runs on tool results at read time now runs on ingested chunks at write time too. Every chunk passing through DocumentStore.add_documents is scanned before embedding; the result is stamped into chunk metadata regardless of match outcome, so operators can distinguish "scanned and clean" from "pre-scanner row" by querying the JSONB field. Two policies: warn (default — ingest proceeds, the chunk's metadata records the patterns, an audit event fires per flagged chunk, and kb_search later surfaces an inline warning when the chunk is returned) and reject (the whole source document is refused atomically before any embed or DB write). A backfill scanner remediates corpora ingested before this shipped — one CLI invocation brings a pre-scanner corpus up to the new contract. The read-side and write-side share the same pattern set, so future improvements to the detector benefit both directions uniformly. Closes the most visible "deliberately partial mitigation" claim in the threat-model document; KB-mediated injection moves from "operators should vet ingested corpora" to "scanned at ingest with operator-configurable enforcement."
Completes the operational-advice workflow from "diagnose" to "act." A deterministic dispatcher consumes an operator-approved OperationalAdviceProposal and walks proposed_actions in proposal order, spawning a per-action handler with capability narrowed to one tool. No reasoning at execution time — the proposal IS the plan; an LLM on the execution path would compound the advisor's accuracy gaps. Four v1 handlers cover the common action shapes (restart workers, scale pool, rotate logs, revert config); each is idempotent by construction so re-run-on-resume is safe. The per-action approval gate fires on every irreversible handler tool call through the existing review queue — a wrong proposal becomes a not-executed wrong proposal, not a wrong action. Every audit row carries the originating proposal's invocation_id via a new indexed column, so the operator's "show me everything this proposal triggered" forensic query is one filter, not a JSONB scan. Operator initiates via an explicit "Remediate" button — separate consent from "approved the diagnosis." The dispatch + handler pattern shipped here is the template every future vertical's execution arm will follow.
Proves the canonical AOSIQ pattern generalizes beyond operational incidents. A second reasoning advisor diagnoses compliance findings across six named scenarios (control failure, regulatory deviation, audit-trail gap, segregation-of-duties violation, retention violation, evidence-collection gap) over three frameworks (SOC2, HIPAA, ISO27001) by spawning a pack of deterministic diagnostic actors — control-state check, structured rule evaluation, evidence-vault lookup. Same shape as the first vertical: reasoning advisor over deterministic actor pack, structured proposal as terminal output, out-of-scope as a first-class refusal. The infrastructure investment amortizes — scheduler, audit chain, capability tokens, approval gate, telemetry capture, replay harness, reinforcement worker, dashboard, and remediation dispatcher are all shared between the two verticals with zero per-vertical duplication. The cross-vertical RemediationAgent dispatcher reads the proposal's vertical and selects the right handler pack at runtime; four compliance handlers (evidence backfill, policy correction, control remediation, compliance-officer notification) plug into the same approval-gated execution path the operational handlers use. The dashboard surfaces per-vertical pass rates side-by-side; the replay harness defaults to walking both corpora and prints a per-vertical breakdown so "the pattern generalizes" is one report away. Adding the third vertical now reduces to one advisor module + one actor pack + one corpus + entries in two registration tables — no new infrastructure.
Turns the "every vertical follows the same shape" claim from rhetorical to structural. The reasoning advisor's LangGraph machinery — node sequence, system-prompt loading, tool dispatch, evidence-marker rewriting, telemetry capture, cost accounting — extracts into a parameterized base class; each shipped vertical (operational, compliance) becomes a thin configuration carrying its scenario list, actor pack, proposal subclass, and two small adapters for the proposal fields that genuinely differ across domains. The operational module shrinks from ~1000 lines to under 400; the compliance module to under 350. Three additional reference families ship alongside — a read-only reviewer that produces findings + recommendation, a guardrail that consults the centralized approval-policy resolver and emits a per-action decision record, a verifier that compares expected against actual evidence and reports succeeded / failed / partially-succeeded / needs-more-info. Each family carries a corresponding role-based capability template (read-only-reviewer, advisor, guardrail, executor-scoped, verifier, reporter) operators compose with a vertical-specific overlay at capability-mint time — separating the role contract from the domain identity. The audit chain gains five new event types linked through a new proposal-lifecycle UUID column so an operator's "show me everything that happened on this proposal" query reads one indexed column instead of scanning JSONB payloads. The tool registry gains explicit risk-level metadata so the approval gate routes high-risk reversible actions to human review without an operator having to label every tool ad-hoc. Reuse cost for the next advisor vertical drops to roughly one configuration class plus a registration helper — comparable in size to writing a new dataclass rather than a new module.
Proves the framework's "next vertical fits in a configuration class plus a registration helper" claim with new code rather than retrofitted code. The third vertical reviews code artifacts a human is about to ship — a pasted snippet, or a unified diff piped from git diff main...HEAD — and emits a structured proposal with categorized findings + a recommendation. Reviewer-family pattern (not advisor): the agent doesn't dispatch diagnostic actors, it looks at the artifact and classifies. Each finding is tagged with one of seven categories — security issue, likely bug, correctness concern, style violation, performance concern, missing test, or out of scope — and one of four severities chosen against an explicit anti-inflation rubric. The default recommendation derives from finding severities (any critical → block, any other finding → revise, no findings → approve), with an override path for cases like style-only diffs that shouldn't gate the merge. Honest refusal is first-class: when the artifact isn't reviewable (binary content, generated code, vendored dependencies), the reviewer emits a single out-of-scope finding rather than fabricating issues. Customer profile is the dogfood loop — the operator running Claude Code on their workstation pipes the diff through AOS and gets a governance surface for the review. The third vertical's module is roughly 330 lines (the framework's claim was about 150-200 for a thin vertical; the residual is genuine domain content — the seven categories, the rubric, the parse-failure path). The cost of vertical four is now expected to land closer to the framework's projection as more shared machinery shakes out.
Closes the loop between proposals the advisor emits and the calibration signal it learns from. Every advisor invocation writes a structured telemetry row (vertical, scenario, confidence, severity, evidence count, operator outcome) — captured at proposal time, completed when the operator acts. The operator-facing surface is three buttons on every proposal: approved, edited, rejected — bound to the proposal's audit row so the feedback is the same shape regardless of which vertical surfaced the proposal. A reinforcement worker reads the feedback stream and updates experiential memory: rejected proposals weaken the underlying recall pattern, approved proposals strengthen it. Two materialized views aggregate the telemetry — a daily metrics rollup for the operator dashboard and a calibration view that exposes per-scenario confidence-vs-accuracy drift over the trailing window. The dashboard surfaces both views as the advisor-trends page; the per-vertical filter lets an operator compare operational vs compliance calibration side-by-side. A telemetry-to-corpus exporter promotes operator-confirmed incidents into the replay corpus with the operator's response and any scenario correction preserved as YAML, so production feedback extends the regression set over time without manual curation.
Validates verticals against curated historical scenarios before promotion. Cross-vertical: walks both operational and compliance corpora by default; per-vertical pass rates surface in one report. Pass-rate measurement is structural — five binary axes per incident (scenario, severity, confidence, actions, evidence) with all five required to pass, so "almost right" never inflates the headline number. Multi-backend baselines published per model: the air-gapped local-LLM tier (mistral-small:22b on Ollama, 80% pass rate) is at parity with API backends (Gemini Flash 78%, OpenAI gpt-4o-mini 76%, Claude Haiku 70%) for operator deployments that cannot route sensitive incident data to a third-party API. Corpus oracles are versioned per case: when an audit identifies an expectation as mis-calibrated (too strict or too lenient on a specific case), revisions bump the case's oracle version, the prior baseline file stays in git byte-for-byte, and an offline rescore tool reproduces the historical number against the prior corpus state without paying for another LLM run. The contract is that every baseline number is well-defined under its specific oracle version — no silent drift, no destroyed history, no manual archaeology required to answer "which oracle was this scored against."
Closes the lifecycle the framework had left half-open: after an advisor proposes and a guardrail gates, the executor changes state, a verifier confirms, and a reporter summarizes for the operator's runbook. Promotes two reserved family slots — executor and reporter — to shipped, taking the family catalog from four shipped + three reserved to six shipped + one reserved (triage remains, awaiting the customer-triage vertical). The executor family is the first write-side member: a deterministic dispatcher that walks an operator-approved proposal's actions in order, capability-narrows per action against a base template that ships zero tools (every authority comes from the per-action overlay), records an idempotency cursor for crash-recovery, and routes failure modes through dedicated audit events. No LLM at execution time — the proposal is the plan, and an LLM at the dispatch boundary would compound the advisor's accuracy gaps without adding value. The existing RemediationAgent refactored onto the generic dispatcher, shrinking from roughly 390 lines to 240 while preserving its public surface bit-identically: a depth-β change that proves the framework absorbs write-side verticals the same way it absorbed read-side ones. The reporter family closes the chain: it consumes the verifier's structured result and the audit chain for the proposal's lifecycle, then emits a runbook-ready summary plus a list of concrete next steps. Status mirrors the verifier by default — a chipper summary on a failed verification is the failure mode this family exists to prevent — and only the reporter's own parse-fallback path downgrades to needs-more-info, distinguishing reporter trouble from underlying remediation outcomes. End-to-end lineage threads through one indexed proposal-id column on the audit log so a single SQL filter returns every event in the lifecycle without payload scanning. Six smoke cases (three executor + three reporter, pass/fail/edge per family) plus a cross-handler capability-isolation matrix (eight handlers, fifty-six pairs) pin the safety properties; sixty-plus pre-existing remediation tests still pass against the refactored dispatcher.
Closes the last forensic gap in the audit chain. Until this sprint the runtime recorded every tool call with its arguments, results, capability token, and timing — but the LLM's reasoning that produced each tool call was discarded after dispatch. Operators reviewing a bad decision could see the action; they could not read why the model chose it. The sprint adds a new audit event type (LLM_TURN) emitted once per reasoning step, capturing the full assistant message text. The text itself lives in object storage under the existing audit-anchor bucket so the primary audit table stays lean even for agents that produce hundreds of reasoning steps; only the canonical-JSON SHA-256 plus the object key land on the audit row, giving operators an integrity proof without bloating the hot path. Every subsequent tool-call audit row sets a causation pointer back to the reasoning turn that produced it. The reasoning → action chain becomes a single SQL JOIN over two partial indexes, indexed both ways: for an agent's full timeline, and for a specific tool call's originating reasoning step. The same data flows through three operator surfaces: a Reasoning Trace section on the agent detail page with collapsible per-turn cards linking each turn to the tool calls it dispatched; a server-sent-event stream that pushes a 200-character preview of each turn as the agent reasons; and a pair of REST endpoints (list-without-content plus per-turn-content fetch) backed by an MCP tool for programmatic post-hoc forensic review. The scope is helper-first: emission is wired into the two reasoning paths that cover every shipped vertical today (the generic agent substrate and the parameterized advisor base that drives both shipped advisor verticals), with a documented one-line adoption pattern for reviewer-family and swarm reasoning paths to opt in as those workflows need it. The MinIO upload at write time falls back to a sentinel rather than failing the audit append, so the chain stays unbroken even when object storage is degraded.
Closes the three coupled problems that made prompt iteration painful. First, prompts lived in Python source — the same tool-call protocol existed in two files with two different contents, and updating either required a code change plus a deploy. Second, the model-selection knob was the transport (api / cli / ollama) rather than the model itself, so every model on a given transport got identical instructions despite known differences in verbosity tolerance, thinking-mode handling, and formatting preference. Third, pass-rate changes had no record of which prompt version produced them, so cause and effect could not be separated. This sprint extracts every prompt into a versioned Markdown file declared in a single manifest, routes resolution through a singleton that picks the right variant by model profile, and threads the resolved prompt's identity onto every reasoning-step audit row so a forensic SQL filter can attribute pass-rate diffs to a prompt change or a model change with the same query. Three style profiles ship: terse for small local models, verbose for API-grade models, and a no-think profile that prepends the directive structurally for thinking-mode models so prompt authors do not need to know which models require it. The resolver enforces the manifest at server startup, surfacing every missing-file gap in one consolidated error rather than as a confusing first-spawn failure. An AST-based CI guard walks production source and rejects any multi-line string that looks like a prompt — the rule survives future PRs that try to land an inline prompt back into Python source. The mechanical refactor lands without changing what any model actually reads; the deferred follow-up writes terse-variant rewrites for the three shipped advisor families and lands the variant-comparison corpus eval tool alongside, so style changes can be measured against the baseline cleanly.
Sprint spec →Closes the residual gap that the earlier ingest-time scanner left open. The regex pre-filter catches injection-shaped syntax synchronously at ingest with zero latency — but the regex layer cannot judge whether a prose paragraph is instructing the agent to leak its capability token, bypass the approval gate, or mirror every proposal to an attacker-controlled URL. This sprint adds a second stage: an LLM classifier fires asynchronously per chunk after commit, stamps a supplementary verdict onto the same chunk metadata, and degrades fail-closed when the classifier backend is unavailable so an outage cannot block ingest. The same sprint adds an entailment judge on every proposal emission — before a proposal is recorded the judge reconstructs the cited evidence and verifies that the proposal's claim actually follows from it. For proposals carrying a high or critical risk level the check is synchronous: a verdict of entails-false (or confidence below an operator-configurable threshold, default 0.75) raises a typed gate error and the proposal is refused at the audit-engine boundary, forcing the agent to re-cite or narrow the claim. Lower-risk proposals get the same verdict written as a fire-and-forget audit event so operators can run forensic queries against the baseline of "claims that drifted past their evidence" without paying the gate's latency. Operators get a per-chunk quarantine surface: any flagged chunk can be hidden from default search results via a bearer-auth-gated endpoint or a per-chunk-CSRF-protected dashboard button, and released chunks return to the flagged state rather than clean so the regex flag stands even after operator review. An adversarial corpus of twenty-four cases (ten syntactic, eight semantic, six benign with one documented false-positive) ships alongside as the regression target — the probe emits a confusion matrix and pins the regex layer's TP/FN/FP/TN. Closes the largest remaining "deliberately partial mitigation" entry in the threat-model document; reasoning-redirection injection moves from "judge-model pattern deferred to a future sprint" to "structural surface in place, judge accuracy continues to evolve with the underlying model."
Sprint spec →The substrate for governed multi-phase, multi-day, multi-agent processes. Single-agent sessions are the wrong primitive when the work spans multiple unbounded human decision points (a CAB review window, a regulator's quarterly sign-off, a postmortem stakeholder cycle), requires different agents at different stages, or must produce a single regulator-visible audit chain from instantiation to closure. The workflow primitive is a durable declarative state machine over phases. Phases are typed (agent spawns an agent session, gate waits for resolution, engine runs a deterministic handler inline). Gates are typed (human_approval, multi_resolver_threshold, time_bounded, sub_workflow_complete, condition, auto_pass) — six resolution shapes the reference workflows together force. The engine ticks in the background, serialises per workflow via Postgres advisory locks, and scopes its scan to templates the registry owns so multiple engines coexist (test isolation, blue/green deploys). The audit chain gains a partial-indexed workflow_id column so "everything for this workflow" is one filter, not a JOIN. Two creation paths: Path A instantiates from a YAML template (operator or system event); Path B is the differentiator — an agent with the propose_workflow capability emits a typed WorkflowAdaptation artifact (eight named adaptation kinds: insert/remove/modify phase, add/remove transition, set resolver, override field), the operator reviews the structural diff at /workflows/proposed, and approval activates the adapted workflow. Three structural refusals are enforced regardless of operator approval (no agent identity in resolver_members, no removing backout from irreversible phases, post-adaptation template re-validation). Five built-in templates ship: IBMi fix cycle with CAB approval and backout sub-workflow, compliance audit cycle with parent-spawns-children fanout, compliance remediation cycle, incident postmortem with time-bounded auto-advance, IBMi fix backout. Dashboard surfaces parallel the existing agents view; nine API endpoints cover instantiation, listing, detail, cancellation, Path B approval, in-flight gate approval, and template catalog management. Approximately two thousand eight hundred lines of code under aos/workflow/ plus migration 037 (three new tables and ten new audit event types) plus around two hundred ten tests including end-to-end against all three reference workflow shapes.
Sprint spec →Closes the substrate gap the workflow primitive's incident-postmortem template opened — agents that detect a critical condition can now page humans, not just surface a proposal on a dashboard nobody is watching. Four outbound channels (Slack, email, PagerDuty, generic webhook) ship as native tool handlers; every tool is reversible=False so the approval gate fires before any message leaves the system. An agent cannot spam operators — a human must approve each individual send, and the gate's args-hash binding means replay-by-re-submission is blocked. Credentials never sit in plaintext. Operators declare channels in a YAML file with environment-variable interpolation, the loader resolves the variables at lifespan startup, and an AES-256-GCM ciphertext lands in a new notification_channels table. Agents pass a channel-id slug they read from the operator's documented surface; they never see webhook URLs, SMTP credentials, or PagerDuty routing keys. The HTTP transport retries once on a 5xx or transport error; a 4xx returns an explicit failure rather than raising, so the agent's evidence chain captures the failure context. Channel-type mismatches refuse pre-send (an agent calling send_slack_message against a type=email channel gets a structured refusal rather than a wrong-channel post). A composable with_notifications capability template grants the four tools with memory-ops:read and no child agents — notification is a leaf operation, fan-out composes at the workflow level. Closes the postmortem cycle's last loose end: the workflow primitive's incident-postmortem template now wires its publication phase directly to these tools, fanning out one send per registered channel and recording each outcome on the workflow's payload so operators can read partial-failure detail without grepping the audit chain. Engine actions skip the per-call approval gate because the workflow template was already operator-approved at instantiation; the agent-driven path still routes through the gate.
run_bash)
The shell sibling of run_python. Same Docker isolation contract — no network by default, read-only rootfs, non-root uid, dropped capabilities, resource caps, hard wall-clock — and a curated CLI toolset baked into the image (jq, gawk, yq, curl, kubectl, postgresql-client, bc, coreutils). No apt-get at runtime; the image is the supply-chain surface and operators rebuild it when a new tool is needed. Closes the gap where the unrestricted bash grant was the only path for an agent that legitimately needs jq on a log payload or kubectl against a cluster API. The host-filesystem case (scanning /var/log/*) deliberately stays on the unrestricted bash grant — the sandbox cannot reach the host filesystem by construction, which is the whole point. Network access is per-agent-class opt-in via a new RegistryEntry.sandbox_network_mode field; the only documented non-default value is bridge, and every AGENT_SPAWN audit payload records the elected mode so a regulator-visible filter answers "which spawns elected outbound reachability" in one SQL query. Agents pass a script/stdin_data/env shape; the env dict is filtered against an operator-managed allow-list before the envelope leaves the host, so an LLM that smuggles a sensitive key into env sees it dropped pre-flight. The with_bash_sandbox capability template grants exactly the run_bash tool plus read/write memory ops — composable with any vertical overlay. Second instance of the sandboxed-execution primitive family the runtime is converging on; future sandboxed languages (Node, Ruby, …) plug in as sibling directories under aos/tools/sandbox/ without touching the language-agnostic invoker.
Closes the temporal-reasoning failure mode that bites every shipped vertical. LLMs fail predictably on date arithmetic, timezone conversion, elapsed-time, and day-of-week reasoning — and the failures appear in proposals (a compliance agent miscalculates a retention deadline, a postmortem renders a timestamp in the wrong timezone). Five pure-stdlib tools delegate every computation to Python's datetime, zoneinfo, and calendar: get_current_time (UTC and optional IANA timezone), parse_date (string → ISO 8601 with explicit refusal for ambiguous slash-formatted inputs rather than guessing), date_diff (days/weeks/months/years with correct month-length and leap-year handling), format_date (long/short/relative styles), and convert_timezone (IANA-to-IANA, naive inputs interpreted in from_tz). All five register reversible=True; inputs and outputs are ISO 8601 strings throughout; every response carries a display field for the operator-facing surface. The with_time_tools capability template grants all five in one template (splitting them is registry surface without security benefit). Composes with any vertical overlay — a compliance audit asking "is this evidence in window" reads the computed value rather than the model's guess.
Sibling sprint to the time/date tools, killing a different systemic LLM failure mode: format fabrication. Agents asked "is this a valid email" or "does this JSON match the schema" will sometimes assert validity when the answer is no — the downstream cost is proposals with invalid data that operators rubber-stamp or systems reject. Six pure-Python tools delegate the question to deterministic code: validate_json_schema (draft-7 with dotted-path error messages), validate_regex (returns the match plus capture groups on success), validate_url (format only — no reachability — with an optional require_https guard), validate_email (RFC 5322 format with no SMTP or DNS lookup), validate_format (named-format dispatcher covering UUID, ISO 8601 date and datetime, IBAN with the ISO 13616 mod-97 checksum, and E.164 phone numbers), and validate_json (parse-only check with the failing line and column). All six register reversible=True and return a uniform {valid, errors, match_groups} shape; error messages are human-readable strings rather than codes so the agent's reasoning trail carries the failure directly. The with_validation capability template grants all six in one template, composable with any vertical overlay. The explicit boundary on validate_url and validate_email is documented on the operator surface: format checks only, no network calls — mixing them would collapse two distinct concerns into one ambiguous tool.
Adds a third channel for agent-emitted trace, sitting between the immutable forensic audit chain and ephemeral stdout. Operational commentary — "scanning 47 matching files in QSYSOPR", "evidence vault returned 4 controls without timestamp metadata, falling back to created_at", "primary log backend returned 503; switching to secondary" — belongs in neither: it isn't legally evidentiary, but operators want to read it while the session runs. The new log_observation(level, message, context) tool writes to a dedicated agent_observations table with no hash chain and no MinIO anchor, so rows are rotatable without affecting forensic integrity. The level field is part of the args (debug / info / warn / error) rather than a tool-family split, matching Python's stdlib logging convention. The syscall layer injects session_id, agent_id, and agent_class from the calling ACB — agents cannot smuggle writes into another session's log via the args dict. A live-polled /observations dashboard surfaces the table with filters for session, agent, and level; color-coded level badges; and collapsible context JSON. The operator-facing surface explicitly documents the audit-chain-vs-observation distinction so a vertical author picks the right channel without re-reading the sprint spec. Composes with any vertical overlay via the with_observation_log capability template.
Read-only structured database access with per-connection capability scoping. Operators declare connections in a YAML file with environment-variable interpolated DSNs; agents pass a connection-id slug and never see credentials. Two tools: db_query for SELECT-only execution and db_schema for column-name introspection. Six defense-in-depth layers: the LLM only sees the broad tool names; the syscall capability verify checks the broad grant; the handler then decodes the agent's token and refuses unless db_query@<connection_id> is ALSO present for the specific call (the per-connection grants are added at token-mint time on top of the template, narrowing the agent's blast radius to exactly the connections it needs); the handler strips SQL comments, walks past leading parens, and rejects anything that isn't SELECT or WITH ... SELECT; per-connection row and timeout caps clamp the result size and wall-clock budget; and operators are expected to back each DSN with a SELECT-only database user as a final layer. Postgres connections use the existing psycopg pool family; DB2 is supported via a thin shim around the lazy-imported ibm_db_dbi library. Non-JSON-native column types (uuid, timestamp, decimal, bytes) are coerced to strings so the agent receives a uniform shape regardless of column types. Writes are explicitly v2 — they require approval-gate integration per call and an operator opt-in per grant, modeled separately as a future db_write tool.
Replaces the prior ungoverned outbound HTTP path — agents could previously call any URL through the in-line http_get with no rate limit, no response size cap, no domain allow-list, no HTTPS enforcement. The new version registers domains in a YAML allow-list with per-domain config (rate limit RPS, max response bytes, optional TTL cache, path-prefix allow-list, HTTPS enforcement). Calling an unregistered host returns ok:false BEFORE the network sees the call. Two tools ship: http_get is reversible=true and cached per the domain's TTL; http_post is reversible=false so the approval gate fires before every call and the reviewing operator sees the exact URL, headers, and body. Six defense-in-depth layers: domain allowlist, scheme check (require_https), path-prefix membership, async token-bucket rate limit, response size cap with explicit truncated flag, optional per-URL TTL LRU cache. Non-2xx responses return ok=true with the actual status code — some APIs treat 404 or 409 as valid signals, and the handler doesn't force a verdict. Binary content types come back as base64 strings so the agent always receives JSON-safe content. Breaking change for the two capability templates that grant the old http_get (with_log_query_diagnostic and with_threshold_diagnostic) — their target domains must be added to config/http_domains.yaml before existing deployments resume working.
Closes a capability gap with no prior path — agents could not read binary or rich-text documents at all. A PDF evidence attachment returned raw bytes; a DOCX contract was unreadable; an XLSX metrics export was opaque. The new fetch_document(source) tool reads a URL or local path, detects the format (Content-Type header for URLs with extension fallback), dispatches to a per-format extractor, and returns a uniform DocumentResult with title, body text, metadata, and any extracted tables. Five formats ship in v1: PDF via pymupdf (first 200 pages, metadata preserved), DOCX via python-docx (paragraphs as body plus native tables), HTML via BeautifulSoup with main/article preference and nav/footer/script noise stripped, CSV via the stdlib (single table, first row as headers), and XLSX via openpyxl (one table per sheet, sheet name carried through). Each extractor is lazy-imported so the package loads cleanly even when an optional library is missing — a call into an unavailable extractor returns ok:false with the install command rather than crashing at import. Size caps are explicit and operator-visible: 50 MB per document (pre-flight via Content-Length on URLs, streamed-bytes total tracked too so a lying server doesn't sneak through), 100,000 characters of body text, configurable per-table row cap (default 500, ceiling 2000), 200 PDF pages with the actual count preserved on metadata so an auditor can see both numbers. The truncated field fires whenever any cap activates, so operators reviewing a proposal that rests on partial document content can spot the elision from one field. The with_document_reader capability template grants the tool with read+write memory ops (so an agent can stash a long extracted body into working memory without re-fetching) and a larger token budget than the other utility tools because document body content flows directly into the reasoning context. Explicit non-goals: OCR for scanned PDFs (use a vision model), PowerPoint, document diff, MinIO evidence anchoring, encrypted documents.
Plus 43 additional shipped items not individually listed: capability templates, cookbook examples, dashboard views, threat model document, migrations 001–042, test infrastructure, deployment scaffolding, the LLM-backend resilience layer (retry-on-transient, per-backend timeouts), operator-driven capability-secret rotation with per-token revocation, the observability surface (Prometheus metrics, OpenTelemetry tracing, structured JSON logging, and continuously-verified audit-chain anchor sweep), a hardening pass on the deterministic-actor model (single-entry compute-ms billing, dequeue index for million-row scale, audit-id-backed evidence citation in the advisor, native bind_tools on Anthropic and Gemini backends), the field-test feedback loop (per-call advisor telemetry capture, materialized rollup views, operator approve/reject/edit reinforcement signal), and the cross-vertical infrastructure that lets the second vertical reuse everything (vertical column on the telemetry table, per-vertical handler packs in RemediationAgent, per-vertical filter on the dashboard, cross-vertical replay harness). The full set is verifiable in the codebase.
§ 04 / ActiveWhat's in flight now
Nothing in flight at the moment. The workflow-primitive cornerstone (item 72) shipped on 2026-05-19 — the substrate for governed multi-phase, multi-day, multi-agent processes, with three new tables (workflow, workflow_phase, workflow_gate), six gate kinds, three phase kinds, an eight-kind Path B agent-instantiation protocol, dashboard surfaces parallel to the existing agents view, and end-to-end coverage against three reference workflow shapes (IBMi fix cycle, compliance audit-cycle with parent-spawns- children fanout, incident postmortem with time-bounded auto-advance). Five built-in templates ship under config/workflows. With the cornerstone landed, the next sprint will come off the Next queue below — the IBMi-fix vertical and the compliance-as-workflow refactor that this primitive was blocking are now unblocked. The notification dispatcher (item 22) shipped the following day to close the substrate gap the postmortem template's publish_postmortem engine action depended on — Slack, email, PagerDuty, and generic webhook tools, all routed through the approval gate.
§ 05 / NextWhat's committed to next
Items scoped, prioritized, and waiting in the queue behind active sprints. Each is independently shippable; the order reflects leverage and dependency rather than calendar.
Native tool catalog
Agents & cookbook
Second vertical class. Produces structured change descriptions for developer review — affected programs, current code, proposed approach, test plan. Read-only; never writes code.
Sprint spec →§ 06 / LaterWhat comes after
Items scoped and committed, but not in the next release cycle. Sequenced behind the work above.
Tool catalog continued
Tree-sitter-aware extraction returning symbol graphs rather than raw text. Three reversible-by-contract tools — one returns the symbol graph for a file with bodies opt-in, one searches symbols by name pattern and kind across a registered codebase, one returns the body of a specific symbol. First call per codebase builds an in-memory index in a few seconds; later calls hit the cache and return in well under a tenth of a second. The cache invalidates on file change via mtime hashing. Four languages parse out of the box: Python, JavaScript, Go, and SQL. An agent can navigate a 50-thousand-line codebase in a handful of tool calls instead of burning the budget on raw file reads. Codebases are operator-registered; the registry is deploy-time-immutable. Pairs naturally with the non-self-modifying runtime principle — agents can read project source freely under capability grant, but the write_file path-prefix denylist still refuses any attempt to mutate those same trees.
Sprint spec →Structured diffs as first-class proposal artifacts. Three read-side tools — one computes a unified diff between two text strings, one parses a unified diff into structured metadata, one validates that a diff would apply cleanly against a target's current content. The agent reads the file's symbol graph through the source-code reader, mentally applies its proposed edit, computes the diff between original and modified, validates the diff against the current file content, and carries the diff in its proposal body for human review. No apply_patch — writing a patched file is an irreversible operation that belongs in an executor vertical behind the approval gate, and the write_file path-prefix denylist still refuses writes into protected trees regardless. The diff is the proposal; patch application happens through standard change-control, not the agent. Validation surfaces the first mismatched line by number so an operator reading the proposal can locate exactly what shifted in the target since the diff was generated.
Sprint spec →Operator-defined command allow-list as defense-in-depth over capability checks. Bash with grep, find, git log, kubectl get — but not arbitrary commands.
Sprint spec →Operator infrastructure
aos-replay re-runs a recorded session against a different LLM backend or different prompt and compares outcomes. Critical for evaluating model upgrades.
Walks registered tools, agent classes, and production tokens; reports unused capabilities and over-broad grants. Helps operators tighten capabilities over time.
Sprint spec →Queryable rollups across sessions by agent class, time window, and operator. For finance and capacity planning, not per-session inspection.
Sprint spec →Daily and weekly export of pending and resolved approvals for compliance teams: every destructive action an agent attempted in a window, with outcome.
Sprint spec →LLM backend strategy
Strategy-based LLM selection per agent class: economy (cheapest model above a configurable quality floor), quality (highest pass rate), local_first (air-gapped by default, escalate only on explicit opt-in). Cascade mode runs the economy model first and escalates to a higher-quality model when response confidence falls below a threshold — achieving near-quality-tier results at economy-tier cost for the majority of calls. Scores seeded from corpus baselines; operator updates them as model quality evolves. Per-agent-class strategy declaration in RegistryEntry.
When bash_sandboxed runs with network_mode="bridge", all outbound destinations are currently permitted. The egress allowlist restricts outbound connections to a named set declared in the agent's RegistryEntry — so kubectl can reach the cluster API but not arbitrary internet hosts. Enforced via iptables rules inserted before container start. Allowlist visible in AGENT_SPAWN audit payload.
Production hardening
Published p50/p99 latency, soak test results, capacity planning under realistic concurrency. Required for first-customer production deployment at scale.
Sprint spec →Per-caller bearer tokens, independent audit attribution, rotation without downtime. Resolves the residual exposure noted in the threat model.
Sprint spec →Federated identity for API callers and mutual-TLS as alternative bearer-token mechanism. Targets enterprise deployments where bearer tokens alone are insufficient.
Sprint spec →§ 07 / ResearchWhere we're exploring
Open questions where the right answer isn't yet clear. We're prototyping and learning rather than committing. Items here graduate to next or later when scope is confidently bounded — or to out of scope if the right answer is "not us."
High-severity proposals routed to a second LLM call with only the proposal and evidence. Closes the largest open category in prompt-injection threat surface, but the architecture has significant ergonomic and cost implications.
Sprint spec →Adversaries who place content across many tool calls steering reasoning gradually evade single-result pattern detection. No good general defense exists today; we're tracking the research literature.
Sprint spec →Detect JWT-shaped or API-key-shaped strings in tool output and redact before LLM exposure. Prevents agents being convinced to leak their own credentials.
Sprint spec →Currently sandbox limits are global env vars. Per-class overrides through capability-token claims would let critical agents get more resources than experimental ones, but the audit shape needs design work.
Sprint spec →Beyond the generic incident-response cookbook, full reference implementations for specific verticals. Which verticals first depends on customer signal.
Sprint spec →An actor with a deterministic main execution path and a reasoning hop at one or more specific decision points. Common shape: a deterministic monitor that calls an LLM only to classify ambiguous signals. Cost-efficient and architecturally cleaner than forcing every actor into one camp; needs design work on capability scoping and audit semantics across the hop.
Sprint spec →The default model (qwen2.5:7b) passes 52% of corpus cases in a fully air-gapped deployment — below the 70% threshold for unassisted production use. Research question: how much of the gap closes through prompt engineering (terse variants) and cascade configuration versus requiring a larger model? Categorize the failure modes in the remaining 48%, calibrate the corpus oracle against human judgment, and publish a minimum-VRAM recommendation for each quality tier (70%, 80%). Output: a recommendation, not a feature.
Sprint spec →The field-test feedback loop (item 29) captures operator approve/reject/edit signals. Those signals are not currently used to extend the replay corpus automatically. Without auto-promotion, the corpus distribution drifts from production incident distribution and pass rates become optimistic over time. Research question: what is the minimum-viable corpus-candidate schema, how do we prevent contamination (model output ≠ ground truth), and what sampling rate keeps the corpus manageable? Output: a workflow design for the implementation sprint.
Sprint spec →The runtime is currently silent between dashboard opens. No signal fires when the review queue has been pending 30 minutes, the cost budget is 80% exhausted, or the anchor sweep has fallen behind. Research question: what are the SLOs worth defining, what are calibrated thresholds grounded in baseline data (not guesses), and does alerting require the notification dispatcher (item 22) first? Output: 4–6 SLOs with thresholds, runbooks for each, and a dependency decision.
Sprint spec →LangGraph checkpoints are written after each node; the resume path is not implemented. When the server restarts, in-flight agents are orphaned and the work is lost. The checkpoint table also grows forever. Research question: what is the correct resume semantic (auto for read-only, operator-consent for agents with irreversible tools remaining), how does a new AGENT_RESUME event type maintain audit chain integrity across the restart gap, and what is the checkpoint pruning policy? Output: design spec and implementation plan.
Multi-persona swarms pass all authored tests but have no documented failure taxonomy, no tested recovery path, and no operator runbook. Three failure modes to characterize: persona-level crash (one of five personas errors mid-investigation), contested memory state (two personas write conflicting conclusions), and aggregation failure (parent receives partial results). Research question: what is the correct policy for each failure mode, what does contested memory resolution look like as an operator workflow, and what is the safe swarm size upper bound? Output: failure taxonomy, runbooks, and implementation spec for recovery paths.
Sprint spec →All agents currently share one database, one audit log, one cost ledger, and one capability JWT root. No isolation boundary exists between callers or teams. Research question: what is the minimum tenant concept (caller-based partitioning, row-level security, or separate deployments), what does the JWT capability model need for a tenant_id claim, and what is the smallest change that opens the multi-tenancy path without a big-bang schema migration? Output: isolation architecture recommendation and first-sprint implementation scope.
§ 08 / Out of scopeWhat we won't build
Items deliberately not pursued. Each has a reason. Listing them is a credibility move: a roadmap that claims to do everything is one that has stopped thinking about trade-offs.
The scheduler preempts between LangGraph node boundaries, not inside an LLM call. True mid-inference preemption requires model-side cooperation that doesn't exist; we won't pretend otherwise.
If an attacker holds both PostgreSQL and audit-anchor credentials, the chain can be rewritten. Mitigation requires external append-only audit, which is the operator's responsibility, not ours.
AOSIQ authenticates API callers via bearer tokens. User identity, single sign-on, and role-based access at the application layer are the host application's responsibility, not the runtime's.
Temporal, DBOS, Restate, and Inngest serve this category. AOSIQ includes durability as one property among many; we don't compete with specialists on durability alone.
Python and sandboxed shell (bash_sandboxed, item 55) cover the immediate execution surface. Node and Ruby are straightforward to add with the same invoker pattern but each requires its own threat model and security review. Deferred until a real customer workflow demands it.
Operators define the curated package set in the Dockerfile. Agent-controlled pip install is a supply-chain attack surface we deliberately do not open.
AOSIQ governs actors; it does not implement them. Bring your own LangGraph agent definitions, your own deterministic scripts, your own business logic. The runtime provides the substrate for governed action; the application is yours.
§ 09 / CadenceHow this updates
This roadmap is updated when items ship, scope, or move between statuses. There are no calendar dates. The runtime moves at the pace of correctness — when a piece of work is correct enough to ship, it ships.
Items move through statuses in one direction: researching → later → next → active → shipped. Items don't move backward in public unless scope is materially reduced; in that case the change appears in the changelog with reasoning.
This page was last updated May 2026. The full version history of this document — including what changed and when — lives in the project repository.
Most recent change: role-scoped tool registry — a
second enforcement layer at every tool dispatch.
The capability discipline that ships in the runtime today is
delegation-bound: a parent agent mints a child token whose
tool grants are the intersection of the parent's grants and
what the child asked for. That mechanism is intact and
load-bearing. The gap this change closes is that capability
narrowing was entirely delegation-bound — the registry
itself did not know that "a researcher agent shouldn't have
a destructive admin tool." A misconfigured parent (or one
that's been jailbroken into requesting a broad grant set)
could hand the child whatever the parent itself held. The
new layer adds an intrinsic role-allowlist ceiling: for
every agent_class an operator declares which tools the role
may ever invoke, in a single deploy-time-immutable YAML.
At every tool call both checks must pass — the token must
grant the tool AND the role must list it in allowed_tools —
so even a prompt-injection-driven over-grant cannot exceed
the role's intrinsic boundary. The role definitions support
single-parent inheritance so a senior-investigator role can
extend the base investigator role with one extra capability
rather than re-listing the entire set. A wildcard fallback
role preserves backward compatibility: any agent_class not
explicitly declared falls through to the wildcard and only
the existing token check applies, logged once per process at
INFO so operators see which classes are still on the
fallback path. A distinct audit event (CAPABILITY_ROLE_DENIED)
fires when the second layer refuses — separate from the
existing CAPABILITY_DENY for token-gate refusals — so a
forensic query answers "which gate fired" in one filter
instead of greppping a free-form reason string. The new
exception subclasses the existing one, so every existing
capability-denial handler still catches role denials
operationally; only audit emission and the dashboard need
to know the difference. Migration 041 widens the audit-event
CHECK constraint with the one new reserved type. No new
runtime dependencies; the registry loader is ~300 lines of
pure-Python YAML parsing and inheritance resolution. A new
HTTP route surfaces the second-layer denials per agent so a
red-team operator can answer "which calls did the role gate
refuse" without writing SQL. Prior change:
MAST reliability guards in the
advisor finalize path. The 2025 multi-agent failure
taxonomy (Cemri et al., 1,600+ annotated traces, κ = 0.88)
identifies four failure modes that account for over half of
observed multi-agent failures. Three of them are now caught
before a proposal becomes a committed result. Each guard
attaches a structured flag to the proposal and emits one
audit row so operators can track failure-mode frequency over
time without a separate dashboard. The first guard measures
token overlap between the proposal's reasoning text and the
text of its proposed actions — a proposal that diagnoses
"disk exhaustion" but proposes restarting an unrelated service
leaves a measurable mismatch. The second compares the
proposal's vocabulary against the original task description;
an investigation asked to analyse a job-queue backlog that
emits a disk-quota proposal trips the scope-drift guard. The
third is opt-in and lives on the verifier family slot:
structural completeness checks on the proposal's required
fields and confidence range, with an operator-promotable
hard-stop mode that returns an honest emergency refusal
instead of the malformed proposal when the verifier rejects.
All three guards are pure-Python — no LLM-as-judge, no
per-call cost, microseconds per finalize — so they run on
every advisor session without budget concerns. Thresholds are
operator-tunable (one env var per guard) and the strict
hard-stop mode is gated by a third env var, off by default
because warn-severity flags carry signal an operator may
legitimately accept. Migration 040 widens the audit-event
CHECK constraint with two new reserved types
(AGENT_RELIABILITY_FLAG, AGENT_RELIABILITY_VERIFIED). No new
runtime dependencies; the tokenizer ships as a small
intersection-of-NLTK-spaCy stop-word set in roughly 60 lines
of pure-stdlib Python. Prior change: startup drift
detector for installed Ollama models. Closes the
natural follow-up to the airgap-ledger alignment: the
runtime's per-model capability table is hand-maintained, but
operators pull new Ollama tags independently. Without a
probe, the operator finds out a tag is unregistered the
first time they try to use it — either via a silent
gate refusal (the default for unknown tags is to refuse
tool-requiring agents) or via an empirical hang several
minutes into a run. At server startup the runtime now
diffs the installed model list against the ledger and
emits one structured log line — three shapes: skipped
when Ollama isn't reachable (the operator hasn't started
it, or the configured URL is wrong), happy when every
installed tag is registered, and a warning naming each
unregistered tag and pointing at the registration
protocol when the lists diverge. The probe is
intentionally asymmetric: it warns only on
installed-but-unregistered, never on registered-but-not-
installed. The inverse warning would fire on essentially
every deployment (no operator installs the full ledger)
and train operators to ignore the probe entirely. The
whole path is diagnostic-only — every failure mode
(connection refused, timeout, malformed response, HTTP
error) returns a structured outcome rather than raising,
and an outer paranoid try/except swallows anything that
escapes anyway, so the probe cannot block startup even
in pathological cases. No new environment variables, no
new audit events, no new runtime dependencies. The whole
addition is roughly forty lines of probe code plus a
small log-shape helper, with fifteen tests covering the
eleven failure modes of the underlying HTTP call plus
the three log shapes plus the ledger-key filter rule.
Prior change: Ollama tool-capability ledger
aligned to airgap measurements. A small-PR-shaped
follow-up to an airgap sample sweep across seven local
models that surfaced a divergence between what the
runtime's per-model tool-calling table claimed and what
models actually do under the production reasoning loop.
Five entries flipped to refused: the Qwen2.5-Coder family
(the prior comment claimed "better at structured tool
emission for code tasks" — measurement shows the model
converges to a syntactically valid proposal that classifies
every operational incident as out-of-scope), Qwen2.5:32b
and the Phi4-mini family and Gemma4-MoE-26B (all three
emit prose with no parseable tool call and hang the
reasoning loop on the operational-advisor corpus), and
DeepSeek-R1:32b (Ollama's API refuses the tools field
outright on reasoning-model tags). Each refused entry now
carries a comment citing the specific airgap measurement
rather than a vendor's marketing claim. The gate behaviour
is the same as before — the runtime refused models with no
ledger entry by default — but the error message moves
from "no tool-capability entry" (reads like an oversight)
to "registered as supports_tool_calling=False" with the
empirical citation (reads like a deliberate decision an
operator can investigate). Pin tests guard against a
future well-meaning edit silently re-enabling a known-bad
model without re-measuring. Prior change:
AOS-native replay backend shipped. The
replay harness gains a second execution path that runs
corpus incidents through the production AOS spawn / poll
/ fetch API rather than the in-process runner — so the
scheduler, audit chain, capability gate, approval policy,
and cost ledger all participate in the measurement the
same way they would in production. A per-incident
wall-clock timeout (default 180 seconds) lets the harness
record an incident as errored rather than
hanging the run when a model fails to emit any parseable
tool call — the operator-visible result is a clean
completion with one row marked errored, instead of a
seven-minute silent stall. A per-vertical
corpus-to-spawn mapper translates corpus rows into the
agent_class + task shape the AOS spawn API expects, and
the report now separates pass rate (model gave a correct
answer) from errored count (runtime could not get an
answer at all) so a regression in one signal does not
mask the other. The same report carries LLM_TURN
economics — per-incident token and cost summary — so the
cost story is a property of the harness output, not a
post-hoc query. Prior change: sensible Ollama
context default shipped. The bundled
docker-compose now sets OLLAMA_NUM_CTX=8192
by default, closing the root cause that drove three of
the airgap sweep's hung-incident rows. Several model
files declare a 131072-token context window, and Ollama
pre-allocates the full KV cache up front on first load —
spilling roughly 15 GB of VRAM for a 2.5 GB model file
and dramatically slowing inference. A direct probe with
the 8K override returns a one-token response from
phi4-mini in five seconds where the default times out
at three minutes. The recommendation now lands as a
default rather than a runbook footnote. Prior change:
diff and patch tools shipped (item 32).
Three read-side tools that complete the proposal-side tooling
pair with the source-code reader. One computes a unified diff
between two text strings using the standard library; one
parses a unified diff into structured metadata — files
changed, per-file hunks, line counts, new-file and
deleted-file flags; one validates that a proposed diff would
apply cleanly against a target's current content and names
the first mismatched line by number when it wouldn't. The
agent reads source through the reader, computes the diff
between the current file and its mental edit, validates the
diff against the target's actual current content, and
surfaces the diff in its proposal body for human review.
Deliberately no apply patch — writing a patched file is an
irreversible operation that belongs in an executor vertical
behind the approval gate. The diff is the proposal; the
operator applies it through standard change control. The
read-write split now holds end-to-end at the tool layer:
agents read source freely, produce structured diffs as
proposals, and the runtime self-protection denylist still
refuses writes into protected trees regardless of grant.
Prior change:
source-code reader shipped (item 31).
Three read-side tools that let agents navigate large
codebases by symbol graph rather than raw file bodies. The
first call per codebase builds an in-memory index in a few
seconds — a sorted hash of file paths and modification times
keys the cache, so subsequent calls return in well under a
tenth of a second and the cache invalidates on the next file
change automatically. Four languages parse out of the box:
Python, JavaScript, Go, and SQL. An agent looking for the
spawn function in a fifty-thousand-line codebase searches by
name to locate the file, reads the file's symbol graph to
see what else lives there, and pulls the body of the one
target symbol — a handful of tool calls instead of burning
the budget on sequential reads. All three tools register
read-only; the recently shipped write_file path-prefix
denylist refuses writes into the same source trees the
reader navigates, so an agent granted both reader and writer
still cannot mutate the source it just read. The capability
template grants the three tools narrowly; the codebase
registry is operator-managed and deploy-time-immutable —
same precedent as the shell allow-list and notification
channels. Prior change:
write_file path-prefix denylist shipped. Closes the explicit follow-up the
non-self-modifying runtime principle named in its consequences
section. Two protected surfaces (the runtime config directory
and the runtime source tree) were previously held only by
absence of grant — no agent class had write permission on
those paths, but a misconfigured custom template could have
granted it. The denylist refuses such writes at the handler
level before any disk activity and before an operator would
be asked to approve, so the runtime's structural
self-protection now survives operator grant errors rather
than depending on them being absent. Symlink resolution runs
before the prefix check so an attacker-controlled symlink
dance cannot bypass the boundary. Operators can extend the
list at deploy time; canonical entries cannot be disabled at
runtime — disabling is an architecture-decision amendment,
not a configuration toggle. Twenty-two tests pin every
protected prefix plus the symlink-escape path. Prior change:
reviewer-family scheduler alignment shipped.
Closes the half of composition-layer-v1 that was honestly
deferred last commit. The reviewer family was intentionally
non-LangGraph-driven by design — operators called the
registration helper directly and invoked the reviewer's
`review` method themselves rather than going through the
scheduler. After this sprint a family-level LangGraph wrapper
turns any reviewer into a graph the scheduler can dispatch,
the third vertical (code review) reaches the scheduler
through the same `register_*` shape as the advisors with the
same audit / capability / approval / cost / reasoning-trace
forensics, and a uniform proposal-retrieval endpoint reads
`agent_processes.result` for every vertical through one
stable envelope. The dogfood loop ships as a thin CLI that
pipes stdin to the running server, polls until terminal, and
renders findings — no in-process imports, exit code maps to
recommendation. Future reviewer verticals (SQL, IaC,
configuration) drop in as thin subclasses on the family-level
graph wrapper. Prior change: composition-layer-v1 shipped (ADR-006 Phase 1).
Wiring-debt hardening sprint that closes a class of drift the
agent-class deep review surfaced: AOS shipped code for three
verticals (operational advice, compliance audit, code review)
but only one was actually wired into the running server's
startup. After this sprint, the compliance vertical and both
remediation handler packs (four operational, four compliance)
register at lifespan startup — the documented surface and the
running surface come back into alignment. A new env-var-gated
startup assertion turns the wiring contract into a boot-time
check so future drift surfaces before any operator hits a
4xx. The synopsis vocabulary half of the sprint adopts the
14-layer positioning model in the architecture doc and on
this site's architecture section, with an honest per-layer
scorecard — the layers that score low (Intent, Planning)
are positioning vocabulary, not yet shipped surfaces, and
the doc says so plainly. The third vertical's scheduler-driven
registration alignment turned out to be a bigger architectural
change than the sprint scoped (the reviewer family is
intentionally non-LangGraph-driven by design), and was
honestly deferred to its own follow-up sprint in the queue
rather than padded into this one. Prior change:
prompt management (item 62) shipped.
Pulls every prompt out of Python source into versioned
Markdown files declared in a single manifest. Resolution
routes by model profile rather than transport, so a small
local model and a frontier API model on the same backend
see different prompts when that helps. The resolved prompt's
identity threads onto every reasoning-step audit row alongside
backend and model, so a forensic SQL filter on
payload->>'prompt_version' isolates whether a
corpus pass-rate change tracks a prompt change or a model
change with one query. An AST-based CI guard walks production
source and rejects multi-line strings on the known prompt
target names; a documented per-line escape exists for the
rare correct-by-construction case. The mechanical refactor
landed without changing what any model actually reads; the
deferred follow-up writes terse-variant rewrites for the
three shipped advisor families and lands the variant-
comparison corpus eval tool alongside, so style changes can
be measured against the baseline cleanly. Prior change:
LLM turn audit (item 61) shipped. Closes
the last forensic gap in the audit chain — until that sprint
the runtime recorded every tool call but discarded the
LLM's reasoning that produced it. The new LLM_TURN event
captures each reasoning step's assistant message; the hash
plus an object-storage key land on the audit row while the
full text lives under the same bucket the tamper-evidence
anchors use. Every subsequent tool-call row sets a causation
pointer back to the reasoning turn that produced it, so the
reasoning → action chain becomes a single indexed SQL JOIN.
The same data flows through three operator surfaces: a
Reasoning Trace section on the agent detail page,
server-sent-event previews as the agent reasons, and a pair
of REST endpoints plus an MCP tool for programmatic post-hoc
review. Emission is wired in the two reasoning paths that
cover every shipped vertical today; a documented one-line
adoption pattern brings reviewer-family and swarm reasoning
paths into the same chain on demand.
Prior change: action execution framework (item 60)
shipped. Two reserved family slots — executor and
reporter — promoted to shipped, taking the family catalog from
four shipped + three reserved to six shipped + one reserved.
The executor family is the first write-side member: a
deterministic dispatcher that walks an operator-approved
proposal's actions in order, capability-narrows per action
against a base template that ships zero tools, records an
idempotency cursor for crash-recovery, and routes failure
modes through dedicated audit events. The existing
RemediationAgent refactored onto the dispatcher, shrinking by
roughly forty percent while preserving its public surface
bit-identically. The reporter family closes the lifecycle: it
consumes the verifier's structured result and the audit chain
for the proposal, then emits a runbook-ready summary plus
concrete next steps — status mirrors the verifier by default,
and only the reporter's own parse-fallback path downgrades to
needs-more-info, distinguishing reporter trouble from
underlying remediation outcomes. End-to-end lineage threads
through one indexed proposal-id column so a single SQL filter
returns every event in the lifecycle. Prior change:
roadmap expanded to reflect the full sprint queue and
product vision. Three implementation-ready sprints
remain queued (items 55, 63, 64): sandboxed shell execution,
KB defense hardening, and backend router.
Six study sprints sit in the research section (items 65–71)
covering the known architectural gaps: local model quality
floor, production corpus feedback loop, operational
observability, agent task persistence, swarm resilience, and
multi-tenancy. Study sprints are research-and-design briefs,
not committed features — each graduates to next or
later when scope is confidently bounded.
Prior change: CodeReviewAgent (item 59) shipped as
the third vertical. First vertical authored from
scratch on the generic agent framework — proving the framework's
"next vertical reduces to a configuration class" claim with new
code rather than retrofitted code. CodeReviewAgent reviews
either a pasted code snippet or a unified git diff
and emits a structured proposal with categorized findings (one
of: security issue, likely bug, correctness concern, style
violation, performance concern, missing test, out of scope) at
one of four severities, plus a recommendation derived from the
severities by default. Honest refusal stays first-class: when
the artifact isn't reviewable (binary content, generated code,
vendored dependencies), the reviewer emits a single
out-of-scope finding rather than fabricating issues. Customer
profile is the dogfood loop — the operator running Claude Code
on their workstation pipes the diff through AOS for governance
before merging. The third vertical landed in roughly 330 lines
of vertical-specific code (configuration + LLM parse logic +
domain content) on top of the framework's substrate, with an
eight-case smoke corpus exercising the pass/fail/edge matrix
across both artifact shapes at 100% under the scripted backend.
Prior change: the generic agent framework (item 58) shipped. The LangGraph machinery behind both reasoning verticals extracted into a parameterized base class; each shipped vertical reduced to a configuration class plus a registration helper. The operational and compliance modules shrunk to under 400 lines each, from roughly 1000. Three additional reference families shipped alongside (reviewer, guardrail, verifier) — each with a corresponding role-based capability template operators compose with a vertical-specific overlay at capability-mint time. The audit chain gained five new event types linked through a proposal-lifecycle UUID column so the "show me everything that happened on this proposal" query reads one indexed column. The tool registry gained explicit risk-level metadata so the approval gate routes high-risk reversible actions to human review without per-deployment manual labelling.
Earlier change: the empirical-validation arc closed. The 50-incident replay harness (item 30) and the field-test feedback loop (item 29) both moved from next to shipped. The harness now publishes cross-vertical, multi-backend baselines per oracle version: the air-gapped local-LLM tier (mistral-small:22b on Ollama) reaches 80% pass rate — at parity with the API tier (Gemini Flash 78%, OpenAI gpt-4o-mini 76%, Claude Haiku 70%) — for operator deployments that cannot route incident data to a third-party API. Corpus oracles are versioned per case, and an offline rescore tool reproduces any historical baseline against its prior oracle state, so revisions never destroy prior numbers. The feedback loop is now end-to-end: structured per-call advisor telemetry, three-button operator outcome capture, a reinforcement worker over experiential memory, two materialized rollup views, a dashboard advisor-trends page with a per-vertical filter, and a telemetry-to-corpus exporter that promotes operator-confirmed incidents into the regression set automatically. The advisor's pass rate is now a function of (backend, oracle version) rather than a single number, and the methodology to audit, revise, and rescore is itself an artifact in the repo.
Prior changes: ComplianceAuditAgent (item 57)
shipped as the second vertical class, proving the canonical
pattern (reasoning advisor over deterministic actor pack,
structured proposal, capability + approval gate) generalizes
beyond operational incidents. Six new scenarios across SOC2,
HIPAA, and ISO27001 frameworks; cross-vertical machinery
(per-vertical telemetry column, per-vertical handler packs in
RemediationAgent, per-vertical dashboard filter, harness that
walks both corpora) means a third vertical now reduces to one
advisor + one actor pack + one corpus + two registration-
table entries. Earlier in the same arc, RemediationAgent
v1 (item 56) shipped as the execution arm of the
OperationalAdvice vertical: a deterministic dispatcher walks
an operator-approved proposal's proposed_actions
in order, spawning per-action handlers whose capability is
narrowed to one tool, with the existing approval gate firing
on every irreversible handler call. OperationalAdviceAgent
v1 (item 18) shipped before either as the first
domain vertical, landing alongside the two architectural
primitives it depended on (sandboxed execution, item 17, and
the DeterministicActor primitive, item 19). Three foundational
hardening sprints followed the first vertical and preceded
the second: the LLM-backend resilience layer (retry-with-
backoff on transient cloud errors, per-backend timeouts,
scheduler-level rate-limit recovery rather than silent
zombification); operator-driven credential hygiene (per-
session HMAC-secret rotation with a configurable grace window
plus per-token revocation via a JTI denylist); and the
observability surface (Prometheus metrics, OpenTelemetry
tracing, structured JSON logging with correlation IDs, and a
periodic anchor-sweep job that continuously verifies the
audit chain). Item 27 (Knowledge-base ingest scanner) shipped
immediately after, closing the most visible "deliberately
partial mitigation" claim in the threat-model document.
bash_sandboxed (item 55) remains the next
instance of the sandbox primitive family.
Building something that depends on specific items here?
If your evaluation hinges on a specific roadmap item — performance numbers, a particular tool, a backend addition — that's worth a conversation. Roadmap order can shift in response to real customer signal in a way that reading public documents alone cannot.