IRIS scores every output from clinical AI agents in real time — seven safety evaluators grounded in RxNorm and openFDA, full tracing to Arize Phoenix — then closes the loop: it detects recurring failure patterns from its own observability data, rewrites the failing prompt, validates the fix against real failures, and redeploys it autonomously.
Clinical AI agents are entering operating rooms and ICUs without any runtime safety layer. When they hallucinate, they do so silently — and the consequences fall on patients.
Every answer your clinical AI gives passes through IRIS — checked against trusted medical sources, watched end to end, and made safer over time. Automatically.
IRIS operates as a continuous supervisor — evaluating every AI output in real time, detecting failure patterns from observability data, and autonomously improving clinical prompts through a validated self-healing loop.
POST /event. Includes the query, output text, retrieved patient context, and surgical phase.get-spans, get-span-annotations) to identify recurring failure clusters across recent traces.production in Phoenix.Built on Google ADK and the Arize Phoenix MCP server: IRIS's agents introspect the system's own observability data, hunt for failure clusters, diagnose root causes, and drive the self-healing loop end to end — with hard validation gates where safety counts.
Every AI output is scored 0–10 by seven concurrent evaluators — LLM-as-judge verdicts grounded in authoritative drug databases. The worst severity wins, and criticals stream to the dashboard instantly.
Every layer of IRIS is purpose-selected — from the agent runtime to the observability backend to the mutation engine.
| Component | Role in IRIS | Category |
|---|---|---|
| Google ADK 2.x | Agent runtime — LlmAgent, Runner, McpToolset with before/after tool callbacks for arg clamping and span filtering | Agent Runtime |
| Gemini 2.5 Flash + Pro | Flash drives the evaluator judges and healing pipeline; Pro drives the MCP tool-calling agents — all routed through a shared throttled gateway with retry | LLM |
| Vertex AI | Model serving platform for every Gemini call — service-account auth, regional endpoints, production quota management | AI Platform |
| Arize Phoenix Cloud | Observability backend — span storage, prompt versioning & tagging, dataset management, experiments, eval annotations | Observability |
| @arizeai/phoenix-mcp | MCP server exposing 10 Phoenix tools consumed by pattern_detector and mcp_probe at runtime | MCP |
| RxNorm + openFDA | Ground truth for drug validation — NIH RxNorm verifies drug names exist; openFDA labels bound safe dose ranges | Clinical Knowledge |
| Mutation Engine | TextGrad-style textual gradients computed by Gemini from real failure examples — candidate prompts gated by evaluator-scored validation | Prompt Optimization |
| FastAPI | Async event ingestion, SSE alert stream, simulation & comparison endpoints, healing approval API | API |
| OpenTelemetry + OpenInference | Span pipeline to Phoenix Cloud via OTLP — evaluation results recorded as span attributes and REST annotations | Telemetry |
| Google Cloud Run | Deployment target — containerized FastAPI service with Vertex AI service-account auth | Infrastructure |
Open the live dashboard: run failure scenarios, watch all seven evaluators score them in real time, trigger an autonomous heal, then re-run the same scenarios under the healed prompt and see the measured improvement.