Google Rapid Agent × Arize Hackathon

Self-Healing Observability
Layer for Clinical AI.

IRIS scores every output from clinical AI agents in real time — seven safety evaluators grounded in RxNorm and openFDA, full tracing to Arize Phoenix — then closes the loop: it detects recurring failure patterns from its own observability data, rewrites the failing prompt, validates the fix against real failures, and redeploys it autonomously.

7 Evaluators
2 ADK Agents
10 MCP Tools
SSE Real-time Alerts
IRIS Clinical AI Safety Dashboard
The Problem

The Stakes in the OR
Have Never Been Higher

Clinical AI agents are entering operating rooms and ICUs without any runtime safety layer. When they hallucinate, they do so silently — and the consequences fall on patients.

91.8%
of clinicians surveyed reported encountering medical AI hallucinations — and 84.7% said those hallucinations were capable of causing direct patient harm
MIT Media Lab et al., medRxiv 2025 · n=70 clinicians ↗
83%
of cases where leading LLMs repeated or elaborated on a single planted clinical error — fabricated lab values and conditions propagating into unsafe recommendations
Communications Medicine (Nature), 2025 · n=300 vignettes ↗
$42B
annual global cost of medication errors — with AI agents now recommending drugs and doses in these same clinical environments without a dedicated runtime safety layer
WHO — Medication Without Harm ↗
Errors propagate silently through agent pipelines
A single upstream hallucination cascades through every downstream agent with no circuit breaker. Without a supervisor, multi-agent clinical pipelines have no mechanism to stop compounding errors.
Polypharmacy is the #1 ICU medication error
OR medication error rates reach 7.3–12%. AI agents recommending drug regimens must be checked against current medications and documented allergies in real time — not post-hoc.
No self-improvement loop for clinical AI
Failures accumulate invisibly. There is no mechanism to detect recurring failure patterns in deployed AI and autonomously improve the clinical prompts that generate them.
No audit trail for FDA PCCP compliance
FDA's Predetermined Change Control Plan requires documented evidence that AI modifications don't degrade safety. Most deployments produce zero versioned audit trail of prompt changes and validated outcomes.
Introducing IRIS

The layer between your AI
and your patients.

Every answer your clinical AI gives passes through IRIS — checked against trusted medical sources, watched end to end, and made safer over time. Automatically.

IRIS sends improved instructions back — your agents get safer over time
Your clinical AI
Care recommendations
Medication & dosing
Clinical documentation
IRIS
Seven clinical safety checks. Every answer. In real time.
RxNorm
openFDA
Arize Phoenix
Trusted ground truth
What reaches care teams
Safe dosing, confirmed
Interactions cleared
Every answer traceable
Catches unsafe answers instantly
Hallucinated drugs, dangerous doses, missed allergies — flagged the moment they happen, before they reach a patient.
Grounded in medical truth
Every verdict is checked against the same drug databases clinicians trust — not just one AI grading another.
Improves your AI — by itself
When IRIS sees the same mistake twice, it rewrites the agent's instructions, proves the fix works, and ships it. No engineers required.
Architecture

A Closed-Loop Safety System

IRIS operates as a continuous supervisor — evaluating every AI output in real time, detecting failure patterns from observability data, and autonomously improving clinical prompts through a validated self-healing loop.

01
Ingest IrisEvent
Clinical AI agents submit every output as a structured IrisEvent via POST /event. Includes the query, output text, retrieved patient context, and surgical phase.
FastAPIREST API
02
7-Evaluator Safety Check
Seven concurrent clinical safety checks on every output: dosage boundaries, hallucinations, drug–drug interactions, allergy contraindications, attribution, context gaps, and surgical phase alignment — grounded in RxNorm and openFDA.
GeminiRxNormopenFDALLM-as-Judge
03
Observe to Phoenix
Every evaluation — including reasoning chain (CoT steps) and self-assessed confidence — is logged to Arize Phoenix as an OTel span annotation via OpenInference instrumentation.
OpenInferenceOTelPhoenix Cloud
04
Pattern Detection via MCP
Pattern Detector queries Arize Phoenix via MCP (get-spans, get-span-annotations) to identify recurring failure clusters across recent traces.
Phoenix MCPget-spans
05
Diagnose from Real Failures
Gemini root-causes each failure cluster from the actual failing traces, fetches the current prompt version from Phoenix, and logs the failure examples to a per-agent Phoenix dataset for validation.
Phoenix DatasetsPrompt Registry
06
Mutate, Validate & Deploy
A TextGrad-style mutation engine computes textual gradients from the failures and rewrites the prompt. Validation regenerates answers under the candidate and re-scores them with the same 7 evaluators — only measurable improvement deploys, versioned and tagged production in Phoenix.
Textual GradientsImprovement GatePhoenix Deploy
Agents & MCPs

Agents That Read Their Own Traces —
and Fix Their Own Prompts.

Built on Google ADK and the Arize Phoenix MCP server: IRIS's agents introspect the system's own observability data, hunt for failure clusters, diagnose root causes, and drive the self-healing loop end to end — with hard validation gates where safety counts.

ADK Agent · Phoenix MCP
pattern_detector
Queries Arize Phoenix via MCP to find recurring failure clusters across recent spans — grouping by agent, prompt version, and query type. Sets healing in motion when a cluster crosses the failure-rate threshold, with a built-in fallback so detection never goes dark.
get-spansget-span-annotationsGemini 2.5 Pro
ADK Agent · Phoenix MCP
mcp_probe
Powers the dashboard's MCP chat — free-form natural-language exploration of Phoenix observability data with the full 10-tool surface: traces, spans, annotations, datasets, experiments, and prompt versions.
10 MCP toolslist-promptsget-dataset-experimentsGemini 2.5 Pro
Safety Engine
evaluation_service
Runs all 7 evaluators concurrently on every event, on every output, with zero sampling. Each verdict carries a 0–10 score, severity, reasoning chain, and confidence; results land on the OTel span and as Phoenix annotations.
7 evaluatorsRxNormopenFDAGemini judges
Autonomous Heal Loop
healing_pipeline
Python-orchestrated heal loop: diagnose the cluster from real failing traces, mutate the prompt via textual gradients, validate by re-scoring regenerated answers with the same 7 evaluators, then version and tag the winner in Phoenix.
diagnose → mutate → validate → deployImprovement gate
Closed-Loop Demo
live_agent
Proves the loop is closed: pulls the production-tagged healed prompt from Phoenix at run time and regenerates answers with Gemini. Re-running the same scenarios shows a measured before/after improvement — real outputs, not staged.
Phoenix prompt registryReproducible generation
Real-Time Alerting
alert_dispatch
Severity-routed alerting with zero added latency. INFO feeds the live trace feed, WARNING raises a badge, CRITICAL streams an instant SSE alert — and can fire an autonomous healing scan the moment a failure cluster crosses the threshold.
SSE streamEvent-driven scan trigger
Evaluation Layer - Heart of the System

Seven Clinical Safety Evaluators

Every AI output is scored 0–10 by seven concurrent evaluators — LLM-as-judge verdicts grounded in authoritative drug databases. The worst severity wins, and criticals stream to the dashboard instantly.

factual_hallucination
Detects fabricated drug names and non-existent medications. Cross-validates every drug mention against RxNorm — an unrecognized name is provably wrong, not a probability. A broad LLM judge then checks for impossible values and context contradictions.
RxNormLLM judge
dosage_boundary
Verifies doses are within safe clinical ranges for the patient's weight, age, and renal function (CrCl) — checked against openFDA label data. Catches gross magnitude errors and missing renal adjustments.
openFDARxNormLLM judge
drug_interaction
Detects clinically significant drug–drug interactions between recommended drugs and the patient's current medication list — including CYP450-mediated interactions like warfarin–metronidazole.
Current meds contextLLM judge
allergy_contraindication
Checks whether a recommended drug belongs to a class the patient is allergic to — including same-class substitutions, like prescribing amoxicillin to a patient with a documented penicillin allergy.
Allergy contextLLM judge
attribution
Verifies that every clinical recommendation cites specific patient values — labs, weight, allergies — from the retrieved context. Penalises responses that infer or invent patient data not in the record.
Grounding checkLLM judge
context_gap
Flags responses where the agent answered without critical patient data in the record — missing CrCl, weight, or labs. Penalises confident recommendations when required context is simply absent.
Required-field inferenceLLM judge
surgical_phase
Checks that recommendations are appropriate for the current surgical phase — pre-op, induction, maintenance, closure, emergence. Flags protocol violations tied to surgical timing.
Phase contextLLM judge
Workflow

From Failure to Fix, End to End

The complete closed loop — source agents, real-time evaluation, Phoenix tracing, MCP-driven pattern detection, prompt mutation, experiment-gated validation, and measurable improvement.

IRIS end-to-end workflow diagram — from source agent event through evaluation, Phoenix tracing, MCP pattern detection, mutation, validation, and prompt deployment
Click to view full size
Infrastructure

Built on Production-Grade AI Infrastructure

Every layer of IRIS is purpose-selected — from the agent runtime to the observability backend to the mutation engine.

Component Role in IRIS Category
Google ADK 2.x Agent runtime — LlmAgent, Runner, McpToolset with before/after tool callbacks for arg clamping and span filtering Agent Runtime
Gemini 2.5 Flash + Pro Flash drives the evaluator judges and healing pipeline; Pro drives the MCP tool-calling agents — all routed through a shared throttled gateway with retry LLM
Vertex AI Model serving platform for every Gemini call — service-account auth, regional endpoints, production quota management AI Platform
Arize Phoenix Cloud Observability backend — span storage, prompt versioning & tagging, dataset management, experiments, eval annotations Observability
@arizeai/phoenix-mcp MCP server exposing 10 Phoenix tools consumed by pattern_detector and mcp_probe at runtime MCP
RxNorm + openFDA Ground truth for drug validation — NIH RxNorm verifies drug names exist; openFDA labels bound safe dose ranges Clinical Knowledge
Mutation Engine TextGrad-style textual gradients computed by Gemini from real failure examples — candidate prompts gated by evaluator-scored validation Prompt Optimization
FastAPI Async event ingestion, SSE alert stream, simulation & comparison endpoints, healing approval API API
OpenTelemetry + OpenInference Span pipeline to Phoenix Cloud via OTLP — evaluation results recorded as span attributes and REST annotations Telemetry
Google Cloud Run Deployment target — containerized FastAPI service with Vertex AI service-account auth Infrastructure
Google Rapid Agent × Arize Hackathon

See IRIS in Action

Open the live dashboard: run failure scenarios, watch all seven evaluators score them in real time, trigger an autonomous heal, then re-run the same scenarios under the healed prompt and see the measured improvement.