Google Rapid Agent × Arize Hackathon

Self-Healing Observability
Layer for Clinical AI.

IRIS scores every output from clinical AI agents in real time — seven safety evaluators grounded in RxNorm and openFDA, full tracing to Arize Phoenix — then closes the loop: it detects recurring failure patterns from its own observability data, rewrites the failing prompt, validates the fix against real failures, and redeploys it autonomously.

Live Dashboard Overview

7 Evaluators

2 ADK Agents

10 MCP Tools

SSE Real-time Alerts

The Problem

The Stakes in the OR
Have Never Been Higher

Clinical AI agents are entering operating rooms and ICUs without any runtime safety layer. When they hallucinate, they do so silently — and the consequences fall on patients.

91.8%

of clinicians surveyed reported encountering medical AI hallucinations — and 84.7% said those hallucinations were capable of causing direct patient harm

MIT Media Lab et al., medRxiv 2025 · n=70 clinicians ↗

83%

of cases where leading LLMs repeated or elaborated on a single planted clinical error — fabricated lab values and conditions propagating into unsafe recommendations

Communications Medicine (Nature), 2025 · n=300 vignettes ↗

$42B

annual global cost of medication errors — with AI agents now recommending drugs and doses in these same clinical environments without a dedicated runtime safety layer

WHO — Medication Without Harm ↗

Errors propagate silently through agent pipelines

A single upstream hallucination cascades through every downstream agent with no circuit breaker. Without a supervisor, multi-agent clinical pipelines have no mechanism to stop compounding errors.

Polypharmacy is the #1 ICU medication error

OR medication error rates reach 7.3–12%. AI agents recommending drug regimens must be checked against current medications and documented allergies in real time — not post-hoc.

No self-improvement loop for clinical AI

Failures accumulate invisibly. There is no mechanism to detect recurring failure patterns in deployed AI and autonomously improve the clinical prompts that generate them.

No audit trail for FDA PCCP compliance

FDA's Predetermined Change Control Plan requires documented evidence that AI modifications don't degrade safety. Most deployments produce zero versioned audit trail of prompt changes and validated outcomes.

Introducing IRIS

The layer between your AI
and your patients.

Every answer your clinical AI gives passes through IRIS — checked against trusted medical sources, watched end to end, and made safer over time. Automatically.

IRIS sends improved instructions back — your agents get safer over time

Your clinical AI

Care recommendations

Medication & dosing

Clinical documentation

IRIS

Seven clinical safety checks. Every answer. In real time.

RxNorm

openFDA

Arize Phoenix

Trusted ground truth

What reaches care teams

Safe dosing, confirmed

Interactions cleared

Every answer traceable

Catches unsafe answers instantly

Hallucinated drugs, dangerous doses, missed allergies — flagged the moment they happen, before they reach a patient.

Grounded in medical truth

Every verdict is checked against the same drug databases clinicians trust — not just one AI grading another.

Improves your AI — by itself

When IRIS sees the same mistake twice, it rewrites the agent's instructions, proves the fix works, and ships it. No engineers required.

Architecture

A Closed-Loop Safety System

IRIS operates as a continuous supervisor — evaluating every AI output in real time, detecting failure patterns from observability data, and autonomously improving clinical prompts through a validated self-healing loop.

Ingest IrisEvent

Clinical AI agents submit every output as a structured IrisEvent via POST /event. Includes the query, output text, retrieved patient context, and surgical phase.

FastAPIREST API

7-Evaluator Safety Check

Seven concurrent clinical safety checks on every output: dosage boundaries, hallucinations, drug–drug interactions, allergy contraindications, attribution, context gaps, and surgical phase alignment — grounded in RxNorm and openFDA.

GeminiRxNormopenFDALLM-as-Judge

Observe to Phoenix

Every evaluation — including reasoning chain (CoT steps) and self-assessed confidence — is logged to Arize Phoenix as an OTel span annotation via OpenInference instrumentation.

OpenInferenceOTelPhoenix Cloud

Pattern Detection via MCP

Pattern Detector queries Arize Phoenix via MCP (get-spans, get-span-annotations) to identify recurring failure clusters across recent traces.

Phoenix MCPget-spans

Diagnose from Real Failures

Gemini root-causes each failure cluster from the actual failing traces, fetches the current prompt version from Phoenix, and logs the failure examples to a per-agent Phoenix dataset for validation.

Phoenix DatasetsPrompt Registry

Mutate, Validate & Deploy

A TextGrad-style mutation engine computes textual gradients from the failures and rewrites the prompt. Validation regenerates answers under the candidate and re-scores them with the same 7 evaluators — only measurable improvement deploys, versioned and tagged production in Phoenix.

Textual GradientsImprovement GatePhoenix Deploy

Agents & MCPs

Agents That Read Their Own Traces —
and Fix Their Own Prompts.

Built on Google ADK and the Arize Phoenix MCP server: IRIS's agents introspect the system's own observability data, hunt for failure clusters, diagnose root causes, and drive the self-healing loop end to end — with hard validation gates where safety counts.

ADK Agent · Phoenix MCP

pattern_detector

Queries Arize Phoenix via MCP to find recurring failure clusters across recent spans — grouping by agent, prompt version, and query type. Sets healing in motion when a cluster crosses the failure-rate threshold, with a built-in fallback so detection never goes dark.

get-spansget-span-annotationsGemini 2.5 Pro

ADK Agent · Phoenix MCP

mcp_probe

Powers the dashboard's MCP chat — free-form natural-language exploration of Phoenix observability data with the full 10-tool surface: traces, spans, annotations, datasets, experiments, and prompt versions.

10 MCP toolslist-promptsget-dataset-experimentsGemini 2.5 Pro

Safety Engine

evaluation_service

Runs all 7 evaluators concurrently on every event, on every output, with zero sampling. Each verdict carries a 0–10 score, severity, reasoning chain, and confidence; results land on the OTel span and as Phoenix annotations.

7 evaluatorsRxNormopenFDAGemini judges

Autonomous Heal Loop

healing_pipeline

Python-orchestrated heal loop: diagnose the cluster from real failing traces, mutate the prompt via textual gradients, validate by re-scoring regenerated answers with the same 7 evaluators, then version and tag the winner in Phoenix.

diagnose → mutate → validate → deployImprovement gate

Closed-Loop Demo

live_agent

Proves the loop is closed: pulls the production-tagged healed prompt from Phoenix at run time and regenerates answers with Gemini. Re-running the same scenarios shows a measured before/after improvement — real outputs, not staged.

Phoenix prompt registryReproducible generation

Real-Time Alerting

alert_dispatch

Severity-routed alerting with zero added latency. INFO feeds the live trace feed, WARNING raises a badge, CRITICAL streams an instant SSE alert — and can fire an autonomous healing scan the moment a failure cluster crosses the threshold.

SSE streamEvent-driven scan trigger

Evaluation Layer - Heart of the System

Seven Clinical Safety Evaluators

Every AI output is scored 0–10 by seven concurrent evaluators — LLM-as-judge verdicts grounded in authoritative drug databases. The worst severity wins, and criticals stream to the dashboard instantly.

factual_hallucination

Detects fabricated drug names and non-existent medications. Cross-validates every drug mention against RxNorm — an unrecognized name is provably wrong, not a probability. A broad LLM judge then checks for impossible values and context contradictions.

RxNormLLM judge

dosage_boundary

Verifies doses are within safe clinical ranges for the patient's weight, age, and renal function (CrCl) — checked against openFDA label data. Catches gross magnitude errors and missing renal adjustments.

openFDARxNormLLM judge

drug_interaction

Detects clinically significant drug–drug interactions between recommended drugs and the patient's current medication list — including CYP450-mediated interactions like warfarin–metronidazole.

Current meds contextLLM judge

allergy_contraindication

Checks whether a recommended drug belongs to a class the patient is allergic to — including same-class substitutions, like prescribing amoxicillin to a patient with a documented penicillin allergy.

Allergy contextLLM judge

attribution

Verifies that every clinical recommendation cites specific patient values — labs, weight, allergies — from the retrieved context. Penalises responses that infer or invent patient data not in the record.

Grounding checkLLM judge

context_gap

Flags responses where the agent answered without critical patient data in the record — missing CrCl, weight, or labs. Penalises confident recommendations when required context is simply absent.

Required-field inferenceLLM judge

surgical_phase

Checks that recommendations are appropriate for the current surgical phase — pre-op, induction, maintenance, closure, emergence. Flags protocol violations tied to surgical timing.

Phase contextLLM judge

Workflow

From Failure to Fix, End to End

The complete closed loop — source agents, real-time evaluation, Phoenix tracing, MCP-driven pattern detection, prompt mutation, experiment-gated validation, and measurable improvement.

IRIS end-to-end workflow diagram — from source agent event through evaluation, Phoenix tracing, MCP pattern detection, mutation, validation, and prompt deployment

Click to view full size

Infrastructure

Built on Production-Grade AI Infrastructure

Every layer of IRIS is purpose-selected — from the agent runtime to the observability backend to the mutation engine.

Component	Role in IRIS	Category
Google ADK 2.x	Agent runtime — LlmAgent, Runner, McpToolset with before/after tool callbacks for arg clamping and span filtering	Agent Runtime
Gemini 2.5 Flash + Pro	Flash drives the evaluator judges and healing pipeline; Pro drives the MCP tool-calling agents — all routed through a shared throttled gateway with retry	LLM
Vertex AI	Model serving platform for every Gemini call — service-account auth, regional endpoints, production quota management	AI Platform
Arize Phoenix Cloud	Observability backend — span storage, prompt versioning & tagging, dataset management, experiments, eval annotations	Observability
@arizeai/phoenix-mcp	MCP server exposing 10 Phoenix tools consumed by pattern_detector and mcp_probe at runtime	MCP
RxNorm + openFDA	Ground truth for drug validation — NIH RxNorm verifies drug names exist; openFDA labels bound safe dose ranges	Clinical Knowledge
Mutation Engine	TextGrad-style textual gradients computed by Gemini from real failure examples — candidate prompts gated by evaluator-scored validation	Prompt Optimization
FastAPI	Async event ingestion, SSE alert stream, simulation & comparison endpoints, healing approval API	API
OpenTelemetry + OpenInference	Span pipeline to Phoenix Cloud via OTLP — evaluation results recorded as span attributes and REST annotations	Telemetry
Google Cloud Run	Deployment target — containerized FastAPI service with Vertex AI service-account auth	Infrastructure

Self-Healing Observability Layer for Clinical AI.

The Stakes in the ORHave Never Been Higher

The layer between your AIand your patients.