by Laudos.AI · v2.1

LAIBench

The only public benchmark that ranks AI radiology reporting by patient safety — not text similarity. A report fails if it drops a critical finding, flips a laterality, or negates an urgent one. Even frontier models don’t ace it.

685 Public cases
528 Variant-controlled
8 Adversarial classes
5 Dimensions
3 Locales (PT-BR/EN/ES)
224 Tests
Public Reference Snapshot
Public rows show only evaluation class and score. Internal implementation details are intentionally withheld.
Evaluation System Type Strict PASS
deterministic safety gate v2
16 cases
hard-frontier.pt-BR
LaudAI Agent Product
Reporting agent 100%
judge-primary reference
49 cases
reference-public.pt-BR
LaudAI Agent Product
Reporting agent 81.6%
mini-agent
10 cases
lite-public.pt-BR
DeepSeek V4 Pro Baseline
Raw model 40.0%
mini-agent
10 cases
lite-public.pt-BR
DeepSeek V4 Flash Baseline
Raw model 20.0%
mini-agent
10 cases
lite-public.pt-BR
inclusionAI Ring-2.6-1T Baseline
Raw model 10.0%
mini-agent
10 cases
lite-public.pt-BR
GPT-5.4 Mini Baseline
Raw model 0.0%
public reference
Open submissions Open
Any agent Submit
Strict PASS is the only metric on the public leaderboard. It counts cases where the report passes every clinically decisive gate (critical findings preserved, laterality correct, no unsafe negation, structural integrity, terminology compliance). LaudAI Agent hard-frontier.pt-BR run (16 multi-finding spine/hepatic/vascular cases, 430 gold findings) uses the v2 deterministic entity safety gate (concept-level, negation- and laterality-scoped). Every case was audited individually; the gate catches every injected real safety defect — dropped, negated, or laterality-flipped critical finding — with no false-fail on faithful paraphrase or on clinically-correct omissions (e.g. LI-RADS is not applied to infectious hepatic lesions). LaudAI Agent reference run: 49-case reference-public.pt-BR with judge-primary frontier-blind-v1. Mini-agent baselines (DeepSeek V4 Pro/Flash, inclusionAI Ring-2.6-1T): 10-case lite-public.pt-BR, judged with anthropic/claude-opus-4.6.

Evaluation Protocol
A locked text-generation protocol where Strict PASS is the only public metric: cases pass only when every clinically decisive gate holds.
01
Input
The evaluated system receives an exam descriptor, clinical findings, and optional context: gold findings, guideline expectations, and patient history. It must generate a complete HTML radiology report.
02
Evaluation
5 modular evaluators (CRIT, QUAL, TERM, GUIDE, RAG) with an extraction layer, synonym matching (30 groups), and negation handling. Phase 2: Adversarial LLM judge detects hallucinations and missing findings.
03
Score
Strict PASS is the only public metric: a case passes only when every clinically decisive gate holds (critical findings preserved, laterality correct, no unsafe negation, structural integrity, terminology compliance). All rows publish bootstrap CIs, policy gates, and canary-token contamination checks.

Five Independent Axes
Each dimension is scored independently and feeds the Strict PASS gate. A case passes only when every gate holds; failed gates are diagnostic, not negotiable.
How a case becomes a score
01
Locked input
Exam, findings, and public-safe expectations are fixed before any system runs.
02
Five checks
CRIT, QUAL, TERM, GUIDE, and RAG score independent failure modes.
03
Clinical gate
Critical misses, contradictions, or operational failures can force FAIL.
04
Strict PASS
Strict PASS is the only public ranking metric. Per-dim scores explain failures but do not aggregate.
Strict PASS is binary per case. Any clinically decisive gate failure (missed critical, laterality flip, unsafe negation, broken structure, wrong terminology) makes the case FAIL.
Clinical weight mix
CRIT
30%
QUAL
25%
TERM
20%
GUIDE
15%
RAG
10%
Reporting rule: Strict PASS is the only public metric. Per-dimension scores are diagnostic to explain gate failures, not for ranking.
CRIT
30%
Critical finding detection. 21 categories with negation handling. Sensitivity/recall/F1 against gold labels. Missing a PE or stroke = critical failure.
QUAL
25%
Clinical quality. Severity-aware finding matching (30 synonym groups). Hallucination detection with pertinent negative exclusion. Gold data or reference report comparison.
TERM
20%
Terminology correctness. 14 CBR forbidden terms. 9 forbidden openers. Modality-specific vocabulary. Classification system enforcement (BI-RADS, TI-RADS, PI-RADS, Bosniak, Fleischner, Lung-RADS).
GUIDE
15%
Guideline adherence. 7 pluggable modules with applicability detection, classification correctness, valid-value-range enforcement, and recommendation checking.
RAG
10%
Retrieval fidelity. IR metrics (Precision@k, Recall@k, MRR, nDCG) for retrieval-enabled agents. Laterality swap detection. Measurement preservation.

Integrity Guarantees
Locked cases, recomputed scores, and provenance checks keep public leaderboard rows reproducible and tamper-resistant.

Test Your Agent
Any company, lab, or independent team can run the public suite and get a comparable score without exposing internal implementation.
1. Clone the open harness
The scoring harness, schemas, and public synthetic suite are open. Your system stays yours.
# open harness + public suite
git clone https://github.com/laudos-ai/laibench-public
npm ci
2. Run your system — one command
Point the harness at your model, agent, or product API. It generates the reports and freezes them.
# any OpenAI-compatible model, agent, or product
npm run bench -- suite \
  --suite suites/lite-public.pt-BR.json \
  --provider command --cmd "node my-agent.mjs" \
  --run-name my-system --out runs/my-system.json
3. Score the submission
Score the generated reports against the locked public cases. Only the public label, validation counts, and scores need to be shared.
reference public run
suite: reference-public.pt-BR
label: public system name
output: score report
4. Publish comparable results
Submit the run artifact for leaderboard review. Proprietary implementation details stay private.
leaderboard artifact
suite hash
public system label
strict PASS rate (only metric)
per-dim diagnostics
run artifact

Preprint
LAIasBench: An Agent-Centric Benchmark for Radiology Finding-to-Report Generation
Natan Paraiso Ribeiro, Petrus Paraiso Ribeiro, Francisco Akira, Stephanie Alba Herrera, Raquel Moreno — Laudos.AI, Sao Paulo, Brazil
LAIasBench evaluates the exact failure modes that make radiology report generation risky: omitted critical findings, invented abnormalities, title drift, broken section order, missing anatomy, terminology errors, lost measurements, and guideline-classification drift. The public paper frames the task as executable text-agent evaluation from provided findings to full report. The single public metric is Strict PASS: a case passes only when every clinically decisive gate holds. Per-dimension scores remain diagnostic to explain failures. Product agents and mini-agent baselines are tracked separately. Private daily regression now uses a deterministic 40-case split sampled from a synthetic 65,812-report corpus built from extractive seed reports, sentence-level finding links, and randomized compatible finding-set recombination; source corpus and implementation details are not exposed.
40-case private monitorSmall enough for daily regression, sampled from a synthetic 65,812-report corpus.
Strict-PASS reportingStrict PASS is the single public metric. Per-dim scores explain failures but do not aggregate.
49-case baselinesMini-agent baselines are reported separately from the full LAIas product-agent reference row.
Implementation privacyPublic artifacts show evaluation class, validation status, and score, not internal implementation or raw case lists.