by Laudos.AI · v2.1

LAIBench

The only public benchmark that ranks AI radiology reporting by patient safety — not text similarity. A report fails if it drops a critical finding, flips a laterality, or negates an urgent one. Even frontier models don’t ace it.

Test your system → Leaderboard Paper

685 Public cases

528 Variant-controlled

8 Adversarial classes

5 Dimensions

3 Locales (PT-BR/EN/ES)

224 Tests

Leaderboard

Public Reference Snapshot

Public rows show only evaluation class and score. Internal implementation details are intentionally withheld.

Evaluation	System	Type	Strict PASS
deterministic safety gate v2 16 cases hard-frontier.pt-BR	LaudAI Agent Product	Reporting agent	100%
judge-primary reference 49 cases reference-public.pt-BR	LaudAI Agent Product	Reporting agent	81.6%
mini-agent 10 cases lite-public.pt-BR	DeepSeek V4 Pro Baseline	Raw model	40.0%
mini-agent 10 cases lite-public.pt-BR	DeepSeek V4 Flash Baseline	Raw model	20.0%
mini-agent 10 cases lite-public.pt-BR	inclusionAI Ring-2.6-1T Baseline	Raw model	10.0%
mini-agent 10 cases lite-public.pt-BR	GPT-5.4 Mini Baseline	Raw model	0.0%
public reference	Open submissions Open	Any agent	Submit

Strict PASS is the only metric on the public leaderboard. It counts cases where the report passes every clinically decisive gate (critical findings preserved, laterality correct, no unsafe negation, structural integrity, terminology compliance). LaudAI Agent hard-frontier.pt-BR run (16 multi-finding spine/hepatic/vascular cases, 430 gold findings) uses the v2 deterministic entity safety gate (concept-level, negation- and laterality-scoped). Every case was audited individually; the gate catches every injected real safety defect — dropped, negated, or laterality-flipped critical finding — with no false-fail on faithful paraphrase or on clinically-correct omissions (e.g. LI-RADS is not applied to infectious hepatic lesions). LaudAI Agent reference run: 49-case reference-public.pt-BR with judge-primary frontier-blind-v1. Mini-agent baselines (DeepSeek V4 Pro/Flash, inclusionAI Ring-2.6-1T): 10-case lite-public.pt-BR, judged with anthropic/claude-opus-4.6.

How It Works

Evaluation Protocol

A locked text-generation protocol where Strict PASS is the only public metric: cases pass only when every clinically decisive gate holds.

Input

The evaluated system receives an exam descriptor, clinical findings, and optional context: gold findings, guideline expectations, and patient history. It must generate a complete HTML radiology report.

Evaluation

5 modular evaluators (CRIT, QUAL, TERM, GUIDE, RAG) with an extraction layer, synonym matching (30 groups), and negation handling. Phase 2: Adversarial LLM judge detects hallucinations and missing findings.

Score

Strict PASS is the only public metric: a case passes only when every clinically decisive gate holds (critical findings preserved, laterality correct, no unsafe negation, structural integrity, terminology compliance). All rows publish bootstrap CIs, policy gates, and canary-token contamination checks.

Dimensions

Five Independent Axes

Each dimension is scored independently and feeds the Strict PASS gate. A case passes only when every gate holds; failed gates are diagnostic, not negotiable.

How a case becomes a score

Locked input

Exam, findings, and public-safe expectations are fixed before any system runs.

Five checks

CRIT, QUAL, TERM, GUIDE, and RAG score independent failure modes.

Clinical gate

Critical misses, contradictions, or operational failures can force FAIL.

Strict PASS

Strict PASS is the only public ranking metric. Per-dim scores explain failures but do not aggregate.

Strict PASS is binary per case. Any clinically decisive gate failure (missed critical, laterality flip, unsafe negation, broken structure, wrong terminology) makes the case FAIL.

Clinical weight mix

CRIT

30%

QUAL

25%

TERM

20%

GUIDE

15%

RAG

10%

Reporting rule: Strict PASS is the only public metric. Per-dimension scores are diagnostic to explain gate failures, not for ranking.

CRIT

30%

Critical finding detection. 21 categories with negation handling. Sensitivity/recall/F1 against gold labels. Missing a PE or stroke = critical failure.

QUAL

25%

Clinical quality. Severity-aware finding matching (30 synonym groups). Hallucination detection with pertinent negative exclusion. Gold data or reference report comparison.

TERM

20%

Terminology correctness. 14 CBR forbidden terms. 9 forbidden openers. Modality-specific vocabulary. Classification system enforcement (BI-RADS, TI-RADS, PI-RADS, Bosniak, Fleischner, Lung-RADS).

GUIDE

15%

Guideline adherence. 7 pluggable modules with applicability detection, classification correctness, valid-value-range enforcement, and recommendation checking.

RAG

10%

Retrieval fidelity. IR metrics (Precision@k, Recall@k, MRR, nDCG) for retrieval-enabled agents. Laterality swap detection. Measurement preservation.

Validation

Integrity Guarantees

Locked cases, recomputed scores, and provenance checks keep public leaderboard rows reproducible and tamper-resistant.

01
Deterministic checks are reproducible. Same input = same score. No variance, no subjectivity. Public reference artifacts expose enough method detail for audit without releasing proprietary implementation.
02
Score recomputation. Case overalls, suite summaries, verdict counts, dimension means, comparable keys, and local suite hashes are recomputed before leaderboard publication.
03
Grouped leaderboards. Runs only compete if they share the same suite, locale, track, scaffold, and judge. Impossible to compare incompatible runs.
04
Submission validation. Missing IDs, duplicates, empty outputs, or malformed JSONL make a submission ineligible; public artifacts show counts and sanitized reasons, not raw case lists.
05
Grouped comparisons. Product agents, custom agents, and open baselines never mix in one rank. Each row competes only against comparable runs.
06
Hash integrity. Each suite has a cryptographic hash of its cases. Any data alteration invalidates all prior runs.
07
Canary tokens. Hidden tokens embedded in suite cases detect benchmark contamination. If a submitted system reproduces canary text, its run is flagged as potentially data-contaminated.
08
Bootstrap confidence intervals. Scores include bootstrap CIs for statistical rigor. Leaderboard differences are only meaningful when confidence intervals do not overlap.

Submit

Test Your Agent

Any company, lab, or independent team can run the public suite and get a comparable score without exposing internal implementation.

1. Clone the open harness

The scoring harness, schemas, and public synthetic suite are open. Your system stays yours.

# open harness + public suite
git clone https://github.com/laudos-ai/laibench-public
npm ci

2. Run your system — one command

Point the harness at your model, agent, or product API. It generates the reports and freezes them.

# any OpenAI-compatible model, agent, or product
npm run bench -- suite \
  --suite suites/lite-public.pt-BR.json \
  --provider command --cmd "node my-agent.mjs" \
  --run-name my-system --out runs/my-system.json

3. Score the submission

Score the generated reports against the locked public cases. Only the public label, validation counts, and scores need to be shared.

reference public run
suite: reference-public.pt-BR
label: public system name
output: score report

4. Publish comparable results

Submit the run artifact for leaderboard review. Proprietary implementation details stay private.

leaderboard artifact
suite hash
public system label
strict PASS rate (only metric)
per-dim diagnostics
run artifact

Paper

Preprint

LAIasBench: An Agent-Centric Benchmark for Radiology Finding-to-Report Generation

Natan Paraiso Ribeiro, Petrus Paraiso Ribeiro, Francisco Akira, Stephanie Alba Herrera, Raquel Moreno — Laudos.AI, Sao Paulo, Brazil

LAIasBench evaluates the exact failure modes that make radiology report generation risky: omitted critical findings, invented abnormalities, title drift, broken section order, missing anatomy, terminology errors, lost measurements, and guideline-classification drift. The public paper frames the task as executable text-agent evaluation from provided findings to full report. The single public metric is Strict PASS: a case passes only when every clinically decisive gate holds. Per-dimension scores remain diagnostic to explain failures. Product agents and mini-agent baselines are tracked separately. Private daily regression now uses a deterministic 40-case split sampled from a synthetic 65,812-report corpus built from extractive seed reports, sentence-level finding links, and randomized compatible finding-set recombination; source corpus and implementation details are not exposed.

40-case private monitorSmall enough for daily regression, sampled from a synthetic 65,812-report corpus.

Strict-PASS reportingStrict PASS is the single public metric. Per-dim scores explain failures but do not aggregate.

49-case baselinesMini-agent baselines are reported separately from the full LAIas product-agent reference row.

Implementation privacyPublic artifacts show evaluation class, validation status, and score, not internal implementation or raw case lists.

Download PDF arXiv source package Submission metadata Proprietary license