by Laudos.AI · v2.1
LAIBench
The only public benchmark that ranks AI radiology reporting by patient safety — not text similarity. A report fails if it drops a critical finding, flips a laterality, or negates an urgent one. Even frontier models don’t ace it.
685 Public cases
528 Variant-controlled
8 Adversarial classes
5 Dimensions
3 Locales (PT-BR/EN/ES)
224 Tests
Leaderboard
Public Reference Snapshot
Public rows show only evaluation class and score. Internal implementation details are intentionally withheld.
| Evaluation | System | Type | Strict PASS |
|---|---|---|---|
| deterministic safety gate v2 16 cases hard-frontier.pt-BR |
LaudAI Agent
Product
|
Reporting agent | 100% |
| judge-primary reference 49 cases reference-public.pt-BR |
LaudAI Agent
Product
|
Reporting agent | 81.6% |
| mini-agent 10 cases lite-public.pt-BR |
DeepSeek V4 Pro
Baseline
|
Raw model | 40.0% |
| mini-agent 10 cases lite-public.pt-BR |
DeepSeek V4 Flash
Baseline
|
Raw model | 20.0% |
| mini-agent 10 cases lite-public.pt-BR |
inclusionAI Ring-2.6-1T
Baseline
|
Raw model | 10.0% |
| mini-agent 10 cases lite-public.pt-BR |
GPT-5.4 Mini
Baseline
|
Raw model | 0.0% |
| public reference |
Open submissions
Open
|
Any agent | Submit |
Strict PASS is the only metric on the public leaderboard. It counts cases where the report passes every clinically decisive gate (critical findings preserved, laterality correct, no unsafe negation, structural integrity, terminology compliance). LaudAI Agent
hard-frontier.pt-BR run (16 multi-finding spine/hepatic/vascular cases, 430 gold findings) uses the v2 deterministic entity safety gate (concept-level, negation- and laterality-scoped). Every case was audited individually; the gate catches every injected real safety defect — dropped, negated, or laterality-flipped critical finding — with no false-fail on faithful paraphrase or on clinically-correct omissions (e.g. LI-RADS is not applied to infectious hepatic lesions). LaudAI Agent reference run: 49-case reference-public.pt-BR with judge-primary frontier-blind-v1. Mini-agent baselines (DeepSeek V4 Pro/Flash, inclusionAI Ring-2.6-1T): 10-case lite-public.pt-BR, judged with anthropic/claude-opus-4.6.How It Works
Evaluation Protocol
A locked text-generation protocol where Strict PASS is the only public metric: cases pass only when every clinically decisive gate holds.
01
Input
The evaluated system receives an exam descriptor, clinical findings, and optional context: gold findings, guideline expectations, and patient history. It must generate a complete HTML radiology report.
02
Evaluation
5 modular evaluators (CRIT, QUAL, TERM, GUIDE, RAG) with an extraction layer, synonym matching (30 groups), and negation handling. Phase 2: Adversarial LLM judge detects hallucinations and missing findings.
03
Score
Strict PASS is the only public metric: a case passes only when every clinically decisive gate holds (critical findings preserved, laterality correct, no unsafe negation, structural integrity, terminology compliance). All rows publish bootstrap CIs, policy gates, and canary-token contamination checks.
Dimensions
Five Independent Axes
Each dimension is scored independently and feeds the Strict PASS gate. A case passes only when every gate holds; failed gates are diagnostic, not negotiable.
How a case becomes a score
01
Locked input
Exam, findings, and public-safe expectations are fixed before any system runs.
02
Five checks
CRIT, QUAL, TERM, GUIDE, and RAG score independent failure modes.
03
Clinical gate
Critical misses, contradictions, or operational failures can force FAIL.
04
Strict PASS
Strict PASS is the only public ranking metric. Per-dim scores explain failures but do not aggregate.
Strict PASS is binary per case. Any clinically decisive gate failure (missed critical, laterality flip, unsafe negation, broken structure, wrong terminology) makes the case FAIL.
Clinical weight mix
Reporting rule: Strict PASS is the only public metric. Per-dimension scores are diagnostic to explain gate failures, not for ranking.
CRIT
30%
Critical finding detection. 21 categories with negation handling. Sensitivity/recall/F1 against gold labels. Missing a PE or stroke = critical failure.
QUAL
25%
Clinical quality. Severity-aware finding matching (30 synonym groups). Hallucination detection with pertinent negative exclusion. Gold data or reference report comparison.
TERM
20%
Terminology correctness. 14 CBR forbidden terms. 9 forbidden openers. Modality-specific vocabulary. Classification system enforcement (BI-RADS, TI-RADS, PI-RADS, Bosniak, Fleischner, Lung-RADS).
GUIDE
15%
Guideline adherence. 7 pluggable modules with applicability detection, classification correctness, valid-value-range enforcement, and recommendation checking.
RAG
10%
Retrieval fidelity. IR metrics (Precision@k, Recall@k, MRR, nDCG) for retrieval-enabled agents. Laterality swap detection. Measurement preservation.
Validation
Integrity Guarantees
Locked cases, recomputed scores, and provenance checks keep public leaderboard rows reproducible and tamper-resistant.
-
01
Deterministic checks are reproducible. Same input = same score. No variance, no subjectivity. Public reference artifacts expose enough method detail for audit without releasing proprietary implementation.
-
02
Score recomputation. Case overalls, suite summaries, verdict counts, dimension means, comparable keys, and local suite hashes are recomputed before leaderboard publication.
-
03
Grouped leaderboards. Runs only compete if they share the same suite, locale, track, scaffold, and judge. Impossible to compare incompatible runs.
-
04
Submission validation. Missing IDs, duplicates, empty outputs, or malformed JSONL make a submission ineligible; public artifacts show counts and sanitized reasons, not raw case lists.
-
05
Grouped comparisons. Product agents, custom agents, and open baselines never mix in one rank. Each row competes only against comparable runs.
-
06
Hash integrity. Each suite has a cryptographic hash of its cases. Any data alteration invalidates all prior runs.
-
07
Canary tokens. Hidden tokens embedded in suite cases detect benchmark contamination. If a submitted system reproduces canary text, its run is flagged as potentially data-contaminated.
-
08
Bootstrap confidence intervals. Scores include bootstrap CIs for statistical rigor. Leaderboard differences are only meaningful when confidence intervals do not overlap.
Submit
Test Your Agent
Any company, lab, or independent team can run the public suite and get a comparable score without exposing internal implementation.
1. Clone the open harness
The scoring harness, schemas, and public synthetic suite are open. Your system stays yours.
# open harness + public suite
git clone https://github.com/laudos-ai/laibench-public
npm ci
git clone https://github.com/laudos-ai/laibench-public
npm ci
2. Run your system — one command
Point the harness at your model, agent, or product API. It generates the reports and freezes them.
# any OpenAI-compatible model, agent, or product
npm run bench -- suite \
--suite suites/lite-public.pt-BR.json \
--provider command --cmd "node my-agent.mjs" \
--run-name my-system --out runs/my-system.json
npm run bench -- suite \
--suite suites/lite-public.pt-BR.json \
--provider command --cmd "node my-agent.mjs" \
--run-name my-system --out runs/my-system.json
3. Score the submission
Score the generated reports against the locked public cases. Only the public label, validation counts, and scores need to be shared.
reference public run
suite: reference-public.pt-BR
label: public system name
output: score report
suite: reference-public.pt-BR
label: public system name
output: score report
4. Publish comparable results
Submit the run artifact for leaderboard review. Proprietary implementation details stay private.
leaderboard artifact
suite hash
public system label
strict PASS rate (only metric)
per-dim diagnostics
run artifact
suite hash
public system label
strict PASS rate (only metric)
per-dim diagnostics
run artifact
Paper
Preprint
LAIasBench: An Agent-Centric Benchmark for Radiology Finding-to-Report Generation
LAIasBench evaluates the exact failure modes that make radiology report generation risky: omitted critical findings, invented abnormalities, title drift, broken section order, missing anatomy, terminology errors, lost measurements, and guideline-classification drift. The public paper frames the task as executable text-agent evaluation from provided findings to full report. The single public metric is Strict PASS: a case passes only when every clinically decisive gate holds. Per-dimension scores remain diagnostic to explain failures. Product agents and mini-agent baselines are tracked separately. Private daily regression now uses a deterministic 40-case split sampled from a synthetic 65,812-report corpus built from extractive seed reports, sentence-level finding links, and randomized compatible finding-set recombination; source corpus and implementation details are not exposed.
40-case private monitorSmall enough for daily regression, sampled from a synthetic 65,812-report corpus.
Strict-PASS reportingStrict PASS is the single public metric. Per-dim scores explain failures but do not aggregate.
49-case baselinesMini-agent baselines are reported separately from the full LAIas product-agent reference row.
Implementation privacyPublic artifacts show evaluation class, validation status, and score, not internal implementation or raw case lists.