Multi-evaluator benchmark · April 14, 2026
Three independent evaluations. Three different rubrics. One unanimous #1.
Average score across three independent evaluators (each used a different rubric).
What this benchmark proves
Three competing AI products. Three different rubrics. Same conclusion.
ChatGPT inflated its own score by +10.0 points. Gemini by +4.2. Claude self-deflated by −9.7. None of them disclosed the conflict of interest.
Even with self-promoting bias, none of them could rank themselves above Hone.
What the competitors said about Hone
Near-lossless extraction. 35/39 detail checkpoints. Captured subcommittee size, accreditation self-study, session timings, formal takeaways, tool-agnostic positioning. Light on editorial interpretation.
Claude
Highest recall and best grounding. Captured 15-person subcommittee, publicly accessible documents, accreditation self-study, president briefing, session flow timings, engagement design, tool-agnostic positioning. Very close to full extraction.
ChatGPT
Most thorough output. Pulled direct quotes. Only model to break down session flow by exact minute marks. Clean, highly structured headings mapping to proposal components.
Gemini
Quotes verbatim from each evaluator's analysis. See full evaluator views below.
How this benchmark was run
"tell me about the educase submission"The full scorecard
| Evaluator ↓ / Model → | Hone | ChatGPT | Claude | Gemini |
|---|---|---|---|---|
| Claude (this eval) | A89.4 | B+79.9 | B75.3self | B-67.5 |
| ChatGPT (self-eval) | A95 | A-91self | B-78 | B80 |
| Gemini (self-eval) | A+98 | B+82 | A92 | B78self |
Where Hone won — and where it lost
We're publishing the misses too because the wins only mean something if the losses are visible.
“The gap isn't about context windows or retrieval. It's about what the AI does with knowledge once it has it.”
— The thesis Hone is building toward. Three frontier models just confirmed it.
Find your vertical
Book a demo and we'll run the same kind of benchmark on a document of your choice.