Multi-evaluator benchmark · April 14, 2026

We asked Claude, ChatGPT, and Gemini to grade our work next to theirs.

Three independent evaluations. Three different rubrics. One unanimous #1.

Hone Studio

94.1

ChatGPT (GPT-5.5)

84.3

Claude Opus 4.6

81.8

Gemini 3 Pro

75.2

Average score across three independent evaluators (each used a different rubric).

What this benchmark proves

Three findings that matter.

Unanimous #1.

Three competing AI products. Three different rubrics. Same conclusion.

They graded themselves.

ChatGPT inflated its own score by +10.0 points. Gemini by +4.2. Claude self-deflated by −9.7. None of them disclosed the conflict of interest.

The bias couldn't bridge the gap.

Even with self-promoting bias, none of them could rank themselves above Hone.

What the competitors said about Hone

Three frontier models reviewed our output. In their own words:

Near-lossless extraction. 35/39 detail checkpoints. Captured subcommittee size, accreditation self-study, session timings, formal takeaways, tool-agnostic positioning. Light on editorial interpretation.

Claude

Highest recall and best grounding. Captured 15-person subcommittee, publicly accessible documents, accreditation self-study, president briefing, session flow timings, engagement design, tool-agnostic positioning. Very close to full extraction.

ChatGPT

Most thorough output. Pulled direct quotes. Only model to break down session flow by exact minute marks. Clean, highly structured headings mapping to proposal components.

Gemini

Quotes verbatim from each evaluator's analysis. See full evaluator views below.

How this benchmark was run

Same prompt. Four products. Three of them as evaluators. None disclosed they were grading themselves.

Source document

A 5-page EDUCAUSE 2026 conference proposal submission. Real, neutral document — not chosen to favor any model.

Prompt

"tell me about the educase submission"

Products tested

Hone Studio · ChatGPT (GPT-5.5) · Claude Opus 4.6 · Gemini 3 Pro — all at the highest model setting available

Evaluators

Claude · ChatGPT · Gemini, each given all four outputs blind to who produced them

Rubrics

Each evaluator used its own rubric. Three different scoring approaches. Same conclusion.

Date

April 14, 2026

We are not publishing the source document or the full model outputs. They contain proprietary content from the institution that produced the submission. The data shown — scores, detail checklist, evaluator commentary — is the publishable layer.

The full scorecard

Dig into the data.

1Hone Studio

AAA+

94.1

Avg rank: 1.0Detail capture: 36/39Unanimous #1

2ChatGPT (GPT-5.5)

B+A-B+

84.3

Avg rank: 2.3Detail capture: 19/39#2 (2 of 3 evals)

3Claude Opus 4.6

BB-A

81.8

Avg rank: 3.0Detail capture: 21/39#3 avg (most disputed)

4Gemini 3 Pro

B-BB

75.2

Avg rank: 3.7Detail capture: 19/39#4 (2 of 3 evals)

Score Matrix

Evaluator ↓ / Model →	Hone	ChatGPT	Claude	Gemini
Claude (this eval)	A89.4	B+79.9	B75.3self	B-67.5
ChatGPT (self-eval)	A95	A-91self	B-78	B80
Gemini (self-eval)	A+98	B+82	A92	B78self

Where Hone won — and where it lost

Showing both sides of the data.

12 things only Hone caught

Data gov / knowledge gov question framing
"It's everywhere but does nothing" context
CCRI 15-person subcommittee
Only publicly accessible documents used
Second app: accreditation self-study
Session timing breakdown (0:00-0:12 etc.)
Three formal takeaways listed
Evidence shown not just described
Materials confirmed as existing/ready
Mazin: Dept of Art, Art History and Design
Lawrence: MPA from URI / Rising Star Award
Todd: 100M+ mobile devices

3 things Hone missed that competitors caught

Keywords listed(Gemini 3 Pro only)
AI defaulted to peer institution language(ChatGPT (GPT-5.5) only)
EDUCAUSE Submission field interpretation(ChatGPT (GPT-5.5) only)

We're publishing the misses too because the wins only mean something if the losses are visible.

“The gap isn't about context windows or retrieval. It's about what the AI does with knowledge once it has it.”

— The thesis Hone is building toward. Three frontier models just confirmed it.

Find your vertical

See more for your kind of work

If you work in higher ed

If you work in boutique professional services

This is one document. We do this every day with your work.

Book a demo and we'll run the same kind of benchmark on a document of your choice.

Book a demo →

Evaluator ↓ / Model →

Hone

ChatGPT

Claude

Gemini

Claude (this eval)

A89.4

B+79.9

B75.3self

B-67.5

ChatGPT (self-eval)

A95

A-91self

B-78

B80

Gemini (self-eval)

A+98

B+82

A92

B78self