HoneLabs
FeaturesPricingBlogAboutTrustBook a Demo

Verticals

Higher EdBoutique Professional Services
FeaturesPricingBlogAboutTrustBook a Demo
HoneLabs

AI-powered institutional memory for organizations that create.

Verticals

  • Higher Ed
  • Boutique Professional Services

Product

  • Features
  • Pricing
  • Blog
  • About
  • Trust

Legal

  • Trust Center
  • Privacy Policy
  • Terms of Service
  • Accessibility
  • Security Disclosure

Contact

  • todd@honelabs.dev

© 2026 Hone Labs. All rights reserved.

Hone Labs LLC — Berkshires, MA

Multi-evaluator benchmark · April 14, 2026

We asked Claude, ChatGPT, and Gemini to grade our work next to theirs.

Three independent evaluations. Three different rubrics. One unanimous #1.

Hone Studio
94.1
ChatGPT (GPT-5.5)
84.3
Claude Opus 4.6
81.8
Gemini 3 Pro
75.2

Average score across three independent evaluators (each used a different rubric).

What this benchmark proves

Three findings that matter.

01

Unanimous #1.

Three competing AI products. Three different rubrics. Same conclusion.

02

They graded themselves.

ChatGPT inflated its own score by +10.0 points. Gemini by +4.2. Claude self-deflated by −9.7. None of them disclosed the conflict of interest.

03

The bias couldn't bridge the gap.

Even with self-promoting bias, none of them could rank themselves above Hone.

What the competitors said about Hone

Three frontier models reviewed our output. In their own words:

“

Near-lossless extraction. 35/39 detail checkpoints. Captured subcommittee size, accreditation self-study, session timings, formal takeaways, tool-agnostic positioning. Light on editorial interpretation.

Claude

“

Highest recall and best grounding. Captured 15-person subcommittee, publicly accessible documents, accreditation self-study, president briefing, session flow timings, engagement design, tool-agnostic positioning. Very close to full extraction.

ChatGPT

“

Most thorough output. Pulled direct quotes. Only model to break down session flow by exact minute marks. Clean, highly structured headings mapping to proposal components.

Gemini

Quotes verbatim from each evaluator's analysis. See full evaluator views below.

How this benchmark was run

Same prompt. Four products. Three of them as evaluators. None disclosed they were grading themselves.

Source document
A 5-page EDUCAUSE 2026 conference proposal submission. Real, neutral document — not chosen to favor any model.
Prompt
"tell me about the educase submission"
Products tested
Hone Studio · ChatGPT (GPT-5.5) · Claude Opus 4.6 · Gemini 3 Pro — all at the highest model setting available
Evaluators
Claude · ChatGPT · Gemini, each given all four outputs blind to who produced them
Rubrics
Each evaluator used its own rubric. Three different scoring approaches. Same conclusion.
Date
April 14, 2026
We are not publishing the source document or the full model outputs. They contain proprietary content from the institution that produced the submission. The data shown — scores, detail checklist, evaluator commentary — is the publishable layer.

The full scorecard

Dig into the data.

1Hone Studio
AAA+
94.1
Avg rank: 1.0Detail capture: 36/39Unanimous #1
2ChatGPT (GPT-5.5)
B+A-B+
84.3
Avg rank: 2.3Detail capture: 19/39#2 (2 of 3 evals)
3Claude Opus 4.6
BB-A
81.8
Avg rank: 3.0Detail capture: 21/39#3 avg (most disputed)
4Gemini 3 Pro
B-BB
75.2
Avg rank: 3.7Detail capture: 19/39#4 (2 of 3 evals)

Score Matrix

Evaluator ↓ / Model →HoneChatGPTClaudeGemini
Claude (this eval)
A89.4
B+79.9
B75.3self
B-67.5
ChatGPT (self-eval)
A95
A-91self
B-78
B80
Gemini (self-eval)
A+98
B+82
A92
B78self

Where Hone won — and where it lost

Showing both sides of the data.

12 things only Hone caught

  • Data gov / knowledge gov question framing
  • "It's everywhere but does nothing" context
  • CCRI 15-person subcommittee
  • Only publicly accessible documents used
  • Second app: accreditation self-study
  • Session timing breakdown (0:00-0:12 etc.)
  • Three formal takeaways listed
  • Evidence shown not just described
  • Materials confirmed as existing/ready
  • Mazin: Dept of Art, Art History and Design
  • Lawrence: MPA from URI / Rising Star Award
  • Todd: 100M+ mobile devices

3 things Hone missed that competitors caught

  • Keywords listed(Gemini 3 Pro only)
  • AI defaulted to peer institution language(ChatGPT (GPT-5.5) only)
  • EDUCAUSE Submission field interpretation(ChatGPT (GPT-5.5) only)

We're publishing the misses too because the wins only mean something if the losses are visible.

“The gap isn't about context windows or retrieval. It's about what the AI does with knowledge once it has it.”

— The thesis Hone is building toward. Three frontier models just confirmed it.

Find your vertical

See more for your kind of work

If you work in higher ed
If you work in boutique professional services

This is one document. We do this every day with your work.

Book a demo and we'll run the same kind of benchmark on a document of your choice.

Book a demo →