I built my own Claude eval

At AI week, Gian Segato from Anthropic said something offhand that I have not been able to put down. He mentioned that a lot of people inside Anthropic keep their own personal eval for Claude. Not the big public benchmarks. A small, private test, tuned to something they personally care about, that they trust more than any leaderboard to tell them whether a new model is actually better.

That one sentence reorganized how I think about model quality. So I went and built mine.

Why a personal eval beats vibes

Most of us judge models by feel. A new release drops, you throw a few of your favorite hard prompts at it, you get a vibe, you move on. The problem is that vibes do not scale and vibes drift. You cannot remember how the last model handled the exact same prompt three months ago, so "it feels smarter" is doing a lot of unearned work.

Public benchmarks have the opposite problem. They are rigorous, but they are not yours. They measure something generic, they leak into training sets, and they get gamed. A score going up on a public leaderboard tells you something, but rarely the thing you actually care about.

A personal eval splits the difference. It is small enough that you own every example. It is opinionated enough that the number means something to you specifically. And it is private enough that no model has been trained on it. When the score moves, you learn something real.

So I built mine, out of jazz

I decided to make mine out of the one domain where I trust my own judgment more than almost anyone's: jazz. I called it JazzBench, and the ground truth is Charlie Parker.

The task, in one sentence: given the first few chords of one of Parker's solos, plus the chord changes coming next, predict the actual notes Parker played over each of those next chords. Then score that prediction against what he really did on the record.

It is a strange thing to ask a language model to do, which is exactly why I like it.

Why jazz is a good test

Almost every eval out there tests verbal, mathematical, or coding reasoning. Those are hard-edged problems: there is a correct answer, and you mostly reach it by being careful and not making mistakes. Improvisation is a different kind of cognition. It is:

Bounded. The chord changes, the key, and the time are all fixed. You cannot just play anything.
Judgeable. We have Parker's actual solo as ground truth, plus formal music-theory methods to score how close a guess is.
Cognitively rich. It is constraint satisfaction, style, and creativity all at once, in real time, with no single right answer but plenty of obviously wrong ones.

That combination is rare. It is the kind of soft, multi-constraint judgment that human experts make intuitively and that almost no benchmark even tries to measure. If I want to know whether a model has taste under pressure, this is a far better probe than another word problem.

How it gets scored

Because there is no single correct answer, you cannot just check for an exact match. So each prediction is scored against what Parker played using five music-theoretic metrics:

PC Jaccard: the overlap between the notes the model predicted and the notes Parker actually played.
Interval-vector distance: how far apart those two note sets are in interval space, not just which exact notes they share.
Complexity and dissonance deltas: the error on Parker's own complexity and dissonance measures, so a guess can be "wrong notes, right texture".
Forte-class match: whether the predicted set has the same abstract shape as Parker's, regardless of transposition.

And it ships with three baselines to beat: sampling randomly from the notes Parker tended to use over that chord, always playing the single most common set for that chord, and a first-order Markov model over the previous segment. The bar is simple. If a frontier model cannot beat "just play the most common thing", it is not really improvising, it is averaging.

I ran Haiku 4.5, Sonnet 4.6, and Opus 4.7 through it. Watching where they land relative to those baselines tells me more about a model than most of what I read on launch day.

What the first batch showed

The one-line version: every Claude tier (Haiku 0.370, Opus 0.400, Sonnet 0.402) beat every baseline on pitch-class overlap with Parker, but Sonnet and Opus are statistically tied, and none of them matched the simple modal-PC-set baseline on interval texture or dissonance proximity. Claude has learned Parker's note vocabulary, but not his characteristic harmonic restraint.

The headline, on note overlap (Jaccard, higher is better): Sonnet 0.402, Opus 0.400, Haiku 0.370, against most-frequent 0.355, Markov 0.327, and random 0.319. Every Claude tier beats every baseline at picking Parker's actual notes, with zero parse errors across 399 agent calls. But the interesting stuff is in the texture, and there are five findings I did not expect (the full writeup is in the paper, section 8):

Every Claude tier beats every baseline on note choice. Sonnet lands about 13% above "just play the most common set" and 26% above random, and the same ordering holds on complexity. The models are genuinely picking better notes, not bluffing.
The dumbest baseline is unbeatable on texture. "Always play the modal set for this chord" still wins on interval distance and dissonance, because by construction it sits at the dead center of Parker's interval distribution. Claude picks the right notes more often, but its wrong guesses fall slightly off Parker's textural center.
Sonnet and Opus are a tie. A 0.002 Jaccard gap is noise. Scaling past Sonnet buys nothing here, which points to either a ceiling around 0.40 under this setup or a bottleneck that is representational, not a matter of raw capacity.
Exact-match rate falls as models get bigger. Haiku 0.034, Sonnet 0.026, Opus 0.015. Bigger models share more notes with Parker on average but rarely reproduce his exact set, while smaller models fall back on high-frequency choices that occasionally nail it outright. That inversion between average overlap and exact match is a real story.
Claude plays more dissonant than Parker. Dissonance error sits at 2.09 to 2.43 across all tiers, versus 1.96 for the baseline. Parker's sparse, triadic preference is something none of the models have learned.

None of that is what a leaderboard would have told me. That is the whole point of owning the ruler.

How you run it

The repo gives you two paths. There is a reproducible path that hits the Anthropic API directly, which is the canonical way to reproduce the numbers. And there is a no-key path that runs as a Claude Code skill, using your existing session:

/jazzbench-run               # haiku, 10 tasks
/jazzbench-run sonnet 20
/jazzbench-run opus 5
/jazzbench-run all 10        # haiku + sonnet + opus, one after another

The skill spawns one workflow subagent per task, forces each one to answer in a strict prediction schema, and writes the results out so the same scoring scripts can grade them. The two paths even show up as separate rows in the comparison table, so you can check whether a model behaves differently through the raw API versus inside Claude Code.

Everything is here: github.com/code91/claude-impro-eval. I also wrote up the method, the metrics, and the results as a short paper: JazzBench v0. It all builds on a Charlie Parker time-series pipeline I put together earlier, which does the unglamorous work of turning transcribed solos into interval vectors and pitch classes per chord.

The actual point: build your own

The jazz is incidental. The real takeaway from Gian's comment is that you should not outsource your sense of whether a model is good to a leaderboard. The people closest to these models keep their own private rulers, and so should you.

It does not have to be elaborate. Take 30 minutes of a domain where you have genuine expertise, write down a handful of cases where you know what "good" looks like, and turn them into something you can score. Now every new model release runs into a wall built out of your own taste, instead of a vibe you will not remember next quarter.

This connects to how I think about building FlatNine generally: the durable asset is never the dashboard or the score, it is the context and the judgment underneath it. An eval is just judgment, written down in a form a machine can keep checking for you.

The honest caveats

This is a v0, and I want to be clear about its edges. It only looks at pitch class, so rhythm, articulation, and register are not scored at all, and those are a huge part of what made Parker Parker. It is a single artist in a single era, so it says nothing yet about other improvisers. The models see only symbolic data, never audio. And the Markov baseline is first-order, which is the easy version. All of that is future work, and none of it changes the point: a small eval you actually trust beats a big one you do not.

That is the project. Go build yours.

Why a personal eval beats vibes

So I built mine, out of jazz

Why jazz is a good test

How it gets scored

What the first batch showed

How you run it

The actual point: build your own

The honest caveats

Related Posts

How AI handles my email before I read it

The self-improving company

I built a machine that watches what the best builders star