AI Interviewer · Model Benchmark

Which model — and thinking effort —
should run our AI interviewer?

13 models judged as the live interviewer on real transcripts, then weighed on speed and cost. Effort (thinking level) moves speed and cost massively, so every view is sliced model × low/medium/high.

What · Why · How

What

  • 13 models × effort levels
  • 2 personas (eng + Thumpn/Meghana)
  • 4 turns per interview
  • ~880 real replies scored

Why

  • Model + effort decides how real, how fast, what cost
  • Today it's a default, not a decision
  • Make it measured + repeatable

How

  • Same prompt + frozen conversation
  • Only model/effort varies
  • Quality = LLM-judge on transcripts
  • Speed/cost = 15 runs, cached

The numbers

Why no deterministic checks? They were 97–98% for every model — frontier models don't make gross errors, so they can't rank anything. The ranking is the LLM judges.

Performance
Speed
Cost
Thumpn (Meghana)
Eng interviews (Shreyas)
ModelEffortOverallVoicePushbackQ-fitn
gpt-5.4low92100751004
medium8310075754
high757575754
gpt-5.5high8375751004
kimi-k2.6medium83100100502
gemini-3.5-flashmedium7867100673
high7810067673
claude-opus-4-8medium7575501004
grok-4.3low757575758
high675075754
claude-opus-4-6medium677550754
deepseek-v4-flashlow7575501004
high585050754
claude-opus-4-7low587550504
medium677550754
gemini-3.1-flash-litemedium422525754
high6725751004
claude-sonnet-4-6medium502575504
deepseek-v4-prolow332525504
high6750501004
nemotron-3-120bmedium505025754

Pass-rate % on the 3 differentiating judges. Overall = their mean. Effort = thinking level. Small n per cell — directional.

Recommendation

Best model × effort considering all three. Live chat needs a fast first response, so anything slower than 6s is ruled out first; among the rest, quality vs cost. Quality shown is the robust eng set (Meghana noted where available).

Best quality
claude-opus-4-8 · low effort
68% quality · Meghana 75%2.5s response$0.852 / interview
Best value — quality per $
deepseek-v4-flash · low effort
52% quality · Meghana 67%1.0s response$0.007 / interview
Balanced
claude-sonnet-4-6 · low effort
67% quality · Meghana 50%2.3s response$0.372 / interview

Full eval detail

Every judge (incl. memory & groundedness) and every turn, for anyone who wants to dig.

Thumpn (Meghana) — all judges × turns
ModelVoicePushbackQ-fitMemoryGrounded2nd-msgmidbefore-endendn
gpt-5.492758310010010078896712
kimi-k2.610010050100100100672
gpt-5.5757510010010010067671004
gemini-3.5-flash83836710010010050836
claude-opus-4-875501001001006767100674
grok-4.36775751009210033896712
deepseek-v4-flash625088100100678383338
claude-opus-4-67550751001003367100674
claude-opus-4-77550621001001003350678
gemini-3.1-flash-lite255088100100835050338
deepseek-v4-pro383875100100676733338
nemotron-3-120b50257510010067670674
claude-sonnet-4-6257550100100673333674

All 5 judges (overall) + per-turn quality (3-judge mean). % = pass-rate.

Eng interviews (Shreyas) — all judges × turns
ModelVoicePushbackQ-fitMemoryGrounded2nd-msgmidbefore-endendn
claude-opus-4-88550701001007373804720
kimi-k2.67567581009233501003312
claude-sonnet-4-6754580100956080874020
gpt-5.56260701001006767705340
gpt-5.4676062981006464764760
claude-opus-4-6604575100905353934020
claude-opus-4-7484085100986050675340
gemini-3.1-flash-lite486560100985753734740
deepseek-v4-pro48556598984757774340
deepseek-v4-flash425065100955043675040
gemini-3.5-flash622864100975747673339
nemotron-3-120b114779100954042534719
grok-4.33437541001002943583859

All 5 judges (overall) + per-turn quality (3-judge mean). % = pass-rate.

Full per-checkpoint leaderboard (every persona · every turn · all 5 judges + cost + tokens): open the full leaderboard →

Appendix

Method

One prompt + one fixed candidate conversation; only model/effort varies. At 4 checkpoints each model produced the interviewer reply. Quality graded by a fixed LLM judge (claude-sonnet-4-6) on real transcripts; speed (TTFT) and cost are medians over 15 runs, warm/cached. Effort labels normalised onto low/medium/high (think/on/eff-high→high; no-think/off→low).

Judges
  • Voice — sounds like the real person. Pushback — challenges vague answers. Q-fit — questions specific to this candidate.
  • Memory & groundedness were ~100% for everyone (excluded from the headline).
Caveats
  • Thumpn/Meghana per-effort n is small (≈1–3) — directional; eng is the robust set.
  • 2 of ~880 replies were non-responses, excluded. nemotron speed/cost mapped from its sibling SKU.