AI Interviewer · Model Benchmark

Which model — and thinking effort —
should run our AI interviewer?

13 models judged as the live interviewer on real transcripts, then weighed on speed and cost. Effort (thinking level) moves speed and cost massively, so every view is sliced model × low/medium/high.

What · Why · How

What

13 models × effort levels
2 personas (eng + Thumpn/Meghana)
4 turns per interview
~880 real replies scored

Why

Model + effort decides how real, how fast, what cost
Today it's a default, not a decision
Make it measured + repeatable

How

Same prompt + frozen conversation
Only model/effort varies
Quality = LLM-judge on transcripts
Speed/cost = 15 runs, cached

The numbers

Why no deterministic checks? They were 97–98% for every model — frontier models don't make gross errors, so they can't rank anything. The ranking is the LLM judges.

Performance

Speed

Cost

Thumpn (Meghana)

Eng interviews (Shreyas)

Model	Effort	Overall	Voice	Pushback	Q-fit	n
gpt-5.4	low	92	100	75	100	4
	medium	83	100	75	75	4
	high	75	75	75	75	4
gpt-5.5	high	83	75	75	100	4
kimi-k2.6	medium	83	100	100	50	2
gemini-3.5-flash	medium	78	67	100	67	3
	high	78	100	67	67	3
claude-opus-4-8	medium	75	75	50	100	4
grok-4.3	low	75	75	75	75	8
	high	67	50	75	75	4
claude-opus-4-6	medium	67	75	50	75	4
deepseek-v4-flash	low	75	75	50	100	4
	high	58	50	50	75	4
claude-opus-4-7	low	58	75	50	50	4
	medium	67	75	50	75	4
gemini-3.1-flash-lite	medium	42	25	25	75	4
	high	67	25	75	100	4
claude-sonnet-4-6	medium	50	25	75	50	4
deepseek-v4-pro	low	33	25	25	50	4
	high	67	50	50	100	4
nemotron-3-120b	medium	50	50	25	75	4

Pass-rate % on the 3 differentiating judges. Overall = their mean. Effort = thinking level. Small n per cell — directional.

Model	Effort	Overall	Voice	Pushback	Q-fit	n
claude-opus-4-8	medium	68	85	50	70	20
claude-sonnet-4-6	medium	67	75	45	80	20
kimi-k2.6	medium	67	75	67	58	12
gpt-5.5	medium	65	65	60	70	20
	high	63	60	60	70	20
gpt-5.4	low	57	70	60	40	20
	medium	68	60	70	75	20
	high	63	70	50	70	20
claude-opus-4-6	medium	60	60	45	75	20
claude-opus-4-7	low	55	40	40	85	20
	medium	60	55	40	85	20
gemini-3.1-flash-lite	medium	57	45	65	60	20
	high	58	50	65	60	20
deepseek-v4-pro	low	55	45	55	65	20
	high	57	50	55	65	20
deepseek-v4-flash	low	53	40	50	70	20
	high	52	45	50	60	20
gemini-3.5-flash	medium	48	55	30	60	20
	high	54	68	26	68	19
nemotron-3-120b	medium	46	11	47	79	19
grok-4.3	low	41	32	35	55	40
	high	44	37	42	53	19

Model	Effort	Median	2nd-msg	mid	before-end	end
grok-4.3	low	0.8	0.8	0.9	0.8	0.8
	high	0.8	0.7	0.9	0.8	0.9
deepseek-v4-flash	low	1.0	1.0	1.1	1.1	1.0
	high	1.0	1.0	1.1	1.0	1.1
gemini-3.1-flash-lite	medium	1.1	2.2	1.2	1.0	0.9
	high	4.3	4.1	4.5	5.7	3.5
deepseek-v4-pro	low	1.2	1.2	1.1	1.2	1.2
	high	1.2	1.1	1.2	1.2	1.2
qwen3.7-max	low	1.2	1.1	1.2	1.2	1.2
glm-5.2	low	1.5	2.6	1.6	1.5	1.3
	high	1.2	0.9	1.1	1.5	1.4
qwen3.5-397b-a17b	low	2.0	2.0	2.0	2.2	2.1
claude-sonnet-4-6	low	2.3	1.9	4.1	2.7	1.9
	medium	5.5	2.4	7.3	5.7	5.2
gpt-5.5	low	2.4	1.7	3.2	2.7	2.2
	medium	7.3	2.7	7.9	10.3	6.7
	high	12.8	7.4	12.9	15.1	12.8
claude-opus-4-8	low	2.5	3.1	4.2	1.9	1.8
	medium	2.9	3.1	5.4	2.8	2.6
glm-4.7-flash	low	2.5	2.3	2.4	2.8	2.6
	high	2.5	2.4	2.4	2.6	2.6
nemotron-3-super	low	2.7	3.3	3.5	2.5	2.3
	medium	5.7	5.7	5.6	5.9	4.9
claude-opus-4-6	low	4.1	4.4	4.9	3.7	3.0
	medium	6.9	5.3	8.5	11.8	5.3
gemini-3.5-flash	low	4.1	4.0	5.1	2.4	4.1
	medium	6.3	6.4	8.0	4.8	6.2
	high	6.9	6.9	9.1	6.8	0.0
kimi-k2.6	low	24.2	19.2	26.6	23.0	36.8

Model	Effort	Full interview	2nd-msg	mid	before-end	end	in/out tok
deepseek-v4-flash	low	$0.007	$0.0001	$0.0001	$0.0001	$0.0001	29120/71
	high	$0.008	$0.0002	$0.0002	$0.0002	$0.0002	29120/268
deepseek-v4-pro	low	$0.017	$0.0002	$0.0002	$0.0002	$0.0002	29122/67
	high	$0.023	$0.0004	$0.0004	$0.0005	$0.0003	29120/275
glm-4.7-flash	low	$0.042	$0.0016	$0.0017	$0.0018	$0.0018	28628/74
	high	$0.048	$0.0018	$0.0019	$0.0019	$0.0019	28628/537
gemini-3.1-flash-lite	low	$0.078	—	—	—	—	—/—
	medium	—	$0.0031	$0.0024	$0.0026	$0.0025	29947/226
	high	—	$0.0039	$0.0037	$0.0041	$0.0036	29930/978
grok-4.3	low	$0.169	$0.0055	$0.0062	$0.0060	$0.0061	28158/43
	high	$0.169	$0.0054	$0.0059	$0.0060	$0.0067	28206/42
kimi-k2.6	low	$0.181	$0.0067	$0.0059	$0.0065	$0.0067	28580/360
glm-5.2	low	$0.223	$0.0074	$0.0076	$0.0080	$0.0080	28612/68
	high	$0.260	$0.0083	$0.0095	$0.0113	$0.0085	28586/446
qwen3.5-397b-a17b	low	$0.285	$0.0112	$0.0118	$0.0124	$0.0124	29663/186
nemotron-3-120b	low	$0.363	—	—	—	—	—/—
	medium	$0.376	—	—	—	—	—/—
gemini-3.5-flash	low	$0.371	$0.0184	$0.0195	$0.0205	$0.0212	29932/212
	medium	$0.466	$0.0209	$0.0246	$0.0227	$0.0233	29945/704
	high	$0.546	$0.0224	$0.0257	$0.0253	$0.0284	29945/1024
claude-sonnet-4-6	low	$0.372	$0.0101	$0.0122	$0.0114	$0.0109	31663/98
	medium	—	$0.0108	$0.0127	$0.0124	$0.0117	31663/210
gpt-5.5	low	$0.578	$0.0176	$0.0257	$0.0217	$0.0191	28356/132
	medium	$0.640	$0.0189	$0.0245	$0.0288	$0.0240	28440/242
	high	$0.774	$0.0220	$0.0312	$0.0346	$0.0327	28440/443
claude-opus-4-6	low	$0.604	$0.0175	$0.0181	$0.0194	$0.0183	31710/100
	medium	$0.678	$0.0189	$0.0219	$0.0262	$0.0198	31663/189
claude-opus-4-7	low	$0.828	—	—	—	—	—/—
	medium	$0.839	—	—	—	—	—/—
claude-opus-4-8	low	$0.852	$0.0249	$0.0274	$0.0268	$0.0257	45596/122
	medium	$0.884	$0.0250	$0.0307	$0.0262	$0.0261	45596/148
qwen3.7-max	low	$0.856	$0.0322	$0.0356	$0.0376	$0.0380	29654/566
gpt-5.4	low	$0.937	—	—	—	—	—/—
	medium	$0.969	—	—	—	—	—/—
	high	$1.036	—	—	—	—	—/—
nemotron-3-super	low	—	$0.0142	$0.0148	$0.0155	$0.0156	29858/100
	medium	—	$0.0147	$0.0151	$0.0160	$0.0157	29858/446

Recommendation

Best model × effort considering all three. Live chat needs a fast first response, so anything slower than 6s is ruled out first; among the rest, quality vs cost. Quality shown is the robust eng set (Meghana noted where available).

Best quality

claude-opus-4-8 · low effort

68% quality · Meghana 75%2.5s response$0.852 / interview

Best value — quality per $

deepseek-v4-flash · low effort

52% quality · Meghana 67%1.0s response$0.007 / interview

Balanced

claude-sonnet-4-6 · low effort

67% quality · Meghana 50%2.3s response$0.372 / interview

Full eval detail

Every judge (incl. memory & groundedness) and every turn, for anyone who wants to dig.

Thumpn (Meghana) — all judges × turns

Model	Voice	Pushback	Q-fit	Memory	Grounded	2nd-msg	mid	before-end	end	n
gpt-5.4	92	75	83	100	100	100	78	89	67	12
kimi-k2.6	100	100	50	100	100	100	67	—	—	2
gpt-5.5	75	75	100	100	100	100	67	67	100	4
gemini-3.5-flash	83	83	67	100	100	100	50	83	—	6
claude-opus-4-8	75	50	100	100	100	67	67	100	67	4
grok-4.3	67	75	75	100	92	100	33	89	67	12
deepseek-v4-flash	62	50	88	100	100	67	83	83	33	8
claude-opus-4-6	75	50	75	100	100	33	67	100	67	4
claude-opus-4-7	75	50	62	100	100	100	33	50	67	8
gemini-3.1-flash-lite	25	50	88	100	100	83	50	50	33	8
deepseek-v4-pro	38	38	75	100	100	67	67	33	33	8
nemotron-3-120b	50	25	75	100	100	67	67	0	67	4
claude-sonnet-4-6	25	75	50	100	100	67	33	33	67	4

All 5 judges (overall) + per-turn quality (3-judge mean). % = pass-rate.

Eng interviews (Shreyas) — all judges × turns

Model	Voice	Pushback	Q-fit	Memory	Grounded	2nd-msg	mid	before-end	end	n
claude-opus-4-8	85	50	70	100	100	73	73	80	47	20
kimi-k2.6	75	67	58	100	92	33	50	100	33	12
claude-sonnet-4-6	75	45	80	100	95	60	80	87	40	20
gpt-5.5	62	60	70	100	100	67	67	70	53	40
gpt-5.4	67	60	62	98	100	64	64	76	47	60
claude-opus-4-6	60	45	75	100	90	53	53	93	40	20
claude-opus-4-7	48	40	85	100	98	60	50	67	53	40
gemini-3.1-flash-lite	48	65	60	100	98	57	53	73	47	40
deepseek-v4-pro	48	55	65	98	98	47	57	77	43	40
deepseek-v4-flash	42	50	65	100	95	50	43	67	50	40
gemini-3.5-flash	62	28	64	100	97	57	47	67	33	39
nemotron-3-120b	11	47	79	100	95	40	42	53	47	19
grok-4.3	34	37	54	100	100	29	43	58	38	59

All 5 judges (overall) + per-turn quality (3-judge mean). % = pass-rate.

Full per-checkpoint leaderboard (every persona · every turn · all 5 judges + cost + tokens): open the full leaderboard →

Appendix

Method

One prompt + one fixed candidate conversation; only model/effort varies. At 4 checkpoints each model produced the interviewer reply. Quality graded by a fixed LLM judge (claude-sonnet-4-6) on real transcripts; speed (TTFT) and cost are medians over 15 runs, warm/cached. Effort labels normalised onto low/medium/high (think/on/eff-high→high; no-think/off→low).

Judges

Voice — sounds like the real person. Pushback — challenges vague answers. Q-fit — questions specific to this candidate.
Memory & groundedness were ~100% for everyone (excluded from the headline).

Caveats

Thumpn/Meghana per-effort n is small (≈1–3) — directional; eng is the robust set.
2 of ~880 replies were non-responses, excluded. nemotron speed/cost mapped from its sibling SKU.

Which model — and thinking effort —should run our AI interviewer?