13 models judged as the live interviewer on real transcripts, then weighed on speed and cost. Effort (thinking level) moves speed and cost massively, so every view is sliced model × low/medium/high.
Why no deterministic checks? They were 97–98% for every model — frontier models don't make gross errors, so they can't rank anything. The ranking is the LLM judges.
| Model | Effort | Overall | Voice | Pushback | Q-fit | n |
|---|---|---|---|---|---|---|
| gpt-5.4 | low | 92 | 100 | 75 | 100 | 4 |
| medium | 83 | 100 | 75 | 75 | 4 | |
| high | 75 | 75 | 75 | 75 | 4 | |
| gpt-5.5 | high | 83 | 75 | 75 | 100 | 4 |
| kimi-k2.6 | medium | 83 | 100 | 100 | 50 | 2 |
| gemini-3.5-flash | medium | 78 | 67 | 100 | 67 | 3 |
| high | 78 | 100 | 67 | 67 | 3 | |
| claude-opus-4-8 | medium | 75 | 75 | 50 | 100 | 4 |
| grok-4.3 | low | 75 | 75 | 75 | 75 | 8 |
| high | 67 | 50 | 75 | 75 | 4 | |
| claude-opus-4-6 | medium | 67 | 75 | 50 | 75 | 4 |
| deepseek-v4-flash | low | 75 | 75 | 50 | 100 | 4 |
| high | 58 | 50 | 50 | 75 | 4 | |
| claude-opus-4-7 | low | 58 | 75 | 50 | 50 | 4 |
| medium | 67 | 75 | 50 | 75 | 4 | |
| gemini-3.1-flash-lite | medium | 42 | 25 | 25 | 75 | 4 |
| high | 67 | 25 | 75 | 100 | 4 | |
| claude-sonnet-4-6 | medium | 50 | 25 | 75 | 50 | 4 |
| deepseek-v4-pro | low | 33 | 25 | 25 | 50 | 4 |
| high | 67 | 50 | 50 | 100 | 4 | |
| nemotron-3-120b | medium | 50 | 50 | 25 | 75 | 4 |
Pass-rate % on the 3 differentiating judges. Overall = their mean. Effort = thinking level. Small n per cell — directional.
| Model | Effort | Overall | Voice | Pushback | Q-fit | n |
|---|---|---|---|---|---|---|
| claude-opus-4-8 | medium | 68 | 85 | 50 | 70 | 20 |
| claude-sonnet-4-6 | medium | 67 | 75 | 45 | 80 | 20 |
| kimi-k2.6 | medium | 67 | 75 | 67 | 58 | 12 |
| gpt-5.5 | medium | 65 | 65 | 60 | 70 | 20 |
| high | 63 | 60 | 60 | 70 | 20 | |
| gpt-5.4 | low | 57 | 70 | 60 | 40 | 20 |
| medium | 68 | 60 | 70 | 75 | 20 | |
| high | 63 | 70 | 50 | 70 | 20 | |
| claude-opus-4-6 | medium | 60 | 60 | 45 | 75 | 20 |
| claude-opus-4-7 | low | 55 | 40 | 40 | 85 | 20 |
| medium | 60 | 55 | 40 | 85 | 20 | |
| gemini-3.1-flash-lite | medium | 57 | 45 | 65 | 60 | 20 |
| high | 58 | 50 | 65 | 60 | 20 | |
| deepseek-v4-pro | low | 55 | 45 | 55 | 65 | 20 |
| high | 57 | 50 | 55 | 65 | 20 | |
| deepseek-v4-flash | low | 53 | 40 | 50 | 70 | 20 |
| high | 52 | 45 | 50 | 60 | 20 | |
| gemini-3.5-flash | medium | 48 | 55 | 30 | 60 | 20 |
| high | 54 | 68 | 26 | 68 | 19 | |
| nemotron-3-120b | medium | 46 | 11 | 47 | 79 | 19 |
| grok-4.3 | low | 41 | 32 | 35 | 55 | 40 |
| high | 44 | 37 | 42 | 53 | 19 |
Pass-rate % on the 3 differentiating judges. Overall = their mean. Effort = thinking level.
| Model | Effort | Median | 2nd-msg | mid | before-end | end |
|---|---|---|---|---|---|---|
| grok-4.3 | low | 0.8 | 0.8 | 0.9 | 0.8 | 0.8 |
| high | 0.8 | 0.7 | 0.9 | 0.8 | 0.9 | |
| deepseek-v4-flash | low | 1.0 | 1.0 | 1.1 | 1.1 | 1.0 |
| high | 1.0 | 1.0 | 1.1 | 1.0 | 1.1 | |
| gemini-3.1-flash-lite | medium | 1.1 | 2.2 | 1.2 | 1.0 | 0.9 |
| high | 4.3 | 4.1 | 4.5 | 5.7 | 3.5 | |
| deepseek-v4-pro | low | 1.2 | 1.2 | 1.1 | 1.2 | 1.2 |
| high | 1.2 | 1.1 | 1.2 | 1.2 | 1.2 | |
| qwen3.7-max | low | 1.2 | 1.1 | 1.2 | 1.2 | 1.2 |
| glm-5.2 | low | 1.5 | 2.6 | 1.6 | 1.5 | 1.3 |
| high | 1.2 | 0.9 | 1.1 | 1.5 | 1.4 | |
| qwen3.5-397b-a17b | low | 2.0 | 2.0 | 2.0 | 2.2 | 2.1 |
| claude-sonnet-4-6 | low | 2.3 | 1.9 | 4.1 | 2.7 | 1.9 |
| medium | 5.5 | 2.4 | 7.3 | 5.7 | 5.2 | |
| gpt-5.5 | low | 2.4 | 1.7 | 3.2 | 2.7 | 2.2 |
| medium | 7.3 | 2.7 | 7.9 | 10.3 | 6.7 | |
| high | 12.8 | 7.4 | 12.9 | 15.1 | 12.8 | |
| claude-opus-4-8 | low | 2.5 | 3.1 | 4.2 | 1.9 | 1.8 |
| medium | 2.9 | 3.1 | 5.4 | 2.8 | 2.6 | |
| glm-4.7-flash | low | 2.5 | 2.3 | 2.4 | 2.8 | 2.6 |
| high | 2.5 | 2.4 | 2.4 | 2.6 | 2.6 | |
| nemotron-3-super | low | 2.7 | 3.3 | 3.5 | 2.5 | 2.3 |
| medium | 5.7 | 5.7 | 5.6 | 5.9 | 4.9 | |
| claude-opus-4-6 | low | 4.1 | 4.4 | 4.9 | 3.7 | 3.0 |
| medium | 6.9 | 5.3 | 8.5 | 11.8 | 5.3 | |
| gemini-3.5-flash | low | 4.1 | 4.0 | 5.1 | 2.4 | 4.1 |
| medium | 6.3 | 6.4 | 8.0 | 4.8 | 6.2 | |
| high | 6.9 | 6.9 | 9.1 | 6.8 | 0.0 | |
| kimi-k2.6 | low | 24.2 | 19.2 | 26.6 | 23.0 | 36.8 |
Time-to-first-token, seconds (warm). Effort is the lever — same model swings several-fold (e.g. gpt-5.5 low 2.4s → high 12.8s).
| Model | Effort | Full interview | 2nd-msg | mid | before-end | end | in/out tok |
|---|---|---|---|---|---|---|---|
| deepseek-v4-flash | low | $0.007 | $0.0001 | $0.0001 | $0.0001 | $0.0001 | 29120/71 |
| high | $0.008 | $0.0002 | $0.0002 | $0.0002 | $0.0002 | 29120/268 | |
| deepseek-v4-pro | low | $0.017 | $0.0002 | $0.0002 | $0.0002 | $0.0002 | 29122/67 |
| high | $0.023 | $0.0004 | $0.0004 | $0.0005 | $0.0003 | 29120/275 | |
| glm-4.7-flash | low | $0.042 | $0.0016 | $0.0017 | $0.0018 | $0.0018 | 28628/74 |
| high | $0.048 | $0.0018 | $0.0019 | $0.0019 | $0.0019 | 28628/537 | |
| gemini-3.1-flash-lite | low | $0.078 | — | — | — | — | —/— |
| medium | — | $0.0031 | $0.0024 | $0.0026 | $0.0025 | 29947/226 | |
| high | — | $0.0039 | $0.0037 | $0.0041 | $0.0036 | 29930/978 | |
| grok-4.3 | low | $0.169 | $0.0055 | $0.0062 | $0.0060 | $0.0061 | 28158/43 |
| high | $0.169 | $0.0054 | $0.0059 | $0.0060 | $0.0067 | 28206/42 | |
| kimi-k2.6 | low | $0.181 | $0.0067 | $0.0059 | $0.0065 | $0.0067 | 28580/360 |
| glm-5.2 | low | $0.223 | $0.0074 | $0.0076 | $0.0080 | $0.0080 | 28612/68 |
| high | $0.260 | $0.0083 | $0.0095 | $0.0113 | $0.0085 | 28586/446 | |
| qwen3.5-397b-a17b | low | $0.285 | $0.0112 | $0.0118 | $0.0124 | $0.0124 | 29663/186 |
| nemotron-3-120b | low | $0.363 | — | — | — | — | —/— |
| medium | $0.376 | — | — | — | — | —/— | |
| gemini-3.5-flash | low | $0.371 | $0.0184 | $0.0195 | $0.0205 | $0.0212 | 29932/212 |
| medium | $0.466 | $0.0209 | $0.0246 | $0.0227 | $0.0233 | 29945/704 | |
| high | $0.546 | $0.0224 | $0.0257 | $0.0253 | $0.0284 | 29945/1024 | |
| claude-sonnet-4-6 | low | $0.372 | $0.0101 | $0.0122 | $0.0114 | $0.0109 | 31663/98 |
| medium | — | $0.0108 | $0.0127 | $0.0124 | $0.0117 | 31663/210 | |
| gpt-5.5 | low | $0.578 | $0.0176 | $0.0257 | $0.0217 | $0.0191 | 28356/132 |
| medium | $0.640 | $0.0189 | $0.0245 | $0.0288 | $0.0240 | 28440/242 | |
| high | $0.774 | $0.0220 | $0.0312 | $0.0346 | $0.0327 | 28440/443 | |
| claude-opus-4-6 | low | $0.604 | $0.0175 | $0.0181 | $0.0194 | $0.0183 | 31710/100 |
| medium | $0.678 | $0.0189 | $0.0219 | $0.0262 | $0.0198 | 31663/189 | |
| claude-opus-4-7 | low | $0.828 | — | — | — | — | —/— |
| medium | $0.839 | — | — | — | — | —/— | |
| claude-opus-4-8 | low | $0.852 | $0.0249 | $0.0274 | $0.0268 | $0.0257 | 45596/122 |
| medium | $0.884 | $0.0250 | $0.0307 | $0.0262 | $0.0261 | 45596/148 | |
| qwen3.7-max | low | $0.856 | $0.0322 | $0.0356 | $0.0376 | $0.0380 | 29654/566 |
| gpt-5.4 | low | $0.937 | — | — | — | — | —/— |
| medium | $0.969 | — | — | — | — | —/— | |
| high | $1.036 | — | — | — | — | —/— | |
| nemotron-3-super | low | — | $0.0142 | $0.0148 | $0.0155 | $0.0156 | 29858/100 |
| medium | — | $0.0147 | $0.0151 | $0.0160 | $0.0157 | 29858/446 |
Cost USD, warm/cached. Higher effort → more tokens → more cost. Full = whole conversation.
Best model × effort considering all three. Live chat needs a fast first response, so anything slower than 6s is ruled out first; among the rest, quality vs cost. Quality shown is the robust eng set (Meghana noted where available).
Every judge (incl. memory & groundedness) and every turn, for anyone who wants to dig.
| Model | Voice | Pushback | Q-fit | Memory | Grounded | 2nd-msg | mid | before-end | end | n |
|---|---|---|---|---|---|---|---|---|---|---|
| gpt-5.4 | 92 | 75 | 83 | 100 | 100 | 100 | 78 | 89 | 67 | 12 |
| kimi-k2.6 | 100 | 100 | 50 | 100 | 100 | 100 | 67 | — | — | 2 |
| gpt-5.5 | 75 | 75 | 100 | 100 | 100 | 100 | 67 | 67 | 100 | 4 |
| gemini-3.5-flash | 83 | 83 | 67 | 100 | 100 | 100 | 50 | 83 | — | 6 |
| claude-opus-4-8 | 75 | 50 | 100 | 100 | 100 | 67 | 67 | 100 | 67 | 4 |
| grok-4.3 | 67 | 75 | 75 | 100 | 92 | 100 | 33 | 89 | 67 | 12 |
| deepseek-v4-flash | 62 | 50 | 88 | 100 | 100 | 67 | 83 | 83 | 33 | 8 |
| claude-opus-4-6 | 75 | 50 | 75 | 100 | 100 | 33 | 67 | 100 | 67 | 4 |
| claude-opus-4-7 | 75 | 50 | 62 | 100 | 100 | 100 | 33 | 50 | 67 | 8 |
| gemini-3.1-flash-lite | 25 | 50 | 88 | 100 | 100 | 83 | 50 | 50 | 33 | 8 |
| deepseek-v4-pro | 38 | 38 | 75 | 100 | 100 | 67 | 67 | 33 | 33 | 8 |
| nemotron-3-120b | 50 | 25 | 75 | 100 | 100 | 67 | 67 | 0 | 67 | 4 |
| claude-sonnet-4-6 | 25 | 75 | 50 | 100 | 100 | 67 | 33 | 33 | 67 | 4 |
All 5 judges (overall) + per-turn quality (3-judge mean). % = pass-rate.
| Model | Voice | Pushback | Q-fit | Memory | Grounded | 2nd-msg | mid | before-end | end | n |
|---|---|---|---|---|---|---|---|---|---|---|
| claude-opus-4-8 | 85 | 50 | 70 | 100 | 100 | 73 | 73 | 80 | 47 | 20 |
| kimi-k2.6 | 75 | 67 | 58 | 100 | 92 | 33 | 50 | 100 | 33 | 12 |
| claude-sonnet-4-6 | 75 | 45 | 80 | 100 | 95 | 60 | 80 | 87 | 40 | 20 |
| gpt-5.5 | 62 | 60 | 70 | 100 | 100 | 67 | 67 | 70 | 53 | 40 |
| gpt-5.4 | 67 | 60 | 62 | 98 | 100 | 64 | 64 | 76 | 47 | 60 |
| claude-opus-4-6 | 60 | 45 | 75 | 100 | 90 | 53 | 53 | 93 | 40 | 20 |
| claude-opus-4-7 | 48 | 40 | 85 | 100 | 98 | 60 | 50 | 67 | 53 | 40 |
| gemini-3.1-flash-lite | 48 | 65 | 60 | 100 | 98 | 57 | 53 | 73 | 47 | 40 |
| deepseek-v4-pro | 48 | 55 | 65 | 98 | 98 | 47 | 57 | 77 | 43 | 40 |
| deepseek-v4-flash | 42 | 50 | 65 | 100 | 95 | 50 | 43 | 67 | 50 | 40 |
| gemini-3.5-flash | 62 | 28 | 64 | 100 | 97 | 57 | 47 | 67 | 33 | 39 |
| nemotron-3-120b | 11 | 47 | 79 | 100 | 95 | 40 | 42 | 53 | 47 | 19 |
| grok-4.3 | 34 | 37 | 54 | 100 | 100 | 29 | 43 | 58 | 38 | 59 |
All 5 judges (overall) + per-turn quality (3-judge mean). % = pass-rate.
Full per-checkpoint leaderboard (every persona · every turn · all 5 judges + cost + tokens): open the full leaderboard →
One prompt + one fixed candidate conversation; only model/effort varies. At 4 checkpoints each model produced the interviewer reply. Quality graded by a fixed LLM judge (claude-sonnet-4-6) on real transcripts; speed (TTFT) and cost are medians over 15 runs, warm/cached. Effort labels normalised onto low/medium/high (think/on/eff-high→high; no-think/off→low).