AI Interviewer ยท Model Benchmark

Real model outputs from the cold-output sheet, judged with full persona + history from Neon. 713 replies across 13 models.

Deterministic = pure-JS objective checks (free). LLM-judges graded by claude-sonnet-4-6. Speed + true cost are NOT in this dataset (the sheet captured outputs only); they require a fresh generation run.

Persona note: these are engineering interviews (Fullstack / AI Researcher), not Thumpn/Meghana. The Shreyas interviews have a real persona (comm-style + 31 memory blocks), so voice/memory/grounded are meaningful. The Default Interviewer interviews have no persona โ€” only format/structure/anti-AI are meaningful there.

Thumpn ยท Meghana (real persona)

2nd-msg

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
gpt-5.4 ๐Ÿ†99%99%91%100%100%100%100%97%100%100%100%100%100%100%100%2233
kimi-k2.698%100%91%100%100%100%100%92%100%100%100%100%100%100%100%4421
claude-opus-4-797%98%91%100%100%100%100%92%98%100%100%100%100%100%100%912
gpt-5.597%92%91%100%100%100%100%100%100%100%100%100%100%100%100%1631
gemini-3.5-flash96%96%82%100%100%100%100%92%100%100%100%100%100%100%100%742
grok-4.399%100%91%100%100%100%100%97%100%100%100%100%100%67%93%1253
gemini-3.1-flash-lite96%94%86%100%100%100%100%92%98%100%50%100%100%100%90%862
claude-sonnet-4-699%100%91%100%100%100%100%100%100%0%100%100%100%100%80%451
claude-opus-4-898%100%91%100%100%100%100%92%100%100%0%100%100%100%80%1061
deepseek-v4-flash98%100%91%100%100%100%100%92%100%100%0%100%100%100%80%1202
deepseek-v4-pro97%98%86%100%100%100%100%92%100%100%50%50%100%100%80%1182
nemotron-3-120b97%96%91%100%100%100%100%92%100%100%0%100%100%100%80%1551
claude-opus-4-698%100%91%100%100%100%100%92%100%100%0%0%100%100%60%431

mid

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
deepseek-v4-flash ๐Ÿ†96%98%100%95%96%100%100%88%98%50%100%100%100%100%90%2602
gpt-5.497%97%94%95%100%100%100%92%99%100%100%33%100%100%87%7083
claude-opus-4-697%96%100%95%100%100%100%92%100%0%100%100%100%100%80%2021
claude-opus-4-897%96%100%95%100%100%100%92%100%0%100%100%100%100%80%2291
deepseek-v4-pro97%98%100%95%100%100%100%92%98%50%100%50%100%100%80%5462
gpt-5.596%96%91%95%100%100%100%92%100%0%100%100%100%100%80%6271
kimi-k2.696%96%100%95%100%100%100%92%96%100%100%0%100%100%80%16241
nemotron-3-120b96%96%100%95%100%100%100%92%96%0%100%100%100%100%80%4381
gemini-3.5-flash96%96%91%95%100%100%100%92%100%50%100%0%100%100%70%962
gemini-3.1-flash-lite96%94%91%95%100%100%100%92%98%0%100%50%100%100%70%1192
claude-sonnet-4-698%100%100%95%100%100%100%92%100%0%100%0%100%100%60%2541
grok-4.397%100%100%95%97%100%100%90%97%0%100%0%100%100%60%1333
claude-opus-4-797%98%100%95%100%100%100%92%96%0%100%0%100%100%60%1562

before-end

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
claude-opus-4-8 ๐Ÿ†96%96%100%89%100%100%100%92%96%100%100%100%100%100%100%2671
claude-opus-4-695%96%91%89%100%100%100%92%96%100%100%100%100%100%100%2161
grok-4.398%99%97%98%100%100%100%92%97%67%100%100%100%100%93%1993
gpt-5.497%99%97%91%100%100%100%92%99%67%100%100%100%100%93%4343
gemini-3.5-flash96%96%100%89%100%100%100%92%100%100%50%100%100%100%90%1282
deepseek-v4-flash96%98%95%95%100%100%100%92%94%100%100%50%100%100%90%3262
gpt-5.596%100%91%89%100%100%100%92%100%100%0%100%100%100%80%6251
claude-opus-4-796%98%95%92%100%100%100%92%96%100%0%50%100%100%70%1742
gemini-3.1-flash-lite96%96%91%92%100%100%100%92%98%0%50%100%100%100%70%1502
claude-sonnet-4-697%100%91%95%100%100%100%92%100%0%100%0%100%100%60%5341
deepseek-v4-pro97%98%95%92%100%100%100%92%100%0%0%100%100%100%60%3552
nemotron-3-120b97%100%100%95%100%100%100%92%96%0%0%0%100%100%40%5721

end

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
gpt-5.5 ๐Ÿ†94%88%100%95%100%100%100%92%92%100%100%100%100%100%100%5071
nemotron-3-120b99%100%100%95%100%100%100%100%100%100%0%100%100%100%80%14601
claude-opus-4-698%100%100%95%100%100%100%92%100%100%0%100%100%100%80%501
claude-opus-4-798%100%100%95%100%100%100%92%100%100%0%100%100%100%80%332
claude-opus-4-898%100%100%95%100%100%100%92%100%100%0%100%100%100%80%1001
claude-sonnet-4-698%100%100%95%100%100%100%92%100%100%0%100%100%100%80%951
gpt-5.498%100%100%95%100%100%100%92%100%100%0%100%100%100%80%3743
grok-4.398%100%100%95%100%100%100%92%100%100%0%100%100%100%80%753
deepseek-v4-flash98%100%100%95%100%100%100%92%100%0%0%100%100%100%60%742
gemini-3.1-flash-lite98%100%100%95%100%100%100%92%98%0%0%100%100%100%60%422
deepseek-v4-pro97%100%100%95%100%100%100%92%96%0%0%100%100%100%60%862

Engineering ยท Shreyas (real persona)

2nd-msg

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
claude-opus-4-8 ๐Ÿ†98%99%98%100%100%100%100%92%98%100%80%40%100%100%84%2225
gpt-5.598%96%97%100%100%100%100%94%100%50%70%80%100%100%80%29410
gpt-5.498%96%95%100%99%100%100%93%100%60%73%60%100%100%79%37615
claude-opus-4-799%99%99%100%100%100%100%92%100%70%30%80%100%90%74%12310
gemini-3.1-flash-lite98%96%95%100%100%100%100%93%99%50%80%40%100%100%74%10010
claude-sonnet-4-698%100%95%100%100%100%100%92%97%80%20%80%100%80%72%3225
gemini-3.5-flash98%95%95%100%100%100%100%92%100%90%30%50%100%90%72%8310
claude-opus-4-699%98%98%100%100%100%100%92%100%60%40%60%100%80%68%3685
deepseek-v4-pro98%100%98%100%100%100%100%92%98%40%40%60%100%100%68%15310
deepseek-v4-flash98%98%99%100%100%100%100%92%98%20%50%80%100%90%68%21210
nemotron-3-120b97%98%91%100%100%100%100%91%96%20%20%80%100%100%64%3825
kimi-k2.699%100%100%100%100%100%100%92%98%100%0%0%100%100%60%9122
grok-4.398%98%96%100%100%100%100%94%98%33%7%47%100%100%57%26815

mid

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
claude-sonnet-4-6 ๐Ÿ†98%98%95%99%100%100%100%92%98%80%60%100%100%100%88%7055
claude-opus-4-898%98%100%99%100%100%100%92%98%80%60%80%100%100%84%2505
gpt-5.597%95%95%99%99%100%100%93%100%70%50%80%100%100%80%41610
gpt-5.497%96%94%98%99%100%100%92%99%67%60%67%100%100%79%66215
claude-opus-4-698%98%96%99%100%100%100%92%98%40%40%80%100%100%72%3255
deepseek-v4-pro97%97%97%99%99%100%100%92%96%50%70%50%100%90%72%20010
kimi-k2.698%99%100%99%100%100%100%92%99%50%75%25%100%100%70%7994
claude-opus-4-798%99%100%99%98%100%100%91%98%20%50%80%100%100%70%12810
gemini-3.1-flash-lite97%98%96%97%100%100%100%92%98%30%80%50%100%90%70%10510
gemini-3.5-flash98%97%94%99%99%100%100%92%100%60%20%60%100%100%68%9810
grok-4.398%99%99%99%99%100%100%93%98%36%64%29%100%100%66%18914
deepseek-v4-flash97%97%98%99%100%100%100%92%96%70%40%20%100%100%66%16810
nemotron-3-120b96%97%93%97%98%100%100%90%97%0%50%75%100%75%60%4014

before-end

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
kimi-k2.6 ๐Ÿ†98%96%100%98%100%100%100%92%100%100%100%100%100%100%100%4805
claude-opus-4-699%99%100%98%100%100%100%94%99%100%80%100%100%80%92%2045
claude-sonnet-4-698%98%100%98%100%100%100%94%100%100%80%80%100%100%92%2945
claude-opus-4-897%98%98%98%100%100%100%92%96%100%60%80%100%100%88%1045
deepseek-v4-pro98%98%96%98%100%100%100%93%99%70%90%70%100%100%86%26010
gpt-5.498%96%96%98%100%100%100%93%99%80%87%60%100%100%85%71815
gemini-3.1-flash-lite97%96%97%96%100%100%100%93%98%70%80%70%100%100%84%13610
gpt-5.597%96%96%98%98%100%100%92%99%70%90%50%100%100%82%43210
claude-opus-4-799%100%100%99%100%100%100%94%99%50%60%90%100%100%80%10410
gemini-3.5-flash97%96%96%97%99%100%100%94%98%70%50%80%100%100%80%10510
deepseek-v4-flash97%97%98%98%100%100%100%92%98%50%90%60%100%90%78%23010
grok-4.398%99%100%98%99%100%100%93%99%40%67%67%100%100%75%17315
nemotron-3-120b96%95%96%98%98%100%100%91%96%0%100%60%100%100%72%5105

end

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
claude-opus-4-7 ๐Ÿ†99%100%100%97%100%100%100%95%99%50%20%90%100%100%72%5410
gpt-5.598%98%97%97%99%100%100%94%100%60%30%70%100%100%72%36110
deepseek-v4-flash98%98%98%97%100%100%100%92%98%30%20%100%100%100%70%18610
claude-opus-4-898%100%98%96%100%100%100%92%99%60%0%80%100%100%68%845
gemini-3.1-flash-lite97%98%98%96%99%100%100%92%98%40%20%80%100%100%68%8410
nemotron-3-120b97%98%96%97%98%100%100%92%97%20%20%100%100%100%68%4935
gpt-5.498%99%95%96%100%100%100%93%100%60%20%60%93%100%67%45315
claude-sonnet-4-698%99%98%97%100%100%100%92%98%40%20%60%100%100%64%2885
claude-opus-4-698%98%96%97%100%100%100%92%98%40%20%60%100%100%64%2805
deepseek-v4-pro98%99%98%96%99%100%100%92%98%30%20%80%90%100%64%18410
grok-4.399%99%100%97%99%100%100%95%99%27%13%73%100%100%63%12515
gemini-3.5-flash97%96%95%95%99%100%100%91%99%22%11%67%100%100%60%4039
kimi-k2.695%92%91%95%100%100%100%92%96%0%0%100%100%0%40%18621

Default Interviewer (no persona โ€” read objective columns only)

2nd-msg

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
claude-opus-4-6 ๐Ÿ†99%98%100%100%100%100%100%92%100%100%100%100%100%100%100%2442
deepseek-v4-pro99%98%100%100%98%100%100%94%100%100%100%100%100%100%100%2634
deepseek-v4-flash98%99%100%100%100%100%100%92%96%100%100%100%100%75%95%1484
grok-4.399%98%100%100%100%100%100%96%98%100%83%83%100%100%93%1896
claude-opus-4-899%100%100%100%100%100%100%92%100%100%100%50%100%100%90%2062
claude-opus-4-799%100%100%100%100%100%100%96%99%100%100%50%100%75%85%764
gpt-5.599%97%100%100%100%100%100%96%100%100%100%50%100%75%85%2334
gemini-3.1-flash-lite98%97%100%100%100%100%100%92%99%100%100%75%75%75%85%804
claude-sonnet-4-6100%100%100%100%100%100%100%96%100%100%50%50%100%100%80%2522
gpt-5.498%96%98%100%100%100%100%95%100%100%50%50%100%100%80%3276
gemini-3.5-flash98%95%100%100%100%100%100%96%100%100%75%50%75%75%75%614
nemotron-3-120b100%100%100%100%100%100%100%96%100%0%100%100%100%50%70%5752
kimi-k2.698%96%100%100%100%100%100%92%100%100%0%100%100%0%60%16151

mid

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
claude-sonnet-4-6 ๐Ÿ†99%98%100%100%100%100%100%96%100%100%100%100%100%100%100%4102
claude-opus-4-698%96%100%100%100%100%100%92%98%100%100%100%100%100%100%3342
gpt-5.598%97%98%100%100%100%100%92%100%100%75%100%100%100%95%5454
claude-opus-4-898%98%100%100%100%100%100%92%96%100%50%100%100%100%90%3102
deepseek-v4-flash99%98%100%100%100%100%100%94%100%100%75%50%100%100%85%1184
claude-opus-4-798%97%100%100%98%100%100%92%98%100%50%100%100%75%85%1294
deepseek-v4-pro98%99%100%100%100%100%100%92%95%100%50%75%100%100%85%2694
kimi-k2.699%98%100%100%100%100%100%96%100%100%50%100%100%50%80%8282
gpt-5.499%98%100%100%100%100%100%94%99%100%83%50%100%67%80%6756
grok-4.399%99%100%99%100%100%100%92%99%67%50%83%100%100%80%2296
gemini-3.1-flash-lite98%95%100%100%100%100%100%92%98%100%50%100%75%75%80%864
nemotron-3-120b99%98%100%100%100%100%100%96%100%100%0%0%100%100%60%5022
gemini-3.5-flash98%98%98%99%100%100%100%94%98%100%25%75%50%50%60%924

before-end

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
claude-opus-4-8 ๐Ÿ†98%98%100%97%100%100%100%92%100%100%50%100%100%100%90%1092
claude-opus-4-697%98%100%97%100%100%100%92%96%100%50%100%100%100%90%4962
gpt-5.597%94%98%97%100%100%100%92%100%100%75%75%100%100%90%6414
gemini-3.1-flash-lite98%98%100%96%100%100%100%92%99%100%50%100%100%75%85%994
gemini-3.5-flash98%97%100%96%98%100%100%94%99%100%75%100%75%75%85%914
grok-4.398%99%100%96%100%100%100%92%99%67%67%83%100%100%83%1866
claude-opus-4-798%98%100%99%100%100%100%92%99%75%50%75%100%100%80%994
deepseek-v4-flash98%96%100%97%100%100%100%92%99%100%50%75%100%75%80%1624
gpt-5.497%94%100%99%100%100%100%92%98%83%50%67%100%67%73%6536
nemotron-3-120b100%100%100%100%100%100%100%96%100%50%50%50%100%100%70%1642
claude-sonnet-4-698%96%100%95%100%100%100%96%100%50%50%50%100%100%70%962
deepseek-v4-pro97%96%98%97%100%100%100%92%98%50%50%50%100%75%65%2664

end

Deterministic (objective) LLM-as-judgeSize
ModelDet%ToneMem-factsFlowMechanicsAdaptEval-hygFormatAnti-AI In-voicePushbackQ-fitMem-useGroundedJudge%Out tokn
nemotron-3-120b ๐Ÿ†99%100%100%97%100%100%100%96%100%100%50%100%100%100%90%13082
claude-opus-4-699%100%100%97%100%100%100%92%100%100%50%100%100%100%90%642
claude-opus-4-799%100%100%97%100%100%100%92%100%100%50%100%100%100%90%484
claude-sonnet-4-699%100%100%97%100%100%100%92%100%100%50%100%100%100%90%672
gpt-5.499%100%100%97%100%100%100%92%100%100%50%100%100%100%90%2906
gemini-3.1-flash-lite97%97%98%97%100%100%100%92%98%100%50%100%100%100%90%904
deepseek-v4-pro98%100%100%97%100%100%100%92%99%75%50%100%100%100%85%1104
claude-opus-4-899%100%100%97%100%100%100%92%100%100%0%100%100%100%80%1282
gpt-5.599%100%100%97%100%100%100%92%100%50%50%100%100%100%80%3404
deepseek-v4-flash98%100%100%97%100%100%100%92%99%75%25%100%100%100%80%1124
kimi-k2.698%100%100%95%100%100%100%92%100%100%0%100%100%100%80%13151
grok-4.399%100%100%97%100%100%100%94%100%33%50%100%100%100%77%456
gemini-3.5-flash97%93%100%97%100%100%100%92%99%25%75%100%100%0%60%6214