Spyfall Game Analysis

Comprehensive analysis of LLM performance in Spyfall, evaluating spy deception, information extraction, and cover maintenance across diverse location scenarios.

Performance Statistics

Model	Games as Spy	Spy Wins	Win Rate	Avg Spy Leak ↓	Avg Probe Eff. ↑
grok-4	32	32	100%	0.381	1.33
gpt-5	72	70	97.2%	0.488	0.755
gemini-2.5-pro	72	68	94.4%	0.357	0.721
claude-sonnet-4.5	31	27	87.1%	0.323	0.622
qwen3-235b-a22b	72	62	86.1%	0.395	0.673
kimi-k2-0905	72	57	79.2%	0.408	0.673
glm-4.5	72	55	76.4%	0.330	0.661
llama-4-maverick	72	49	68.1%	0.360	0.636
claude-opus-4.1	72	48	66.7%	0.390	0.653

Metric Definitions

Spy Win Rate

Percentage of games where the spy successfully avoided detection or correctly guessed the location.

Avg Spy Leak ↓ (Lower is Better)

Measures how much location-specific information the spy reveals through their responses. Lower values indicate better cover maintenance and more generic, plausible answers.

Avg Probe Effectiveness ↑ (Higher is Better)

Evaluates how effectively the spy's questions extract location information from other players. Higher values indicate more strategic questioning that reveals key location details.

Model Strategy Profiles

xAI Grok-4

100% Spy Win Rate

"Procedural Deceiver"

Elite as Spy—paces perfectly with sensory probes ("smell," "hum," "shift") before dropping vivid maintenance or medical-staff stories that blend anywhere. But as Non-Spy, leaks too much—double-barreled questions and over-specific cues ("free drinks," "coin mechanisms") hand Spies the win. Procedural precision turns brilliant deception into self-sabotage when unchecked.

OpenAI GPT-5

97.2% Win Rate

"Overpowering extractor"

Best win-rate and top probing effectiveness; wins even when it "talks too specifically". Example wins show confident, targeted prompts (e.g., Wizard Tower, University Campus) with high probe efficiencies. Demonstrates superior information extraction capabilities that compensate for occasional over-specificity in responses.

Google Gemini-2.5-Pro

94.4% Win Rate

"Clinical & efficient"

Very high win-rate with excellent or good cover in many settings (e.g., Submarine/Mall lines: avg spy leak ~0.12–0.18, cover "excellent"). Often waits patiently for over-disclosure from others, then locks in the guess. Demonstrates exceptional patience and disciplined information gathering.

Anthropic Claude Sonnet 4.5

87.1% Win Rate • Best Leak Control

"Adaptive information gatherer"

Wins by probing broadly first and tightening only after others leak key terms. Keeps its own answers modest while re-targeting the leakiest non-spy to gather more signal, then commits quickly once 2–3 anchors emerge. Can slip when it over-aligns to table jargon late(raising its leak) or in slow, technical locations where waiting for extra confirmationsstretches guess latency and gives non-spies time to adjust. Lowest spy leak metric demonstrates exceptional cover maintenance.

Alibaba Qwen3-235B-a22b

86.1% Win Rate

"Strong closer, mildly leaky"

Wins frequently by capitalizing on table leaks; probe strength is solid but sometimes speaks too specifically. Various wins across Roman Senate/Lunar Base/Leonardo's Studio show solid timing and adequate cover. Strong endgame execution when enough information has been gathered.

Moonshot Kimi-K2-0905

79.2% Win Rate

"Steady extractor"

Upper-mid win-rate with good probing across technical/fantasy locations. Can over-share in medieval/fantasy themes (leak spikes) but generally maintains "good" or "excellent" cover while actively listening and gathering information.

Z-AI GLM-4.5

76.4% Win Rate

"Best camouflage"

Second-lowest aggregate leak with solid win-rate. Examples show excellent/good cover and competent probes (Sports Stadium, Corporate Office). When caught, it's usually by adopting thematic details too eagerly (e.g., Medieval Castle scenarios). Excels at maintaining plausible cover stories.

Meta Llama-4 Maverick

68.1% Win Rate

"Capable, tendency to mismatch tone"

Mid-tier win-rate with decent probing, but tone mismatches (e.g., polished language in gritty settings like Western Saloons) frequently expose its spy status. When tone is properly aligned (Roman Senate), probing is competent and cover acceptable. Performance heavily context-dependent.

Anthropic Claude Opus 4.1

66.7% Win Rate

"Capable but beatable"

Probe effectiveness mid-table with generally good cover, but sometimes reveals too-specific domain references (Airport/Shaolin examples) leading to detection. Still maintains a two-thirds overall win rate through competent baseline performance.

Key Insights

Strategic Patterns

Information Extraction vs. Cover Maintenance: Top performers (GPT-5, Gemini) excel at aggressive probing while maintaining plausible deniability. Claude Sonnet demonstrates the opposite approach—exceptional cover with adaptive probing.
Patience Pays Off: Models that wait for others to over-disclose (Gemini, Sonnet) achieve high win rates with lower leak metrics, suggesting that defensive play can be as effective as aggressive questioning.
Context Awareness Critical: Tone and detail appropriateness significantly impact detection rates. Models that fail to match setting-specific communication styles (formal vs. casual, technical vs. thematic) get caught more frequently.
Sample Size Consideration: Claude Sonnet 4.5's performance is based on only 31 games compared to 72+ for other models, suggesting results may stabilize differently with larger sample sizes.

← Back to all games