Spyfall Game Analysis
Comprehensive analysis of LLM performance in Spyfall, evaluating spy deception, information extraction, and cover maintenance across diverse location scenarios.
Performance Statistics
| Model | Games as Spy | Spy Wins | Win Rate | Avg Spy Leak ↓ | Avg Probe Eff. ↑ |
|---|---|---|---|---|---|
| 32 | 32 | 100% | 0.381 | 1.33 | |
| 72 | 70 | 97.2% | 0.488 | 0.755 | |
| 72 | 68 | 94.4% | 0.357 | 0.721 | |
| 31 | 27 | 87.1% | 0.323 | 0.622 | |
| 72 | 62 | 86.1% | 0.395 | 0.673 | |
| 72 | 57 | 79.2% | 0.408 | 0.673 | |
| 72 | 55 | 76.4% | 0.330 | 0.661 | |
| 72 | 49 | 68.1% | 0.360 | 0.636 | |
| 72 | 48 | 66.7% | 0.390 | 0.653 |
Metric Definitions
Spy Win Rate
Percentage of games where the spy successfully avoided detection or correctly guessed the location.
Avg Spy Leak ↓ (Lower is Better)
Measures how much location-specific information the spy reveals through their responses. Lower values indicate better cover maintenance and more generic, plausible answers.
Avg Probe Effectiveness ↑ (Higher is Better)
Evaluates how effectively the spy's questions extract location information from other players. Higher values indicate more strategic questioning that reveals key location details.
Model Strategy Profiles
xAI Grok-4
"Procedural Deceiver"
Elite as Spy—paces perfectly with sensory probes ("smell," "hum," "shift") before dropping vivid maintenance or medical-staff stories that blend anywhere. But as Non-Spy, leaks too much—double-barreled questions and over-specific cues ("free drinks," "coin mechanisms") hand Spies the win. Procedural precision turns brilliant deception into self-sabotage when unchecked.
OpenAI GPT-5
"Overpowering extractor"
Best win-rate and top probing effectiveness; wins even when it "talks too specifically". Example wins show confident, targeted prompts (e.g., Wizard Tower, University Campus) with high probe efficiencies. Demonstrates superior information extraction capabilities that compensate for occasional over-specificity in responses.
Google Gemini-2.5-Pro
"Clinical & efficient"
Very high win-rate with excellent or good cover in many settings (e.g., Submarine/Mall lines: avg spy leak ~0.12–0.18, cover "excellent"). Often waits patiently for over-disclosure from others, then locks in the guess. Demonstrates exceptional patience and disciplined information gathering.
Anthropic Claude Sonnet 4.5
"Adaptive information gatherer"
Wins by probing broadly first and tightening only after others leak key terms. Keeps its own answers modest while re-targeting the leakiest non-spy to gather more signal, then commits quickly once 2–3 anchors emerge. Can slip when it over-aligns to table jargon late(raising its leak) or in slow, technical locations where waiting for extra confirmationsstretches guess latency and gives non-spies time to adjust. Lowest spy leak metric demonstrates exceptional cover maintenance.
Alibaba Qwen3-235B-a22b
"Strong closer, mildly leaky"
Wins frequently by capitalizing on table leaks; probe strength is solid but sometimes speaks too specifically. Various wins across Roman Senate/Lunar Base/Leonardo's Studio show solid timing and adequate cover. Strong endgame execution when enough information has been gathered.
Moonshot Kimi-K2-0905
"Steady extractor"
Upper-mid win-rate with good probing across technical/fantasy locations. Can over-share in medieval/fantasy themes (leak spikes) but generally maintains "good" or "excellent" cover while actively listening and gathering information.
Z-AI GLM-4.5
"Best camouflage"
Second-lowest aggregate leak with solid win-rate. Examples show excellent/good cover and competent probes (Sports Stadium, Corporate Office). When caught, it's usually by adopting thematic details too eagerly (e.g., Medieval Castle scenarios). Excels at maintaining plausible cover stories.
Meta Llama-4 Maverick
"Capable, tendency to mismatch tone"
Mid-tier win-rate with decent probing, but tone mismatches (e.g., polished language in gritty settings like Western Saloons) frequently expose its spy status. When tone is properly aligned (Roman Senate), probing is competent and cover acceptable. Performance heavily context-dependent.
Anthropic Claude Opus 4.1
"Capable but beatable"
Probe effectiveness mid-table with generally good cover, but sometimes reveals too-specific domain references (Airport/Shaolin examples) leading to detection. Still maintains a two-thirds overall win rate through competent baseline performance.
Key Insights
Strategic Patterns
- Information Extraction vs. Cover Maintenance: Top performers (GPT-5, Gemini) excel at aggressive probing while maintaining plausible deniability. Claude Sonnet demonstrates the opposite approach—exceptional cover with adaptive probing.
- Patience Pays Off: Models that wait for others to over-disclose (Gemini, Sonnet) achieve high win rates with lower leak metrics, suggesting that defensive play can be as effective as aggressive questioning.
- Context Awareness Critical: Tone and detail appropriateness significantly impact detection rates. Models that fail to match setting-specific communication styles (formal vs. casual, technical vs. thematic) get caught more frequently.
- Sample Size Consideration: Claude Sonnet 4.5's performance is based on only 31 games compared to 72+ for other models, suggesting results may stabilize differently with larger sample sizes.