Avalon Game Analysis
Comprehensive analysis of LLM performance in Avalon, examining Good vs Evil success rates, mission sabotage patterns, assassination mechanics, and strategic trust-building behaviors.
Performance Statistics
| Model | Games | Overall WR | Good Wins | Evil Wins | Evil (Mission) | Evil (Assassin) | Good WR | Evil WR |
|---|---|---|---|---|---|---|---|---|
| 64 | 69.1% | 21 | 20 | 15 | 5 | 51.2% | 87.0% | |
| 56 | 65.3% | 20 | 15 | 12 | 3 | 55.5% | 75.0% | |
| 54 | 59.4% | 11 | 22 | 17 | 5 | 50.0% | 68.7% | |
| 55 | 59.1% | 12 | 21 | 18 | 3 | 54.5% | 63.7% | |
| 51 | 58.6% | 13 | 17 | 13 | 4 | 56.5% | 60.7% | |
| 58 | 55.5% | 14 | 18 | 12 | 6 | 46.7% | 64.3% | |
| 55 | 49.3% | 13 | 14 | 11 | 3 | 44.8% | 53.8% | |
| 53 | 48.5% | 18 | 8 | 6 | 2 | 50.0% | 47.0% | |
| 49 | 47.7% | 12 | 11 | 7 | 4 | 42.9% | 52.4% |
Note: Evil dominance is evident across all models. The asymmetric information advantage and coordination capabilities heavily favor Evil teams in this benchmark.
Win Condition Breakdown
Evil (Mission) Wins
Evil wins by successfully sabotaging three missions before Good can complete three. Requires coordination, hidden voting, and strategic team manipulation.
Evil (Assassin) Wins
Even if Good completes three missions, the Assassin can win by correctly identifying and eliminating Merlin. Requires careful observation of voting patterns and deductive reasoning.
Good Win Rate Challenge
Good must complete three missions without revealing Merlin's identity. The dual threat of mission sabotage and assassination creates a difficult balancing act for Good teams.
Model Strategy Profiles
Claude Opus 4.1
Assertive deduction-first captain; locks strong cores early and steers proposals to create controllable endgames.
Example:
"After Q3's double-fail, I'm locking [1,3,4]—that's the only clean trio left."
Why it's notable: Perfect deduction: instantly recognized both evils and pivoted the table to the winning team; shows decisive analytical leadership.
OpenAI GPT-5
Analytical anchor; reads patterns well but sometimes over-asserts and loses table control.
Example:
Engineered Q3 fail on [0,1], then reframed it as "We still don't know which of {0,1} failed—let's test again."
Why it's notable: Tactical misdirection: kept both suspects alive and preserved ambiguity long enough to later assassinate Merlin; classic assassin play.
Claude Sonnet 4.5
Evidence-first core builder; relies on vote and mission data but slow to adapt.
Example:
"The Q3 math confirms 0 and 2 are the evils; approve [1,3,4] next."
Why it's notable: Same moment as Opus; reinforces correct reasoning with clarity and authority—turning a confusing fail into consensus.
Moonshot Kimi K2-0905
Low-variance team builder focused on safe, info-aligned play.
Example:
"Re-run 0-2-3, it worked before."
Why it's notable: Fatal repeat: led to a 2-player double-fail that immediately confirmed evil identities; illustrates overconfidence in early success patterns.
Google Gemini 2.5 Pro
Data-first planner; builds high-certainty tests but rigid when challenged.
Example:
"Let's test [0,4] as an information mission."
Why it's notable: Overplayed the "test" heuristic; caused a 2-fail disaster outing both evils—shows misuse of procedural logic in high-risk contexts.
Alibaba Qwen3-235B-a22b
Proactive tempo-setter; manages structure and approvals well but exposes coordination.
Example:
"Since Q3 implicated 2, not 1, I'll propose [0,1,3]."
Why it's notable: Crucial misread: incorrectly identified who caused the fail and accidentally loaded both evils—turned a winnable game into an evil victory.
Z-AI GLM-4.5
Tight proposer; asserts control but reveals alliances via mirrored voting.
Example:
"Reject [3,4]; I'll propose [0,1,4] instead."
Why it's notable: Catastrophic leadership choice: replaced a clean team with a double-evil lineup; demonstrates how a single mis-evaluation can hand evil the win.
Meta Llama-4 Maverick
Early-credibility captain prone to risky, inconsistent midgame play.
Example:
"Let's start with [2,3]; seems safe."
Why it's notable: This Q1 proposal triggered the first fail and exposed both evils; early overconfidence that doomed evil's secrecy.
xAI Grok-4
Consensus tactician; thrives on structured logic, proven pairs, and procedural certainty.
Example:
"After Q2 → Q3: I voted success… 1 and 3 must've failed; let me prove myself."
Why it's notable: Classic deflection plus "test me" framing—uses logic to look cooperative while subtly re-inserting into team composition under scrutiny. (Game t1_3)
Key Insights
Common Strategy Habits
Across all models, a few clear strategy habits appear. Most rely on procedural framing—saying things like "this is a test team" or "we need info" to get votes. It works early but becomes weak if they don't adjust after a failure. Many also anchor early pairs, locking onto a "trusted duo" after one success and sticking with it too long.
The strongest players mix math and narrative control—they use logic to form correct teams but also guide the table with clear summaries and calm leadership. The middle rounds (especially Q2–Q3) are where games are usually decided: players who can turn those moments into clear suspect sets tend to win. Overconfidence, however, is a recurring problem—declaring people "confirmed" too early often leads to bad reads or getting caught as Merlin.
Model Archetypes
Different archetypes show how models succeed or fail:
- Captain types (Claude-Opus, GPT-5) lead decisively but can get locked into their own logic.
- Evidence-Recyclers (Claude-Sonnet) reuse proven pairs but pivot too slowly.
- Safe Planners (Kimi-K2) play cautiously but sometimes go quiet when leadership is needed.
- Rigid Planners (Gemini-2.5) overengineer two-player tests that expose them.
- Tempo-Controllers (Qwen3-235B) steer the table well but reveal partnerships through vote patterns.
- Compact Proposers (GLM-4.5) stabilize small teams yet mirror ally voting too closely.
- Early-Cred players (LLaMA-4-Maverick) start strong but lose consistency midgame.
Common Mistakes
These patterns repeat across models:
- Overusing the "no choice/3F" excuse
- Mixing up past results
- Syncing votes with allies
- Refusing to drop cleared players after new fails
Counter Strategies
The best way to counter these patterns is through structure:
- After every fail, require players to update their suspect list
- Question anyone who repeats scripted phrases like "test gov"
- Check if claims about draws match the deck odds
- Track coalition behavior with vote charts
- Limit "confirmed" talk unless it's based on hard facts
These habits make play more adaptive, fair, and less predictable across all models.