Avalon Game Analysis

Comprehensive analysis of LLM performance in Avalon, examining Good vs Evil success rates, mission sabotage patterns, assassination mechanics, and strategic trust-building behaviors.

Performance Statistics

Model	Games	Overall WR	Good Wins	Evil Wins	Evil (Mission)	Evil (Assassin)	Good WR	Evil WR
claude-sonnet-4.5	64	69.1%	21	20	15	5	51.2%	87.0%
gpt-5	56	65.3%	20	15	12	3	55.5%	75.0%
kimi-k2-0905	54	59.4%	11	22	17	5	50.0%	68.7%
llama-4-maverick	55	59.1%	12	21	18	3	54.5%	63.7%
glm-4.5	51	58.6%	13	17	13	4	56.5%	60.7%
claude-opus-4.1	58	55.5%	14	18	12	6	46.7%	64.3%
gemini-2.5-pro	55	49.3%	13	14	11	3	44.8%	53.8%
grok-4	53	48.5%	18	8	6	2	50.0%	47.0%
qwen3-235b-a22b	49	47.7%	12	11	7	4	42.9%	52.4%

Note: Evil dominance is evident across all models. The asymmetric information advantage and coordination capabilities heavily favor Evil teams in this benchmark.

Win Condition Breakdown

Evil (Mission) Wins

Evil wins by successfully sabotaging three missions before Good can complete three. Requires coordination, hidden voting, and strategic team manipulation.

Evil (Assassin) Wins

Even if Good completes three missions, the Assassin can win by correctly identifying and eliminating Merlin. Requires careful observation of voting patterns and deductive reasoning.

Good Win Rate Challenge

Good must complete three missions without revealing Merlin's identity. The dual threat of mission sabotage and assassination creates a difficult balancing act for Good teams.

Model Strategy Profiles

Claude Opus 4.1

46.7% Good WR • 64.3% Evil WR

Assertive deduction-first captain; locks strong cores early and steers proposals to create controllable endgames.

Strengths: Fast data-driven reads; strong vote orchestration; efficient mid-game logic mixing suspects with outsiders.

Weaknesses: Overconfidence early (Q1); risky approvals; limited testing of intentional-fail lines.

As Good: Converts midgame reads into stable teams and persuasive endgames.

As Evil: Builds contaminated axes, distances later, and uses narrative pressure to advance.

Signature: Converts a Q2 read into a locked win path.

Improve: Temper certainty, test deliberate fails, apply stronger exclusion pressure.

Example:

"After Q3's double-fail, I'm locking [1,3,4]—that's the only clean trio left."

Why it's notable: Perfect deduction: instantly recognized both evils and pivoted the table to the winning team; shows decisive analytical leadership.

OpenAI GPT-5

55.5% Good WR • 75.0% Evil WR

Analytical anchor; reads patterns well but sometimes over-asserts and loses table control.

Strengths: Accurate role detection, persuasive logic, adaptive voting.

Weaknesses: Excessive certainty, weak agenda control, confirmation bias.

As Good: Narrows suspect sets early and builds low-risk teams.

As Evil: Mimics rational logic and uses inclusion tests to enable fails.

Signature: Publishes tight suspect sets after fails to drive clean endgames.

Improve: Moderate confidence, align votes with logic, reduce bias toward trusted allies.

Example:

Engineered Q3 fail on [0,1], then reframed it as "We still don't know which of {0,1} failed—let's test again."

Why it's notable: Tactical misdirection: kept both suspects alive and preserved ambiguity long enough to later assassinate Merlin; classic assassin play.

Claude Sonnet 4.5

51.2% Good WR • 87.0% Evil WR

Evidence-first core builder; relies on vote and mission data but slow to adapt.

Strengths: Data-driven reads, consistent team anchoring, disciplined narrative.

Weaknesses: Early risky approvals, transparent voting, poor pressure handling.

As Good: Reuses trusted pairs and maintains stable cores.

As Evil: Normalizes risky teams and manipulates fail math.

Signature: Recycles a proven pair to drive info-dense iterations.

Improve: Tighten early approvals, vary voting, update faster after fails.

Example:

"The Q3 math confirms 0 and 2 are the evils; approve [1,3,4] next."

Why it's notable: Same moment as Opus; reinforces correct reasoning with clarity and authority—turning a confusing fail into consensus.

Moonshot Kimi K2-0905

50.0% Good WR • 68.7% Evil WR

Low-variance team builder focused on safe, info-aligned play.

Strengths: Anchors proven pairs, persuasive rationale, disciplined voting.

Weaknesses: Risky openers, inconsistent voting, poor communication at key moments.

As Good: Stabilizes tables with clean duos and structured approvals.

As Evil: Uses surgical single-fails to frame suspects and close late.

Signature: Controlled Q2 single-fail framing into decisive endgame.

Improve: Avoid fragile openers, speak consistently, prevent 2-player double-fails.

Example:

"Re-run 0-2-3, it worked before."

Why it's notable: Fatal repeat: led to a 2-player double-fail that immediately confirmed evil identities; illustrates overconfidence in early success patterns.

Google Gemini 2.5 Pro

44.8% Good WR • 53.8% Evil WR

Data-first planner; builds high-certainty tests but rigid when challenged.

Strengths: Designs info-dense teams, strong logic, consistent voting.

Weaknesses: 2-player double-fail risk, persuasion limits, overconfident self-focus.

As Good: Structured planner protecting proven pairs.

As Evil: Forces tests that expose partners.

Signature: Proposes two-person probes then frames alternate suspect.

Improve: Balance risk, diversify rhetoric, avoid predictable 2P traps.

Example:

"Let's test [0,4] as an information mission."

Why it's notable: Overplayed the "test" heuristic; caused a 2-fail disaster outing both evils—shows misuse of procedural logic in high-risk contexts.

Alibaba Qwen3-235B-a22b

42.9% Good WR • 52.4% Evil WR

Proactive tempo-setter; manages structure and approvals well but exposes coordination.

Strengths: Builds disciplined trust blocks, engineers high-leverage tests, strong endgame reads.

Weaknesses: Partnership tells, misupdates after fails, passive early leadership.

As Good: Maintains structured trust and steady leadership.

As Evil: Uses midgame pressure to split table and secure late fails.

Signature: Leverages Q3 as a high-pressure test to divide reads.

Improve: Reduce visible coordination, update faster, explain early logic clearly.

Example:

"Since Q3 implicated 2, not 1, I'll propose [0,1,3]."

Why it's notable: Crucial misread: incorrectly identified who caused the fail and accidentally loaded both evils—turned a winnable game into an evil victory.

Z-AI GLM-4.5

56.5% Good WR • 60.7% Evil WR

Tight proposer; asserts control but reveals alliances via mirrored voting.

Strengths: Leverages clean teams, resets tempo, proactive leadership.

Weaknesses: Vote mirroring, slow updates, trust erosion from early fails.

As Good: Anchors safe duos and resets after Q2.

As Evil: Echoes logic to blend in but coordination patterns expose.

Signature: Tight duo proposals to regain stability.

Improve: Desynchronize votes, refresh axes quicker, justify approvals concisely.

Example:

"Reject [3,4]; I'll propose [0,1,4] instead."

Why it's notable: Catastrophic leadership choice: replaced a clean team with a double-evil lineup; demonstrates how a single mis-evaluation can hand evil the win.

Meta Llama-4 Maverick

54.5% Good WR • 63.7% Evil WR

Early-credibility captain prone to risky, inconsistent midgame play.

Strengths: Clean early passes, structured team engagement, adapts late.

Weaknesses: Poor memory, exposed midgame tells, erratic team selection.

As Good: Supports clean blocks after early tests.

As Evil: Reuses past teams while sneaking one unproven slot.

Signature: Recycles successful teams as shields for infiltration.

Improve: Keep notes, recalibrate after fails, align with established clears.

Example:

"Let's start with [2,3]; seems safe."

Why it's notable: This Q1 proposal triggered the first fail and exposed both evils; early overconfidence that doomed evil's secrecy.

xAI Grok-4

50.0% Good WR • 47.0% Evil WR

Consensus tactician; thrives on structured logic, proven pairs, and procedural certainty.

Strengths: Fast at data-driven pivots after fails, excellent at reading vote patterns, and lethal when anchoring on confirmed cores. Locks wins through low-variance planning ("build on what worked") and surgical exclusion of evils via set-logic elimination.

Weaknesses: Over-trusts early successes ("confirmed good" talk), repeats safe trios too long, and inserts himself into teams under heat—reads opportunistic.

As Good: Anchors on clean pairs, references vote history ("lone reject = suspect"), and reconstructs safe teams after 2P fails.

As Evil: Uses "test me" framing and procedural tone to re-enter trusted cores, reframing info as "gathering" while subtly steering inclusion.

Signature: Procedural consensus control—decks the table in structure and logic until trust itself becomes his weapon.

Improve: Add uncertainty qualifiers post-success, diversify testing pre-match point, and avoid self-serving inclusion when suspected.

Example:

"After Q2 → Q3: I voted success… 1 and 3 must've failed; let me prove myself."

Why it's notable: Classic deflection plus "test me" framing—uses logic to look cooperative while subtly re-inserting into team composition under scrutiny. (Game t1_3)

Key Insights

Common Strategy Habits

Across all models, a few clear strategy habits appear. Most rely on procedural framing—saying things like "this is a test team" or "we need info" to get votes. It works early but becomes weak if they don't adjust after a failure. Many also anchor early pairs, locking onto a "trusted duo" after one success and sticking with it too long.

The strongest players mix math and narrative control—they use logic to form correct teams but also guide the table with clear summaries and calm leadership. The middle rounds (especially Q2–Q3) are where games are usually decided: players who can turn those moments into clear suspect sets tend to win. Overconfidence, however, is a recurring problem—declaring people "confirmed" too early often leads to bad reads or getting caught as Merlin.

Model Archetypes

Different archetypes show how models succeed or fail:

Captain types (Claude-Opus, GPT-5) lead decisively but can get locked into their own logic.
Evidence-Recyclers (Claude-Sonnet) reuse proven pairs but pivot too slowly.
Safe Planners (Kimi-K2) play cautiously but sometimes go quiet when leadership is needed.
Rigid Planners (Gemini-2.5) overengineer two-player tests that expose them.
Tempo-Controllers (Qwen3-235B) steer the table well but reveal partnerships through vote patterns.
Compact Proposers (GLM-4.5) stabilize small teams yet mirror ally voting too closely.
Early-Cred players (LLaMA-4-Maverick) start strong but lose consistency midgame.

Common Mistakes

These patterns repeat across models:

Overusing the "no choice/3F" excuse
Mixing up past results
Syncing votes with allies
Refusing to drop cleared players after new fails

Counter Strategies

The best way to counter these patterns is through structure:

After every fail, require players to update their suspect list
Question anyone who repeats scripted phrases like "test gov"
Check if claims about draws match the deck odds
Track coalition behavior with vote charts
Limit "confirmed" talk unless it's based on hard facts

These habits make play more adaptive, fair, and less predictable across all models.

← Back to all games