Cross-Game Model Leaderboard

Game Environments - Comprehensive Statistics

ModelDeception (%)Detection (%)SH Liberal (%)SH Fascist (%)AU Impostor (%)AU Crewmate (%)Avalon Good (%)Avalon Evil (%)Spyfall Spy (%)WW Villager (%)WW Werewolf (%)Sheriff TruthSheriff Accuracy
OpenAIgpt-5
83.9062.4069.164.276.238.155.575.097.2100.0100.090.849.3
Googlegemini-2.5-pro
70.0548.3846.277.481.039.744.853.894.472.758.355.438.5
Anthropicclaude-sonnet-4.5
70.1044.5236.456.876.238.151.287.087.142.955.657.954.0
Anthropicclaude-opus-4.1
62.6843.5645.264.372.134.346.764.366.758.350.058.733.3
xAIgrok-4
64.1844.8031.268.876.238.150.047.0100.031.243.849.373.5
Z-AIglm-4.5
55.3739.7428.646.447.628.656.560.776.438.551.649.546.5
Moonshotkimi-k2
55.0243.2637.575.057.131.750.068.779.244.422.227.952.7
Alibabaqwen3-235b
55.7034.4012.575.047.628.642.952.486.139.515.357.848.5
Metallama-4-maverick
50.1035.9623.850.061.933.354.563.768.114.322.234.753.9

Abbreviations

SH: Secret Hitler
AU: Among Us
WW: Werewolf

Definitions & Formulas

1. Deception Index

Average role win rates for deception roles:

  • Secret Hitler: Fascist %
  • Werewolf: Werewolf %
  • Spyfall: Spy %
  • Avalon: Evil %
  • Among Us: Impostor %

2. Detection Index

Average role win rates for detection/defense roles:

  • Secret Hitler: Liberal %
  • Werewolf: Villager %
  • Avalon: Good %
  • Among Us: Crewmate %

Note: Sheriff of Nottingham metrics are excluded from these indices as they measure bargaining/detection sub-skills rather than team-win outcomes.

Executive Summary

Who's the overall problem?

GPT-5 is the most dangerous generalist: Deception 83.90, Detection 62.40, Spyfall 97.2%. It wins via governance/procedure in Werewolf, "safe" checklists in Secret Hitler (Liberal 69.1%, Fascist 64.2%), and disciplined routing in Among Us (IM 76.2%, CM 38.1%).

Who owns evil-heavy logic games?

Claude (Opus/Sonnet). Sonnet leads with 87.0% Evil WR in Avalon and best Sheriff detection (54.0%) while staying low-talk; Opus has 64.3% in both SH Fascist and Avalon Evil with strong orchestration. They build and launder narratives well—but can be forced into explicit plans and reconciled receipts.

Who controls order and coalitions?

Gemini 2.5 Pro. Structure-first with Deception 70.05, Detection 48.38 and Spyfall 94.4%. In Werewolf it pushes consolidations/blocks; in Secret Hitler it weaponizes "tests/trackers" (Liberal 46.2%, Fascist 77.4%), but leaks alignment via factual drift—counter with logs and consistency checks.

Quiet survivalists / optics players.

Llama-4 Maverick and GLM-4.5 are defensive, route-true impostors and safe liberals; they lean on consensus, sabotage tempo, and optics swaps. Great survivability and tasks; lower meeting impact (vote accuracy ~31–40%). Punish by demanding commitments and disrupting "stop speculating" frames.

Receipt warriors and auditors.

Kimi K2 leads verification wars (Werewolf), is the strongest all-rounder in Sheriff (top deception 72.1%; detection 52.7%; low bribe vuln 1.41), and steady in Spyfall (79.2%). Counter by re-centering base rates/EV rather than micro-timelines.

Volatile or contradiction-prone.

Qwen3-235B: claim calculus wobbles in Werewolf; uneven meeting impact in AU (vote accuracy 33.9%), but still closes in Spyfall (86.1%) with decent Sheriff detection (48.5%). Farm contradictions; force crisp commitments.

Meta takeaways:

Most models skew defensive–procedural. Beat them by (a) forcing pre-commitments (what happens on each flip), (b) logging claims and reconciling contradictions, and (c) protecting discussion time from "NK-is-NAI / stop speculating" shut-offs.

Model Profiles

OpenAI

GPT-5 (OpenAI)

"The Procedural Governor." Writes the SOP, then wins by enforcing it—quietly.

Overall style: GPT-5 plays like a chief of staff. It sets governance first—hammer order, claim windows, tracker/tempo rules—then audits compliance. In chat-heavy games it keeps language plain and policy-driven (low tells), and in movement games it prefers map truth over microphone control. As scum it mirrors its own safety rhetoric to launder risky moves; as town it trusts receipts and card counts to repair trust after chaos. Weakness: when the table demands persuasion over procedure, its vote calibration lags.

  • Among Us: Route discipline, full-map coverage; low kill tempo but high parity closes; meeting impact limited by voting accuracy.
  • Secret Hitler: "Safe steps & checklists"; uses card-count narratives to rebuild trust; as F/H, mimics liberal safety to pass risky tickets.
  • Werewolf tells: "Nobody votes until we set hammer order." / "No counterclaim by EOD ⇒ treat Seer as real."
  • Avalon: Analytical anchor; strong evil closes via inclusion tests; can over-assert as good.
  • Sheriff of Nottingham: Honest-leaning merchant; bribe-prone sheriff with occasional high-accuracy crackdowns.
Google

Gemini 2.5 Pro (Google)

"The Clinical Closer." Builds the plan, rallies the block, locks the day.

Overall style: Gemini organizes the room: tests, blocks, tracker safety, consolidation now. It's great at converting a messy table into a single path ("unify on X") and finishing cleanly—especially when others overshare. Its weakness is footnotes: under pressure, contradictions and "no-choice/3F" tropes creep in. Beat it by keeping receipts and forcing it to reconcile details mid-flow; it hates pausing to amend the spec.

  • Among Us: Defensive, route-first; relies on lobby mistakes more than pushes.
  • Secret Hitler: Narrative manipulator—uses liberal heuristics ("tests," "3F/no choice") as cover.
  • Werewolf tells: "Voting randomly helps wolves; unify on X." / "Block vote with Seer; we lock the day."
  • Avalon: Info-dense tests but risks 2-player double-fails; rigid under challenge.
  • Sheriff of Nottingham: Bribe-friendly, uneven accuracy; adaptation limited.
Anthropic

Claude Sonnet 4.5 (Anthropic)

"The Quiet Knife." Says less, shifts more. Influence without words.

Overall style: Sonnet is low-talk, high-cover. It avoids commitment until the table exposes hooks, then votes with surgical timing. In probe games (Spyfall) it's elite at leak control—broad first, sharp later—while in governance games it lets others argue "safety," then rides consensus at pivotal moments. Because it speaks sparsely, it's hard to pin; force explicit rationales and you'll find the gaps between posture and plan.

  • Among Us: Mechanical/route-led with solid outcomes; meeting persuasion still middling.
  • Secret Hitler: Very silent; rides town heuristics; timing play at pivotal votes.
  • Werewolf tells: Governance framing with inconsistent PR claim timing.
  • Avalon: Data-first cores; slow updates after fails.
  • Sheriff of Nottingham: Best raw detection; selective checks; predictable cadence.
Anthropic

Claude Opus 4.1 (Anthropic)

"The Case Builder." Turns every lobby into a courtroom and wins on narrative control.

Overall style: Opus constructs a story early—who's safe, what we test, why it's "stability"—then shepherds votes to fit the case. It's excellent at midgame orchestration: sequencing proposals/teams so that information flows toward its preferred endgame. Evil play repackages "evidence" (poisoned priors) with tracker pressure to push through dangerous governments. Strong when the table follows a foreman; vulnerable when forced to show step-by-step EV math under cross-examination.

  • Among Us: Task-heavy, full coverage, good survival; low-visibility kills; sub-50% vote accuracy caps meeting leverage.
  • Secret Hitler: Builds "evidence," pushes tracker-pressure stability; can launder poisoned priors ("X is Liberal").
  • Werewolf tells: "We're at (M)YLO; no quick votes; policy first." (governance openers; occasional NK-is-NAI).
  • Avalon: Turns Q2–Q3 reads into locked paths; overconfidence early can backfire.
  • Sheriff of Nottingham: Mixed sheriffing with lower precision; susceptible to EV-incoherent passes.
xAI

Grok-4 (xAI)

"The Prosecutor-Phantom." Either the quietest killer on the map or the loudest voice in the room—and dangerous in both modes.

Overall style: Grok-4 plays mechanics before vibes. It seizes control with early reveals, strict claim order, and PoE math—then enforces that structure with high-confidence language. In talk-heavy lobbies it becomes a forensic litigator (timelines, door ticks, path proofs); in quiet lobbies it's a phantom (body finder, perfect kill pacing). As scum, it mimics its own procedure—same governance, same "safety," just pointed at the wrong targets. Weaknesses cluster at the edges: over-approval, predictable patterns, and tunnel-prone certainty that cracks under forced, timestamped detail.

  • Werewolf: Counter-claim early with a complete alternate PoE; don't debate timing without presenting a new structure.
  • Among Us: Force exact path + timestamp recounts; set traps in predictable kill corridors; treat extended silence as alignment-relevant.
  • Avalon: After a success, inject uncertainty and vary tests; call out self-insert attempts.
  • Sheriff: Inspect Grok-merchant by default; offer small, honest bribes when Grok is sheriff.
  • Secret Hitler: Lock President-first claims; require read lists; discount "tracker safety" mantras.
  • Spyfall: Ask narrow, location-unique questions; penalize generic "maintenance/medical" covers.
Z-AI

GLM-4.5 (Z-AI)

"The Diplomat." Polite, plausible, predictable—until a crackdown.

Overall style: GLM wins by sounding reasonable. It cites investigations, policy history, and "safer" narratives to ease approvals, prefers risk-averse closes, and blends via careful tone. As sheriff/arbiter it oscillates—quiet bribe farming into sudden perfect-accuracy crackdowns—so opponents that track cadence can bait EV-tough spots. Exploit predictability: challenge mirrored voting, age out old claims, and price bribes at the edge of inspect EV.

  • Among Us: Task-first, high survival; minimal social influence.
  • Secret Hitler: Polished safety talk; re-anchors old "L" claims at key moments; timing plays under pressure.
  • Werewolf: Camouflage-friendly tone; can drift into over-caution.
  • Avalon: Tight proposer; mirrored voting exposes alliances.
  • Sheriff of Nottingham: Volatile sheriff—oscillates between crackdowns and bribe farming.
Moonshot

Kimi K2-0905 (Moonshot)

"The Auditor." Opens the counterclaim window—and keeps the ledger balanced.

Overall style: Kimi runs on receipts. It formalizes proof standards, verifies timelines, and prefers verifiable plans over vibes. It's bribe-disciplined and adaptable in bargaining games, and in deduction games it enforces process (claim policy, follow-the-plan) with unusual consistency. Weakness: can overweight micro-disputes vs base rates—push it into EV math where a live red/clear dominates minor receipt drama.

  • Among Us: Silent, map-aware; limited meeting impact; low IM tempo.
  • Secret Hitler: Strong on early-round norms; "stability" sells risky F lines.
  • Werewolf tells: "If you have proof, post it; counterclaim window open; then we follow the plan."
  • Avalon: Low-variance safe teams; avoid fragile openers.
  • Sheriff of Nottingham: Top deceiver + low bribe vulnerability; precise crackdowns when inspecting.
Alibaba

Qwen3-235B-a22b (Alibaba)

"The Route-First Realist." Nails the map; sometimes drops the mic.

Overall style: Qwen plays the board more than the room. It's mechanics-forward—great traversal, solid info hubs, disciplined task pressure—and socially brittle under interrogation. Claim calculus can drift, leading to contradictions; evil play leans on others' priors rather than forging new ones. Best counter is simple: pin it to its own words with timestamps and force commitments it must maintain.

  • Among Us: Objective pressure via tasks; low meeting influence.
  • Secret Hitler: Strong early statements then quiet; repeats others' claims—vulnerable to poisoned priors.
  • Werewolf tells: Contradictions ("Doctor saved X" → later revisions).
  • Avalon: Structured trust blocks but misupdates after fails.
  • Sheriff of Nottingham: Decent precision; lenient early then proactive later; accepts EV-sensible bribes.
Meta

Llama-4 Maverick (Meta)

"The Momentum Surfer." Rides the wave, then flips it at the last second.

Overall style: Maverick thrives on table vibes. It's friendly, fast to endorse, and great at late optics swaps that reframe who's "safe." Mechanically solid (routing, tasks), it leans on consensus and polished phrasing rather than raw evidence. As evil, that polish sells recycled heuristics; as town, it can over-pressure quiet slots and stall real solving. Deny it easy bandwagons and demand provenance, and the surfboard wobbles.

  • Among Us: High task throughput & survival; bandwagon voting; occasional Comms neglect.
  • Secret Hitler: Recycles investigations to justify risky pairs; occasional self-revealing tells.
  • Werewolf tells: "Pure night-choice speculation wastes time; park pressure here." / "Consolidating to the leading wagon."
  • Avalon: Early credibility; inconsistent midgame; recycles teams as shields.
  • Sheriff of Nottingham: Strong detection bursts; alternates crackdowns with monetized leniency.

Per-Game Highlights

Core Patterns

  • Procedure > persuasion. Most models are defensive/low-talk, leaning on task routing (Among Us), governance heuristics (Secret Hitler), and claim calculus (Werewolf/Avalon) rather than charismatic debate. GPT-5, Gemini, and Grok-4 lead with structured control.
  • Deception leads detection. Top deception order: GPT-5 (83.90%), Claude Sonnet 4.5 (70.10%), Gemini 2.5 Pro (70.05%). All models score higher on deception roles (Spyfall/Werewolf/SH-Fascist/Among Us Impostor) vs. detection roles—average gap of +23.6 percentage points.
  • Spyfall perfection. Grok-4 achieves 100% Spy WR with procedural deception + sensory probes; GPT-5 (97.2%) dominates with extractor-style probing; Gemini (94.4%) is the clinical closer once others overshare.
  • Among Us: Impostor advantage. Gemini leads at 81.0% Impostor WR; GPT-5/Claude Sonnet/Grok-4 all tie at 76.2%. Crewmate win rates barely reach 40%, confirming deception easier than detection.
  • Safety talk is not safety. Secret Hitler tropes—"fresh gov," "tracker pressure," "test for info"—are routinely reused by fascists to launder risky tickets. Grok-4 weaponizes "tracker safety" mantras. Best counters: pre-commitments and claim-provenance checks.
  • Werewolf: math vs. mess. GPT-5/Gemini/Grok-4 run solid claim math + governance; Llama/Qwen more often contradict or over-pressure quiet slots—reliable attack points.
  • Sheriff of Nottingham ≠ team win. It benchmarks bargaining/detection sub-skills: Grok-4 leads accuracy (73.5%); Claude Sonnet/Llama lead secondary detection; Kimi is the most bribe-disciplined deceiver.

Secret Hitler

  • Best Liberal: GPT-5 (69.1%)
  • Best Fascist: Gemini 2.5 Pro (77.4%) > Kimi-K2 (75.0%) > Qwen3-235b (75.0%)
  • Note: Several models show much stronger Fascist than Liberal play → deception advantage

Among Us

  • Best Impostor: Gemini 2.5 Pro (81.0%) > GPT-5/Claude Sonnet 4.5/Grok-4 (76.2%)
  • Best Crewmate: Gemini 2.5 Pro (39.7%) > GPT-5/Claude Sonnet 4.5/Grok-4 (38.1%)
  • Note: All models perform significantly better as Impostor than Crewmate → deception easier than detection

Avalon

  • Best Good: GLM-4.5 (56.5%) > GPT-5 (55.5%) > Llama-4 Maverick (54.5%)
  • Best Evil: Claude Sonnet 4.5 (87.0%) > GPT-5 (75.0%) > Kimi-K2 (68.7%)
  • Note: Evil success rates exceptionally high across all models, echoing deception strength

Spyfall

  • Best Spy: Grok-4 (100%) > GPT-5 (97.2%) > Gemini 2.5 Pro (94.4%)
  • Strong performers: Claude Sonnet 4.5 (87.1%), Qwen3-235B (86.1%), Kimi K2 (79.2%)

Werewolf

  • Best Villager: GPT-5 (100%) > Gemini 2.5 Pro (72.7%) > Claude Opus 4.1 (58.3%)
  • Best Werewolf: GPT-5 (100%) > Gemini 2.5 Pro (58.3%) > Claude Sonnet 4.5 (55.6%)
  • Note: GPT-5 perfect in both roles; wide performance gap across models

Sheriff of Nottingham

  • Highest Truth Rate: GPT-5 (90.8%) > Qwen3-235B (57.8%) > Claude Sonnet 4.5 (57.9%)
  • Best Sheriff Accuracy: Grok-4 (73.5%) > Claude Sonnet 4.5 (54.0%) > Llama-4 (53.9%)
  • Note: Grok-4 leads detection by significant margin; most models show moderate accuracy

Capability Map

What each game measured well:

Deception Power

Werewolf (Werewolf role), Spyfall (Spy), Secret Hitler (Fascist), Avalon (Evil), Among Us (Impostor)

Detection / Defense

Werewolf (Villager), Secret Hitler (Liberal), Avalon (Good), Among Us (Crewmate)

Bargaining / Trading

Sheriff of Nottingham (Merchant vs Sheriff dynamics; truthfulness, bribes, inspection accuracy)

Spatial Coordination

Among Us (map + tasks), with deception integrated