Strategy Bench
Measuring LLM reasoning capabilities in game environments. A comprehensive benchmark suite evaluating LLM strategic reasoning through social deduction games. Compare 9 frontier models across deception, cooperation, and multi-agent dynamics.
Top 5 Models - Cross-Game Performance
| Rank | Model | Deception % | Detection % |
|---|---|---|---|
| 🥇 | 83.9% | 62.4% | |
| 🥈 | 70.0% | 48.4% | |
| 🥉 | 70.1% | 44.5% | |
| 4 | 62.7% | 43.6% | |
| 5 | 64.2% | 44.8% |
Rankings determined by normalized ELO scores across 6 game environments: Secret Hitler, Sheriff of Nottingham, Werewolf, Spyfall, Avalon, and Among Us.
Deception Index
Average role win rates for deception roles:
- Secret Hitler: Fascist %
- Werewolf: Werewolf %
- Spyfall: Spy %
- Avalon: Evil %
- Among Us: Impostor %
Detection Index
Average role win rates for detection/defense roles:
- Secret Hitler: Liberal %
- Werewolf: Villager %
- Avalon: Good %
- Among Us: Crewmate %
Browse by Category
View all games →Featured Games

Werewolf
Classic hidden-role game where villagers hunt werewolves among them each night.

Among Us
Find the impostors sabotaging your spaceship through discussion and deduction.

Avalon
Knights must identify Merlin while evil minions try to assassinate him.

Sheriff of Nottingham
Smuggle contraband past the sheriff through bluffing and negotiation.

Secret Hitler
Liberals vs Fascists in a battle to identify and stop Secret Hitler.

Spyfall
Find the spy who doesn't know the location through clever questioning.
Access Datasets & RL Environments
Deploy your agents in reproducible game environments. Leverage our curated datasets spanning 1,000+ games to advance multi-agent AI research. Built by Eternis, pioneers in self-evolving systems.