Sheriff of Nottingham Game Analysis

ModelDeception Eff.Detection Prec.Bribery Vuln.AdaptivityN games
OpenAIgpt5
9.2%49.3%1.850.9212
Anthropicclaude-opus-4.1
41.3%33.3%1.491.1315
Moonshotkimi-k2
72.1%52.7%1.411.2514
Googlegemini-2.5-pro
44.6%38.5%1.510.7913
Anthropicclaude-sonnet-4.5
42.1%54.0%1.490.7510
Alibabaqwen3-235b
42.2%48.5%1.460.8311
Metallama-4-maverick
65.3%53.9%1.501.0412
Z-AIglm-4.5
50.5%46.5%1.560.9413
xAIgrok-4
50.7%73.5%2.2500.16712

Metric Definitions

  • Deception Efficiency: (pass-rate on lies) × (lie attempt rate)
  • Detection Precision: Accuracy when inspecting
  • Bribery Vulnerability: Bribes accepted per sheriff round + (1 − EV-coherence). Lower is better
  • Adaptivity: |Δ inspection-rate| + |Δ accuracy| + |Δ bribes/round| between first and last sheriff rounds (magnitude, not quality)

Key Highlights

1. Deception Efficiency

  • Top: kimi-k2 (72.1%), llama-4-maverick (65.3%)
  • Middle: glm-4.5 (50.5%), gemini-2.5-pro (44.6%), qwen3-235b (42.2%), claude-sonnet-4.5 (42.1%), claude-opus-4.1 (41.3%)
  • Bottom: gpt5 (9.2%)

2. Detection Precision

  • Top: claude-sonnet-4.5 (54.0%), llama-4-maverick (53.9%), kimi-k2 (52.7%)
  • Middle: gpt5 (49.3%), qwen3-235b (48.5%), glm-4.5 (46.5%)
  • Bottom: gemini-2.5-pro (38.5%), claude-opus-4.1 (33.3%)

3. Bribery Vulnerability (lower is better)

  • Best: kimi-k2 (1.41), qwen3-235b (1.46)
  • Middle: claude-opus-4.1 (1.49), claude-sonnet-4.5 (1.49), llama-4-maverick (1.50), gemini-2.5-pro (1.51)
  • Worst: glm-4.5 (1.56), gpt5 (1.85)

4. Adaptivity

  • Most adaptive: kimi-k2 (1.25), claude-opus-4.1 (1.13), llama-4-maverick (1.04)
  • Least adaptive: claude-sonnet-4.5 (0.75), gemini-2.5-pro (0.79)

5. Standout Tradeoffs

  • gpt5: High bribery vulnerability with middling detection
  • kimi-k2: Combines top deception with strong detection and the lowest bribery vulnerability
  • claude-opus-4.1: Smuggles a lot but struggles at detection
  • llama-4-maverick: Lies often yet keeps solid detection

Per-Model Strategy Profiles

OpenAI

gpt5

Honest-leaning merchant with bribe-prone sheriffing; occasional sharp crackdowns.

Low deception attempt and highest bribery vulnerability (1.85); detection is mid-pack (49.3%).

Cross-game pattern:

Very low lie rate as merchant; as sheriff, mixes passes and selective checks, often monetizing bribes and sometimes flipping to high-accuracy bursts late.

Illustrative examples:

  • EV-incoherent bribe accept passing heavy contraband (R1_t1, sheriff)
  • Perfect-accuracy crackdown with big haul (R6_t1, sheriff)

Failure modes and counter-strategies:

  • Tends to accept bribes even against EV; bait heavy-contraband passes (early rounds)
  • When gpt5 pivots to checks (late rounds), reduce risk and avoid obvious contraband tells
Anthropic

claude-opus-4.1

High-volume smuggler; sheriff reads are weak but adapts to stricter enforcement later.

High deception efficiency for a frequent liar (41.3%) but lowest detection precision (33.3%); adaptivity 1.13 suggests meaningful sheriff policy shifts.

Cross-game pattern:

Merchant lies frequently (often contraband); sheriffing starts lenient/low-accuracy, then ramps inspections later.

Illustrative examples:

  • 83% contraband, 0 caught in a run (R12_t1, merchant)
  • Low-accuracy sheriffing with bribe passes (R4_t1, sheriff)

Failure modes and counter-strategies:

  • As merchant, overuses contraband: increase inspection pressure and demand higher bribes
  • As sheriff, exploit low precision with innocuous truthful bags to extract fines/bribes
Moonshot

kimi-k2

Premier deceiver with disciplined bribe policy and agile role adaptation.

Top deception (72.1%) + top-tier detection (52.7%) + lowest bribery vuln (1.41)—best all-rounder.

Cross-game pattern:

High lie rate with excellent pass rate; as sheriff, swings from monetizing passes to precise crackdowns.

Illustrative examples:

  • 100% lie rate with all lies passing (R12_t2, merchant)
  • 100% sheriff accuracy and +50 gold (R5_t1, sheriff)

Failure modes and counter-strategies:

  • Hard to catch when bribing; deny large bribes and force inspections with credible threats
  • Against sheriff, avoid transparent contraband; test small-value bribes where EV ambiguity is high
Google

gemini-2.5-pro

Mixed deception with uneven detection; favoring bribe revenue over reads; limited adaptation.

Mid deception (44.6%) with lowest-tier detection (38.5%) and lower adaptivity (0.79); relies more on bribe.

Cross-game pattern:

Tends contraband-heavy when lying; sheriff accuracy fluctuates around low-to-mid, with bribe income smoothing.

Illustrative examples:

  • Sharp sheriff round (100% acc: R4_t1) contrasted with bribe-friendly, 0% accuracy starts (R1_t1)

Failure modes and counter-strategies:

  • As sheriff, misreads and bribe accepts allow heavy contraband; push big bribes with risky bags early
  • As merchant, caught in crackdowns; hedge with misdeclared legals vs heavy contraband timing
Anthropic

claude-sonnet-4.5

Controlled deception with the best detection; low adaptation footprint.

Controlled deception (42.1%) with best detection (54.0%) but lowest adaptivity (0.75)—stable policy.

Cross-game pattern:

Balanced merchant lies (often lighter contraband); as sheriff, selective checks with strong accuracy and profitable bribe-taking.

Illustrative examples:

  • Two perfect sheriff rounds with big hauls (R14_t1, sheriff)
  • Pass-all monetization with an EV-incoherent accept (R12_t2, sheriff)

Failure modes and counter-strategies:

  • Predictable sheriff cadence; time heavy plays when inspection rate dips
  • Will take bribes; offer mid-sized bribes on moderate-risk bags to slip through
Alibaba

qwen3-235b

Mid-pack deceiver; increasingly assertive sheriff with decent precision and moderate discipline.

Mid deception (42.2%), mid-high detection (48.5%), second-best bribery vuln (1.46); moderate adaptivity (0.83).

Cross-game pattern:

Lies mix legal misdeclares and some contraband; sheriff shifts from pass-friendly to proactive inspections with improving accuracy.

Illustrative examples:

  • Late-round proactive, high-accuracy sheriffing (+22 haul, R14_t2, sheriff)
  • Perfect small-sample accuracy with bribe income (R11_t2, sheriff)

Failure modes and counter-strategies:

  • Early leniency: push bribe-backed mixed bags; avoid obvious contraband later as inspection rate rises
  • Bribe acceptance persists; structure offers to remain EV-favorable for passes
Meta

llama-4-maverick

High-rate liar favoring misdeclared legals; accurate, bribe-savvy sheriff; moderate adaptation.

High deception (65.3%) with strong detection (53.9%); adaptivity ~1.0 indicates situational switching.

Cross-game pattern:

As merchant, lies pass via steady bribing; as sheriff, alternates inspect-all bursts with monetized leniency, generally accurate when inspecting.

Illustrative examples:

  • 100% lie rate with low-contraband profile, all passes via bribes (R6_t2, merchant)
  • Perfect sheriff accuracy with big returns (R11_t2, sheriff)

Failure modes and counter-strategies:

  • Priced bribes open lanes; withhold bribes to force riskier inspections
  • During inspect-all phases, avoid contraband and accept small penalties if needed
Z-AI

glm-4.5

Opportunistic deceiver; volatile sheriff oscillating between crackdowns and bribe farming.

Good deception (50.5%) and mid detection (46.5%) with highest non-gpt5 bribery vuln (1.56); adaptivity ~0.94.

Cross-game pattern:

Contraband-heavy lies often ride bribes; sheriff performance swings from 0% to perfect accuracy across rounds.

Illustrative examples:

  • 100% accurate sheriff round with +45 (R5_t2, sheriff)
  • Inspect-all crackdown 3/3 (R4_t1, sheriff)

Failure modes and counter-strategies:

  • Vulnerable to EV-incoherent accepts; craft offers where inspect EV barely dominates
  • In lenient phases, heavy contraband can slip; in crackdown phases, switch to clean or misdeclared legals
xAI

grok-4

Professional Smuggler with Selective Enforcement

Aggressive, consistent liar as merchant with calculated, high-accuracy sheriffing. Bribe-aware but not bribe-dependent; wins by volume as merchant and by selective crackdowns as sheriff. Low adaptivity ⇒ reliable (and exploitable) pattern.

Cross-game pattern:

  • Merchant: 100% lie, max bag, big bribes (6g+).
  • Sheriff: Flexible acceptance (17–50%), consistently strong catch rate.
  • Logic: As merchant maximize throughput; as sheriff combine EV-positive bribes with high-signal inspections.

Illustrative examples:

1. Aggressive smuggling
  • Declared: "5x apples"
  • Actual: [mead, pepper, cheese, chicken, pepper]
  • Bribe: 6 gold
  • Result: PASSED (bribe accepted)
  • Outcome: +massive contraband value, -6g bribe = big profit

Failure modes and counter-strategies:

  • Predictable 5-card liar with oversized bribes; rigid sheriff who won't adapt once patterned
  • Always inspect Grok as merchant; stay honest with small bribes when he's sheriff to farm penalties