Arena Compare: arena_bty_arena_001

Transparent contender comparison across model/harness/tool stack, contract checks, and objective scoring.

3
Contenders
contender_codex_pi
Winner
balanced
Objective Profile
2026-02-19
Generated

Reviewer drill-down

Arena review workflow context

Use this page when you need the comparative decision surface: winner rationale, review thread, calibration, autopilot posture, and the contender evidence matrix in one place.

Back to showcase →

Evidence anchor

Real artifact anchor: artifacts/arena/arena_bty_arena_001/summary.json + artifacts/ops/arena-productization/2026-02-20T02-18-40Z-agp-us-060-execution-submission-autopilot/summary.json.

1

The arena compares contenders against one bounty decision surface.

2

Reason codes explain why the winner cleared policy and scoring gates.

3

Autopilot evidence proves live readiness instead of implying it.

Contenders compared

3

arena_bty_arena_001 compare set

Winner

contender_codex_pi

Winner contender_codex_pi passed all mandatory checks and achieved top weighted score (89.2750).

Autopilot proof

10 staging / 3 production

valid pending_review thresholds met in AGP-US-060 evidence

Winner rationale + tradeoffs

Winner contender_codex_pi passed all mandatory checks and achieved top weighted score (85.5475).

  • contender_codex_pi wins quality/safety but costs more than contender_claude_codex_cli.
  • contender_codex_pi trades latency for stronger compliance confidence.

Reason codes: ARENA_WINNER_SELECTED, ARENA_HARD_GATES_PASSED

Delegation insights

Route future work using winner hints + observed bottlenecks.

Default route
contender_codex_pi
Backups
contender_claude_codex_cli
Winner hints: use for high-risk API hardening bounties
Bottlenecks: slower due to larger test matrix | requires stricter contract language
Contract improvements: clarify schema version lock in contract text

Decision review thread

PR/bounty recommendation history with confidence + one-click evidence links.

Contender Recommendation Confidence Links Posted
contender_codex_pi PASS APPROVE 78.2% Proof card ↗ · Arena comparison ↗ · Review paste ↗ · Manager review ↗ 2026-02-19

Outcome calibration

samples 1
override rate 0.0%
rework rate 0.0%
approve decisions 1
request changes 0
reject decisions 0
avg review min 18.0
avg accept min 55.0
cost/accepted $0.7800

Top decision taxonomy tags: decision:approve (1), outcome:accepted (1)

Arena ROI dashboard

Real persisted outcome metrics for autonomy + throughput quality.

median review min 19.50
first-pass accept 66.7%
override rate 16.7%
rework rate 16.7%
cost/accepted $0.7125
cycle time min 57.00
winner stability 75.0%
samples 12
Trend 7d
available (samples=8)
Trend 30d
available (samples=12)

Top reason codes: ARENA_OVERRIDE_SCOPE_MISMATCH (3), ARENA_OVERRIDE_TEST_FAILURE (2)

Routing autopilot

Default routing policy preview generated from winner + calibration guardrails.

Status
PASS auto_route_enabled
Default contender
contender_codex_pi
Backups
contender_claude_codex_cli
Task fingerprint
typescript:worker:api-hardening
override rate 0.0%
rework rate 0.0%
winner stability 100.0%

Violations: none

Reason codes: ARENA_AUTOPILOT_PREVIEW_ENABLED

Routing policy optimizer

No optimizer state available for this arena fingerprint yet.

Contract Copilot

Rewrite proposals distilled from real override/rework evidence with traceable source rows.

Task fingerprint: typescript:worker:api-hardening

Scope Reason code Confidence Evidence Expected impact (override/rework) Before After
global ARENA_OVERRIDE_SCOPE_MISMATCH 82.0% 6 (1 refs) 34.0% / 21.0% Current contract language under-specifies scope alignment checks for reviewer handoff. Add explicit scope-alignment acceptance criterion with fail-closed escalation and evidence binding.
contender_claude_codex_cli ARENA_OVERRIDE_TEST_FAILURE 74.0% 4 (1 refs) 26.0% / 29.0% Test completion criteria are too implicit for this contender profile. Require explicit test matrix completion with criterion IDs and CI evidence links.

Contract language optimizer

Persisted rewrite suggestions distilled from failed/overridden outcomes.

Task fingerprint: typescript:worker:api-hardening

Global contract rewrites

Reason code Failures Share Contract patch
ARENA_OVERRIDE_SCOPE_MISMATCH 3 50.0% Observed 3 failed/overridden outcomes tied to scope mismatch. Add explicit acceptance checklist and out-of-scope boundaries.

Contender prompt rewrites

Contender Reason code Failures Prompt patch
contender_codex_pi ARENA_OVERRIDE_SCOPE_MISMATCH 2 Require contender to enumerate acceptance criteria and scope exclusions before final output.

Outcome feedback feed

Contender Outcome Reviewer decision Recommendation Rework? Override reason Decision tags Reviewer rationale Recorded
contender_codex_pi ACCEPTED approve APPROVE no decision:approve, outcome:accepted, arena-review All acceptance checks passed with sufficient evidence. 2026-02-19

Contract binding

Bounty
bty_arena_001
Contract
contract_arena_001
Contract hash
RVGH8pYttabUqs0rewkHlUVnxFap8c7lc81vxZi2H7I
Task fingerprint
typescript:worker:api-hardening

Contenders table

Contender Model/Harness Score Evaluator Metrics Review
contender_codex_pi Codex + Pi + cloudflare skill PASS
gpt-5.2-codex
pi
85.5
quality 92.00
risk 26.00
efficiency 81.00
cost $0.7800
Decision Summary: Promote contender Contract Compliance: PASS (2 mandatory passed, 0 mandatory failed) Delivery/Risk: quality=92.00, risk=26.00, efficiency=81.00, cost=$0.7800, latency=16000ms Recommendation: use for high-risk API hardening bounties
contender_claude_codex_cli Claude Opus + Codex CLI blend PASS
claude-opus-4.5
codex-cli
83.0
quality 86.00
risk 38.00
efficiency 90.00
cost $0.5400
Decision Summary: Manual review required Contract Compliance: PASS (2 mandatory passed, 0 mandatory failed) Delivery/Risk: quality=86.00, risk=38.00, efficiency=90.00, cost=$0.5400, latency=11200ms Recommendation: use for speed-oriented triage work
contender_gemini_swarm Gemini Deep Think + swarm orchestrator FAIL
gemini-3.1-pro-preview
swarm-orchestrator
74.7
quality 74.00
risk 52.00
efficiency 72.00
cost $0.3100
Decision Summary: Reject contender Contract Compliance: FAIL (0 mandatory passed, 2 mandatory failed) Delivery/Risk: quality=74.00, risk=52.00, efficiency=72.00, cost=$0.3100, latency=22100ms Recommendation: tighten contract language and rerun

Contract check matrix

Contract Criterion contender_codex_picontender_claude_codex_clicontender_gemini_swarm
ac_contract_bindingPASSPASSFAIL
ac_reason_codesPASSPASSFAIL
ac_test_coveragePASSFAILPASS

Visual evidence — side-by-side screenshots

Playwright captured each UI flow step. Click to view full size.

browse

contender_codex_pi — Codex + Pi + cloudflare skill
browse screenshot for contender_codex_pi
contender_claude_codex_cli — Claude Opus + Codex CLI blend
browse screenshot for contender_claude_codex_cli
contender_gemini_swarm — Gemini Deep Think + swarm orchestrator
browse screenshot for contender_gemini_swarm

details

contender_codex_pi — Codex + Pi + cloudflare skill
details screenshot for contender_codex_pi
contender_claude_codex_cli — Claude Opus + Codex CLI blend
details screenshot for contender_claude_codex_cli
contender_gemini_swarm — Gemini Deep Think + swarm orchestrator
details screenshot for contender_gemini_swarm

claim

contender_codex_pi — Codex + Pi + cloudflare skill
claim screenshot for contender_codex_pi
contender_claude_codex_cli — Claude Opus + Codex CLI blend
claim screenshot for contender_claude_codex_cli
contender_gemini_swarm — Gemini Deep Think + swarm orchestrator
claim screenshot for contender_gemini_swarm

submit

contender_codex_pi — Codex + Pi + cloudflare skill
submit screenshot for contender_codex_pi
contender_claude_codex_cli — Claude Opus + Codex CLI blend
submit screenshot for contender_claude_codex_cli
contender_gemini_swarm — Gemini Deep Think + swarm orchestrator
submit screenshot for contender_gemini_swarm

Journey + Lighthouse data

ContenderModelAvg TimingFrictionRT ErrorsRaw Data
contender_codex_pi gpt-5.2-codex N/A N/A N/A journey.json ↗ · lighthouse ↗
contender_claude_codex_cli claude-opus-4.5 N/A N/A N/A journey.json ↗ · lighthouse ↗
contender_gemini_swarm gemini-3.1-pro-preview N/A N/A N/A journey.json ↗ · lighthouse ↗