Arena Compare: arena_duel_bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc_1771618534567

Transparent contender comparison across model/harness/tool stack, contract checks, and objective scoring.

2
Contenders
contender_gemini_pi
Winner
evaluator_duel
Objective Profile
2026-02-20
Generated

Reviewer drill-down

Arena review workflow context

Use this page when you need the comparative decision surface: winner rationale, review thread, calibration, autopilot posture, and the contender evidence matrix in one place.

Back to showcase →

Evidence anchor

Real artifact anchor: artifacts/arena/arena_bty_arena_001/summary.json + artifacts/ops/arena-productization/2026-02-20T02-18-40Z-agp-us-060-execution-submission-autopilot/summary.json.

1

The arena compares contenders against one bounty decision surface.

2

Reason codes explain why the winner cleared policy and scoring gates.

3

Autopilot evidence proves live readiness instead of implying it.

Contenders compared

3

arena_bty_arena_001 compare set

Winner

contender_codex_pi

Winner contender_codex_pi passed all mandatory checks and achieved top weighted score (89.2750).

Autopilot proof

10 staging / 3 production

valid pending_review thresholds met in AGP-US-060 evidence

Winner rationale + tradeoffs

Evaluator scores: contender_gemini_pi=98.4 vs contender_codex_pi=98.4

  • No tradeoff notes supplied.

Reason codes: ARENA_DUEL_BATCH_COMPLETED, ARENA_RESOLVE_LOOP_COMPLETED

Delegation insights

Route future work using winner hints + observed bottlenecks.

Default route
contender_gemini_pi
Backups
contender_codex_pi
Winner hints: none
Bottlenecks: none
Contract improvements: none

Decision review thread

No decision paste entries posted yet for this arena.

Outcome calibration

No outcome feedback recorded yet.

Arena ROI dashboard

INSUFFICIENT_SAMPLE — sample_count=0, arena_count=0.

Reason codes: ARENA_ROI_INSUFFICIENT_SAMPLE

Routing autopilot

Default routing policy preview generated from winner + calibration guardrails.

Status
PASS auto_route_enabled
Default contender
contender_gemini_pi
Backups
contender_codex_pi
Task fingerprint
AEM-FP-REAL-DUEL-V2
override rate 0.0%
rework rate 0.0%
winner stability 100.0%

Violations: none

Reason codes: ARENA_AUTOPILOT_PREVIEW_ENABLED

Routing policy optimizer

Shadow policy is computed from real outcomes and promoted only when confidence gates pass.

Status
empty
Promotion status
empty
Active policy contender
none
Shadow policy contender
none
sample count 0
confidence 0.0%
min samples 0
min confidence 0.0%

Reason codes: none

Contract Copilot

No persisted copilot suggestions for this task fingerprint yet.

Contract language optimizer

No failed/overridden outcomes yet for optimizer suggestions.

Outcome feedback feed

No recorded outcomes for this arena yet.

Contract binding

Bounty
bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc
Contract
contract_duel_bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc
Contract hash
QRrFJWsqW73IYU2WWSIrgQWHhA3PATdsZWJ5nAjn-b4
Task fingerprint
AEM-FP-REAL-DUEL-V2

Contenders table

Contender Model/Harness Score Evaluator Metrics Review
contender_codex_pi GPT-5.3 Codex xHigh via Pi PASS
gpt-5.3-codex
pi
98.4
UX 100.0
perf 100.0
a11y 96.0
visual 100.0
maint 90.0
CLS 0.000
flows 100%
flow pass 4/4
avg timing 215ms
RT errors 0
crit a11y 0
friction 0
core flows PASS
no RT err PASS
no crit a11y PASS
Contender: contender_codex_pi Bounty: bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc Hard gate: PASS Final score: 98.4 Flows: 4/4 (browse=true details=true claim=true submit=true) Runtime errors: 0 Lighthouse perf: undefined Reason codes: none
contender_gemini_pi Gemini 3.1 Pro Preview via Pi PASS
gemini-3.1-pro-preview
pi
98.4
UX 100.0
perf 100.0
a11y 96.0
visual 100.0
maint 90.0
CLS 0.000
flows 100%
flow pass 4/4
avg timing 256ms
RT errors 0
crit a11y 0
friction 0
core flows PASS
no RT err PASS
no crit a11y PASS
Contender: contender_gemini_pi Bounty: bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc Hard gate: PASS Final score: 98.4 Flows: 4/4 (browse=true details=true claim=true submit=true) Runtime errors: 0 Lighthouse perf: undefined Reason codes: none

Contract check matrix

Contract Criterion contender_codex_picontender_gemini_pi
core_flows_passPASSPASS
no_a11y_criticalPASSPASS
no_runtime_errorsPASSPASS

Visual evidence — side-by-side screenshots

Playwright captured each UI flow step. Click to view full size.

browse

contender_codex_pi — GPT-5.3 Codex xHigh via Pi
browse screenshot for contender_codex_pi
contender_gemini_pi — Gemini 3.1 Pro Preview via Pi
browse screenshot for contender_gemini_pi

details

contender_codex_pi — GPT-5.3 Codex xHigh via Pi
details screenshot for contender_codex_pi
contender_gemini_pi — Gemini 3.1 Pro Preview via Pi
details screenshot for contender_gemini_pi

claim

contender_codex_pi — GPT-5.3 Codex xHigh via Pi
claim screenshot for contender_codex_pi
contender_gemini_pi — Gemini 3.1 Pro Preview via Pi
claim screenshot for contender_gemini_pi

submit

contender_codex_pi — GPT-5.3 Codex xHigh via Pi
submit screenshot for contender_codex_pi
contender_gemini_pi — Gemini 3.1 Pro Preview via Pi
submit screenshot for contender_gemini_pi

Journey + Lighthouse data

ContenderModelAvg TimingFrictionRT ErrorsRaw Data
contender_codex_pi gpt-5.3-codex 215ms 0 0 journey.json ↗ · lighthouse ↗
contender_gemini_pi gemini-3.1-pro-preview 256ms 0 0 journey.json ↗ · lighthouse ↗