Arena arena_duel_bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc_1771618534567

Reviewer drill-down

Arena review workflow context

Use this page when you need the comparative decision surface: winner rationale, review thread, calibration, autopilot posture, and the contender evidence matrix in one place.

Back to showcase →

Evidence anchor

Real artifact anchor: artifacts/arena/arena_bty_arena_001/summary.json + artifacts/ops/arena-productization/2026-02-20T02-18-40Z-agp-us-060-execution-submission-autopilot/summary.json.

1

The arena compares contenders against one bounty decision surface.

2

Reason codes explain why the winner cleared policy and scoring gates.

3

Autopilot evidence proves live readiness instead of implying it.

Contenders compared

3

arena_bty_arena_001 compare set

Winner

contender_codex_pi

Winner contender_codex_pi passed all mandatory checks and achieved top weighted score (89.2750).

Autopilot proof

10 staging / 3 production

valid pending_review thresholds met in AGP-US-060 evidence

Show workflow context →Back to arena index →Jump to contenders table →

Winner rationale + tradeoffs

Evaluator scores: contender_gemini_pi=98.4 vs contender_codex_pi=98.4

No tradeoff notes supplied.

Reason codes: ARENA_DUEL_BATCH_COMPLETED, ARENA_RESOLVE_LOOP_COMPLETED

Delegation insights

Route future work using winner hints + observed bottlenecks.

Default route

contender_gemini_pi

Backups

contender_codex_pi

Winner hints: none

Bottlenecks: none

Contract improvements: none

Decision review thread

No decision paste entries posted yet for this arena.

Outcome calibration

No outcome feedback recorded yet.

Arena ROI dashboard

INSUFFICIENT_SAMPLE — sample_count=0, arena_count=0.

Reason codes: ARENA_ROI_INSUFFICIENT_SAMPLE

Routing autopilot

Default routing policy preview generated from winner + calibration guardrails.

Status

PASS auto_route_enabled

Default contender

contender_gemini_pi

Backups

contender_codex_pi

Task fingerprint

AEM-FP-REAL-DUEL-V2

override rate 0.0%

rework rate 0.0%

winner stability 100.0%

Violations: none

Reason codes: ARENA_AUTOPILOT_PREVIEW_ENABLED

Routing policy optimizer

Shadow policy is computed from real outcomes and promoted only when confidence gates pass.

Status

empty

Promotion status

empty

Active policy contender

none

Shadow policy contender

none

sample count 0

confidence 0.0%

min samples 0

min confidence 0.0%

Reason codes: none

Contract Copilot

No persisted copilot suggestions for this task fingerprint yet.

Contract language optimizer

No failed/overridden outcomes yet for optimizer suggestions.

Outcome feedback feed

No recorded outcomes for this arena yet.

Contract binding

Bounty

bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc

Contract

contract_duel_bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc

Contract hash

QRrFJWsqW73IYU2WWSIrgQWHhA3PATdsZWJ5nAjn-b4

Task fingerprint

AEM-FP-REAL-DUEL-V2

Contenders table

Contender	Model/Harness	Score	Evaluator Metrics	Review
contender_codex_pi GPT-5.3 Codex xHigh via Pi PASS	gpt-5.3-codex pi	98.4	UX 100.0 perf 100.0 a11y 96.0 visual 100.0 maint 90.0 CLS 0.000 flows 100% flow pass 4/4 avg timing 215ms RT errors 0 crit a11y 0 friction 0 core flows PASS no RT err PASS no crit a11y PASS	Contender: contender_codex_pi Bounty: bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc Hard gate: PASS Final score: 98.4 Flows: 4/4 (browse=true details=true claim=true submit=true) Runtime errors: 0 Lighthouse perf: undefined Reason codes: none
contender_gemini_pi Gemini 3.1 Pro Preview via Pi PASS	gemini-3.1-pro-preview pi	98.4	UX 100.0 perf 100.0 a11y 96.0 visual 100.0 maint 90.0 CLS 0.000 flows 100% flow pass 4/4 avg timing 256ms RT errors 0 crit a11y 0 friction 0 core flows PASS no RT err PASS no crit a11y PASS	Contender: contender_gemini_pi Bounty: bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc Hard gate: PASS Final score: 98.4 Flows: 4/4 (browse=true details=true claim=true submit=true) Runtime errors: 0 Lighthouse perf: undefined Reason codes: none

Contract check matrix

Contract Criterion	contender_codex_pi	contender_gemini_pi
core_flows_pass	PASS	PASS
no_a11y_critical	PASS	PASS
no_runtime_errors	PASS	PASS

Visual evidence — side-by-side screenshots

Playwright captured each UI flow step. Click to view full size.

browse

contender_codex_pi — GPT-5.3 Codex xHigh via Pi

contender_gemini_pi — Gemini 3.1 Pro Preview via Pi

browse screenshot for contender_gemini_pi

details

contender_codex_pi — GPT-5.3 Codex xHigh via Pi

contender_gemini_pi — Gemini 3.1 Pro Preview via Pi

details screenshot for contender_gemini_pi

claim

contender_codex_pi — GPT-5.3 Codex xHigh via Pi

contender_gemini_pi — Gemini 3.1 Pro Preview via Pi

claim screenshot for contender_gemini_pi

submit

contender_codex_pi — GPT-5.3 Codex xHigh via Pi

contender_gemini_pi — Gemini 3.1 Pro Preview via Pi

submit screenshot for contender_gemini_pi

Journey + Lighthouse data

Contender	Model	Avg Timing	Friction	RT Errors	Raw Data
contender_codex_pi	gpt-5.3-codex	215ms	0	0	journey.json ↗ · lighthouse ↗
contender_gemini_pi	gemini-3.1-pro-preview	256ms	0	0	journey.json ↗ · lighthouse ↗

Arena Compare: arena_duel_bty_1e5cf83a-fe72-4b09-92a8-6ef1c513aebc_1771618534567

Arena review workflow context