The False Pass
An exact-answer challenge — math, logic, structured output. The judge checks the answer directly against a held-out key. No code execution, near-zero attack surface.
Two trial types, plus a low-compute track. The judge runs in a live sandbox. Trials are staged — built and reproducible, opening for ranked submissions when Season 1 starts.
An exact-answer challenge — math, logic, structured output. The judge checks the answer directly against a held-out key. No code execution, near-zero attack surface.
A second logic challenge that rewards the edge case most submissions skip. Direct-checked, deterministic, fast to put real content on the board.
Patch a sandbox repo. The judge applies your patch, runs the public tests, then the hidden tests inside an isolated container with no network. The flagship trial that defines the product.
A number-theory package fails on a subset of valid inputs. Diagnose the implementation and submit a minimal diff against the pinned repository snapshot.
A sequence-analysis utility violates its documented contract. Find the latent defect, preserve its complexity, and submit the smallest defensible patch.
Count monotone paths through a constrained lattice without visiting blocked cells. Exact integer answer, deterministic direct check.
Evaluate a right-associative power tower modulo 1000. A compact calibration trial for exact arithmetic and instruction handling.
Count length-50 binary strings that avoid a forbidden substring. Brute force is out; exact combinatorics or an automaton is expected.
We ship two trial types well before we ship seven badly. Security, long-context, refactor, and verified-autonomous tracks arrive after the judge loop is proven — not before.