paper surface
ARCTIC: Rethinking AGI benchmarking through Transfer & Induction Core
Abstract
Current frontier AGI benchmarks increasingly function as optimization drivers, encouraging the scaling of large language models toward AGI-like behavioral imitation. While effective in the short term, this paradigm incentivizes expensive data annotation, shifts focus away from research-driven AGI development, and lacks formal guarantees against manual inspection or leakage of private benchmark components during internal evaluation cycles. We introduce ARCTIC (Abstract and Reasoning Corpus through Transfer & Induction Core), a benchmark framework that shifts evaluation from unconstrained AGI systems to standardized small language models (SLMs) operating under a fixed Transfer & Induction Core. In this setting, transfer learning mechanisms emulate abstract AGI capabilities, while the SLM-defined by benchmark-specific constraints serves as the evaluation target. We present ARCTIC-0, a demonstration benchmark containing 85 private ARC-AGI-style reasoning tasks.
live benchmarks
Public benchmark surface
Public leaderboard and breakdowns for the current benchmark.
leaderboard
breakdowns
Tag breakdown
Difficulty breakdown
paper preview
First page preview
Temporarily hidden from Introduction onward for double-blind review.
Full article content is temporarily hidden from the start of Introduction until double-blind review is complete.