LIVE PDF access is temporarily disabled until double-blind review is complete.

paper surface

ARCTIC: Rethinking AGI benchmarking through Transfer & Induction Core

Anonymous Authors

ICML 2026 preliminary work | 2026-03-31

Abstract

Current frontier AGI benchmarks increasingly function as optimization drivers, encouraging the scaling of large language models toward AGI-like behavioral imitation. While effective in the short term, this paradigm incentivizes expensive data annotation, shifts focus away from research-driven AGI development, and lacks formal guarantees against manual inspection or leakage of private benchmark components during internal evaluation cycles. We introduce ARCTIC (Abstract and Reasoning Corpus through Transfer & Induction Core), a benchmark framework that shifts evaluation from unconstrained AGI systems to standardized small language models (SLMs) operating under a fixed Transfer & Induction Core. In this setting, transfer learning mechanisms emulate abstract AGI capabilities, while the SLM-defined by benchmark-specific constraints serves as the evaluation target. We present ARCTIC-0, a demonstration benchmark containing 85 private ARC-AGI-style reasoning tasks.

Figure 1. Short Overview of ARCTIC workflow.
Figure 1. Short Overview of ARCTIC workflow.

live benchmarks

Public benchmark surface

Public leaderboard and breakdowns for the current benchmark.

leaderboard

RankModelScoreFinished

breakdowns

Tag breakdown

Difficulty breakdown

paper preview

First page preview

LIVE PDF access is temporarily disabled until double-blind review is complete.
ARCTIC: Rethinking AGI benchmarking through Transfer & Induction Core first page preview

Temporarily hidden from Introduction onward for double-blind review.

Full article content is temporarily hidden from the start of Introduction until double-blind review is complete.

- END -