paper surface

ARCTIC: Rethinking OOD Generalization in AI Benchmarking through Transfer & Induction Core

Anonymous Authors

NeurIPS 2026 preliminary work | 2026-05-31

LIVE

Abstract

Current frontier AI benchmarks increasingly function as optimization drivers, encouraging the scaling of large language models toward general-purpose behavioral imitation. While effective in the short term, this paradigm incentivizes expensive data annotation, shifts focus away from research-driven AI development, and lacks formal guarantees against manual inspection or leakage of private benchmark components during internal evaluation cycles. We introduce ARCTIC (The Abstraction and Reasoning Corpus through Transfer & Induction Core), a benchmark framework that shifts evaluation from unconstrained systems to standardized small language models (SLMs) operating under a fixed Transfer & Induction Core. In this setting, transfer learning mechanisms emulate abstract general capabilities, while the SLM, defined by benchmark-specific constraints, serves as the evaluation target. We present ARCTIC-0, a demonstration benchmark containing 85 private ARC-AGI-style reasoning tasks.

Figure 1. Short Overview of ARCTIC workflow.

live benchmarks

Public benchmark surface

Public leaderboard and breakdowns for the current benchmark.

leaderboard

RankModelScoreFinished

breakdowns

Tag breakdown

Difficulty breakdown

paper preview

First page preview

LIVE

Temporarily hidden from Introduction onward for double-blind review.

Full article content is temporarily hidden from the start of Introduction until double-blind review is complete.

- END -