MARC2

Metaphor Abstraction and Reasoning Corpus v2

May 11, 2026

About

MARC2 is a dataset of abstract reasoning puzzles drawn from ARC-AGI2 where figurative language—metaphors, analogies, and domain-grounded reframings—demonstrably helps AI models solve tasks they cannot crack from examples alone.

The dataset exploits the capability gap between models. Claude Opus 4.6 (82% on ARC-AGI2 training) solves tasks and distills its reasoning into language-complete descriptions. Smaller subject models are then tested under controlled conditions—examples only, language only, and both together—to identify where figurative reframings unlock understanding that neither modality provides on its own.

Each verified puzzle ships with 15 domain-diverse figurative variants spanning domains from music theory to fluid dynamics, enabling fine-grained study of how conceptual framing interacts with model reasoning.

1,120
ARC-AGI2 Tasks
865
Solved by Claude
791
Validated Descriptions
350
MARC-Eligible Tasks
1,910
Figurative Clues
8
Subject Models

Pipeline

Upper Bound — Claude Opus 4.6

1Solve ARC-AGI2 tasks via subagents, capturing reasoning traces865 / 1,000 training
2Distill reasoning into LARC-style language-complete descriptionssee / do / grid
3Validate descriptions: fresh subagent solves from description alone791 validated

Lower Bound — Subject Models

4Baseline 3-condition testing (examples only, language only, both)per model
5Classify tasks: examples-sufficient, language-sufficient, both-required, unsolvable4 subsets
6Generate figurative descriptions from validated literal ones350 tasks
7Test figurative descriptions on subject models, verify MARC propertyper model
8Generate 15 domain-diverse alternative figurative clues per task1,910 clues

The MARC Property

A task has the MARC property for a given model when:

This three-way contrast isolates the unique contribution of figurative language—it is neither a substitute for examples nor redundant with them.

Subject Models

Model Ex. Only Lang. Only Both MARC Tasks MARC Clues
qwen3.6-35b15.3%35.9%31.7%126455
gemma-4-26b25.8%59.3%54.6%110506
gpt-oss-120b25.8%58.2%51.5%104848
qwen3.6-27b12.8%39.3%29.3%101358
gemma-4-31b37.8%68.7%68.0%96445
qwen3.5-122b11.6%40.8%31.7%77322
gpt-oss-20b11.0%42.1%27.1%67261
nemotron-3-super9.6%41.0%35.5%57235

Baseline accuracy on 791 validated tasks under three conditions. "MARC Tasks" = unique tasks satisfying the full MARC property (examples fail, figurative alone fails, figurative + examples succeed). "MARC Clues" = total figurative descriptions exhibiting the property across all variants.

Interactive Views

Featured views use gpt-oss-120b as the primary subject model. Views for all eight subject models are available below.

Analysis Report
Pipeline funnel, baseline accuracy, MARC yield by domain, opacity analysis.
Puzzle Inspector
Browse MARC puzzles with grids, literal and figurative descriptions, and trial results.
Variant Comparison
Side-by-side comparison of 15 domain-diverse metaphors per puzzle.
Views for other subject models
Model Report Inspector Comparison qwen3.6-35b report inspector comparison gemma-4-26b report inspector comparison qwen3.6-27b report inspector comparison gemma-4-31b report inspector comparison qwen3.5-122b report inspector comparison gpt-oss-20b report inspector comparison nemotron-3-super report inspector comparison