MARC2

Metaphor Abstraction and Reasoning Corpus v2

May 11, 2026

About

MARC2 is a dataset of abstract reasoning puzzles drawn from ARC-AGI2 where figurative language—metaphors, analogies, and domain-grounded reframings—demonstrably helps AI models solve tasks they cannot crack from examples alone.

The dataset exploits the capability gap between models. Claude Opus 4.6 (82% on ARC-AGI2 training) solves tasks and distills its reasoning into language-complete descriptions. Smaller subject models are then tested under controlled conditions—examples only, language only, and both together—to identify where figurative reframings unlock understanding that neither modality provides on its own.

Each verified puzzle ships with 15 domain-diverse figurative variants spanning domains from music theory to fluid dynamics, enabling fine-grained study of how conceptual framing interacts with model reasoning.

1,120

ARC-AGI2 Tasks

865

Solved by Claude

791

Validated Descriptions

350

MARC-Eligible Tasks

1,910

Figurative Clues

Subject Models

Pipeline

Upper Bound — Claude Opus 4.6

1Solve ARC-AGI2 tasks via subagents, capturing reasoning traces865 / 1,000 training

2Distill reasoning into LARC-style language-complete descriptionssee / do / grid

3Validate descriptions: fresh subagent solves from description alone791 validated

Lower Bound — Subject Models

4Baseline 3-condition testing (examples only, language only, both)per model

5Classify tasks: examples-sufficient, language-sufficient, both-required, unsolvable4 subsets

6Generate figurative descriptions from validated literal ones350 tasks

7Test figurative descriptions on subject models, verify MARC propertyper model

8Generate 15 domain-diverse alternative figurative clues per task1,910 clues

The MARC Property

A task has the MARC property for a given model when:

Examples alone fail
Figurative description alone fails
Figurative description + examples succeed

This three-way contrast isolates the unique contribution of figurative language—it is neither a substitute for examples nor redundant with them.

Subject Models

Model	Ex. Only	Lang. Only	Both	MARC Tasks	MARC Clues
qwen3.6-35b	15.3%	35.9%	31.7%	126	455
gemma-4-26b	25.8%	59.3%	54.6%	110	506
gpt-oss-120b	25.8%	58.2%	51.5%	104	848
qwen3.6-27b	12.8%	39.3%	29.3%	101	358
gemma-4-31b	37.8%	68.7%	68.0%	96	445
qwen3.5-122b	11.6%	40.8%	31.7%	77	322
gpt-oss-20b	11.0%	42.1%	27.1%	67	261
nemotron-3-super	9.6%	41.0%	35.5%	57	235

Baseline accuracy on 791 validated tasks under three conditions. "MARC Tasks" = unique tasks satisfying the full MARC property (examples fail, figurative alone fails, figurative + examples succeed). "MARC Clues" = total figurative descriptions exhibiting the property across all variants.

Interactive Views

Featured views use gpt-oss-120b as the primary subject model. Views for all eight subject models are available below.

Analysis Report

Pipeline funnel, baseline accuracy, MARC yield by domain, opacity analysis.

Puzzle Inspector

Browse MARC puzzles with grids, literal and figurative descriptions, and trial results.

Variant Comparison

Side-by-side comparison of 15 domain-diverse metaphors per puzzle.

Views for other subject models

Model Report Inspector Comparison qwen3.6-35b report inspector comparison gemma-4-26b report inspector comparison qwen3.6-27b report inspector comparison gemma-4-31b report inspector comparison qwen3.5-122b report inspector comparison gpt-oss-20b report inspector comparison nemotron-3-super report inspector comparison