Could a tiny model have predicted Pluto's demotion?

An interactive walkthrough of the Pluto-Toy experiment · May 7, 2026

In August 2006, the International Astronomical Union voted Pluto out of the planets. The case had been quietly building for years — astronomers had been finding objects out beyond Neptune since the 1990s, and by 2005, Eris was discovered, larger than Pluto and outside the planet category. The category was fraying.

This page walks through a small experiment that asks a sharper version of that historical question: given only the kind of evidence Pluto-era astronomers had, can a tiny language model detect that the planet category is under strain? And, more interestingly: which way of asking the model gives a meaningful answer?

The walkthrough is interactive. Each section has controls you can play with — sliders, toggles, hover-tooltips on the plots. You don't need a machine-learning background; the goal is to make the structure of the experiment legible.

Words with a dotted underline have a hover-definition: rest your cursor (or your finger, on a touchscreen) on them to see a one-sentence explanation.

1. The toy world

To study category strain in something we can fully control, we don't use real astronomy. We invent a synthetic world. Each “celestial body” is described by exactly three integer features:

The world has four canonical categories — planet, asteroid, comet, moon — each with a prototypical feature combination. P1…P9 are nine generic planets sampled near the planet prototype. P10 is our Pluto-analog: it has the planet label, but its features sit at the small-mass, distant-orbit edge of the category — (mass=small, diameter=large, orbit=distant), shown as the pink star in the plot below.

Then we add six new entities — E1…E6, the Eris-analogs — each labeled dwarf. The experimental knob is where in feature space we place those six dwarfs. There are only two places they can go:

So the question is: of the six dwarfs, how many do we put in the disjoint corner? We call that knob the corner-mix. The buttons below run the experiment with four different values of it. Click each one to see where the six dwarfs end up in the 3D feature space.

Each point is one entity. Hover for its name and category. P10 is the pink star. The red diamonds are the six dwarf-labeled entities; their positions change with corner-mix. (We add a small visual jitter so that several entities at the same nominal position appear as a cloud rather than a single stacked marker — nothing is randomized in the actual experiment.)

Reading the four conditions:

Aside: what does “rote” mean here?

“Rote” is the everyday word: learning by memorizing flat pairings, without integrating them into anything else. When a child memorizes “8×7=56” as a free-standing fact rather than working it out from 8×8=64 minus 8, that's rote.

Applied to this experiment: under the rote-control condition (corner-mix = 6/6), the six dwarfs sit in a region of feature space that no other category occupies. So the model can only learn “dwarf” by rote — by memorizing “things-with-these-particular-features map to the word dwarf,” with no overlap or conflict against what it already knows about planets. There's no opportunity for category strain because the feature evidence and the existing label structure don't compete.

That makes the 6/6 condition the experimental control against tension. If the model produces both labels for P10 even at 6/6, it can't be because the features support both labels — the dwarf-labeled examples don't share P10's features at all. So 6/6 is the “flat memorization only” setup; 0/6 is the “features-actively-contested” setup; and the intermediate values let us watch how the two stories trade off.

Why does any of this matter? Because if the model picks up “dwarf” for P10, we want to know why. Did it pick that up because P10's features look just like a dwarf's (the overlap-driven story)? Or because the model has just generally learned to say “dwarf” after seeing dwarf-labeled examples in phase 2 (a frequency-driven story that has nothing to do with P10's features)? The corner-mix knob separates these two. At 0/6 they're confounded; at 6/6 only the frequency story can be at work.

2. Three ways of training

The model trains in two phases. Phase 1 is the canon: 28 entities (P1…P10, plus asteroids, comets, moons), each described and labeled. Phase 2 is the new evidence: 6 E* described and labeled as dwarfs.

How those two phases get combined matters — a lot. We compare three schedules:

Canon-only: just phase 1. The model never sees the E*. This is our “before-Eris” baseline.
Curriculum: phase 1 first, then phase 2 as a fine-tune. This mirrors how a student learns — textbook first, new findings after — and it is also how most real-world models get updated. The danger here is catastrophic forgetting: when a network is trained on new data without rehearsing the old, the new data can overwrite the old representation.
Mixed: shuffle phase 1 and phase 2 together and train on the union. The model sees canon and evidence interleaved throughout, so canon never gets a chance to be forgotten.

The toy uses a 50,000-parameter transformer (about four million times smaller than ChatGPT-class models). We also test a 200,000-parameter version — still tiny — to see whether more capacity changes anything.

3. The dual-label probe

Once the model is trained, we ask: what does it think P10 is? The natural way is to give it the prompt

P10 has mass small diameter large orbit distant . P10 is a ___

and look at the probability the model assigns to each possible next word. (More precisely: the model produces a number for every word in its vocabulary, and we run those numbers through the softmax function to turn them into probabilities that sum to 1.) A forced-choice probe asks: did planet beat dwarf? The trouble is, this collapses two pieces of information into one. If the model is genuinely conflicted — if it has acquired both labels for P10 — the forced-choice probe cannot say so.

The dual-label probe reads P(planet) and P(dwarf) independently. Each P10 reading then lands in a 2×2 grid:

P(planet) highP(planet) low
P(dwarf) high TENSION
both labels in play
swap
label flipped
P(dwarf) low canon
still a planet
nothing
some third token wins

“High” is operationalized with an absolute floor: a label is in play when it has at least 10% of the probability mass. So a P10 reading of (P_planet=0.45, P_dwarf=0.50) would land in TENSION; a reading of (0.01, 0.99) would land in swap; and (0.99, 0.01) in canon.

The motivating distinction is that tension is structural conflict — the model has internalized two competing categorizations for the same features — while swap is just relabeling: the model has been pushed from one verdict to the other. The same forced-choice probe sees both as “dwarf wins,” but they are different states of the model, and only one of them is the phenomenon a category-strain story would point to.

4. What the model thinks P10 is

Below is the headline result. We ran 160 trainings of the model: every combination of schedule (curriculum vs mixed), corner-mix (0…6), and model size (50K vs 200K params), with 10 random seeds per cell to make sure the picture isn't a fluke of a single training run.

Each point is the mean of 10 random seeds; error bars are one standard error. Solid line is P(planet); dashed is P(dwarf). Toggle the controls to compare model sizes and schedules.

The mixed schedule is rock-solid canon

If you turn off curriculum and look at just mixed, you see a flat picture: the model says P10 is a planet with P(planet) ≈ 1.0 at every corner-mix, both at 50K and at 200K params. Phase 1 is always present in the training mix, so the planet label for P10 is never under threat.

The curriculum schedule depends on size

Now turn off mixed and look at curriculum at the small model. The picture is messy — P(planet) hovers around 0.25–0.4 with very wide error bars. The 50K-param model is too small to do consistent inference on the corner-mix axis. Some seeds end up in canon, some in swap, a few in tension — it's a high-variance regime.

Switch to the large 200K model and the curriculum picture sharpens dramatically. At low corner-mix (0–4 of 6 E* in the overlap region), P(planet) is near zero and P(dwarf) is near 0.9. The model has been pushed cleanly into a label swap. At full rote-control (6/6), P(planet) jumps to about 0.42 and the picture splits across seeds.

The larger model isn't more uncertain — it's more decisive. Capacity goes into learning the phase-2 pattern more thoroughly, which costs the planet category for P10 more cleanly when feature evidence supports the dwarf label.

5. Comparison with shallow classifiers

So far we've only looked at what the trained transformer says. But the same labeled feature data is available to any classifier. What does a shallow classifier — one with no sequence modeling, no schedule effects, no fine-tuning dynamics — predict for P10?

Below: four baselines, each fit on the same union of phase-1 + phase-2 entities (34 labeled examples per corpus), all asked the same question: given P10's three features, what's the probability of each category?

Solid lines: P(planet | P10's features). Dashed lines: P(dwarf | P10's features). The baselines are deterministic given the corpus; the transformer error bars are 1 standard error across 10 seeds.

The data alone produces a clean monotone

Every baseline shows the same pattern. At 0/6 (full overlap), the data implies P(dwarf | P10's features) is near 1 — six entities labeled dwarf sit in P10's feature neighborhood, and the only entity with the planet label there is P10 itself. At 6/6 (full rote-control), the dwarfs sit in a disjoint corner and contribute nothing to P10's neighborhood, so P(dwarf | P10's features) falls to zero.

Notice that none of the baselines lands in tension either. At every corner-mix, the data implies one decisive answer; the answer just shifts smoothly from dwarf to planet as overlap drops. The data does not contain category strain. What it contains is a shifting single-label majority.

Curriculum approximates the baseline; mixed overrides it

Turn the controls so just large (200K) curriculum and the four baselines are visible. The large transformer under curriculum has the same shape as the baselines — P(dwarf) falls as corner-mix rises — but is offset upward, especially at high corner-mix. The most natural reading is that curriculum imposes an extra phase-2 frequency bias on top of the baseline: every “X is a” prompt during phase 2 has been followed by dwarf, so the model carries a residual positional bias toward dwarf even when feature evidence stops supporting it.

Now consider mixed. The mixed transformer (which you can verify in section 4) sits at P(planet) ≈ 1.0 across every corner-mix — including the overlap-heavy ones where shallow classifiers are emphatic that P10 should be a dwarf. The mixed-schedule transformer ignores feature evidence that the baselines treat as decisive. Phase 1 says “P10 is a planet,” and the model holds that label even when six dwarf-labeled entities sit on top of P10 in feature space.

6. What we learned

  1. Tension is rare. Across 160 trainings of the toy at two model sizes, no condition produces the “both labels in play simultaneously” cell as the modal outcome. Tension exists in a few seeds at a few corner-mix values, but it is a transient between two more decisive attractors (canon and swap).
  2. Curriculum is a data-conditional regime; mixed is a label-conditional one. Under curriculum, the model approximately tracks what shallow classifiers would predict from feature evidence. Under mixed, the model holds the explicit canon label regardless of what the features say.
  3. The data has the evidence; what the model does with it depends on schedule. A larger model amplifies whichever of these two tendencies the schedule selects. It does not introduce a third regime (such as a stable tension cell).

For an undergraduate reader: the lesson is methodological. When we ask whether a language model has “noticed” something in its data, the answer depends on (a) how we ask — the probe matters — and (b) how the data was sequenced into training. The same labeled corpus can produce a model that respects feature distributions or one that ignores them. Comparing the model's verdicts to shallow baselines on the same data is one of the cleanest ways to tell which is happening.

This is a miniature of a much bigger question that the parent project is asking at full scale: what would it have taken for a model trained only on pre-2006 astronomy texts to recognize that the planet category was already strained? The toy says: there's no single “recognized” state. There are several different things a model might do with that evidence, and which one happens depends on choices we make in how the model is built and trained.

7. Under the hood: training details

This section is for readers who want to know how the model was actually trained. Most of the page above can be read without it. The numbers here are the exact settings used to produce the results in sections 4 and 5.

Architecture

Both model sizes are minimal decoder-only transformers (the same family as GPT-2, just much smaller), implemented in <200 lines of PyTorch. Pre-LayerNorm residual blocks; multi-head causal self-attention; a 4× expansion MLP per block; tied input and output embeddings. No attention dropout, no fused kernels — every operation visible in the source.

small large
n_embd (hidden width) 32 64
n_layer (transformer blocks) 4 6
n_head (attention heads) 4 4
head_dim 8 16
block_size (context length) 64 64
parameter count 54,464 306,688

Corpus

The tokenizer is word-level (the corpus generator uses whitespace-separated tokens, including punctuation), fit on the union of phase 1 + phase 2. Vocabulary size is 64 tokens: a handful of category and feature words plus 28 + 6 entity names. Corpus sizes (for the May 7 sweeps with n_eris = 6):

linestokens
Phase 1 (canon) 4803,520
Phase 2 (evidence) 72 504
Joint (mixed schedule) 5524,024

Training schedule

Tokens per parameter, and a note on Chinchilla

Multiplying out: each schedule serves the model the following total quantities of training tokens:

schedule steps tokens served tokens / param (small) tokens / param (large) epochs over its corpus
canon-only 1,500 3,072,000 56.4 10.0 873 (phase 1)
curriculum 1,500 + 300 3,686,400 67.7 12.0 873 (phase 1) · 1,219 (phase 2)
mixed 1,500 3,072,000 56.4 10.0 764 (joint)

The Chinchilla scaling result (Hoffmann et al., 2022) found that compute-optimal language model pretraining wants roughly 20 training tokens per parameter. By that yardstick, the small model in this toy is trained 3× past Chinchilla-optimal (heavy over-training; the model has time to memorize the canon many times over) while the large model is trained at about 0.6× Chinchilla (slightly under-trained relative to compute-optimal).

To be fair, the Chinchilla framing doesn't translate directly to this experiment. Chinchilla scaling laws are derived for pretraining on web-scale text where the goal is held-out loss minimization at fixed compute. We're running a controlled interpretability study on a synthetic corpus, where the goal is to fully internalize the canonical category structure so the dual-label probe has something to read. The right yardstick here is “has phase 1 converged?” (it has — loss falls to ~0.1 well before step 1,500), not “is total compute Chinchilla-optimal?”

That said, the cross-size comparison is informative: the larger model is exposed to fewer epochs of the canon than the small one is (still ~764–873) but can fit it in fewer passes thanks to having more capacity. The May 7 finding that the large model amplifies the corner-mix effect under curriculum is therefore not driven by a Chinchilla-style under-training of either; both models are far past the convergence point on phase 1 alone.

Why so many epochs? Most modern LM pretraining sees each token exactly once or twice. Here we deliberately train for many hundreds of epochs because (a) the corpus is so small that fewer passes leave the canon under-learned, and (b) we want catastrophic forgetting under the curriculum schedule to be a clean phenomenon rather than a partial pass artifact. Numerals, the sibling toy project, runs in a similar regime for the same reasons.

Hardware and wall time

All training runs use a single Apple M-series GPU via PyTorch's MPS backend. A small-model training run takes about 30 seconds; a large-model run takes about 40 seconds. The 160-cell follow-up sweep (10 seeds × 4 corner-mix × 2 schedules × 2 sizes) ran in about 95 minutes wall time on one machine.

The full code is in github.com/bertybaums/pluto-toy. Every operation is visible in the source: model.py is the transformer, train.py is the loop above, tension_probe.py is the dual-label probe, and baseline_classifier.py is what produced the four shallow-classifier curves in section 5.