67 Parameters. Perfect Addition.

A single-layer transformer that learned to add any two 10-digit numbers with 100% accuracy.

67 trainable parameters — 10,010 / 10,010 correct
+
Carry Chain
02 — TOKEN EMBEDDINGS

3 Parameters, 10 Digits

All digits live on a circular arc. Only radius, start angle, and stride define the entire embedding table.

Interactive Circular Arc

Drag the sliders to reshape the embedding. Hover digits for coordinates.

Why Circles Matter

The y-coordinate increases monotonically with digit value.

Key insight: When the model averages two digit embeddings (via ~50/50 attention), the y-coordinate of the average encodes x+y. The FFN then detects whether x+y ≥ 10 to determine carries.
03 — POSITION ENCODING

A 3D Sinusoidal Spiral

10 digit positions on a frozen sinusoidal ring. The carry position is 4.6× farther from the origin.

Position Norms

The carry position z_hi has norm ~16.2, dwarfing digit positions (~3.5).

04 — ATTENTION

Position-Only Lookahead

A −29.4° phase rotation breaks symmetry, creating the +1 lookahead that routes each answer digit to the correct inputs.

Phase Rotation Demo

Drag the angle to see how attention shifts from self to lookahead.

Attention Flow

Curved arcs show where each output position gathers information.

05 — FFN & CARRY DETECTION

2 Hidden Units, 1 Binary Decision

The FFN's two GELU units form a push-pull pair that detects whether the digit sum exceeds 10.

FFN Activation Space

Each dot is one output position. Red = carry, green = no carry.

Residual Stream: Before & After FFN

The FFN primarily modifies token dimensions 0 and 1.

06 — TRAINING

How 67 Parameters Learn to Add

Grokking: the model memorizes first, then suddenly generalizes. Carry-mix curriculum makes this reliable.

Grokking Curve

Token accuracy rises slowly, then exact match jumps from 0% to 100% around step 46K.

Carry-Mix Curriculum

80% carry-heavy sampling ensures the model sees hard cases early.

Result: Carry-mix with step-based fade improved grokking rate from ~10% to 100% of seeds.
07 — PARAMETER BUDGET

Every Parameter Counts

67 parameters. Zero frozen pretrained weights. Everything learned from scratch.