67 Parameters. Perfect Addition.

A single-layer transformer that learned to add any two 10-digit numbers with 100% accuracy.

67 trainable parameters — 10,010 / 10,010 correct

Carry Chain

02 — TOKEN EMBEDDINGS

3 Parameters, 10 Digits

All digits live on a circular arc. Only radius, start angle, and stride define the entire embedding table.

Interactive Circular Arc

Drag the sliders to reshape the embedding. Hover digits for coordinates.

A (radius): start (angle): stride (spacing):

Why Circles Matter

The y-coordinate increases monotonically with digit value.

Key insight: When the model averages two digit embeddings (via ~50/50 attention), the y-coordinate of the average encodes x+y. The FFN then detects whether x+y ≥ 10 to determine carries.

03 — POSITION ENCODING

A 3D Sinusoidal Spiral

10 digit positions on a frozen sinusoidal ring. The carry position is 4.6× farther from the origin.

Position Norms

The carry position z_hi has norm ~16.2, dwarfing digit positions (~3.5).

04 — ATTENTION

Position-Only Lookahead

A −29.4° phase rotation breaks symmetry, creating the +1 lookahead that routes each answer digit to the correct inputs.

Phase Rotation Demo

Drag the angle to see how attention shifts from self to lookahead.

Phase angle: -29.4°

Attention Flow

Curved arcs show where each output position gathers information.

Output position: Z[0]

05 — FFN & CARRY DETECTION

2 Hidden Units, 1 Binary Decision

The FFN's two GELU units form a push-pull pair that detects whether the digit sum exceeds 10.

FFN Activation Space

Each dot is one output position. Red = carry, green = no carry.

Residual Stream: Before & After FFN

The FFN primarily modifies token dimensions 0 and 1.

Output position: Z[0]

06 — TRAINING

How 67 Parameters Learn to Add

Grokking: the model memorizes first, then suddenly generalizes. Carry-mix curriculum makes this reliable.

Grokking Curve

Token accuracy rises slowly, then exact match jumps from 0% to 100% around step 46K.

Carry-Mix Curriculum

80% carry-heavy sampling ensures the model sees hard cases early.

Result: Carry-mix with step-based fade improved grokking rate from ~10% to 100% of seeds.

07 — PARAMETER BUDGET

Every Parameter Counts

67 parameters. Zero frozen pretrained weights. Everything learned from scratch.