A single-layer transformer that learned to add any two 10-digit numbers with 100% accuracy.
All digits live on a circular arc. Only radius, start angle, and stride define the entire embedding table.
Drag the sliders to reshape the embedding. Hover digits for coordinates.
The y-coordinate increases monotonically with digit value.
10 digit positions on a frozen sinusoidal ring. The carry position is 4.6× farther from the origin.
The carry position z_hi has norm ~16.2, dwarfing digit positions (~3.5).
A −29.4° phase rotation breaks symmetry, creating the +1 lookahead that routes each answer digit to the correct inputs.
Drag the angle to see how attention shifts from self to lookahead.
Curved arcs show where each output position gathers information.
The FFN's two GELU units form a push-pull pair that detects whether the digit sum exceeds 10.
Each dot is one output position. Red = carry, green = no carry.
The FFN primarily modifies token dimensions 0 and 1.
Grokking: the model memorizes first, then suddenly generalizes. Carry-mix curriculum makes this reliable.
Token accuracy rises slowly, then exact match jumps from 0% to 100% around step 46K.
80% carry-heavy sampling ensures the model sees hard cases early.
67 parameters. Zero frozen pretrained weights. Everything learned from scratch.