Lorentz GPT-2 - Ground-Up Hyperbolic Transformer

Motivation

Why Not Poincaré? The CNN Analogy

Our Poincaré results showed that bolting hyperbolic geometry onto a Euclidean transformer creates an information bottleneck: the exp₀ map compresses rich transformer representations through tanh saturation, destroying the very structure the transformer learned. The Poincaré Hyp-Output GPT-2 achieves only 114.5 PPL vs 48.9 Euclidean.

This is analogous to 1990s image recognition: flattening pixels into an MLP ignores spatial structure. CNNs succeeded by making every operation respect that structure. Similarly, we need a transformer where every layer operates natively in hyperbolic space.

Poincaré Ball vs Lorentz Hyperboloid

The Poincaré ball has two numerical failure modes we observed:

Property	Poincaré Ball	Lorentz Hyperboloid
Boundary	Singular (λ → ∞)	No boundary
Inner product	Möbius addition (expensive)	Minkowski: −x₀y₀ + Σx_iy_i
Distance	arctanh (boundary issues)	arccosh (stable)
Attention	Not natural	⟨Q, K⟩_L replaces dot product
Optimizer	RiemannianAdam required	Standard Adam works

Component	Euclidean GPT-2	Lorentz GPT-2
Embedding	nn.Embedding (ℝ³⁸⁴)	Tangent vectors → exp₀ (ℍ³⁸⁵)
Position	+ learned vectors	Tangent-space addition → exp₀
Normalization	LayerNorm	FréchetNorm: log₀ → LN(⋅/√d) → exp₀
Attention scores	QK^T/√d	−c ⋅ ⟨Q,K⟩_L / √d
Value aggregation	Weighted sum	Einstein midpoint + projection
Residual	x + f(x)	exp₀(log₀(x) + log₀(f(x)))
FFN	Linear → GELU → Linear	log₀ → Linear → GELU → Linear → exp₀
Output head	nn.Linear (weight-tied)	LorentzMLR: −d_L(x, p_k)² + b_k
Curvature	N/A	Learnable c (init 1.0 → learned 0.827)

Mathematical Details

The Lorentz Hyperboloid

The Lorentz model represents n-dimensional hyperbolic space as the upper sheet of a hyperboloid in ( n+1)-dimensional Minkowski space:

ℍⁿ = { x ∈ ℝⁿ⁺¹ : ⟨x, x⟩_L = −1/c, x₀ > 0 }

where the Minkowski inner product is ⟨x, y⟩_L = −x₀y₀ + x₁y₁ + … + x_ny_n, and c > 0 is the curvature parameter (sectional curvature = −c). The origin is o = (1/√c, 0, …, 0).

Fig. Cross-section of the Lorentz hyperboloid. All learnable parameters live in the flat tangent space T_oL (blue dashed line) and are mapped to the manifold (green curve) via exp_o. Geodesic distance d_L between two points on the manifold uses arccosh of the Minkowski inner product.

Geodesic Distance

The distance between two points on the hyperboloid is:

d(x, y) = (1/√c) ⋅ arccosh(−c ⋅ ⟨x, y⟩_L)

Unlike the Poincaré distance (which uses arctanh and blows up near the boundary), arccosh is numerically stable for all points on the hyperboloid. There is no boundary singularity.

Exponential Map at Origin

The exp map takes a tangent vector v = (0, v₁, …, v_n) at the origin and maps it to a point on the hyperboloid:

exp_o(v) = cosh(√c ⋅ ‖v‖) ⋅ o + sinh(√c ⋅ ‖v‖) / (√c ⋅ ‖v‖) ⋅ v

This is used everywhere: embedding lookup, after every linear layer, after normalization. Key design choice: all learnable parameters are stored as tangent vectors at the origin. Standard Adam updates these freely in ℝⁿ, and exp₀ maps the result onto the hyperboloid. No Riemannian optimizer is needed.

Logarithmic Map at Origin

The inverse operation, mapping a hyperboloid point back to a tangent vector:

log_o(x) = α / sinh(α) ⋅ x_spatial, where α = arccosh(√c ⋅ x₀)

Lorentz Attention

Standard attention computes scores via Euclidean dot product: score = q^Tk / √d. We replace this with the Minkowski inner product:

score(q, k) = −c ⋅ ⟨q, k⟩_L / √d_head

For points on the hyperboloid, ⟨q, k⟩_L ≤ −1/c (with equality when q = k). So −c ⋅ ⟨q, k⟩_L ≥ 1, and closer points produce larger scores, which is what we want for attention. Each of the 6 heads operates on an independent 64-dimensional hyperboloid (385 = 1 + 6 × 64).

Value aggregation uses the Einstein midpoint: weighted average in ambient space followed by projection back to the hyperboloid. This is a fast, differentiable approximation of the Fréchet mean.

FréchetNorm (Replaces LayerNorm)

Standard LayerNorm destroys manifold structure: it centers and scales in ℝⁿ, which has no meaning on a curved surface. Our replacement:

x ⟶ log₀(x) ⟶ LayerNorm(v_spatial) ⟶ scale by γ/√d + β/√d ⟶ exp₀

The critical detail is the 1/√d scaling. After LayerNorm, each of the 384 spatial components has variance ≈ 1, so ‖v‖ ≈ √384 ≈ 19.6. Without rescaling, exp₀ would clamp this to the max norm (4.0), destroying 80% of the dynamic range every layer. Dividing by √d keeps ‖v‖ ≈ 1.

Tangent-Space Residual Connections

In Euclidean transformers, residuals are simple vector addition: x' = x + f(x). On the hyperboloid, this has no meaning (the sum of two hyperboloid points is not on the hyperboloid). We use:

x' = exp₀(log₀(x) + log₀(f(x)))

Map both to tangent space at origin, add (valid since tangent space is ℝⁿ), map back. This is simpler than parallel-transport residuals and far more stable. The parallel-transport version amplified vectors exponentially when points drifted far from origin, causing NaN within 300 steps.

LorentzMLR Output Layer

Instead of linear classification (logit_k = w_k^Tz + b_k), we use distance-based classification:

logit_k(z) = −d_L(z, p_k)² + b_k

Each class k has a prototype p_k on the hyperboloid (stored as a tangent vector at origin, mapped via exp₀). The logit is the negated squared geodesic distance: closer points get higher logits. The bias b_k handles class imbalance.

The distance computation expands to −arccosh(−c ⋅ ⟨z, p_k⟩_L)² / c, where the Minkowski inner product is a simple matrix multiply (with sign flip on the first coordinate). This is chunked over classes (4096 at a time) for memory efficiency with 16,384 classes.

Engineering

Three Stability Fixes That Made Training Possible

The initial Lorentz GPT-2 implementation diverged to NaN at ~300 training steps. Three interacting numerical issues and their solutions:

1. FréchetNorm output blowup. After LayerNorm, spatial norm ≈ √384 ≈ 19.6, which gets hard-clamped by exp₀ (max_norm=4.0), destroying information across layers. Each layer loses 80% of its dynamic range. Fix: scale output by 1/√d so tangent norms stay O(1).

2. Geodesic residual amplification. Parallel transport from origin to point x uses the coefficient c ⋅ ⟨y,v⟩_L / (1 − c⋅⟨o,y⟩_L). For points far from origin, this coefficient grows, amplifying tangent vectors exponentially across layers. Fix: tangent-space residual (add in tangent space at origin, then exp₀).

3. Attention score overflow. For two points with geodesic norms ≈ 4 on the hyperboloid, −c⋅⟨q,k⟩_L can exceed 500, producing attention logits that cause softmax to return NaN. Fix: clamp logits to [−50, 50] before softmax.

Model	Val PPL	Params	Δ vs Euclidean
Euclidean GPT-2	48.9	17.0M	baseline
Poincaré Hyp-Embed	1,534	25M	+3,040%
Poincaré Hyp-Output	114.5	25M	+134%
Lorentz GPT-2	43.69	23.4M	−10.7%

Geometric Analysis

Norm-Frequency Structure

The norm-frequency correlation reveals complementary encoding strategies between Euclidean and Lorentz models.

Left: Euclidean embeddings (ρ = +0.924), frequent tokens get larger norms. Center: Lorentz embeddings (ρ = −0.650), frequent tokens cluster near the flat origin, rare tokens spread outward. Right: Lorentz MLR centroids (ρ = +0.177), centroids cluster near the max_norm boundary.

Sign reversal. Euclidean: ρ = +0.924 (frequent = large norm). Lorentz embeddings: ρ = −0.650 (frequent = small norm, near the flat origin). This is the geometrically natural arrangement: frequent tokens in the flat center where distinctions are coarse, rare tokens in the curved periphery where the metric provides finer resolution.

Embedding PCA Projections

PCA of token embedding spatial components. Euclidean (left): elongated cloud with frequency gradient along PC1. Lorentz (right): dense core of frequent tokens (dark) with rare tokens (light) radiating outward, consistent with hyperbolic radial structure projected to 2D.

Centroid & Direction Analysis

Top-left: centroid geodesic norms cluster near 4.0 (the max_norm boundary). Top-right: embedding norms are bimodal, peaking near 0.1 and 0.85. Bottom-left: centroid norm vs rank shows weak hierarchy (ρ = +0.177). Bottom-right: direction norms show negative correlation (ρ = −0.315), frequent tokens get smaller decision boundaries.

Centroid saturation at max_norm. The LorentzMLR centroids cluster at geodesic norm ≈ 4.0, which corresponds to the tangent norm clamp of 4.0 in our exp_map0. Unlike the Poincaré boundary saturation (which was a failure), here the centroids are being actively pushed to the boundary by the distance-based MLR objective: placing centroids far from the origin maximizes the dynamic range of −d² logits.

Gromov δ-Hyperbolicity

Space	2δ/d_avg	Interpretation
Token co-occurrence graph	0.43	Moderate hierarchy
Euclidean GPT-2	0.22	Learned tree structure
Poincaré Hyp-Output	0.32	Some hierarchy
Lorentz GPT-2	0.166	Most tree-like of all

The Lorentz model achieves δ = 0.166, well below the 0.25 threshold for "strongly tree-like" structure. This is more tree-like than the Euclidean model (0.22), which itself was more tree-like than the input data (0.43). The ground-up hyperbolic architecture discovers and encodes hierarchical structure more effectively than any other variant.

Where the Gain Comes From

Rare-Token Perplexity

The 10.7% headline PPL drop averages over the whole vocabulary. Stratifying by training-set frequency rank shows a much sharper picture: the gain is concentrated in the long tail, and on common tokens the two models are basically tied.

Frequency bucket	Euc PPL	Lor PPL	Gap
Top 1k (most frequent)	18.3	18.9	+3.1%
Ranks 1k–4k	276.5	214.7	−22.4%
Ranks 4k–8k	724.5	409.1	−43.5%
Ranks 8k–16k (rarest)	1655.6	616.8	−62.7%

On the rarest 8,000 tokens, Lorentz cuts perplexity by more than half. This is the specific prediction hyperbolic theory makes. Exponential volume growth gives more room to separate tokens with few observations, which is exactly the tail of a Zipfian vocabulary.

Why the flip between common and rare tokens

Common tokens have many nearby neighbours in any reasonable embedding, and a flat linear classifier separates them fine. The Lorentz output head uses distances from class centroids (logit_k = −d_L(z, p_k)² + b_k), which is slightly less well-conditioned for tightly packed frequent classes. In the tail, the picture changes. Rare tokens live where the manifold is highly curved, and the conformal factor scales distances so that the model can resolve finely between classes that have almost no training signal. The LSTM Hyp-Output result in our earlier paper showed the same qualitative trade (−43% on very-rare tokens, +63% on common), but here the effect is larger and appears in a full transformer.

Dimension Scaling

The Advantage Is Not Uniform in n

We retrained both models at matched compute (batch 64, 10k steps) across five embedding dimensions. The PPL curve is non-monotonic: Lorentz wins at very low n, loses in the middle, and wins again at full n.

Spatial dim n	Euc PPL	Lor PPL	Δ	Winner
12	337.2	268.5	−20.4%	Lorentz
32	162.9	148.5	−8.8%	Lorentz
64	94.8	103.1	+8.8%	Euclidean
128	58.3	70.5	+20.9%	Euclidean
192	45.5	58.5	+28.5%	Euclidean
384 (full)	48.3	43.3	−10.4%	Lorentz

At n = 12, there simply is not enough Euclidean room to place 16k tokens, and hyperbolic volume growth wins by 20%. Between n = 64 and n = 192, the Euclidean transformer has enough capacity and its tied input/output embedding (which Lorentz cannot use, because the Lorentz head is distance-based) costs 6.5M fewer parameters for the same expressivity. In that regime, weight tying is a stronger inductive bias than curvature. At full n = 384, Lorentz attention and the rare-token allocation outweigh the loss from untied weights, and the sign flips again.

This is the honest story. Hyperbolic geometry is not universally better at the token level. It helps when the model is capacity-constrained (low n) or when the vocabulary tail starts to matter (full n). The simple "hyperbolic beats Euclidean" claim that appears in some prior work does not survive careful dimension sweeps at matched compute.

Model	Seed 42	Seed 137	Seed 256	Mean ± std
Euclidean (n=384)	48.48	48.05	48.30	48.28 ± 0.18
Lorentz (n=385)	43.56	43.00	43.28*	43.28 ± 0.30

Variant	Val PPL	Note
Lorentz (learnable c)	43.69	main result, c converges to 0.77
Lorentz (fixed c = 1.0)	44.42	learnable c contributes 0.73 PPL
Lorentz (3 layers, half depth)	46.78	still 13.2% below Euclidean 3L
Euclidean (3 layers, half depth)	53.91	matched control

Curvature

How Curvature Is Learned

Curvature c controls the "strength" of hyperbolicity. At c = 0, the hyperboloid is flat (ℝⁿ). As c increases, volumes grow more exponentially and hierarchical structure is encoded more aggressively.

Parameterization

Curvature must stay positive. We store an unconstrained parameter log_c and recover curvature via:

c = clamp(exp(log_c), 0.1, 10.0)

log_c is initialized to ln(1.0) = 0. It receives gradients through the entire forward pass because every exp₀, log₀, attention score, and distance computation depends on c. The gradient ∂L/∂c tells the model whether to increase or decrease curvature to better fit the data.

Curvature uses a 100× lower learning rate (3×10⁻⁶ vs 3×10⁻⁴ for other parameters) to prevent oscillation: since c affects every operation in every layer, small changes have outsized effects.

Measured trajectory (100k-step run)

We checkpointed every 5000 steps and read c directly from the stored log_c parameter. Starting from c = exp(0) = 1.00 (the softplus-clamped initial value in our run was ln 2 ≈ 0.69), the curvature drifts slowly upward and settles near 0.77 by step 100k:

Step	5k	10k	25k	50k	75k	100k
c	0.691	0.689	0.705	0.731	0.753	0.773

The trajectory is smooth and monotonic after a brief early dip. c does not blow up, does not collapse to the 0.1 floor, and does not hit the 10.0 ceiling. It picks a value below the c = 1 default. BPE token vocabularies encode real hierarchy (frequency, morphology), but that hierarchy is shallower than the semantic trees Nickel and Kiela used for WordNet (≈ 82k nodes, strict hypernym relations). A less aggressive curvature fits a shallower tree.

The fixed-curvature ablation (c = 1.0, no learning) reaches 44.42 PPL vs 43.69 for learnable c. Learning the curvature is worth 0.73 PPL, which is real but small. The reason it is small is now visible from the trajectory: the learned value is already close to the fixed value.

Summary

Honest Takeaways

Hyperbolic geometry at the token level is a real but narrow win. The specific claims we think this work supports:

A ground-up Lorentz transformer works. It trains stably under float32 once three numerical fixes are in place.
Single-interface hyperbolic placements (Poincaré embeddings, Poincaré output head) fail at GPT-2 scale on WikiText-103: 1534 and 114.5 PPL respectively, vs 48.3 Euclidean.
At full dimension, Lorentz beats Euclidean by 10.4% (43.3 ± 0.3 vs 48.3 ± 0.2 across 3 seeds).
The headline gain is almost entirely driven by rare tokens: 63% PPL reduction on the rarest 8,000 tokens, tied performance on the most frequent 1,000.
The advantage reverses at intermediate dimensions (64–192), where weight tying in the Euclidean model matters more than curvature.
Learnable curvature contributes 0.7 PPL. The learned value (c ≈ 0.77) is close to the fixed default.

Things we do not claim: state-of-the-art on WikiText-103 (there are stronger Euclidean baselines at this parameter count), universal hyperbolic superiority (the dim sweep rules this out), or downstream task evaluation (we did not run any).

For the full paper draft, see paper_v2.tex. For the original output-layer study on LSTM, see the main site. For the 3D interactive PCA, see the 3D explorer.

Lorentz GPT-2: A Ground-Up Hyperbolic Transformer

Key Numbers

Why Not Poincaré? The CNN Analogy

Poincaré Ball vs Lorentz Hyperboloid

Component-by-Component Comparison

The Lorentz Hyperboloid

Geodesic Distance

Exponential Map at Origin

Logarithmic Map at Origin

Lorentz Attention

FréchetNorm (Replaces LayerNorm)

Tangent-Space Residual Connections

LorentzMLR Output Layer

Three Stability Fixes That Made Training Possible

Perplexity Comparison

Norm-Frequency Structure

Embedding PCA Projections

Centroid & Direction Analysis

Gromov δ-Hyperbolicity

Rare-Token Perplexity

Why the flip between common and rare tokens

The Advantage Is Not Uniform in n

Seeds and Ablations

Three seeds at full dimension

Ablations at full dimension

Explore the Embedding Spaces

Embedding PCA: Euclidean vs Lorentz

Lorentz MLR Centroids

3D Embedding Geometry

Norm vs Frequency Rank

How Curvature Is Learned

Parameterization

Measured trajectory (100k-step run)

Honest Takeaways