Does Geometry Matter at the Token Level?

Abstract

Hyperbolic geometry helps, but only in the right place

Modern NLP pipelines rely on subword tokenization (BPE, WordPiece), whose token vocabularies implicitly encode hierarchical structure: high-frequency tokens act as generic building blocks while rare tokens encode increasingly specific, compositional information. Hyperbolic space, whose volume grows exponentially with radius, is theoretically well-suited for such hierarchies. We ask: does this geometric advantage carry over to token-level embeddings learned during language modeling?

Our answer is conditional on placement. We run a controlled ablation across two architectures (a single-layer LSTM on WikiText-2 and a 25M-parameter GPT-2 on WikiText-103), comparing embedding-layer vs. output-layer hyperbolic geometry. At both scales, hyperbolic geometry at the embedding layer fails due to geometric degeneracy (boundary saturation in LSTM, origin collapse in GPT-2). At the output layer it succeeds: reducing LSTM perplexity by 2.2% and producing rich centroid hierarchy for GPT-2. Strikingly, Euclidean GPT-2 embeddings spontaneously develop tree-like structure (Gromov δ = 0.22, ρ = +0.924), suggesting transformers implicitly learn hyperbolic-compatible representations without explicit geometric constraints.

Research Questions

Four questions, one controlled experiment

We design our study around four concrete questions, each probing a different aspect of geometry in language modeling.

RQ 1

Do hyperbolic token embeddings yield lower perplexity?

Comparing identical models with only geometry changed at embedding or output layers.

Conditional yes (output layer only)

RQ 2

Does geometry help more for rare tokens?

Analyzing per-frequency-bin perplexity: does hyperbolic space improve very rare token prediction?

Yes: −43% PPL on very rare tokens (LSTM)

RQ 3

Where to place hyperbolic geometry: embedding or output layer?

Our core placement ablation: which architectural position benefits from curved space?

Output layer, consistent at both scales

RQ 4

Do GPT-2 Euclidean embeddings spontaneously develop hyperbolic structure?

Measuring Gromov δ-hyperbolicity and norm-frequency correlation in trained Euclidean embeddings.

Yes: δ = 0.22, ρ = +0.924 without any constraint

Methodology

Geometry as the only variable

We keep the sequence model architecture identical across all variants. Only the geometry at the embedding or output layer changes. This allows direct attribution of any performance difference to geometric choice.

Euclidean

Baseline

Standard nn.Embedding + nn.Linear output. Weight tying enabled. Optimized with AdamW.

Hyp-Embed

Hyperbolic Embedding Layer

Poincaré ball embeddings via exp₀ projection. RiemannianAdam for embedding, AdamW for rest. Logmap to Euclidean before transformer.

Hyp-Output

Hyperbolic Output Layer

Euclidean sequence model, HyperbolicMLR at output. Geodesic hyperplane classifiers with per-class centroid p_k and direction a_k.

LSTM on WikiText-2

Single-layer LSTM, hidden dim 256, BPE vocabulary 8k tokens. Trained for 50k steps. Gradient clipping, dropout 0.3.

Dataset: WikiText-2 (~2M train tokens)
Vocab: BPE 8,192 tokens
Sequence length: 256
Embedding dim: 256

GPT-2 on WikiText-103

nanoGPT-style 25M parameter model: 6 layers, 6 heads, dim 384. Pre-norm, GELU FFN, causal attention. BPE vocabulary 16k.

Dataset: WikiText-103 (~103M train tokens)
Vocab: BPE 16,384 tokens
Sequence length: 256
Model dim: 384

Results: LSTM

Embedding layer fails; output layer wins

At the LSTM scale, the placement verdict is clear: hyperbolic geometry at the embedding layer increases perplexity by 53%, while at the output layer it improves it by 2.2%, with the largest gains on the rarest tokens.

120.1

LSTM Euclidean PPL

183.8

Hyp-Embed PPL (+53%)

117.4

Hyp-Output PPL (−2.2%)

−43%

Very rare token gain

Model	Val PPL	Test PPL	Δ PPL	Very Common	Rare	Very Rare
LSTM Euclidean	120.1	113.4	—	88.3	310.2	748.5
LSTM Hyp-Embed	183.8	178.2	+53%	142.1	412.6	956.3
LSTM Hyp-Output	117.4	111.1	−2.2%	144.2	278.4	425.7

LSTM Poincaré disk: boundary saturation (Hyp-Embed) vs. centroid hierarchy (Hyp-Output)

Figure 1. Poincaré disk visualization. Left: Hyp-Embed embeddings saturate near the boundary (mean ‖x‖ = 0.92), destroying radial frequency structure. Right: Hyp-Output centroids self-organize with frequent tokens near the origin and rare tokens near the boundary (Spearman ρ = +0.82).

Figure 2. Embedding norm vs. log token frequency rank. Hyp-Output centroids show a strong positive correlation (ρ = +0.82): frequent tokens near the origin, rare tokens near the boundary. Euclidean and Hyp-Embed show no such structure.

Figure 3. Per-frequency-bin perplexity. Hyp-Output trades common-token accuracy (+63% PPL on very common) for rare-token gains (−43% PPL on very rare). The crossover occurs in the rare-token regime.

Why does embedding-layer fail? In dimension 256, volume concentration pushes all embeddings toward the ball boundary (mean norm 0.92). At ‖x‖ ≈ 1, hyperbolic distance depends almost entirely on angle; the space becomes effectively spherical and loses its hierarchical resolution.

Results: GPT-2

Scale changes the failure mode, not the verdict

At 384 dimensions, hyperbolic embeddings collapse to the origin (mean norm 0.0003) rather than saturating the boundary. The output-layer placement still develops rich centroid hierarchy. Most surprisingly, the Euclidean GPT-2 baseline spontaneously develops tree-like structure without any geometric constraint.

48.9

GPT-2 Euclidean PPL

1534

Hyp-Embed PPL (collapse)

114.5

Hyp-Output PPL (undertrained)

ρ = +0.924

Euclidean norm-freq correlation

GPT-2 embedding geometry: 3-panel comparison

Figure 4. GPT-2 embedding geometry. (a) Euclidean: dense PCA cloud, no radial frequency structure, but strong tree-like geometry measured externally (PPL = 48.9). (b) Hyp-Embed: origin collapse (mean norm 0.0003): exp₀ becomes approximately the identity and all geometric advantage is lost (PPL = 1534). (c) Hyp-Output centroids: radial hierarchy with frequent tokens near center, rare near boundary (PPL = 114.5). Color = log token frequency.

Model	Val PPL	Norm (mean)	ρ (norm-freq)	Geometry verdict
GPT-2 Euclidean	48.9	—	+0.924	Spontaneous tree-like structure (δ = 0.22)
GPT-2 Hyp-Embed	1534	0.0003	≈ 0	Origin collapse — geometry destroyed
GPT-2 Hyp-Output	114.5	—	ρ_dir = +0.96	Rich centroid hierarchy develops

Figure 5. GPT-2 norm vs. frequency rank. Hyp-Output direction vectors show ρ_dir = +0.96.

Figure 6. Hyp-Output centroid norms by frequency bin. Frequent tokens cluster near origin, rare tokens pushed outward.

Why does the transformer develop hyperbolic structure on its own? The Euclidean GPT-2 baseline achieves Gromov δ = 0.22, more tree-like than the input token co-occurrence graph (δ = 0.43). The attention mechanism, by composing hierarchical word structure through many layers, implicitly learns to place tokens in configurations that are compatible with hyperbolic geometry. This is consistent with Park et al. (ICLR 2025), who show hierarchical concepts map to geometric structures in LLM representations.

Analysis

Measuring tree-likeness with Gromov δ

The Gromov δ-hyperbolicity (four-point condition) quantifies how closely a metric space resembles a tree. Lower δ means more tree-like. We measure this across all models and the raw token co-occurrence graph.

Space	2δ / d_avg	Interpretation
Token co-occurrence graph	0.43	Moderate hierarchy in raw data
LSTM Hyp-Embed	0.07	Most tree-like — but useless (saturation)
LSTM Euclidean	0.38	Near input-level hierarchy
LSTM Hyp-Output	0.27	Improved over Euclidean
GPT-2 Euclidean	0.22	More tree-like than the input data — spontaneous
GPT-2 Hyp-Output	0.32	Hierarchy present, not stronger than Euclidean

Figure 7. Normalized Gromov δ across spaces. The LSTM Hyp-Embed achieves the lowest δ but worst perplexity. Tree-likeness is necessary but not sufficient for useful representations.

Key Findings

What we learned

Embedding-layer hyperbolic geometry consistently fails

Geometric degeneracy at both scales destroys the radial hierarchy that hyperbolic space is supposed to provide. Boundary saturation at dim 256 (LSTM, mean norm 0.92) and origin collapse at dim 384 (GPT-2, mean norm 0.0003) both leave the exp₀ projection as an information bottleneck with no usable curvature signal.

Output-layer placement succeeds at both scales

HyperbolicMLR at the output layer lets geometry enter only at classification, where cross-entropy gradients flow directly to centroids. No Riemannian optimizer is needed for the sequence model itself. LSTM perplexity improves by 2.2%; GPT-2 centroids develop strong radial hierarchy (ρ_dir = +0.96).

Hyperbolic output trades common-token accuracy for rare-token gains

In the LSTM, Hyp-Output reduces very rare token PPL by 43% while increasing very common token PPL by 63%. The geodesic hyperplane classifier naturally allocates more representational capacity to low-frequency tokens, exactly where Euclidean space is most deficient.

Transformers spontaneously develop hyperbolic-compatible structure

Euclidean GPT-2 achieves Gromov δ = 0.22 and norm-frequency correlation ρ = +0.924, more tree-like than the input co-occurrence graph itself (δ = 0.43). The attention mechanism implicitly learns to place tokens in hyperbolic-compatible configurations without any geometric constraint.

→

Takeaway

Apply hyperbolic geometry at the output layer. Expect the largest gains on rare tokens. For transformers, measure Gromov δ first; the architecture may already capture the structure you seek.