A controlled placement study of hyperbolic geometry in subword token representations across LSTM and GPT-2 language models.
International Institute of Information Technology, Hyderabad · Introduction to NLP, Spring 2026
Abstract
Modern NLP pipelines rely on subword tokenization (BPE, WordPiece), whose token vocabularies
implicitly encode hierarchical structure: high-frequency tokens act as generic building blocks
while rare tokens encode increasingly specific, compositional information. Hyperbolic space,
whose volume grows exponentially with radius, is theoretically well-suited for such hierarchies.
We ask: does this geometric advantage carry over to token-level embeddings learned during language modeling?
Our answer is conditional on placement. We run a controlled ablation across two architectures
(a single-layer LSTM on WikiText-2 and a 25M-parameter GPT-2 on WikiText-103), comparing
embedding-layer vs. output-layer hyperbolic geometry. At both scales, hyperbolic geometry at
the embedding layer fails due to geometric degeneracy (boundary saturation in LSTM,
origin collapse in GPT-2). At the output layer it succeeds: reducing LSTM perplexity
by 2.2% and producing rich centroid hierarchy for GPT-2. Strikingly, Euclidean GPT-2 embeddings
spontaneously develop tree-like structure (Gromov δ = 0.22, ρ = +0.924),
suggesting transformers implicitly learn hyperbolic-compatible representations without
explicit geometric constraints.
Research Questions
We design our study around four concrete questions, each probing a different aspect of geometry in language modeling.
Comparing identical models with only geometry changed at embedding or output layers.
Analyzing per-frequency-bin perplexity: does hyperbolic space improve very rare token prediction?
Our core placement ablation: which architectural position benefits from curved space?
Measuring Gromov δ-hyperbolicity and norm-frequency correlation in trained Euclidean embeddings.
Methodology
We keep the sequence model architecture identical across all variants. Only the geometry at the embedding or output layer changes. This allows direct attribution of any performance difference to geometric choice.
Standard nn.Embedding + nn.Linear output. Weight tying enabled. Optimized with AdamW.
Poincaré ball embeddings via exp₀ projection. RiemannianAdam for embedding, AdamW for rest. Logmap to Euclidean before transformer.
Euclidean sequence model, HyperbolicMLR at output. Geodesic hyperplane classifiers with per-class centroid pk and direction ak.
Single-layer LSTM, hidden dim 256, BPE vocabulary 8k tokens. Trained for 50k steps. Gradient clipping, dropout 0.3.
nanoGPT-style 25M parameter model: 6 layers, 6 heads, dim 384. Pre-norm, GELU FFN, causal attention. BPE vocabulary 16k.
Results: LSTM
At the LSTM scale, the placement verdict is clear: hyperbolic geometry at the embedding layer increases perplexity by 53%, while at the output layer it improves it by 2.2%, with the largest gains on the rarest tokens.
| Model | Val PPL | Test PPL | Δ PPL | Very Common | Rare | Very Rare |
|---|---|---|---|---|---|---|
| LSTM Euclidean | 120.1 | 113.4 | — | 88.3 | 310.2 | 748.5 |
| LSTM Hyp-Embed | 183.8 | 178.2 | +53% | 142.1 | 412.6 | 956.3 |
| LSTM Hyp-Output | 117.4 | 111.1 | −2.2% | 144.2 | 278.4 | 425.7 |
Figure 1. Poincaré disk visualization. Left: Hyp-Embed embeddings saturate near the boundary (mean ‖x‖ = 0.92), destroying radial frequency structure. Right: Hyp-Output centroids self-organize with frequent tokens near the origin and rare tokens near the boundary (Spearman ρ = +0.82).
Figure 2. Embedding norm vs. log token frequency rank. Hyp-Output centroids show a strong positive correlation (ρ = +0.82): frequent tokens near the origin, rare tokens near the boundary. Euclidean and Hyp-Embed show no such structure.
Figure 3. Per-frequency-bin perplexity. Hyp-Output trades common-token accuracy (+63% PPL on very common) for rare-token gains (−43% PPL on very rare). The crossover occurs in the rare-token regime.
Why does embedding-layer fail? In dimension 256, volume concentration pushes all embeddings toward the ball boundary (mean norm 0.92). At ‖x‖ ≈ 1, hyperbolic distance depends almost entirely on angle; the space becomes effectively spherical and loses its hierarchical resolution.
Results: GPT-2
At 384 dimensions, hyperbolic embeddings collapse to the origin (mean norm 0.0003) rather than saturating the boundary. The output-layer placement still develops rich centroid hierarchy. Most surprisingly, the Euclidean GPT-2 baseline spontaneously develops tree-like structure without any geometric constraint.
Figure 4. GPT-2 embedding geometry. (a) Euclidean: dense PCA cloud, no radial frequency structure, but strong tree-like geometry measured externally (PPL = 48.9). (b) Hyp-Embed: origin collapse (mean norm 0.0003): exp₀ becomes approximately the identity and all geometric advantage is lost (PPL = 1534). (c) Hyp-Output centroids: radial hierarchy with frequent tokens near center, rare near boundary (PPL = 114.5). Color = log token frequency.
| Model | Val PPL | Norm (mean) | ρ (norm-freq) | Geometry verdict |
|---|---|---|---|---|
| GPT-2 Euclidean | 48.9 | — | +0.924 | Spontaneous tree-like structure (δ = 0.22) |
| GPT-2 Hyp-Embed | 1534 | 0.0003 | ≈ 0 | Origin collapse — geometry destroyed |
| GPT-2 Hyp-Output | 114.5 | — | ρdir = +0.96 | Rich centroid hierarchy develops |
Figure 5. GPT-2 norm vs. frequency rank. Hyp-Output direction vectors show ρdir = +0.96.
Figure 6. Hyp-Output centroid norms by frequency bin. Frequent tokens cluster near origin, rare tokens pushed outward.
Why does the transformer develop hyperbolic structure on its own? The Euclidean GPT-2 baseline achieves Gromov δ = 0.22, more tree-like than the input token co-occurrence graph (δ = 0.43). The attention mechanism, by composing hierarchical word structure through many layers, implicitly learns to place tokens in configurations that are compatible with hyperbolic geometry. This is consistent with Park et al. (ICLR 2025), who show hierarchical concepts map to geometric structures in LLM representations.
Analysis
The Gromov δ-hyperbolicity (four-point condition) quantifies how closely a metric space resembles a tree. Lower δ means more tree-like. We measure this across all models and the raw token co-occurrence graph.
| Space | 2δ / davg | Interpretation |
|---|---|---|
| Token co-occurrence graph | 0.43 | Moderate hierarchy in raw data |
| LSTM Hyp-Embed | 0.07 | Most tree-like — but useless (saturation) |
| LSTM Euclidean | 0.38 | Near input-level hierarchy |
| LSTM Hyp-Output | 0.27 | Improved over Euclidean |
| GPT-2 Euclidean | 0.22 | More tree-like than the input data — spontaneous |
| GPT-2 Hyp-Output | 0.32 | Hierarchy present, not stronger than Euclidean |
Figure 7. Normalized Gromov δ across spaces. The LSTM Hyp-Embed achieves the lowest δ but worst perplexity. Tree-likeness is necessary but not sufficient for useful representations.
Key Findings
01
Geometric degeneracy at both scales destroys the radial hierarchy that hyperbolic space is supposed to provide. Boundary saturation at dim 256 (LSTM, mean norm 0.92) and origin collapse at dim 384 (GPT-2, mean norm 0.0003) both leave the exp₀ projection as an information bottleneck with no usable curvature signal.
02
HyperbolicMLR at the output layer lets geometry enter only at classification, where cross-entropy gradients flow directly to centroids. No Riemannian optimizer is needed for the sequence model itself. LSTM perplexity improves by 2.2%; GPT-2 centroids develop strong radial hierarchy (ρdir = +0.96).
03
In the LSTM, Hyp-Output reduces very rare token PPL by 43% while increasing very common token PPL by 63%. The geodesic hyperplane classifier naturally allocates more representational capacity to low-frequency tokens, exactly where Euclidean space is most deficient.
04
Euclidean GPT-2 achieves Gromov δ = 0.22 and norm-frequency correlation ρ = +0.924, more tree-like than the input co-occurrence graph itself (δ = 0.43). The attention mechanism implicitly learns to place tokens in hyperbolic-compatible configurations without any geometric constraint.
→
Apply hyperbolic geometry at the output layer. Expect the largest gains on rare tokens. For transformers, measure Gromov δ first; the architecture may already capture the structure you seek.