INLP Project · Semester 6 · IIIT Hyderabad

Does Geometry Matter at the Token Level?

A controlled placement study of hyperbolic geometry in subword token representations across LSTM and GPT-2 language models.

Krish Pandya Harshith Seera Ronit Jalihal

International Institute of Information Technology, Hyderabad · Introduction to NLP, Spring 2026

Hyperbolic geometry helps, but only in the right place

Modern NLP pipelines rely on subword tokenization (BPE, WordPiece), whose token vocabularies implicitly encode hierarchical structure: high-frequency tokens act as generic building blocks while rare tokens encode increasingly specific, compositional information. Hyperbolic space, whose volume grows exponentially with radius, is theoretically well-suited for such hierarchies. We ask: does this geometric advantage carry over to token-level embeddings learned during language modeling?

Our answer is conditional on placement. We run a controlled ablation across two architectures (a single-layer LSTM on WikiText-2 and a 25M-parameter GPT-2 on WikiText-103), comparing embedding-layer vs. output-layer hyperbolic geometry. At both scales, hyperbolic geometry at the embedding layer fails due to geometric degeneracy (boundary saturation in LSTM, origin collapse in GPT-2). At the output layer it succeeds: reducing LSTM perplexity by 2.2% and producing rich centroid hierarchy for GPT-2. Strikingly, Euclidean GPT-2 embeddings spontaneously develop tree-like structure (Gromov δ = 0.22, ρ = +0.924), suggesting transformers implicitly learn hyperbolic-compatible representations without explicit geometric constraints.

Four questions, one controlled experiment

We design our study around four concrete questions, each probing a different aspect of geometry in language modeling.

RQ 1

Do hyperbolic token embeddings yield lower perplexity?

Comparing identical models with only geometry changed at embedding or output layers.

Conditional yes (output layer only)
RQ 2

Does geometry help more for rare tokens?

Analyzing per-frequency-bin perplexity: does hyperbolic space improve very rare token prediction?

Yes: −43% PPL on very rare tokens (LSTM)
RQ 3

Where to place hyperbolic geometry: embedding or output layer?

Our core placement ablation: which architectural position benefits from curved space?

Output layer, consistent at both scales
RQ 4

Do GPT-2 Euclidean embeddings spontaneously develop hyperbolic structure?

Measuring Gromov δ-hyperbolicity and norm-frequency correlation in trained Euclidean embeddings.

Yes: δ = 0.22, ρ = +0.924 without any constraint

Geometry as the only variable

We keep the sequence model architecture identical across all variants. Only the geometry at the embedding or output layer changes. This allows direct attribution of any performance difference to geometric choice.

Euclidean

Baseline

Standard nn.Embedding + nn.Linear output. Weight tying enabled. Optimized with AdamW.

Hyp-Embed

Hyperbolic Embedding Layer

Poincaré ball embeddings via exp₀ projection. RiemannianAdam for embedding, AdamW for rest. Logmap to Euclidean before transformer.

Hyp-Output

Hyperbolic Output Layer

Euclidean sequence model, HyperbolicMLR at output. Geodesic hyperplane classifiers with per-class centroid pk and direction ak.


LSTM on WikiText-2

Single-layer LSTM, hidden dim 256, BPE vocabulary 8k tokens. Trained for 50k steps. Gradient clipping, dropout 0.3.

  • Dataset: WikiText-2 (~2M train tokens)
  • Vocab: BPE 8,192 tokens
  • Sequence length: 256
  • Embedding dim: 256

GPT-2 on WikiText-103

nanoGPT-style 25M parameter model: 6 layers, 6 heads, dim 384. Pre-norm, GELU FFN, causal attention. BPE vocabulary 16k.

  • Dataset: WikiText-103 (~103M train tokens)
  • Vocab: BPE 16,384 tokens
  • Sequence length: 256
  • Model dim: 384

Embedding layer fails; output layer wins

At the LSTM scale, the placement verdict is clear: hyperbolic geometry at the embedding layer increases perplexity by 53%, while at the output layer it improves it by 2.2%, with the largest gains on the rarest tokens.

120.1
LSTM Euclidean PPL
183.8
Hyp-Embed PPL (+53%)
117.4
Hyp-Output PPL (−2.2%)
−43%
Very rare token gain
Model Val PPL Test PPL Δ PPL Very Common Rare Very Rare
LSTM Euclidean 120.1 113.4 88.3 310.2 748.5
LSTM Hyp-Embed 183.8 178.2 +53% 142.1 412.6 956.3
LSTM Hyp-Output 117.4 111.1 −2.2% 144.2 278.4 425.7
LSTM Poincaré disk: boundary saturation (Hyp-Embed) vs. centroid hierarchy (Hyp-Output)

Figure 1. Poincaré disk visualization. Left: Hyp-Embed embeddings saturate near the boundary (mean ‖x‖ = 0.92), destroying radial frequency structure. Right: Hyp-Output centroids self-organize with frequent tokens near the origin and rare tokens near the boundary (Spearman ρ = +0.82).

Norm vs frequency rank

Figure 2. Embedding norm vs. log token frequency rank. Hyp-Output centroids show a strong positive correlation (ρ = +0.82): frequent tokens near the origin, rare tokens near the boundary. Euclidean and Hyp-Embed show no such structure.

Per-frequency-bin perplexity

Figure 3. Per-frequency-bin perplexity. Hyp-Output trades common-token accuracy (+63% PPL on very common) for rare-token gains (−43% PPL on very rare). The crossover occurs in the rare-token regime.

Why does embedding-layer fail? In dimension 256, volume concentration pushes all embeddings toward the ball boundary (mean norm 0.92). At ‖x‖ ≈ 1, hyperbolic distance depends almost entirely on angle; the space becomes effectively spherical and loses its hierarchical resolution.

Scale changes the failure mode, not the verdict

At 384 dimensions, hyperbolic embeddings collapse to the origin (mean norm 0.0003) rather than saturating the boundary. The output-layer placement still develops rich centroid hierarchy. Most surprisingly, the Euclidean GPT-2 baseline spontaneously develops tree-like structure without any geometric constraint.

48.9
GPT-2 Euclidean PPL
1534
Hyp-Embed PPL (collapse)
114.5
Hyp-Output PPL (undertrained)
ρ = +0.924
Euclidean norm-freq correlation
GPT-2 embedding geometry: 3-panel comparison

Figure 4. GPT-2 embedding geometry. (a) Euclidean: dense PCA cloud, no radial frequency structure, but strong tree-like geometry measured externally (PPL = 48.9). (b) Hyp-Embed: origin collapse (mean norm 0.0003): exp₀ becomes approximately the identity and all geometric advantage is lost (PPL = 1534). (c) Hyp-Output centroids: radial hierarchy with frequent tokens near center, rare near boundary (PPL = 114.5). Color = log token frequency.

Model Val PPL Norm (mean) ρ (norm-freq) Geometry verdict
GPT-2 Euclidean 48.9 +0.924 Spontaneous tree-like structure (δ = 0.22)
GPT-2 Hyp-Embed 1534 0.0003 ≈ 0 Origin collapse — geometry destroyed
GPT-2 Hyp-Output 114.5 ρdir = +0.96 Rich centroid hierarchy develops
GPT-2 norm vs frequency rank

Figure 5. GPT-2 norm vs. frequency rank. Hyp-Output direction vectors show ρdir = +0.96.

GPT-2 centroid analysis

Figure 6. Hyp-Output centroid norms by frequency bin. Frequent tokens cluster near origin, rare tokens pushed outward.

Why does the transformer develop hyperbolic structure on its own? The Euclidean GPT-2 baseline achieves Gromov δ = 0.22, more tree-like than the input token co-occurrence graph (δ = 0.43). The attention mechanism, by composing hierarchical word structure through many layers, implicitly learns to place tokens in configurations that are compatible with hyperbolic geometry. This is consistent with Park et al. (ICLR 2025), who show hierarchical concepts map to geometric structures in LLM representations.

Measuring tree-likeness with Gromov δ

The Gromov δ-hyperbolicity (four-point condition) quantifies how closely a metric space resembles a tree. Lower δ means more tree-like. We measure this across all models and the raw token co-occurrence graph.

Space2δ / davgInterpretation
Token co-occurrence graph 0.43 Moderate hierarchy in raw data
LSTM Hyp-Embed 0.07 Most tree-like — but useless (saturation)
LSTM Euclidean 0.38 Near input-level hierarchy
LSTM Hyp-Output 0.27 Improved over Euclidean
GPT-2 Euclidean 0.22 More tree-like than the input data — spontaneous
GPT-2 Hyp-Output 0.32 Hierarchy present, not stronger than Euclidean
Gromov delta comparison

Figure 7. Normalized Gromov δ across spaces. The LSTM Hyp-Embed achieves the lowest δ but worst perplexity. Tree-likeness is necessary but not sufficient for useful representations.

What we learned

01

Embedding-layer hyperbolic geometry consistently fails

Geometric degeneracy at both scales destroys the radial hierarchy that hyperbolic space is supposed to provide. Boundary saturation at dim 256 (LSTM, mean norm 0.92) and origin collapse at dim 384 (GPT-2, mean norm 0.0003) both leave the exp₀ projection as an information bottleneck with no usable curvature signal.

02

Output-layer placement succeeds at both scales

HyperbolicMLR at the output layer lets geometry enter only at classification, where cross-entropy gradients flow directly to centroids. No Riemannian optimizer is needed for the sequence model itself. LSTM perplexity improves by 2.2%; GPT-2 centroids develop strong radial hierarchy (ρdir = +0.96).

03

Hyperbolic output trades common-token accuracy for rare-token gains

In the LSTM, Hyp-Output reduces very rare token PPL by 43% while increasing very common token PPL by 63%. The geodesic hyperplane classifier naturally allocates more representational capacity to low-frequency tokens, exactly where Euclidean space is most deficient.

04

Transformers spontaneously develop hyperbolic-compatible structure

Euclidean GPT-2 achieves Gromov δ = 0.22 and norm-frequency correlation ρ = +0.924, more tree-like than the input co-occurrence graph itself (δ = 0.43). The attention mechanism implicitly learns to place tokens in hyperbolic-compatible configurations without any geometric constraint.

Takeaway

Apply hyperbolic geometry at the output layer. Expect the largest gains on rare tokens. For transformers, measure Gromov δ first; the architecture may already capture the structure you seek.