Analysis Overview

This page collects all experimental results, figures, and structural diagnostics from our two-phase study. Phase 1 places embeddings directly in the Poincare ball; Phase 2 keeps embeddings Euclidean and applies geometry only at the output layer. The Poincare GloVe section provides the reference picture of what successful hyperbolic embeddings look like.

113.5
Euclidean best val PPL (Phase 1, 15k steps)
174.0
Hyperbolic best val PPL (Phase 1, 15k steps)
117.4
Hyperbolic best val PPL (Phase 2, 15k steps)
121.7
Spherical best val PPL (Phase 2, 15k steps)
Analysis overview: perplexity, norm distributions, norm-vs-frequency
Four-panel overview from Phase 1. Top-left: validation perplexity training curves -- Euclidean converges to 113.5, hyperbolic stalls near 220. Top-right: embedding norm distributions -- hyperbolic norms concentrate sharply at ~3.2 (boundary of the Poincare ball) while Euclidean spans a broader range. Bottom: norm vs log10(frequency) scatterplots. Euclidean shows a clear positive correlation (r=0.324); hyperbolic is nearly flat (r=0.066).

Phase 1 - Full Hyperbolic Embedding

Following Nickel & Kiela (2017): embeddings live in the Poincare ball, optimized via RiemannianAdam with a 500-step burn-in phase. Embeddings are mapped to tangent space before the LSTM via logmap0.

Finding: Hyperbolic embeddings perform strictly worse than Euclidean at every training duration tested. At 15k steps: Euclidean 113.5 vs Hyperbolic 174.0. The additional optimization complexity (manifold LR, burn-in, Riemannian Adam) provides no measurable benefit.

Perplexity by Token Frequency Bin

We stratify validation perplexity by token frequency to test the hypothesis that hyperbolic geometry helps with rare tokens. If the curved space encoded a frequency hierarchy, rare tokens near the ball boundary should have better representations.

Perplexity by token frequency bin: Euclidean vs Hyperbolic
Perplexity stratified by token frequency tier (Phase 1, 15k steps). Euclidean beats hyperbolic in every bin. The gap is largest for very rare tokens (<0.01% frequency), where hyperbolic should theoretically help most.

Raw Numbers

Very Rare
1811
3279
Rare
281
595
Uncommon
55
106
Common
7.3
9.8

Blue = Euclidean, Pink = Hyperbolic. Bars scaled relative to max (3279).

Embedding Norm Diagnosis

The crucial diagnostic: do embedding norms correlate with token frequency? In Nickel & Kiela's framework, a well-trained hyperbolic embedding places general tokens near the origin (small norm) and specific rare tokens near the boundary (large norm).

Norm vs frequency rank: Euclidean r=0.335, Hyperbolic r=0.045
Embedding norm vs token frequency rank (1 = most frequent, 8192 = rarest). Euclidean (left, r_log=0.335): clear monotonic trend -- high-frequency tokens cluster at lower norms, rare tokens spread upward. The rolling average (window 50) shows a smooth inverse-U shape. Hyperbolic (right, r_log=0.045): embeddings cluster tightly at norm ~3.2 with almost no radial variation. The rolling average is nearly flat. This confirms boundary saturation: all embeddings have converged to the effective Poincare ball radius, destroying any hierarchical structure.
0.335
r (log-freq vs norm), Euclidean
0.045
r (log-freq vs norm), Hyperbolic
3.2
Hyperbolic mean norm (near boundary)
2.68
Euclidean mean norm (broader spread)

Phase 2 - Output-Layer Geometry

Motivated by Moreira et al. (2023): embeddings and LSTM remain fully Euclidean; only the output classification layer uses geometry. This avoids boundary saturation while still testing whether hyperbolic distance structure aids classification. All three variants (Euclidean, Hyperbolic, Spherical) use identical Adam optimizers -- no manifold learning rates or burn-in phases.

Phase 2 Perplexity Results

Finding: Hyperbolic MLR at the output layer achieves best val PPL 117.4, outperforming the Euclidean baseline (120.1) and spherical (121.7) at 15k steps. This is a meaningful reversal from Phase 1 (where hyperbolic was 60+ PPL worse). The Phase 2 architecture removes the saturation failure mode.
Run ID Type Emb Dim Hidden Dropout Steps Best Val PPL Train PPL
run-20260308_132126 euclidean 2562560.215 000 120.1 88.6
run-20260308_135230 hyperbolic 2562560.215 000 117.4 57.7
run-20260308_132127 spherical 2562560.215 000 121.7 76.6
Note on overfitting: Hyperbolic train PPL (57.7) is substantially lower than its val PPL (117.4), suggesting the hyperbolic MLR layer has more expressive power that may be overfitting. The spherical variant shows an intermediate profile. A more thorough sweep of dropout and regularization strengths is needed before drawing strong conclusions about which geometry generalizes better.

Phase 2 Architecture Detail

ComponentEuclideanHyperbolicSpherical
EmbeddingR^256R^256R^256
LSTM (1 layer)R^256R^256R^256
ProjectionIdentityexpmap0 to Poincare ballL2 normalize to sphere
ClassifierLinear (8192)HyperbolicMLR (Ganea 2018)Linear on sphere
OptimizerAdam 1e-3Adam 1e-3Adam 1e-3

Why Phase 1 Fails: Boundary Saturation

Moreira et al. (2023) give a precise theoretical explanation for the Phase 1 failure, and our diagnostics confirm it exactly.

The Theoretical Argument

In d-dimensional hyperbolic space, the ratio of ball volume to ball surface area is bounded by r/d. As d grows this ratio approaches 0, exactly as in Euclidean space. All volume concentrates at the boundary. Given that cross-entropy loss is unbounded below as embeddings approach the unit ball boundary, the optimizer follows the gradient to the boundary.

Once all embeddings lie at radius r_eff = (1-epsilon)/sqrt(-k), the Poincare distance between any two points u, v with ||u||=||v||=r_eff reduces to a function of the angle between them only. The space is then isometric to a Euclidean sphere -- the radial hierarchy-encoding property is lost entirely.

# From our diagnostics (Phase 1, 15k steps, dim=256):
euclidean_mean_norm  = 2.68   # broad distribution
hyperbolic_mean_norm = 3.20   # saturated at boundary
hyperbolic_std_norm  = 0.05   # extremely tight clustering

# Poincare ball boundary = 1.0 (normalized), but with curvature k=1
# r_eff ~ (1-epsilon) / sqrt(k) ~ 1.0 in Poincare parameterization
# Our embedding norms show saturation in geoopt's non-unit-ball parameterization

# Norm-frequency correlation:
r_euclidean = 0.335   # clear hierarchical structure
r_hyperbolic = 0.045  # no structure
Key diagnostic: The near-zero norm-frequency correlation (r=0.045) is a direct signature of boundary saturation. If hyperbolic geometry were working, this correlation should be strongly negative (high-frequency tokens near origin = low norm). Instead it is essentially zero, confirming all embeddings are at the same radius regardless of token frequency.

Why Phase 2 Avoids This

Phase 2 never places the embedding table in the Poincare ball. The LSTM hidden state (a 256-dim Euclidean vector) is projected to the ball only at the final classification step, where it is used as a query point for hyperbolic softmax. There is no pressure for this single projection to saturate, since the hidden state itself is regularized by the language modeling task and dropout.

Reference: Poincare GloVe Analysis

From Tifrea et al. (2019). These plots show what successful hyperbolic embeddings look like on full-word vocabulary (190k words, 20D). This is the gold standard our BPE token embeddings should eventually approach.

Norm vs Frequency Rank - Full Words

Poincare GloVe: target and context vector norms vs word frequency rank
Target vector norm comparison (top): Dot-product GloVe (blue) vs Poincare GloVe (orange) across 190k word vocabulary sorted by frequency rank. The Poincare model shows a smooth, monotonic decrease in norm with rank -- high-frequency words have large norms (closer to boundary) while rare words have small norms. Note this is opposite to Nickel & Kiela: in GloVe's parameterization the most frequent words act as context, not roots, so the relationship inverts. Context vector norm (middle) also decreases with rank. Bias terms (bottom) follow a similar pattern.

Key statistics from the notebook: Spearman r(norm, frequency) = -0.69 for Poincare target vectors, -0.61 for standard GloVe -- a measurable but modest improvement from hyperbolic geometry. Hyperbolic mean norm 0.653, std 0.050 (compared to our token embeddings at 3.2 +/- 0.05 in a different Poincare ball parameterization).
-0.69
Spearman r (norm vs freq), Poincare GloVe
-0.61
Spearman r (norm vs freq), standard GloVe
+0.045
r (norm vs freq), our BPE hyperbolic (Phase 1)
+0.335
r (norm vs freq), our BPE Euclidean
The gap: Poincare GloVe achieves r=-0.69 on 190k-word vocabulary. Our BPE hyperbolic model achieves r=+0.045 on 8k tokens -- essentially zero. Notably, even our Euclidean model (r=+0.335) shows more frequency-structure than the failed hyperbolic model. The sign difference (+0.335 vs -0.69) reflects different parameterizations: in GloVe frequent words are contexts (high norm = general); in our LM frequent tokens likely develop smaller norms because they are well-specified.

Distance Structure - Poincare vs Euclidean

Distance distribution: Poincare GloVe spreads distances far more than vanilla GloVe
Distance distribution between a target word and all vocabulary words. Left: sorted distances (neighbor rank vs distance). Poincare GloVe (orange) produces exponentially larger distances to distant words, giving much sharper contrast between close and far neighbors. Vanilla GloVe cosine distances (blue) compress most words into a narrow [0,1] band. Right: histogram of all pairwise distances. Vanilla GloVe peaks sharply at ~0.8 (most words about equally distant -- the "curse of dimensionality" in high-d cosine space). Poincare GloVe peaks at ~7 with a long tail -- far better separation of near/far concepts.

Average Relative Contrast: Poincare 4.46 (top 100) / 2.03 (rest) vs GloVe 16.1 (top 100) / 2.32 (rest). Interestingly GloVe has higher ARC for top-100 frequent words in this metric -- hyperbolic geometry spreads the semantic space but at the cost of anchor stability for the most common words.
Hyperbolic norm distribution: sorted ranks and histogram
Euclidean norm distribution of Poincare GloVe vectors (190k vocab). Left: sorted norm rank -- norms span [0.20, 0.92] with the majority between 0.55-0.70. Right: histogram peaks tightly at ~0.63 -- all embeddings have bounded norm inside the Poincare ball (unit ball in this parameterization), with meaningful radial variation (std=0.050) encoding frequency hierarchy. Compare to our Phase 1 norms: same std (0.050) but all saturated at ~3.2 -- the proportional variation is the same but the radial signal is absent.

Nearest Neighbor Quality (from notebook)

WordPoincare GloVe Nearest Neighbors
dancedancing, dances, music, singing, musical, performing, hip-hop, pop, folk, dancers
sixtiesseventies, eighties, nineties, 60s, 70's, 60's, 1960's, 80's, 90's, 70s
daughterson, wife, mother, sister, father, husband, brother, daughters, sons, grandmother

Nearest neighbors in Poincare distance are semantically coherent and capture analogical structure. This is what well-trained hyperbolic embeddings can produce. At the subword BPE level, analogous quality would mean tokens like -ing clustering near general morphological roots, with rare compositional tokens near the boundary.

Center vs Border of the Ball

LocationSample Words (190k vocab Poincare GloVe)
Near center (small norm)
general, abstract
alola, arecoideae, gnetophytes, chennselaig, wesleys, duckwater -- very rare, specific named entities
Near border (large norm)
specific, frequent
singles, race, road, hockey, i, starring, player, income, rural, yards -- common topical words
Interpretation: In Poincare GloVe the most frequent words are near the boundary (large norm) because they appear in many co-occurrence contexts and are "pulled" toward the boundary by many training pairs. Extremely rare words have fewer co-occurrence constraints and settle near the center. This is the inverse of the pure hypernymy picture (Nickel & Kiela), reflecting the unsupervised co-occurrence objective. Our LM setting is more similar to GloVe than to hypernymy, so we should expect the same inversion if the geometry is working.

Interactive: Cherry-picked Semantic Hierarchies

From Tifrea et al. (2019). The Poincare GloVe model uses 20-dimensional embeddings decomposed into 10 separate 2D Poincare disks (one subplot per embedding slot). Each disk shows 6 semantic categories -- presidents, mathematics terms, numbers, chemistry, sports figures, countries -- each as a colored cluster. Hover over points to see word labels. Use the Plotly toolbar to zoom, pan, or export.

Context vectors for 180k-word vocabulary. All 10 disks share the same [-1, 1] x [-1, 1] range to keep the Poincare ball boundary visible.

Poincare GloVe MIX model (20D, 10 x 2D Poincare disks, dist-sq objective, 180k vocab). Each subplot is one 2D Poincare disk. The boundary at radius 1 is the Poincare ball limit.

What to look for: Each semantic category (color) forms a compact cluster. Categories that are conceptually related appear in similar angular sectors across multiple disks. The MIX structure lets each 2D plane specialize in a different aspect of meaning. Compare how sports figures cluster tightly in some disks but spread in others, while number words remain cohesive across all disks -- this is the geometry doing real work.

All Runs

Complete table of all wandb runs with metrics and configurations.

Run ID (timestamp) Phase Type Emb Dim Hidden Dropout Steps Best Val PPL Train PPL
20260307_1335351 hyperbolic 2565120400 698.3656.9
20260307_1335361 hyperbolic 2565120200 984.3656.6
20260307_1344081 euclidean 2565120240 593.4549.0
20260307_1357271 euclidean 25651205 000 121.871.7
20260307_1358421 hyperbolic 25651205 000 186.7114.4
20260307_2356541 euclidean 2565120.25 000 124.562.2
20260307_2359001 hyperbolic 2565120.25 000 203.3138.8
20260308_0101541 hyperbolic 2565120.22 360 354.2282.2
20260308_0215261 hyperbolic 2565120.215 000 174.076.2
20260308_0215281 euclidean 2565120.215 000 113.536.3
20260308_1321262 euclidean 2562560.215 000 120.188.6
20260308_1352302 hyperbolic 2562560.215 000 117.457.7
20260308_1321272 spherical 2562560.215 000 121.776.6

Open Questions and Future Work

1. Is the Token Co-occurrence Graph Hyperbolic?

Tifrea et al. measure the Gromov delta-hyperbolicity of word co-occurrence graphs and find 2*delta_avg/d_avg ratios as low as 0.0034, confirming that word co-occurrence data is genuinely tree-like. This measurement has not been performed for BPE subword tokens. BPE tokens are defined by frequency-based compression, not semantic content, and their co-occurrence structure may differ fundamentally from full-word graphs.

Planned: Compute delta-hyperbolicity of BPE token co-occurrence graph from WikiText-2 using the Tifrea et al. methodology. If 2*delta/d is low, the geometric premise holds. If high, no architectural variant will help -- which is itself an important negative result.

2. Do Phase 2 Embeddings Show Radial Structure?

The Phase 2 models have not yet been subjected to the norm-vs-frequency diagnostic. Since embeddings remain Euclidean, the interesting question is whether the LSTM hidden state, when projected to the Poincare ball, develops radial structure tied to the frequency of the next token being predicted.

3. Overfitting in Phase 2 Hyperbolic

Hyperbolic MLR train PPL (57.7) vs val PPL (117.4) shows a larger train/val gap than Euclidean (88.6 / 120.1). The hyperbolic output layer may have more capacity due to the richer distance structure, but this capacity is not yet being regularized effectively. Investigating dropout rates and weight decay specifically for the hyperbolic output parameters is the immediate next experiment.

4. Low-Dimension Regime

Theory predicts hyperbolic space should show the clearest advantages at low embedding dimensions (e.g., dim=8 or 16), where Euclidean space genuinely struggles to represent hierarchical structure. Our current experiments all use dim=256. Running a dimension sweep (8, 16, 32, 64, 128, 256) with Phase 2 architecture is the most direct test of the theoretical prediction.