Hyperbolic vision–language embeddings and loss functions for multimodal meme classification
Date
2026
Authors
Warner, Ryan
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis investigates whether hyperbolic geometry can improve multimodal Vision-Language Model (VLM) performance on the Facebook Hateful Memes dataset. The challenge of this benchmark stems partly from subtle semantic hierarchies in how images and text combine. For example, the phrase "you smell great" paired with a skunk conveys a fundamentally different meaning than the same text paired with a rose. Such shifts in meaning are often attributed to entailment relationships within and between the visual and textual components of a meme.
We hypothesize that hyperbolic geometry, with its natural capacity to represent hierarchical structure, may capture these entailment relationships more effectively than flat Euclidean space. To explore this idea, we introduce Hyperbolic Flamingo, to our knowledge the first Flamingo-style architecture implemented in hyperbolic geometry. The model combines frozen MERU and CLIP (Contrastive Language-Image Pre-training) encoders with hyperbolic gated cross-attention layers. We adopt Flamingo's frozen-encoder design because the benchmark requires rapid iteration, and lightweight adapters allow efficient experimentation across different geometric configurations.
Initial experiments revealed a core difficulty for hyperbolic VLMs: boundary collapse, where embeddings drift toward the edge of the Poincaré disk and gradients vanish. Under these conditions, angle-based losses saturate and performance collapses toward randomness. To address this, our main methodological contribution is a discriminative prototype loss (L_proto). Instead of classifying via token-likelihood, the model predicts labels by geodesic distance to learnable class prototypes. This shift from generative prediction to geometric separation prevents boundary collapse and enables stable hyperbolic training where previous approaches failed. Experiments with centroid-regularised prototypes (L_proto-reg) show mixed, dimension-dependent effects: regularisation helps at high dimensionalities (e.g., +1.15% at 256d) but reduces performance in lower-dimensional settings.
Initial experiments (Phase 3) with a simplified architecture demonstrate parity: the hyperbolic prototype head (63.44% ± 2.1% Area Under the Receiver Operating Characteristic curve, or AUROC) matches the Euclidean baseline (63.4% ± 0.7%), with hyperbolic cross-attention and the LM head (63.9% ± 1.8%) slightly exceeding the baseline. Extended ablation (Phase 5) with the complete Flamingo architecture (Perceiver Resampler plus six interleaved cross-attention layers) reveals a stronger finding: the hyperbolic prototype head outperforms the LM head by ~3% (67.32% vs 64.37% AUROC), confirming the value of the discriminative pivot. The best configuration achieves 67.97% ± 0.35% AUROC with a best single-seed result of 69.50%. We view the Lorentzian distance losses explored here as a starting point; more advanced formulations (such as Accept-the-Gap's exterior-angle losses or HyCoCLIP's compositional entailment cones) may better realise the theoretical benefits of hyperbolic space.
Overall, these findings establish Hyperbolic Flamingo as a practical and extensible foundation for further research on geometric inductive bias in vision-language models.
Code availability: Source code is available on Hugging Face (https://huggingface.co/rkwarnerwsslskunkworx/Hyperbolic_Flamingo) with access granted upon request.