Hyperbolic vision–language embeddings and loss functions for multimodal meme classification

Warner, Ryan

Hyperbolic vision–language embeddings and loss functions for multimodal meme classification

dc.contributor.author	Warner, Ryan
dc.contributor.supervisor	Thomo, Alex
dc.date.accessioned	2026-02-02T23:03:28Z
dc.date.available	2026-02-02T23:03:28Z
dc.date.issued	2026
dc.degree.department	Department of Computer Science
dc.degree.level	Master of Science MSc
dc.description.abstract	This thesis investigates whether hyperbolic geometry can improve multimodal Vision-Language Model (VLM) performance on the Facebook Hateful Memes dataset. The challenge of this benchmark stems partly from subtle semantic hierarchies in how images and text combine. For example, the phrase "you smell great" paired with a skunk conveys a fundamentally different meaning than the same text paired with a rose. Such shifts in meaning are often attributed to entailment relationships within and between the visual and textual components of a meme. We hypothesize that hyperbolic geometry, with its natural capacity to represent hierarchical structure, may capture these entailment relationships more effectively than flat Euclidean space. To explore this idea, we introduce Hyperbolic Flamingo, to our knowledge the first Flamingo-style architecture implemented in hyperbolic geometry. The model combines frozen MERU and CLIP (Contrastive Language-Image Pre-training) encoders with hyperbolic gated cross-attention layers. We adopt Flamingo's frozen-encoder design because the benchmark requires rapid iteration, and lightweight adapters allow efficient experimentation across different geometric configurations. Initial experiments revealed a core difficulty for hyperbolic VLMs: boundary collapse, where embeddings drift toward the edge of the Poincaré disk and gradients vanish. Under these conditions, angle-based losses saturate and performance collapses toward randomness. To address this, our main methodological contribution is a discriminative prototype loss (L_proto). Instead of classifying via token-likelihood, the model predicts labels by geodesic distance to learnable class prototypes. This shift from generative prediction to geometric separation prevents boundary collapse and enables stable hyperbolic training where previous approaches failed. Experiments with centroid-regularised prototypes (L_proto-reg) show mixed, dimension-dependent effects: regularisation helps at high dimensionalities (e.g., +1.15% at 256d) but reduces performance in lower-dimensional settings. Initial experiments (Phase 3) with a simplified architecture demonstrate parity: the hyperbolic prototype head (63.44% ± 2.1% Area Under the Receiver Operating Characteristic curve, or AUROC) matches the Euclidean baseline (63.4% ± 0.7%), with hyperbolic cross-attention and the LM head (63.9% ± 1.8%) slightly exceeding the baseline. Extended ablation (Phase 5) with the complete Flamingo architecture (Perceiver Resampler plus six interleaved cross-attention layers) reveals a stronger finding: the hyperbolic prototype head outperforms the LM head by ~3% (67.32% vs 64.37% AUROC), confirming the value of the discriminative pivot. The best configuration achieves 67.97% ± 0.35% AUROC with a best single-seed result of 69.50%. We view the Lorentzian distance losses explored here as a starting point; more advanced formulations (such as Accept-the-Gap's exterior-angle losses or HyCoCLIP's compositional entailment cones) may better realise the theoretical benefits of hyperbolic space. Overall, these findings establish Hyperbolic Flamingo as a practical and extensible foundation for further research on geometric inductive bias in vision-language models. Code availability: Source code is available on Hugging Face (https://huggingface.co/rkwarnerwsslskunkworx/Hyperbolic_Flamingo) with access granted upon request.
dc.description.scholarlevel	Graduate
dc.identifier.uri	https://hdl.handle.net/1828/23091
dc.language	English	eng
dc.language.iso	en
dc.rights	Available to the World Wide Web
dc.title	Hyperbolic vision–language embeddings and loss functions for multimodal meme classification
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Warner_Ryan_MSc_2025.pdf
Size:: 2.41 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.62 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electronic Theses and Dissertations (ETD)