Hyperbolic vision–language embeddings and loss functions for multimodal meme classification

dc.contributor.authorWarner, Ryan
dc.contributor.supervisorThomo, Alex
dc.date.accessioned2026-02-02T23:03:28Z
dc.date.available2026-02-02T23:03:28Z
dc.date.issued2026
dc.degree.departmentDepartment of Computer Science
dc.degree.levelMaster of Science MSc
dc.description.abstractThis thesis investigates whether hyperbolic geometry can improve multimodal Vision-Language Model (VLM) performance on the Facebook Hateful Memes dataset. The challenge of this benchmark stems partly from subtle semantic hierarchies in how images and text combine. For example, the phrase "you smell great" paired with a skunk conveys a fundamentally different meaning than the same text paired with a rose. Such shifts in meaning are often attributed to entailment relationships within and between the visual and textual components of a meme. We hypothesize that hyperbolic geometry, with its natural capacity to represent hierarchical structure, may capture these entailment relationships more effectively than flat Euclidean space. To explore this idea, we introduce Hyperbolic Flamingo, to our knowledge the first Flamingo-style architecture implemented in hyperbolic geometry. The model combines frozen MERU and CLIP (Contrastive Language-Image Pre-training) encoders with hyperbolic gated cross-attention layers. We adopt Flamingo's frozen-encoder design because the benchmark requires rapid iteration, and lightweight adapters allow efficient experimentation across different geometric configurations. Initial experiments revealed a core difficulty for hyperbolic VLMs: boundary collapse, where embeddings drift toward the edge of the Poincaré disk and gradients vanish. Under these conditions, angle-based losses saturate and performance collapses toward randomness. To address this, our main methodological contribution is a discriminative prototype loss (L_proto). Instead of classifying via token-likelihood, the model predicts labels by geodesic distance to learnable class prototypes. This shift from generative prediction to geometric separation prevents boundary collapse and enables stable hyperbolic training where previous approaches failed. Experiments with centroid-regularised prototypes (L_proto-reg) show mixed, dimension-dependent effects: regularisation helps at high dimensionalities (e.g., +1.15% at 256d) but reduces performance in lower-dimensional settings. Initial experiments (Phase 3) with a simplified architecture demonstrate parity: the hyperbolic prototype head (63.44% ± 2.1% Area Under the Receiver Operating Characteristic curve, or AUROC) matches the Euclidean baseline (63.4% ± 0.7%), with hyperbolic cross-attention and the LM head (63.9% ± 1.8%) slightly exceeding the baseline. Extended ablation (Phase 5) with the complete Flamingo architecture (Perceiver Resampler plus six interleaved cross-attention layers) reveals a stronger finding: the hyperbolic prototype head outperforms the LM head by ~3% (67.32% vs 64.37% AUROC), confirming the value of the discriminative pivot. The best configuration achieves 67.97% ± 0.35% AUROC with a best single-seed result of 69.50%. We view the Lorentzian distance losses explored here as a starting point; more advanced formulations (such as Accept-the-Gap's exterior-angle losses or HyCoCLIP's compositional entailment cones) may better realise the theoretical benefits of hyperbolic space. Overall, these findings establish Hyperbolic Flamingo as a practical and extensible foundation for further research on geometric inductive bias in vision-language models. Code availability: Source code is available on Hugging Face (https://huggingface.co/rkwarnerwsslskunkworx/Hyperbolic_Flamingo) with access granted upon request.
dc.description.scholarlevelGraduate
dc.identifier.urihttps://hdl.handle.net/1828/23091
dc.languageEnglisheng
dc.language.isoen
dc.rightsAvailable to the World Wide Web
dc.titleHyperbolic vision–language embeddings and loss functions for multimodal meme classification
dc.typeThesis

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Warner_Ryan_MSc_2025.pdf
Size:
2.41 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.62 KB
Format:
Item-specific license agreed upon to submission
Description: