A performance evaluation of collective communication libraries

Srinivasan, Subiksha

A performance evaluation of collective communication libraries

Files

Subiksha_Srinivasan_MSc_2026.pdf (889.89 KB)

Date

2026

Authors

Srinivasan, Subiksha

Abstract

Collective communication operations such as AllGather and AlltoAll are fundamental to high-performance computing (HPC) and large-scale machine learning workloads. Their performance, however, is tightly constrained by network structure, link latency, and bandwidth availability across modern multi-GPU and multi-node systems. As systems scale and become increasingly heterogeneous, traditional collective scheduling approaches, which often assume unrealistic symmetry in latency and topology, become ineffective. This project investigates Traffic Engineering for Collective Communication (TE-CCL), an optimization-based framework that formulates collective scheduling as a Mixed-Integer Linear Programming (MILP) problem. TE-CCL explicitly incorporates link-level latency (α) into its scheduling formulation, enabling more realistic modelling of heterogeneous multi-fabric GPU clusters. This project examines how varying α across links affects routing decisions, epoch schedules, and solver behaviour. By introducing heterogeneous α values—rather than assuming a fixed latency across all links—the model adapts its schedules to prioritize low-latency paths, reduce hop count where beneficial, and capture realistic communication delays found in the cloud and datacenter clusters. This work provides an analysis of TE-CCL under latency variability, evaluating solver behaviour, schedule structures, and topology sensitivity across multiple cluster designs. The study highlights how α-aware scheduling reshapes the communication patterns selected by the solver and provides insights into when and why topology-regularity influences optimization stability. Overall, this investigation clarifies the importance of latency modelling in collective communication and offers guidance for extending TE-CCL toward more robust, topology-adaptive scheduling strategies for next-generation HPC and ML systems.

URI

https://hdl.handle.net/1828/23563

Collections

Master's Projects

Full item page

A performance evaluation of collective communication libraries

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections