A performance evaluation of collective communication libraries

Srinivasan, Subiksha

A performance evaluation of collective communication libraries

dc.contributor.author	Srinivasan, Subiksha
dc.contributor.supervisor	Wu, Kui
dc.contributor.supervisor	Prakash Champati, Jaya
dc.date.accessioned	2026-04-10T16:39:39Z
dc.date.available	2026-04-10T16:39:39Z
dc.date.issued	2026
dc.degree.department	Department of Computer Science
dc.degree.level	Master of Science MSc
dc.description.abstract	Collective communication operations such as AllGather and AlltoAll are fundamental to high-performance computing (HPC) and large-scale machine learning workloads. Their performance, however, is tightly constrained by network structure, link latency, and bandwidth availability across modern multi-GPU and multi-node systems. As systems scale and become increasingly heterogeneous, traditional collective scheduling approaches, which often assume unrealistic symmetry in latency and topology, become ineffective. This project investigates Traffic Engineering for Collective Communication (TE-CCL), an optimization-based framework that formulates collective scheduling as a Mixed-Integer Linear Programming (MILP) problem. TE-CCL explicitly incorporates link-level latency (α) into its scheduling formulation, enabling more realistic modelling of heterogeneous multi-fabric GPU clusters. This project examines how varying α across links affects routing decisions, epoch schedules, and solver behaviour. By introducing heterogeneous α values—rather than assuming a fixed latency across all links—the model adapts its schedules to prioritize low-latency paths, reduce hop count where beneficial, and capture realistic communication delays found in the cloud and datacenter clusters. This work provides an analysis of TE-CCL under latency variability, evaluating solver behaviour, schedule structures, and topology sensitivity across multiple cluster designs. The study highlights how α-aware scheduling reshapes the communication patterns selected by the solver and provides insights into when and why topology-regularity influences optimization stability. Overall, this investigation clarifies the importance of latency modelling in collective communication and offers guidance for extending TE-CCL toward more robust, topology-adaptive scheduling strategies for next-generation HPC and ML systems.
dc.description.scholarlevel	Graduate
dc.identifier.uri	https://hdl.handle.net/1828/23563
dc.language.iso	en
dc.rights	Available to the World Wide Web
dc.title	A performance evaluation of collective communication libraries
dc.type	project

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Subiksha_Srinivasan_MSc_2026.pdf
Size:: 889.89 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.62 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Master's Projects