A performance evaluation of collective communication libraries
| dc.contributor.author | Srinivasan, Subiksha | |
| dc.contributor.supervisor | Wu, Kui | |
| dc.contributor.supervisor | Prakash Champati, Jaya | |
| dc.date.accessioned | 2026-04-10T16:39:39Z | |
| dc.date.available | 2026-04-10T16:39:39Z | |
| dc.date.issued | 2026 | |
| dc.degree.department | Department of Computer Science | |
| dc.degree.level | Master of Science MSc | |
| dc.description.abstract | Collective communication operations such as AllGather and AlltoAll are fundamental to high-performance computing (HPC) and large-scale machine learning workloads. Their performance, however, is tightly constrained by network structure, link latency, and bandwidth availability across modern multi-GPU and multi-node systems. As systems scale and become increasingly heterogeneous, traditional collective scheduling approaches, which often assume unrealistic symmetry in latency and topology, become ineffective. This project investigates Traffic Engineering for Collective Communication (TE-CCL), an optimization-based framework that formulates collective scheduling as a Mixed-Integer Linear Programming (MILP) problem. TE-CCL explicitly incorporates link-level latency (α) into its scheduling formulation, enabling more realistic modelling of heterogeneous multi-fabric GPU clusters. This project examines how varying α across links affects routing decisions, epoch schedules, and solver behaviour. By introducing heterogeneous α values—rather than assuming a fixed latency across all links—the model adapts its schedules to prioritize low-latency paths, reduce hop count where beneficial, and capture realistic communication delays found in the cloud and datacenter clusters. This work provides an analysis of TE-CCL under latency variability, evaluating solver behaviour, schedule structures, and topology sensitivity across multiple cluster designs. The study highlights how α-aware scheduling reshapes the communication patterns selected by the solver and provides insights into when and why topology-regularity influences optimization stability. Overall, this investigation clarifies the importance of latency modelling in collective communication and offers guidance for extending TE-CCL toward more robust, topology-adaptive scheduling strategies for next-generation HPC and ML systems. | |
| dc.description.scholarlevel | Graduate | |
| dc.identifier.uri | https://hdl.handle.net/1828/23563 | |
| dc.language.iso | en | |
| dc.rights | Available to the World Wide Web | |
| dc.title | A performance evaluation of collective communication libraries | |
| dc.type | project |