A performance evaluation of collective communication libraries
Date
2026
Authors
Srinivasan, Subiksha
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Collective communication operations such as AllGather and AlltoAll are fundamental to high-performance computing (HPC) and large-scale machine learning workloads. Their performance, however, is tightly constrained by network structure, link latency, and bandwidth availability across modern multi-GPU and multi-node systems. As systems scale and become increasingly heterogeneous, traditional collective scheduling approaches, which often assume unrealistic symmetry in latency and topology, become ineffective.
This project investigates Traffic Engineering for Collective Communication (TE-CCL), an optimization-based framework that formulates collective scheduling as a Mixed-Integer Linear Programming (MILP) problem. TE-CCL explicitly incorporates link-level latency (α) into its scheduling formulation, enabling more realistic modelling of heterogeneous multi-fabric GPU clusters. This project examines how varying α across links affects routing decisions, epoch schedules, and solver behaviour. By introducing heterogeneous α values—rather than assuming a fixed latency across all links—the model adapts its schedules to prioritize low-latency paths, reduce hop count where beneficial, and capture realistic communication delays found in the cloud and datacenter clusters.
This work provides an analysis of TE-CCL under latency variability, evaluating solver behaviour, schedule structures, and topology sensitivity across multiple cluster designs. The study highlights how α-aware scheduling reshapes the communication patterns selected by the solver and provides insights into when and why topology-regularity influences optimization stability. Overall, this investigation clarifies the importance of latency modelling in collective communication and offers guidance for extending TE-CCL toward
more robust, topology-adaptive scheduling strategies for next-generation HPC and ML systems.