Inferring network topology for distributed machine learning model training

Date

2024

Authors

An, Renjun

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

With the application of distributed machine learning in various industries, there is an increasing demand for model training using cloud computing resources. However, many cloud computing service providers refuse to provide end-users with information about the underlying network topology for commercial and security reasons. Due to this opaqueness, it is challenging to arrange the computation modules in different Virtual Machines (VMs) to achieve the best resource utilization efficiency. To address this problem, we propose an algorithm called Flow Tracking (FT), which uses external measurements to infer the internal structure of a general graph. Compared to the state-of-the-art topology inference algorithms, FT achieves the most accurate topology measured in four different metrics. Notably, FT achieves 100% reconstruction of the underlying topology under the shortest-path routing strategy of the underlying network. Experimentally, resource allocation using the inferred topology improves the model training efficiency significantly compared to random allocation.

Description

Keywords

Network tomography, Topology inference, Distributed machine learning

Citation