Architectural enhancement for message passing interconnects




Khun Jush, Farshad

Journal Title

Journal ISSN

Volume Title



Research in high-performance architecture has been focusing on achieving more computing power to solve computationally-intensive problems. Advancements in the processor industry are not applicable in applications that need several hundred or thousand-fold improvement in performance. The parallel architecture approach promises to provide more computing power and scalability. Cluster computing, consisting of low-cost and high-performance processors, has been an alternative to proprietary and expensive supercomputer platforms. As in any other parallel system, communication overhead (including hardware, software, and network) adversely affects the computation performance in a cluster environment. Therefore, decreasing this overhead is the main concern in such environments. Communication overhead is the key obstacle to reaching hardware performance limits and is mostly associated with software overhead, a significant portion of which is attributed to message copying. Message copying is largely caused by a lack of knowledge of the next received message, which can be dealt with through speculation. To reduce this copying overhead and advance toward a finer granularity, architectural extensions comprised of a specialized network cache and instructions to manage the operations of these extensions were introduced. In order to investigate the effectiveness of the proposed architectural enhancement, a simulation environment was established by expanding an existing single-thread infrastructure to one that can run MPI applications. Then the proposed extensions were implemented, along with the MPI functions on top of the SimpleScalar infrastructure. Further, two techniques were proposed in order to achieve zero-copy data transfer in message passing environments, two policies that determine when a message is to be bound and sent to the data cache. These policies are called Direct to Cache Transfer DTCT and lazy DTCT. The simulations showed that by using the proposed network extension along with the DTCT techniques fewer data cache misses were encountered as compared to when the DTCT techniques were not used. This involved a study of the possible overhead and cache pollution introduced by the operating system and the communications stack, as exemplified by Linux, TCP/IP and M-VIA. Then these effects on the proposed extensions were explored. Ultimately, this enabled a comparison of the performance achieved by applications running on a system incorporating the proposed extension with the performance of the same applications running on a standard system. The results showed that the proposed approach could improve the performance of MPI applications by 15 to 20%. Moreover, data transfer mechanisms and the associated components in the CELL BE processor were studied. For this, two general data transfer methods were explored involving the PUT and GET functions, demonstrating that the SPE-initiated DMA data transfers are faster than the corresponding PPE-initiated DMAs. The main components of each data transfer were also investigated. In the SPE-initiated GET function, the main component is data delivery. However, the PPE-initiated GET function shows a long DMA issue time as well as a lengthy gap in receiving successive messages. It was demonstrated that the main components of the SPE-initiated PUT function are data delivery and latency (that is, the time to receive the first byte), and the main components in the PPE-initiated PUT function are the DMA issue time and latency. Further, an investigation revealed that memory-management overhead is comparable to the data transfer time; therefore, this calls for techniques to hide the unavoidable overhead in order to reach high-throughput communication in MPI implementation in the Cell BE processor.



Network Cache, MPI, Memory Hierarchy System, DTCT Techniques