On the Optimality of TeraSort in MapReduce

Date

2016-08-26

Authors

Xia, Fei

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

MapReduce is a scalable, reliable and easy-to-program parallel computation frame- work for massive data processing. The key for a MapReduce algorithm to be efficient is the balance of workloads on the participating machines. Building on the notion of minimal MapReduce algorithms, this project report discusses the sampling and partitioning techniques used in TeraSort. For one of them, we improve the bound on partition sizes to one of asymptotic optimality in terms of increasing number of partitions. In light of the wide applicability of this partition technique, our result potentially strengthens the worst case performance guarantee in other algorithms. We show the application in top-k and k-selection problems as an example.

Description

Keywords

TeraSort, MapReduce, optimality, sample

Citation