Abstract:
MapReduce is a scalable, reliable and easy-to-program parallel computation frame-
work for massive data processing. The key for a MapReduce algorithm to be efficient
is the balance of workloads on the participating machines. Building on the notion
of minimal MapReduce algorithms, this project report discusses the sampling and
partitioning techniques used in TeraSort. For one of them, we improve the bound
on partition sizes to one of asymptotic optimality in terms of increasing number of
partitions. In light of the wide applicability of this partition technique, our result
potentially strengthens the worst case performance guarantee in other algorithms.
We show the application in top-k and k-selection problems as an example.