A distributed approach to Frequent Itemset Mining at low support levels




Clark, Neal

Journal Title

Journal ISSN

Volume Title



Frequent Itemset Mining, the process of finding frequently co-occurring sets of items in a dataset, has been at the core of the field of data mining for the past 25 years. During this time the datasets have grown much faster than the algorithms capacity to process them. Great progress was made at optimizing this task on a single computer however, despite years of research, very little progress has been made on parallelizing this task. FPGrowth based algorithms have proven notoriously difficult to parallelize and Apriori has largely fallen out of favor with the research community. In this thesis we introduce a parallel, Apriori based, Frequent Itemset Mining algo- rithm capable of distributing computation across large commodity clusters. Our case study demonstrates that our algorithm can efficiently scale to hundreds of cores, on a standard Hadoop MapReduce cluster, and can improve executions times by at least an order of magnitude at the lowest support levels.



Apriori, MapReduce, Frequent Itemset Mining, FPGrowth, Distributed, Machine Learning, Hadoop