A distributed approach to Frequent Itemset Mining at low support levels

Clark, Neal

A distributed approach to Frequent Itemset Mining at low support levels

Files

Clark_Neal_MSc_2014.pdf (458.37 KB)

Date

2014-12-22

Authors

Clark, Neal

Abstract

Frequent Itemset Mining, the process of finding frequently co-occurring sets of items in a dataset, has been at the core of the field of data mining for the past 25 years. During this time the datasets have grown much faster than the algorithms capacity to process them. Great progress was made at optimizing this task on a single computer however, despite years of research, very little progress has been made on parallelizing this task. FPGrowth based algorithms have proven notoriously difficult to parallelize and Apriori has largely fallen out of favor with the research community. In this thesis we introduce a parallel, Apriori based, Frequent Itemset Mining algo- rithm capable of distributing computation across large commodity clusters. Our case study demonstrates that our algorithm can efficiently scale to hundreds of cores, on a standard Hadoop MapReduce cluster, and can improve executions times by at least an order of magnitude at the lowest support levels.

Keywords

Apriori, MapReduce, Frequent Itemset Mining, FPGrowth, Distributed, Machine Learning, Hadoop

URI

http://hdl.handle.net/1828/5803

Collections

ETD (Electronic Theses and Dissertations)
Theses (Computer Science)

Full item page

A distributed approach to Frequent Itemset Mining at low support levels

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections