Mining frequent highly-correlated item-pairs at very low support levels

dc.contributor.authorSandler, Ian
dc.contributor.supervisorThomo, Alex
dc.date.accessioned2011-12-20T23:13:25Z
dc.date.available2011-12-20T23:13:25Z
dc.date.copyright2011en_US
dc.date.issued2011-12-20
dc.degree.departmentDept. of Computer Scienceen_US
dc.degree.levelMaster of Science M.Sc.en_US
dc.description.abstractThe ability to extract frequent pairs from a set of transactions is one of the fundamental building blocks of data mining. When the number of items in a given transaction is relatively small the problem is trivial. Even when dealing with millions of transactions it is still trivial if the number of unique items in the transaction set is small. The problem becomes much more challenging when we deal with millions of transactions, each containing hundreds of items that are part of a set of millions of potential items. Especially when we are looking for highly correlated results at extremely low support levels. For 25 years the Direct Hashing and Pruning Park Chen Yu (PCY) algorithm has been the principal technique used when there are billions of potential pairs that need to be counted. In this paper we propose a new approach that allows us to take full advantage of both multi-core and multi-CPU availability which works in cases where PCY fails, with excellent performance scaling that continues even when the number of processors, unique items and items per transaction are at their highest. We believe that our approach has much broader applicability in the field of co-occurrence counting, and can be used to generate much more interesting results when mining very large data sets.en_US
dc.description.scholarlevelGraduateen_US
dc.identifier.bibliographicCitationSandler I., A. Thomo. (2010) Mining Frequent Highly-Correlated Item-Pairs at Very Low Support Levels. Proc. of the SDM'10 Workshop on High Performance Analytics - Algorithms, Implementations, and Applications (PHPA'10).en_US
dc.identifier.urihttp://hdl.handle.net/1828/3756
dc.languageEnglisheng
dc.language.isoenen_US
dc.rights.tempAvailable to the World Wide Weben_US
dc.subjectdata miningen_US
dc.subjectpark chen yu algorithmen_US
dc.subjectmap reduceen_US
dc.subjectmining frequent datasetsen_US
dc.titleMining frequent highly-correlated item-pairs at very low support levelsen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Sandler_Ian_MSc_2011.pdf
Size:
260.34 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.74 KB
Format:
Item-specific license agreed upon to submission
Description: