Suffix trees for very large inputs

dc.contributor.authorBarsky, Marina
dc.contributor.supervisorThomo, Alex
dc.contributor.supervisorStege, Ulrike
dc.contributor.supervisorUpton, Christopher
dc.date.accessioned2010-07-16T21:15:08Z
dc.date.available2010-07-16T21:15:08Z
dc.date.copyright2010en
dc.date.issued2010-07-16T21:15:08Z
dc.degree.departmentDepartment of Computer Science
dc.degree.levelDoctor of Philosophy Ph.D.en
dc.description.abstractA suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than their input sequences and quickly outgrow the main memory, the first half of this work is focused on designing a practical algorithm that avoids massive random access to the trees being built. This effort resulted in a new algorithm DiGeST which improves significantly over previous work in reducing random access to the suffix tree and performing only two passes over disk data. As a result, this algorithm scales to larger genomic data than managed before. All the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The ever increasing amount of genomic data requires however the ability to build suffix trees for much larger strings. In the second half of this work we present another suffix tree construction algorithm, BBST that is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. Both the input string and the suffix tree are kept on disk and the algorithm is designed to avoid multiple random I/Os to both of them. As a proof of concept, we show that BBST allows to build a suffix tree for 12 GB of real DNA sequences in 26 hours on a single machine with 2 GB of RAM. This input is four times the size of the Human Genome. The construction of suffix trees for inputs of such magnitude was never reported before. Finally, we show that, after the off-line suffix tree construction is complete, search queries on entire sequenced genomes can be performed very efficiently. This high query performance is achieved due to a special disk layout of the suffix trees produced by our algorithms.en
dc.identifier.bibliographicCitationM. Barsky, U. Stege, A. Thomo, and C. Upton, A new method for indexing genomes using on-disk suffix trees, CIKM, 2008, pp. 649–658.en
dc.identifier.bibliographicCitationM. Barsky, U. Stege, A. Thomo, and C. Upton, Suffix trees for very large genomic sequences, CIKM, 2009, pp. 1417–1420.en
dc.identifier.bibliographicCitationM. Barsky, U. Stege, and A. Thomo, A survey of practical algorithms for suffix tree construction in external memory. (Accepted for publication, Software: Practice and Experience) Pre-published on-line:http://authorservices.wiley.com/bauthor/WISproxy.asp?doi=10.1002/spe.960&ArticleID=658860en
dc.identifier.urihttp://hdl.handle.net/1828/2901
dc.languageEnglisheng
dc.language.isoenen
dc.rightsAvailable to the World Wide Weben
dc.subjectfull-text indexen
dc.subjectgenomicsen
dc.subject.lcshUVic Subject Index::Sciences and Engineering::Applied Sciences::Computer scienceen
dc.titleSuffix trees for very large inputsen
dc.typeThesisen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ThesisBarsky16july.pdf
Size:
2.07 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.83 KB
Format:
Item-specific license agreed upon to submission
Description: