Suffix trees for very large inputs

Barsky, Marina

Suffix trees for very large inputs

dc.contributor.author	Barsky, Marina
dc.contributor.supervisor	Thomo, Alex
dc.contributor.supervisor	Stege, Ulrike
dc.contributor.supervisor	Upton, Christopher
dc.date.accessioned	2010-07-16T21:15:08Z
dc.date.available	2010-07-16T21:15:08Z
dc.date.copyright	2010	en
dc.date.issued	2010-07-16T21:15:08Z
dc.degree.department	Department of Computer Science
dc.degree.level	Doctor of Philosophy Ph.D.	en
dc.description.abstract	A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than their input sequences and quickly outgrow the main memory, the first half of this work is focused on designing a practical algorithm that avoids massive random access to the trees being built. This effort resulted in a new algorithm DiGeST which improves significantly over previous work in reducing random access to the suffix tree and performing only two passes over disk data. As a result, this algorithm scales to larger genomic data than managed before. All the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The ever increasing amount of genomic data requires however the ability to build suffix trees for much larger strings. In the second half of this work we present another suffix tree construction algorithm, BBST that is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. Both the input string and the suffix tree are kept on disk and the algorithm is designed to avoid multiple random I/Os to both of them. As a proof of concept, we show that BBST allows to build a suffix tree for 12 GB of real DNA sequences in 26 hours on a single machine with 2 GB of RAM. This input is four times the size of the Human Genome. The construction of suffix trees for inputs of such magnitude was never reported before. Finally, we show that, after the off-line suffix tree construction is complete, search queries on entire sequenced genomes can be performed very efficiently. This high query performance is achieved due to a special disk layout of the suffix trees produced by our algorithms.	en
dc.identifier.bibliographicCitation	M. Barsky, U. Stege, A. Thomo, and C. Upton, A new method for indexing genomes using on-disk suffix trees, CIKM, 2008, pp. 649–658.	en
dc.identifier.bibliographicCitation	M. Barsky, U. Stege, A. Thomo, and C. Upton, Suffix trees for very large genomic sequences, CIKM, 2009, pp. 1417–1420.	en
dc.identifier.bibliographicCitation	M. Barsky, U. Stege, and A. Thomo, A survey of practical algorithms for suffix tree construction in external memory. (Accepted for publication, Software: Practice and Experience) Pre-published on-line:http://authorservices.wiley.com/bauthor/WISproxy.asp?doi=10.1002/spe.960&ArticleID=658860	en
dc.identifier.uri	http://hdl.handle.net/1828/2901
dc.language	English	eng
dc.language.iso	en	en
dc.rights	Available to the World Wide Web	en
dc.subject	full-text index	en
dc.subject	genomics	en
dc.subject.lcsh	UVic Subject Index::Sciences and Engineering::Applied Sciences::Computer science	en
dc.title	Suffix trees for very large inputs	en
dc.type	Thesis	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ThesisBarsky16july.pdf
Size:: 2.07 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.83 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electronic Theses and Dissertations (ETD)