Suffix tree searcher: exploration of common substrings in large DNA sequence sets

Minkley, David; Whitney, Michael J; Lin, Song-Han; Barsky, Marina G; Kelly, Chris; Upton, Chris

Suffix tree searcher: exploration of common substrings in large DNA sequence sets

dc.contributor.author	Minkley, David
dc.contributor.author	Whitney, Michael J
dc.contributor.author	Lin, Song-Han
dc.contributor.author	Barsky, Marina G
dc.contributor.author	Kelly, Chris
dc.contributor.author	Upton, Chris
dc.date.accessioned	2015-05-19T19:43:01Z
dc.date.available	2015-05-19T19:43:01Z
dc.date.copyright	2014	en_US
dc.date.issued	2014-07-23
dc.description	BioMed Central	en_US
dc.description.abstract	Background: Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graphical user interface, preventing their incorporation into a feasible laboratory workflow. Results: Suffix Tree Searcher (STS) is designed as an easy-to-use tool to index, search, and analyze very large DNA sequence datasets. The program accommodates very large numbers of very large sequences, with aggregate size reaching tens of billions of nucleotides. The program makes use of pre-sorted persistent “building blocks” to reduce the time required to construct new trees. STS is comprised of a graphical user interface written in Java, and four C modules. All components are automatically downloaded when a web link is clicked. The underlying suffix tree data structure permits extremely fast searching for specific nucleotide strings, with wild cards or mismatches allowed. Complete tree traversals for detecting common substrings are also very fast. The graphical user interface allows theuser to transition seamlessly between building, traversing, and searching the dataset. Conclusions: Thus, STS provides a new resource for the detection of substrings common to multiple DNA sequences or within a single sequence, for truly huge data sets. The re-searching of sequence hits, allowing wild card positions or mismatched nucleotides, together with the ability to rapidly retrieve large numbers of sequence hits from the DNA sequence files, provides the user with an efficient method of evaluating the similarity between nucleotide sequences by multiple alignment or use of Logos. The ability to re-use existing suffix tree pieces considerably shortens index generation time. The graphical user interface enables quick mastery of the analysis functions, easy access to the generated data, and seamless workflow integration.	en_US
dc.description.reviewstatus	Reviewed	en_US
dc.description.scholarlevel	Faculty	en_US
dc.description.sponsorship	This work was funded by a Canadian NSERC Discovery grant to CU and by NIAID grant HHSN266200400036C.	en_US
dc.identifier.citation	Minkley et al.: Suffix tree searcher: exploration of common substrings in large DNA sequence sets. BMC Research Notes 2014 7:466	en_US
dc.identifier.uri	http://www.biomedcentral.com/1756-0500/7/466
dc.identifier.uri	http://dx.doi.org/10.1186/1756-0500-7-466
dc.identifier.uri	http://hdl.handle.net/1828/6180
dc.language.iso	en	en_US
dc.publisher	BMC Research Notes	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 2.5 Canada	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/2.5/ca/	*
dc.subject	Suffix tree
dc.subject	Genome
dc.subject	Substring
dc.subject	DNA sequence
dc.subject	STS
dc.subject.department	Department of Microbiology and Biochemistry
dc.subject.department	Department of Biochemistry and Microbiology
dc.title	Suffix tree searcher: exploration of common substrings in large DNA sequence sets	en_US
dc.type	Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Minkley_David_BMCResearchNotes_2014.pdf
Size:: 807.87 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.74 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Faculty and Staff Publications