Optimizing Index Structures to Support Semantic Queries in Relational Databases




Computers are not able to natively understand text. Thus, when text data is stored in a database, it is represented as “strings” of characters encoded using a standard such as ASCII or UTF-8. In this research, we explore the current methods used to manage string-based keys in relational databases, as well as to generate vector representations of strings that encode semantic meaning based on the entropy in a collection of training text, in order to enable semantic queries with string-based keys with little additional overhead cost. We consider the top-k query, where the k highest-ranking results are retrieved. Several candidate algorithms and their associated spatial index data structures are proposed in order to accelerate top-k queries that compare dimensionally reduced word embeddings based on cosine similarity. We introduce two spatial partitioning-based algorithms that improve on naive and optimized scan-based methods. Further, we implement and test these algorithms in order to evaluate their relative performance.



Software Prototypes, Asymptotic Analysis, Word Embeddings, Index Structures, Natural Language Processing, Databases