Fischer, Johannes (2007): Data Structures for Efficient String Algorithms. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics |
Preview |
PDF
Fischer_Johannes.pdf 1MB |
Abstract
This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n). It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern. Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means.
Item Type: | Theses (Dissertation, LMU Munich) |
---|---|
Keywords: | range minimum queries, lowest or nearest common ancestors, text indexing, word-based indexing, data mining on strings |
Subjects: | 500 Natural sciences and mathematics > 510 Mathematics 500 Natural sciences and mathematics |
Faculties: | Faculty of Mathematics, Computer Science and Statistics |
Language: | English |
Date of oral examination: | 8. October 2007 |
1. Referee: | Heun, Volker |
MD5 Checksum of the PDF-file: | aef7794761837b78234686435dadfcf4 |
Signature of the printed copy: | 0001/UMC 16544 |
ID Code: | 7505 |
Deposited On: | 15. Oct 2007 |
Last Modified: | 24. Oct 2020 08:07 |