Fischer, Johannes (2007): Data Structures for Efficient String Algorithms. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics 

PDF
Fischer_Johannes.pdf 1MB 
Abstract
This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (socalled RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and socalled Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the twodimensional variant of the RMQproblem we give a preprocessing scheme with quasioptimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n). It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQproblem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space and timeconsumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix and LCParray 2n+o(n) bits of additional space (coming from our RMQscheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for nonconstant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequencyrelated string mining tasks in optimal time. In a final chapter we propose a space and timeoptimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all wordaligned occurrences of a pattern. Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on realword problem instances. In most cases our algorithms outperform previous approaches by all means.
Item Type:  Thesis (Dissertation, LMU Munich) 

Keywords:  range minimum queries, lowest or nearest common ancestors, text indexing, wordbased indexing, data mining on strings 
Subjects:  600 Natural sciences and mathematics > 510 Mathematics 600 Natural sciences and mathematics 
Faculties:  Faculty of Mathematics, Computer Science and Statistics 
Language:  English 
Date Accepted:  8. October 2007 
1. Referee:  Heun, Volker 
Persistent Identifier (URN):  urn:nbn:de:bvb:1975053 
MD5 Checksum of the PDFfile:  aef7794761837b78234686435dadfcf4 
Signature of the printed copy:  0001/UMC 16544 
ID Code:  7505 
Deposited On:  15. Oct 2007 
Last Modified:  16. Oct 2012 08:10 