Home  |  Browse  |  Advanced Search  |  Help
Login | Create Account

Fischer, Johannes (2007): Data Structures for Efficient String Algorithms. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics

[img]
Preview
PDF - Klicken Sie auf das PDF-Icon, um die Dissertation im Volltext herunterzuladen.
1374Kb

Abstract

This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n). It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern. Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means.

Item Type:Thesis (Dissertation, LMU Munich)
Keywords:range minimum queries, lowest or nearest common ancestors, text indexing, word-based indexing, data mining on strings
Dewey Decimal Classification:600 Natural sciences and mathematics > 510 Mathematics
600 Natural sciences and mathematics
Faculties:Faculty of Mathematics, Computer Science and Statistics
Language:English
Date Accepted:08. October 2007
1. Referee:Heun, Volker
Persistent Identifier (URN):urn:nbn:de:bvb:19-75053
MD5 Checksum of the PDF-file:aef7794761837b78234686435dadfcf4
Signature of the printed copy:0001/UMC 16544
ID Code:7505
Deposited By:Johannes Fischer
Deposited On:15. Oct 2007
Last Modified:22. Oct 2008 15:54

Repository Staff Only: item control page

Digitale Hochschulschriften is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software creditsAbout