Engelhardt, Alexander (2017): Efficient estimation algorithms for large and complex data sets. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics 

PDF
Engelhardt_Alexander.pdf 8MB 
Abstract
The recent worldwide surge in available data allows the investigation of many new and sophisticated questions that were inconceivable just a few years ago. However, two types of data sets often complicate the subsequent analysis: Data that is simple in structure but large in size, and data that is small in size but complex in structure. These two kinds of problems also apply to biological data. For example, data sets acquired from family studies, where the data can be visualized as pedigrees, are small in size but, because of the dependencies within families, they are complex in structure. By comparison, nextgeneration sequencing data, such as data from chromatin immunoprecipitation followed by deep sequencing (ChIPSeq), is simple in structure but large in size. Even though the available computational power is increasing steadily, it often cannot keep up with the massive amounts of new data that are being acquired. In these situations, ordinary methods are no longer applicable or scale badly with increasing sample size. The challenge in today’s environment is then to adapt common algorithms for modern data sets. This dissertation considers the challenge of performing inference on modern data sets, and approaches the problem in two parts: first using a problem in the field of genetics, and then using one from molecular biology. In the first part, we focus on data of a complex nature. Specifically, we analyze data from a family study on colorectal cancer (CRC). To model familial clusters of increased cancer risk, we assume inheritable but latent variables for a risk factor that increases the hazard rate for the occurrence of CRC. During parameter estimation, the inheritability of this latent variable necessitates a marginalization of the likelihood that is costly in time for large families. We first approached this problem by implementing computational accelerations that reduced the time for an optimization by the NelderMead method to about 10% of a naive implementation. In a next step, we developed an expectationmaximization (EM) algorithm that works on data obtained from pedigrees. To achieve this, we used factor graphs to factorize the likelihood into a product of “local” functions, which enabled us to apply the sumproduct algorithm in the Estep, reducing the computational complexity from exponential to linear. Our algorithm thus enables parameter estimation for family studies in a feasible amount of time. In the second part, we turn to ChIPSeq data. Previously, practitioners were required to assemble a set of tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies between experimental conditions. In order to remove these restrictions and create a unified framework for ChIPSeq analysis, we developed GenoGAM (Genomewide Generalized Additive Model), which extends generalized additive models to efficiently work on data spread over a long x axis by reducing the scaling from cubic to linear and by employing a data parallelism strategy. Our software makes the wellestablished and flexible GAM framework available for a number of genomic applications. Furthermore, the statistical framework allows for significance testing for differential occupancy. In conclusion, I show how developing algorithms of lower complexity can open the door for analyses that were previously intractable. On this basis, it is recommended to focus subsequent research efforts on lowering the complexity of existing algorithms and design new, lowercomplexity algorithms.
Item Type:  Thesis (Dissertation, LMU Munich) 

Keywords:  Complex Data, Estimation Algorithms, EM Algorithm, NextGeneration Sequencing 
Subjects:  500 Natural sciences and mathematics 500 Natural sciences and mathematics > 510 Mathematics 
Faculties:  Faculty of Mathematics, Computer Science and Statistics 
Language:  English 
Date of oral examination:  28. July 2017 
1. Referee:  Mansmann, Ulrich 
MD5 Checksum of the PDFfile:  4f3aea2574049ecbc7cced746b754c03 
Signature of the printed copy:  0001/UMC 24841 
ID Code:  21020 
Deposited On:  04. Aug 2017 08:20 
Last Modified:  04. Aug 2017 08:20 