Efficient estimation algorithms for large and complex data sets

www.lmu.de | UB | Blättern | FAQ

Zur erweiterten Suche

English

Zur erweiterten Suche

The recent world-wide surge in available data allows the investigation of many new and sophisticated questions that were inconceivable just a few years ago. However, two types of data sets often complicate the subsequent analysis: Data that is simple in structure but large in size, and data that is small in size but complex in structure. These two kinds of problems also apply to biological data. For example, data sets acquired from family studies, where the data can be visualized as pedigrees, are small in size but, because of the dependencies within families, they are complex in structure. By comparison, next-generation sequencing data, such as data from chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq), is simple in structure but large in size. Even though the available computational power is increasing steadily, it often cannot keep up with the massive amounts of new data that are being acquired. In these situations, ordinary methods are no longer applicable or scale badly with increasing sample size. The challenge in today’s environment is then to adapt common algorithms for modern data sets. This dissertation considers the challenge of performing inference on modern data sets, and approaches the problem in two parts: first using a problem in the field of genetics, and then using one from molecular biology. In the first part, we focus on data of a complex nature. Specifically, we analyze data from a family study on colorectal cancer (CRC). To model familial clusters of increased cancer risk, we assume inheritable but latent variables for a risk factor that increases the hazard rate for the occurrence of CRC. During parameter estimation, the inheritability of this latent variable necessitates a marginalization of the likelihood that is costly in time for large families. We first approached this problem by implementing computational accelerations that reduced the time for an optimization by the Nelder-Mead method to about 10% of a naive implementation. In a next step, we developed an expectation-maximization (EM) algorithm that works on data obtained from pedigrees. To achieve this, we used factor graphs to factorize the likelihood into a product of “local” functions, which enabled us to apply the sum-product algorithm in the E-step, reducing the computational complexity from exponential to linear. Our algorithm thus enables parameter estimation for family studies in a feasible amount of time. In the second part, we turn to ChIP-Seq data. Previously, practitioners were required to assemble a set of tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies between experimental conditions. In order to remove these restrictions and create a unified framework for ChIP-Seq analysis, we developed GenoGAM (Genome-wide Generalized Additive Model), which extends generalized additive models to efficiently work on data spread over a long x axis by reducing the scaling from cubic to linear and by employing a data parallelism strategy. Our software makes the well-established and flexible GAM framework available for a number of genomic applications. Furthermore, the statistical framework allows for significance testing for differential occupancy. In conclusion, I show how developing algorithms of lower complexity can open the door for analyses that were previously intractable. On this basis, it is recommended to focus subsequent research efforts on lowering the complexity of existing algorithms and design new, lower-complexity algorithms.

Complex Data, Estimation Algorithms, EM Algorithm, Next-Generation Sequencing

Engelhardt, Alexander

28. Jul. 2017

2017

Englisch

Universitätsbibliothek der Ludwig-Maximilians-Universität München

https://nbn-resolving.org/urn:nbn:de:bvb:19-210204

Engelhardt, Alexander (2017): Efficient estimation algorithms for large and complex data sets. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik

Vorschau

PDF
Engelhardt_Alexander.pdf
8MB

DOI: 10.5282/edoc.21020

URN: urn:nbn:de:bvb:19-210204

Abstract

Dokumententyp:	Dissertationen (Dissertation, LMU München)
Keywords:	Complex Data, Estimation Algorithms, EM Algorithm, Next-Generation Sequencing
Themengebiete:	500 Naturwissenschaften und Mathematik 500 Naturwissenschaften und Mathematik > 510 Mathematik
Fakultäten:	Fakultät für Mathematik, Informatik und Statistik
Sprache der Hochschulschrift:	Englisch
Datum der mündlichen Prüfung:	28. Juli 2017
1. Berichterstatter:in:	Mansmann, Ulrich
MD5 Prüfsumme der PDF-Datei:	4f3aea2574049ecbc7cced746b754c03
Signatur der gedruckten Ausgabe:	0001/UMC 24841
ID Code:	21020
Eingestellt am:	04. Aug. 2017 08:20
Letzte Änderungen:	23. Oct. 2020 18:57

Nur für Administratoren und Editoren: Dokument bearbeiten