Parallel Computing for Biological Data

www.lmu.de | UB | Blättern | FAQ

Zur erweiterten Suche

English

Zur erweiterten Suche

In the 1990s a number of technological innovations appeared that revolutionized biology, and 'Bioinformatics' became a new scientific discipline. Microarrays can measure the abundance of tens of thousands of mRNA species, data on the complete genomic sequences of many different organisms are available, and other technologies make it possible to study various processes at the molecular level. In Bioinformatics and Biostatistics, current research and computations are limited by the available computer hardware. However, this problem can be solved using high-performance computing resources. There are several reasons for the increased focus on high-performance computing: larger data sets, increased computational requirements stemming from more sophisticated methodologies, and latest developments in computer chip production. The open-source programming language 'R' was developed to provide a powerful and extensible environment for statistical and graphical techniques. There are many good reasons for preferring R to other software or programming languages for scientific computations (in statistics and biology). However, the development of the R language was not aimed at providing a software for parallel or high-performance computing. Nonetheless, during the last decade, a great deal of research has been conducted on using parallel computing techniques with R. This PhD thesis demonstrates the usefulness of the R language and parallel computing for biological research. It introduces parallel computing with R, and reviews and evaluates existing techniques and R packages for parallel computing on Computer Clusters, on Multi-Core Systems, and in Grid Computing. From a computer-scientific point of view the packages were examined as to their reusability in biological applications, and some upgrades were proposed. Furthermore, parallel applications for next-generation sequence data and preprocessing of microarray data were developed. Microarray data are characterized by high levels of noise and bias. As these perturbations have to be removed, preprocessing of raw data has been a research topic of high priority over the past few years. A new Bioconductor package called affyPara for parallelized preprocessing of high-density oligonucleotide microarray data was developed and published. The partition of data can be performed on arrays using a block cyclic partition, and, as a result, parallelization of algorithms becomes directly possible. Existing statistical algorithms and data structures had to be adjusted and reformulated for the use in parallel computing. Using the new parallel infrastructure, normalization methods can be enhanced and new methods became available. The partition of data and distribution to several nodes or processors solves the main memory problem and accelerates the methods by up to the factor fifteen for 300 arrays or more. The final part of the thesis contains a huge cancer study analysing more than 7000 microarrays from a publicly available database, and estimating gene interaction networks. For this purpose, a new R package for microarray data management was developed, and various challenges regarding the analysis of this amount of data are discussed. The comparison of gene networks for different pathways and different cancer entities in the new amount of data partly confirms already established forms of gene interaction.

R, Parallel Computing, Microarrays, Next-Generation Sequenzing

Schmidberger, Markus

18. Nov. 2009

2009

Englisch

Universitätsbibliothek der Ludwig-Maximilians-Universität München

https://nbn-resolving.org/urn:nbn:de:bvb:19-104921

Schmidberger, Markus (2009): Parallel Computing for Biological Data. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik

Vorschau	PDF schmidberger_markus.pdf 9MB
	ZIP Schmidberger_container.zip 807MB

DOI: 10.5282/edoc.10492

URN: urn:nbn:de:bvb:19-104921

Abstract

Dokumententyp:	Dissertationen (Dissertation, LMU München)
Keywords:	R, Parallel Computing, Microarrays, Next-Generation Sequenzing
Themengebiete:	500 Naturwissenschaften und Mathematik > 510 Mathematik 500 Naturwissenschaften und Mathematik
Fakultäten:	Fakultät für Mathematik, Informatik und Statistik
Sprache der Hochschulschrift:	Englisch
Datum der mündlichen Prüfung:	18. November 2009
1. Berichterstatter:in:	Mansmann, Ulrich
MD5 Prüfsumme der PDF-Datei:	bb25bcfe1c00dbe3c6565d7c97750791
MD5 Prüfsumme der ZIP-Datei:	8f1f5abcf075eae98d6a157bbdaf5cd4
Signatur der gedruckten Ausgabe:	0001/UMC 18178
ID Code:	10492
Eingestellt am:	27. Nov. 2009 08:40
Letzte Änderungen:	24. Oct. 2020 05:54

Nur für Administratoren und Editoren: Dokument bearbeiten