Statistical methods for the analysis of copy number alterations in the genome

Rueda Palacio, Oscar Manuel

Statistical methods for the analysis of copy number alterations in the genome

Rueda Palacio, Oscar Manuel

unter der Leitung von:

Cristina Rueda Sabater Doktormutter
Ramón Díaz Uriarte Doktorvater/Doktormutter

Universität der Verteidigung: Universidad de Valladolid

Fecha de defensa: 19 von Dezember von 2008

Gericht:

Bonifacio Salvador González Präsident
Eustasio del Barrio Sekretär
Virgilio Gomez Ruiz Vocal
Juan Francisco Poyatos Vocal
Ana María Rojas Mendoza Vocal

Fachbereiche:

Estadística e Investigación Operativa

Art: Dissertation

Teseo: 174061 DIALNET TESEO editor

Zusammenfassung

Genomic DNA copy number alterations (CNAs) are associated with complex diseases, including cancer: CNAs are indeed related to tumoral grade, metastasis, and patient survival. CNAs discovered from array-based Comparative Genomic Hybridization (aCGH) data have been instrumental for identifying disease-related genes and potential therapeutic targets. To be immediately useful in both clinical and basic research scenarios, aCGH data analysis requires accurate methods that do not impose unrealistic biological assumptions and that provide direct answers to the key question "What is the probability that this gene/region has CNAs?". Recent studies have shown that these phenomena are common in the population, leading to the term \copy number variation". Thus a second problem is to distinguish between individual copy number variation and copy number changes related to disease. We have developed a statistical model and algorithms based on biological principles to approach these problems. It is a non-homogeneous Hidden Markov Model with an unknown number of hidden states and fitted via Reversible Jump Markov Chain Monte Carlo. With this formulation we can incorporate explicitly the distance between genes/probes and employ Bayesian Model Averaging, thus incorporating model uncertainty and not conditioning our inferences to the selection of a particular model. The model can be extended to include random eects to incorporate heterogeneity among dierent individuals. We present also two algorithms to find common regions of alteration. One of them is oriented to detect regions common to a set of samples with an overall probability of copy number alteration as high as a given threshold and the other identifies subsets of individuals that share regions with a probability of alteration as high as a given threshold. We show, using simulated and real data sets, that our method outperforms alternative ones, and compare the results of our algorithms to others found in the literature on well-known data sets with very satisfactory results.