"Machine Learning Techniques for the Analysis of High Throughput DNA and RNA Sequencing Data"
Machine Learning Techniques for the Analysis of High Throughput DNA and RNA Sequencing Data
Sprache des Titels:
The identification of copy number variations in high-throughput DNA sequencing data and the detection of differential expression in RNA sequencing data are central topics in genetics and molecular biology. In these fields, new analysis methods should either enable researchers to investigate the data in a novel way that provide biologically relevant information, or have higher performance than previous methods e.g. by yielding a lower false discovery rate and a lower false negative rate. This thesis describes two new methods, called 'cn.MOPS' and "DEXUS", for copy number detection in DNA sequencing data and identification of differentially expressed gene in RNA sequencing data, respectively. cn.MOPS outperformed all other methods with respect to false discovery rate and recall and is currently developing into a standard analysis tool for both genome and exome sequencing data. DEXUS enabled researchers for the first time to analyze RNA sequencing data even if the sample conditions are unknown, which is the case for many study designs. For study designs in which sample conditions are known, DEXUS outperformed in almost all settings all other methods with respect to the area under the ROC curve. Both methods are based on a probabilistic latent variable model. Model selection is done by maximizing the posterior with an expectation maximization (EM) algorithm. The EM algorithm makes model selection computationally efficient such that the methods are fast enough to analyze huge amounts of data which is an important criterion for bioinformatics methods. cn.MOPS and DEXUS are tested on a large number of benchmarking data sets and on many data sets with highly-relevant biological research questions, and there both algorithms provide excellent results.