Protein target prediction with Logistic regression in a large scale setting
Sprache des Titels:
ISMB 2013 Proceedings
In drug discovery, research should focus on those compounds, where the desired effects are maximized and unwanted off-target effects are avoided. Therefore one of the initial steps is compound prioritization. High throughput screens support this process by measuring bioactivities of the chemical structures and therefore help to identify a lead compound.
Here we want to present a method, which uses the data obtained by these bioassay measurements to predict the activities of new, unmeasured potential drug candidates for different targets based on their chemical structures. A tool like that may serve as an additional tool in the drug discovery process. The main challenge is the amount of data to be processed. ChEMBL, for example, contains more than one million compounds and in overall about ten million bioactivity measurements.
Since the usage of complex molecule kernels might cause much computational effort for a large number of targets and compounds, we suggest fingerprint methods to describe the structure of molecules together with an efficient implementation of logistic regression, which makes explicitly use of the sparseness of that molecule representation. Advantage of logistic regression is the convexity of the method and probabilistic output. Many fingerprint features may be extremely weak in discriminating active from inactive compounds. Therefore, in order to reduce the dimensionality of the fingerprint features, we use an additional test for filtering.
In order to assess the performance of this approach, we apply our method to targets in ChEMBL in a leave-one-cluster-out setting and compare with some previously suggested methods.