Analoging in large databases with structural fingerprint features
Sprache des Vortragstitels:
Non-Clinical Statistics Conference 2012
Sprache des Tagungstitel:
Analogs share a similar bioactivity with a given lead compound and are vital for drug design helping to improve the final product in terms of effectivity, toxicity, side effects, bacterial resistance and other limitations or optimizations. Structure?Activity Relationship (SAR) is the principle that structural similar molecules have similar activities. Here we propose a method which exploits gene expression data to derive a subset of structural fingerprint features indicative for a gene of interest. Using these fingerprint features a Support Vector Machine is trained and afterwards used to identify analogs in a large database. To get reasonable results two points are crucial: Select the relevant fingerprint features indicative for the bioactivity and the method has to be fast enough to scale with the data e.g. ChEMBL. Both requirements are fulfilled by the Potential Support Vector Machine (P-SVM). To avoid selecting features stemming from possible compound outliers we have defined a robust feature selection protocol based on Leave-One-Out Cross-validation and feature ranking. We briefly introduce the P-SVM focusing on the feature selection capabilities and characteristics of the P-SVM and present the robust feature selection protocol.
Based on an example from a gene expression study with 62 compounds with 3200 structural fingerprint features the results of a ChEMBL analog search are shown.