Automated Design and Discovery of Novel Pharmaceutical using Semi-Supervised Learning in Large Molecular Databases

 

 

Mark J. Embrechts (embrem@rpi.edu)                  

Department of Decision Sciences and Engineering System

 

Curt Breneman (brenec@rpi.edu)                    

Department of Chemistry

 

Kristin P. Bennett (bennek@rpi.edu)

Department of Mathematics

Rensselaer Polytechnic Institute, Troy NY 12180

Contact Information                                                     WWW PAGE

Mark J. Embrechts                                                        http://www.drugmining.com
DSES, CII5217
Rensselaer Polytechnic Institute
Troy, NY 12180
Phone: (518) 276-4009   Fax : (518) 276-8227
Email: (embrem@rpi.edu)    URL: www.drugmining.com

 

List of Supported Students and Staff

Graduate Students - Mathew Sundling, Muhsin Ozdemir, Fabio Arcienegas, Bo Jiang, Qiong Luo, Mighu Song, Wei Deng, Larry Lockwood, Dechuan Zhuang, Jinbo Bi, Michinari Momma, Abigail Michels, Gretchen Koch, Robert Bress, and Jasmine Zhang.

 

Undergraduate Students - Pieter De Temmerman, Andres de la Guardia, Bill Katt

 

Postdoctoral Researcher: - N. Sukumar

Project Award Information

Virtual design of pharmaceuticals using data mining. This is a three-year NSF funded project from 09/01/1999 - 08/31/2002.  This is the third year of the project. The award number is IIS-9979860.

Keywords

Drug Discovery, Virtual Design, Novel Pharmaceuticals, Molecular Databases, Support Vector Machines, Machine Learning, QSAR, QSPR, Descriptor Generation, Wavelet Descriptors, Shape-Specific Property Descriptors, Descriptor selection, Transferable Atom Equivalent (TAE) Descriptors, RECON, Neural Networks, GAPLS, Sensitivity Analysis, Partial Least Squares, Evolutionary Computing, Data Mining

Project Summary

This research results in a new framework for the virtual discovery of new pharmaceuticals. The basic idea is to utilize large existing pharmaceutical databases as input for a new type of structure/activity correlation methodology. A large set of new and traditional descriptors is used to create improved Quantitative Structure-Activity Relationship (QSAR) models that characterize and predict important biological responses. Once the descriptors have been determined and a predictive model has been built, thousands of new potential molecules, chemically similar to those of the benchmark data set, are scanned from large databases and are evaluated for their chemical properties based on the predictive model. The aim is to target a few novel molecules with potentially attractive pharmaceutical properties that can then be tested further in the traditional way in the laboratory.  Computationally intelligent data mining techniques are vital to extract the information necessary to select these novel molecules. This research develops and applies novel machine learning paradigms for solving inference problems in high dimensions with few data points.  These algorithms predict desired biological responses and generate QSAR models using both known (labeled) and unknown (unlabeled) biological responses. This project involves the development of an infrastructure of computationally intelligent computer codes that allow for the virtual design of novel pharmaceuticals or the improvement of existing pharmaceuticals. The proposed methodology is applicable to most pharmaceuticals for which a database of bioactivities is available. The ultimate pay-off of this methodology is the rapid invention of new drugs for new of known society threatening diseases where a very fast response is warranted.

Publications

1.      K. P. Bennett and C. Campbell, “Support Vector Machines: Hype or Hallelujah?” SIGKDD Explorations, 3:1, pp. 1-13 (2001).

2.      G. Raetsch, A. Demiriz, and K. Bennett, “Support Regression Ensembles in Infinite and Finite Hypothesis Space,” Machine Learning, 48, 1-3, pp 193-221, (2002).

3.      A. Demiriz, K. P. Bennett, and M. Embrechts, “A Genetic Algorithm Approach for Semi-Supervised Clustering, Smart Engineering Systems Design,” 4, pp. 35 - 44 (2002).

4.      M. Momma and K. Bennett,  “A Pattern Search Method for Model Selection of Support Vector Regression,” to appear in SIAM Conference on Data Mining (2002)

5.      J. Bi and K. P. Bennett, “Duality, Geometry, and Support Vector Regression,” to appear in Advances in Neural Information Processing, 14 (2002).

6.      A. Demiriz, K. Bennett, C. Breneman, and M. Embrechts,  “Support Vector Machine Regression in Chemometrics,” to appear in Computer Science and Statistics (2002).

7.      C. Campbell and K. P. Bennett, “Linear Programming Techniques for Novelty Detection,” Advances in Neural Information Processing, 13 (2001).

8.      M. Momma and K. Bennett,  “A Pattern Search Method for Model Selection of Support Vector Regression,” to appear in proceedings of 2nd SIAM International Conference on Data Mining, Arlington Virginia (2002).

9.      C.M. Breneman, Mark J. Embrechts, Muhsin Ozdemir, Larry Lockwood, Kristin Bennett, and Dirk DeVogelaere, “Feature Selection Methods Based on Genetic Algorithms for In Silico Drug Design,”in Evolutionary computing in Drug Design, David Corne and Larry Fogel, Eds., Springer Verlag (to appear, 2002).

10.  C.M. Breneman, K.P. Bennett, M. Embrechts, S. Cramer, M. Song and J. Bi, “Descriptor Generation, Selection and Model Building in Quantitative Structure-Property Analysis”, in Experimental Design for Combinatorial  High-Throughput Materials Development, J. Cawse, Ed., Wiley (to appear, 2002) .

11.  N. Tugcu, C. Mazza, C. Breneman, Y. Sanghvi, J. Moore and S. M. Cramer, “High Throughput Screening and Quantitative Structure-Efficacy Relationship Models for Designing Displacers for Anti-sense Oligonucleotide Purification in Anion-Exchange Systems,” Separation Science and Technology. 37(7), pp. 1-15 (2002).

12.  Mark J. Embrechts, Fabio Arciniegas, Muhsin Ozdemir, Curt M. Breneman, and Kristin P. Bennett, “Data Mining Using 2-D Neural Network Sensitivity Analysis for Molecules,” in Intelligent Engineering Systems through Artificial Neural Networks: Smart Engineering System Design: Vol. 11, C. H. Dagli et al., Eds., pp. 345 – 350, ASME Press (2001).

13.  Mark J. Embrechts and Robert A. Bress, “Local Sequence Alignment with Genetic Algorithms,” in Intelligent Engineering Systems through Artificial Neural Networks: Smart Engineering System Design: Vol. 11, C. H. Dagli et al., Eds., pp. 153 – 157, ASME Press  (2001).

14.  Cihan H. Dagli, Anna L.Buczak, Joydeep Ghosh, Mark J. Embrechts, Okan Ersoy, and Stephen Kercel, Eds. Intelligent Engineering Systems through Artificial Neural Networks: Smart Engineering System Design: Vol. 11: Neural Networks, Fuzzy Logic, Evolutionary Programming, Data Mining and Complex Systems, ASME Press, New York (November 2001).

15.  Mark J. Embrechts, Hugh. F. VanLandingham, and Seppo V. Ovaska, Eds. Proceedings of the IEEE Mountain Workshop, SMCia01, IEEE Press, IEEE Catalog Number 01EX504, ISBM 0-7803-7154-2, (June 2001).

16.  Mark J. Embrechts, Fabio Arciniegas, Muhsin Ozdemir, Michinari Momma, Curt M. Breneman, Larry Lockwood, Kristin P. Bennett and Robert H. Kewley, “StripMining for Molecules,” Proceedings IEEE International Joint Conference on Neural Networks,” IJCNN’02, Honolulu, Hawaii, May 12-15 (2002).

17.  Jianguo Xin, and Mark J. Embrechts, “Supervised Learning With Spiking Neuron Networks,” Proceedings IEEE International Joint Conference on Neural Networks,” IJCNN’01, Washington D.C., July 15-19, (2001).

18.  Fabio Arciniegas and Mark J. Embrechts, “ Bagging Neural Network Sensitivity Analysis for Feature Reduction in QSAR Problems, Proceedings IEEE International Joint Conference on Neural Networks,” IJCNN’01, pp. 2478-2484 Washington D.C., July 15-19 (2001).

19.  Mark J. Embrechts, Fabio Arciniegas, Muhsin Ozdemir, and M. Momma, “Scientific Data Mining with StripMiner™,” Proc. of the IEEE Mountain Workshop, pp. 13 – 18, SMCia01, June 25-27, Blacksburg, VA (2001).

20.  Muhsin Ozdemir, Mark J. Embrechts, Fabio Arciniegas, Curt M. Breneman, Larry Lockwood, and Kristin P. Bennett “Feature Selection for In-Silico Drug Design using Genetic Algorithms and Neural Networks,” Proc. of the IEEE Mountain Workshop, pp. 53 – 58, SMCia01, Blacksburg, VA, June 25-27 (2001).

Project Impact

The techniques developed for this project lead to new powerful data mining tools for the virtual design and discovery of pharmaceuticals. The use of machine intelligence in QSAR and molecular design will change the way new drugs are invented by minimizing the lengthy procedures for testing on humans and animals and allowing the near real-time virtual invention of drugs for society threatening diseases.

Goals, Objectives, and Targeted Activities

1.      DATASETS – Dataset selection and provision from both industrial and published sources on the basis of both intrinsic difficulty or lack previous success as well as biological and medical relevance.  Development and analysis of large QSAR datasets with ~ 50,000-100,000 molecules.

2.      DESCRIPTORS Development of rapidly calculable Wavelet Coefficient Descriptors (WCDs) that capture important features of molecular electron density distributions from either Transferable Atom Equivalent (TAE) reconstruction or from DFT or ab-initio wavefunctions.  Development of shape-specific electronic property (PEST) descriptors. Promoting and supporting RECON code for TAE and wavelet descriptor generation.

3.      DATAMINING TOOLS Benchmarking, documenting and promoting StripMiner code for feature reduction with Genetic Algorithms and Sensitivity analysis. StripMiner incorporates bootstrapping and bagging for predictions based on Kernel Partial Least Squares (K-PLS), support vector machines (SVM), local learning, and neural networks, partial least squares (PLS) for drug design. Feature selection methods are based on sensitivity analysis, selection of features with genetic algorithms and sparse support vector machines.

4.      CHEMISTRY-IN/CHEMISTRY-OUT – The emphasis of the third phase of the project was to explore a prototype system for feeding back useful information to enhance the domain knowledge of the drug designer from the selected ensemble of features.

Area Background

This project develops an infrastructure for rapid drug design using chemometric information from large molecular datasets. In a first phase, new descriptors were developed that are potentially related to biological activities.  In a second phase, machine learning models were developed to predict these biological activities.  In the third phase, the descriptors and modeling procedures were validated and enhanced and codes were tuned for the analysis of large molecular data sets.

The project has two major components: development of molecular descriptors and creation of  “strip mining” tools to predict bio-responses based on selected descriptors. We have developed two new types of electron density-based descriptors as alternatives to traditional 2D and 3D property descriptors.  The new descriptor types include a set of wavelet coefficient descriptors (WCDs), and a new type of shape-specific electronic property (PEST-SHAPE) descriptors. The performance of the WCDs has been benchmarked against TAE descriptors and all other modern QSAR descriptors available in the open literature. We have created a suite of inference and validation tools for bio-response prediction.  A full cycle PLS, K-PLS, and SVM QSAR methodology has been developed including feature selection, automated model selection, and robust ensemble predictions. Several benchmark datasets (CCK, binding to the human serum albumin, NCI Developmental Therapeutics anti-cancer, HIV reverse-transcriptase inhibitor data and a tyrosine kinase data) were analyzed in addition to the Lombardo ADME data.  Standard formats for web dissemination of our datasets and results are being developed.  

Area References

The DDASSL project activities and products are fully documented at www.drugmining.com.

Potential Related Projects

1) Development of screening and virtual library generation for rapid-responses to biological threats to humans, plants and animals. 2) Protein-stationary phase interaction modeling for bio-separations technology. 3) Displacement chromatography modeling and displacer design. 4) Molecular design techniques as applied to molecules of non-biological interest such as “Materials by Design” or specialization and optimization of industrial intermediates.