IDM 2000 REPORT

Automated Design and Discovery of Novel Pharmaceutical using Semi-Supervised Learning in Large Molecular Databases

 

Kristin P. Bennett (bennek@rpi.edu)
Department of Mathematics
Curt Breneman (brenec@rpi.edu)
Department of Chemistry
Mark J. Embrechts (embrem@rpi.edu)
Department of Decision Sciences and Engineering Systems
Rensselaer Polytechnic Institute, Troy NY 12180

Contact Information

Machine Learning/Transduction/Support Vector Machines
Kristin P. Bennett (bennek@rpi.edu)
Department of Mathematics (Ph: 518 276-6899, Fax: 518 276-4824)
QSAR/Drug Design & Discovery/Virtual HTS/Virtual Combinatorial Libraries/TAE Descriptors
Curt Breneman (brenec@rpi.edu)
Department of Chemistry (Ph: 518 276-2678, Fax: 518 276-4887)
Genetic Algorithms / Neural Networks / Data Stripminer
Mark J. Embrechts (embrem@rpi.edu)
DSES, CII5217 (Ph: 518 276-4009, Fax: 518 276-8227)
Rensselaer Polytechnic Institute, Troy NY 12180

WWW PAGE

http://www.rpi.edu/locker/82/001182/public_html/files/

List of Supported Students and Staff (optional)

Graduate Students
Mathew Sundling, Muhsin Ozdemir, Fabio Arcienegas, Robert Kewley, Jr., Neil Eklund, Ayhan Demiriz, Meenatchi Ramalingam, Dirk De Vogelaere, Bo Jiang, Quiong Luo, Mighu Song, Wei Deng, Larry Lockwood

Undergraduate Students
Pieter De Temmerman, Andres de la Guardia, Bill Kott

Project Award Information

Virtual design of pharmaceuticals. This is a three-year NSF funded project from 09/01/1999 - 08/31/2002.
This is the first year of the project. The award number is IIS-9979860.

Keywords

Drug Discovery, Virtual Design, Novel Pharmaceuticals, Semi-Supervised Learning, Capacity Control, Molecular Databases, Support Vector Machines, QSAR, QSPR, Descriptor Generation, Descriptor selection, Transferable Atom Equivalent (TAE), RECON, Neural Networks, Genetic Algorithms, Data Mining

Project Summary

This research results in a new framework for the virtual discovery of new pharmaceuticals. The basic idea is to utilize large existing pharmaceutical databases as input for a new type of structure/activity correlation methodology in order to calculate a large set of new and traditional descriptors to create improved Quantitative Structure-Activity Relationship (QSAR) models that characterize and predict important biological responses. Once the descriptors have been determined and a predictive model has been built, thousands of new potential molecules, chemically similar to those of the benchmark data set, are scanned from large databases and are evaluated for their chemical properties based on the predictive model. The aim is to target a few novel molecules with potentially attractive pharmaceutical properties that can then be tested further in the traditional way in the laboratory. Computationally intelligent data mining techniques are vital to extract the information necessary to select these novel molecules. This research applied novel machine learning paradigms such as semi-supervised learning with capacity control. These algorithms predict desired biological responses and generate QSAR models using both known (labeled) and unknown (unlabeled) biological responses. This project involves the development of an infrastructure of computationally intelligent computer codes that allow for the virtual design of novel pharmaceuticals or the improvement of existing pharmaceuticals. The proposed methodology is applicable to most pharmaceuticals for which a database of responses is available. The ultimate pay-off of this methodology is expected to lead to the rapid invention of new drugs for new of known society threatening diseases where a very fast response is warranted.

Publications

1. A. Demiriz and K. P. Bennett, "Optimization Approaches to Semi-Supervised Learning," Applications and Algorithms of Complementarity, M. C. Publications Ferris, O. L. Mangasarian and J.-S. Pang, editors, Kluwer Academic Publishers, 2000. (Refereed compilation).
2. Ayhan Demiriz, Kristin P. Bennett, and Mark J. Embrechts, "Semi-Supervised Clustering Using Genetic Algorithms," in Intelligent Engineering Systems through Artificial Neural Networks, Vol. 9, Cihan Dagli et al., eds., pp. 809 - 814, ASME Press (1999).
3. Mark J. Embrechts, Ayhan Demiriz, and Kristin P. Bennett, "Supervised Scaled Regression Clustering with Genetic Algorithms," in Intelligent Engineering Systems through Artificial Neural Networks, Vol. 9, Cihan Dagli et al., eds., pp. 457 - 462, ASME Press (1999).
4. Harold W. Lewis, III, and Mark J. Embrechts, "Fuzzy Expert Systems," accepted for publication by The Shogaku Ronshu: Journal of Commerce, Economics, and Economic History (Submitted October 1999).
5. Robert Kewley, Mark J. Embrechts and Curt Breneman, "Neural Network Sensitivity Analysis and Cross-Validation for "Data Strip Mining Problems," Accepted for publication by IEEE Transactions on Neural Networks.
6. Dirk DeVogelaere, P. Van Bael, M. Rijckaert, and Mark J. Embrechts, "A Water Pollution Problem Solved: Comparison of GadC versus Other Methods," Proceedings Modelling, Identification and Control - MIC 2000, Innsbruck, Austria, February 14 - 17, 2000.

Talks given related to project by Kristin Bennett:

1. "Support Vector Machines: Hype or Hallelujah?" Plenary Speech, ANNIE'99, Artificial Neural Networks in Engineering Conference, St. Louis, November1999.
2. "Semi-supervised Clustering using Genetic Algorithms," ANNIE'99 Artificial Neural Networks in Engineering Conference, St. Louis, November 1999.
3. "Geometry in Data Mining," Thompson Science Series, University of Puget Sound, Tacoma, WA, February, 2000.
4. "Soft Margin Boosting using Column Generation," Support Vector Machine Workshop, Large Margin Classifiers Workshop, Neural Information Processing Systems Conference, Denver, CO, December 1999.
5. "Soft Margin Boosting using Column Generation," West Coast Optimization Meeting organized by Terry Rockefeller, University of Washington, Seattle, September 1999.

Talks given related to project by Curt M. Breneman

1. "Drug-Design through Semi-Supervised Learning" Bioinformatics Workshop, RPI Nov. 1999.
2. "New Methods of Surface Descriptor Representation" Eastman Kodak, Jan. 2000.
3. "Optimization of Molecular Properties using TAE Descriptors" GE Corporate R&D, Jan. 2000.

Talks given related to project by Mark J. Embrechts:

1. "Virtual Design of Pharmaceuticals with Semi-Supervised Learning," Invited Colloquium, Space and Naval Warfare Systems Center (SPAWAR), February 18, 2000.
2. "Supervised Scaled Regression Clustering with Genetic Algorithms," ANNIE'99 Artificial Neural Networks in Engineering Conference, St. Louis, November 1999.

Project Impact

The techniques developed for this project lead to new powerful data mining tools for the virtual design and discovery of pharmaceuticals. The use of machine intelligence in QSAR and molecular design will change the way new drugs are invented avoiding lengthy procedures for testing on humans and animals and allowing the real-time virtual invention of drugs for society threatening diseases.

Goals, Objectives, and Targeted Activities

Dataset selection from both industrial and published sources on the basis of both intrinsic difficulty or lack previous success as well as biological and medical relevance.
Development of rapidly calculable Wavelet Coefficient Descriptors (WCDs) that capture important features of molecular electron density distributions from either Transferable Atom Equivalent (TAE) reconstruction or from DFT or ab-initio wavefunctions.
Development of genetic clustering algorithms and neural network models for drug design and benchmarking their performance with support vector machines with semi-supervised learning and boosting strategies.
First phase of Data Strip Miner code development.
Proof of principle of semi-supervised learning using datasets with many unlabelled entries.

Area Background

The idea of the project is to develop an infrastructure for on-the fly drug design from large molecular datasets. In a first phase new descriptors are developed that are potentially related to biological activities. In a second phase machine learning models are developed to predict these biological activities.
At this stage, we have developed a new set of wavelet descriptors (WCD's) as an alternative to traditional 2D and 3D property descriptors. The performance of the new WCD's is benchmarked against TAE descriptors and all other modern QSAR descriptors available in the open literature. The modules generated for StripMiner to date include: a neural network, GA-driven clustering, GA-clustering with Semi-supervised learning. Several benchmark datasets including the Merck CCK dataset, the NCI Developmental Therapeutics anti-cancer dataset, several HIV reverse-transcriptase inhibitor data sets and a tyrosine kinase dataset were analyzed. Standard formats for web dissemination of our datasets and results are being developed.

Area References

[Embr98a] M. J. Embrechts, R. Kewley, Jr. and C. M. Breneman, "Computationally Intelligent Data Mining for the Automated Discovery of Novel Pharmaceuticals," in Intelligent Engineering Systems through Artificial Neural Networks, Vol. 8, C. Dagli et al., eds., pp. 397 - 403, ASME Press (1998).

[Brene97] C. M. Breneman and M. Rhem, "A QSPR Analysis of HPLC Column Capacity Factors for a set of High-Energy Materials Using Electronic Van der Waals Surface Property Descriptors Computed by the Transferable Atom Equivalent Method," J. Comp. Chem., Vol. 18:2, pp. 182-197 (1997).

[Brene95] C. M. Breneman, T. R. Thompson, M. Rhem and M. Dung, "Electron Density Modeling of Large Systems Using the Transferable Atom Equivalent Method," Computers & Chemistry, Vol. 19:3, pp. 161 (1995).

[Hansch95] C. Hansch and A. Leo "Exploring QSAR, Fundamentals and Applications in Chemistry and Biology," ACS National Meeting, Washington, DC (1995).

[Cho95] Cho, S.J. and Tropsha, A., "Cross-Validated R-Squared Guided Region Selection for Comparative Molecular Field Analysis: A Simple Method to Achieve Consistent Results," Journal of Medicinal Chemistry, Vol. 38, pp. 1060 (1995).

[Rogers94] D. Rogers and A. J. Hopfinger, "Application of Genetic Function Approximation to Quantitative Structure Activity Relationships and Quantitative Structure Property Relationships," J. Chem. Inf. Comp. Sci., Vol. 34, pp. 854-866 (1994).

[Rogers96a] D. Rogers, "Genetic Function Approximation: A Genetic Approach to Building Quantitative Structure-Activity Relationship Models," in QSAR and Molecular Modelling: Concepts, Computational Tools and Biological Applications, F. Sanz, J. Giraldo, and F. Manaut, eds., Prous Science Publishers, Barcelona, Spain, pp. 420-426 (1996).

[Rogers96b] D. Rogers, "Some Theory and Examples of Genetic Function Approximation with Comparison to Evolutionary Techniques," in Genetic Algorithms in Molecular Modeling, J. Devillers, ed., Academic Press, London, England, pp. 87-107 (1996).

[Rogers96c] W. J. Dunn, and D. Rogers, "Genetic Partial Least Squares in QSAR," in Genetic Algorithms in Molecular Modeling, J. Devillers, ed., Academic Press, London, England, pp. 109-30 (1996).

[Jurs93] P.C. Jurs, "Applications of Computational Neural Networks in Chemistry," CICSJ Bulletin 11, pp. 2-10 (1993).

Potential Related Projects

Development of screening and virtual library generation for rapid-responses to biological threats to humans, plants and animals . Molecular design techniques as applied to molecules of non-biological interest such as "Materials by Design" or specialization and optimization of industrial intermediates.