Chapter 19

Title: Chemoinformatics: Structure- and Property-activity Relationship Development


Chapter 19 describes a second Quantitative Structure-Activity Relationship (QSAR) use case relevant in chemistry and the pharmaceutical industry, the identification of novel functional inhibitors of acid sphingomyelinase (ASM). The use case in this chapter is based on the previous chapter and hence you should first read Chapter 18 before reading this chapter. In the data preprocessing step, the PaDEL (Pharmaceutical Data Exploration Laboratory) extension for RapidMiner described in the previous chapter is again used to compute molecular properties from given molecular 2-D or 3-D structures. These properties are then used to predict ASM inhibition. Automated feature selection with backward elimination is used to reduce the number of properties to a relevant set for the prediction task, for which a classification learner, namely Random Forests, generates the predictive model that captures the structure- and property-activity relationships.

The process of drug design from the biological target to the drug candidate and, subsequently, the approved drug has become increasingly expensive. Therefore, strategies and tools that reduce costs have been investigated to improve the effectiveness of drug design. Among them, the most time-consuming and cost-intensive steps are the selection, synthesis, and experimental testing of the drug candidates. Therefore, numerous attempts have been made to reduce the number of potential drug candidates for experimental testing. Several methods that rank compounds with respect to their likelihood to act as an active drug have been developed and applied with variable success. In silico methods that support the drug design process by reducing the number of promising drug candidates are collectively known as virtual screening methods. Their common goal is to reduce the number of drug candidates subjected to biological testing and to thereby increase the efficacy of the drug design process.

This chapter demonstrates an in silico method to predict biological activity based on RapidMiner data mining work flows. This chapter is based on the type of chemoinformatic predictions described in the previous chapter based on chemoinformatic descriptors computed by PaDEL. Random Forests are used as a predictive model for predicting the molecular activity of a molecule of a given structure, for which PaDEL is used to compute molecular structural properties, which are first reduced to a smaller set by automated attribute weighting and selecting the attributes with the highest weights according to several weighting criteria and which are reduced to an even smaller set of attributes by automated attribute selection using a Backward Elimination wrapper. Starting with a large number of properties for the example set, a feature selection vastly reduces the number of attributes before the systematic backward elimination search finds the most predictive model for the feature generation. Finally, a validation is performed to avoid over-fitting and the benefits of Y-randomization are shown.

Table of Contents

19.1 Introduction
19.2 Example Workflow
19.3 Importing the Example Set
19.4 Preprocessing of the Data
19.5 Feature Selection
19.6 Model Generation
19.7 Validation
19.8 Y-Randomization
19.9 Results
19.10 Conclusion/Summary
19.10 Acknowledgment
19.10 Bibliography

Dataset & Processes: Click here to download