Advantages of Relative versus Absolute Data for ... - ACS Publications


Advantages of Relative versus Absolute Data for...

0 downloads 89 Views 1MB Size

Subscriber access provided by READING UNIV

Article

Advantages of Relative versus Absolute Data for the Development of QSAR Classification Models Irene Luque Ruiz, and Miguel Ángel Gómez-Nieto J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00492 • Publication Date (Web): 26 Oct 2017 Downloaded from http://pubs.acs.org on October 28, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Advantages of Relative versus Absolute Data for the Development of QSAR Classification Models Irene Luque Ruiz1, Miguel Ángel Gómez Nieto. University of Córdoba. Department of Computing and Numerical Analysis. Campus de Rabanales. Albert Einstein building. E-14071, Córdoba, Spain. 1

[email protected]

ABSTRACT: The appropriate selection of a chemical space represented by the data set, the selection of its chemical data representation and the development of a correct modeling process using a robust and reproducible algorithm and performing an exhaustive training and external validation determine the usability and reproducibility of a QSAR classification model. In this paper, we show that the use of relative versus absolute data in the representation of the data sets produces better classification models when the other processes are not modified. Relative data considers a reference frame to measure the chemical characteristics involved in the classification model, refining the data set representation and smoothing the lack of chemical information. Three data sets with different characteristics have been used in this study and classifications models have been built applying support vector machine algorithm. For randomly selected training and test sets, values of accuracy and area under Receiver Operating Characteristic close

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 40

to 100% have been obtained for the generation of the models and external validations in all cases.

1.

Introduction Given a set of molecules and a property/activity already known about them, a QSAR

classification model is a statistical model that allows to assigning each molecule of the set to a group or class of molecules that contain a subset with the same characteristics as per an established range for that property or activity1, 2. Every classification process consists of four non-independent phases3: a) selecting a representation model for the data set molecules (chemical space), considering the molecular characteristics that determine the property under study, b) transforming (if needed) the representation model into a data model that allows a computational processing and manipulation, c) selecting a statistic model and processing that data model, thus, generating a data set classification model, and d) validating the classification model using an external data set and the same data model. In the flow of processes that are carried out in the QSAR classification model’s construction, a series of initial decisions that will determine the goodness of the QSAR model4 can be found. Firstly, the representation model of molecules from the data set must contain information related to the property that is being studied, this meaning the structural and/or physic-chemical characteristics that will determine the value of this property, And secondly, the data model, in case that the transformation of the representation model is required, must represent these characteristics in an efficient way for its algorithmic processing, and must describe with the

ACS Paragon Plus Environment

2

Page 3 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

highest accuracy possible the relationships between these characteristics and the property being studied. Traditionally, there are several types of representation models of data sets used in QSAR classification, being mainly: fingerprints, descriptors, representations based of graphs. Chemical fingerprints, are binary arrays of a determined length obtained through a hash encoding algorithm. Depending on the type of fingerprint, the algorithm encodes different structural characteristics of the molecules, assigning a value of 1 to an element of the binary array if that characteristic (atoms, bonds, paths, fragments or substructures, etc.) is present in the molecule and 0 otherwise. Representations based on descriptors are rectangular matrixes in which each row represents the molecules of the data set, and each column represents a variable related to the property that is being studied. These variables can be measurements of structural, topological, geometric or physic-chemical properties either known or calculated from the molecules. Graph representations are based in the consideration of structural fragments present in the molecules, generating rectangular matrixes in which the rows represent the molecules of the data set, the columns represent the different structural fragments and the elements in the matrix contain information about the presence of this fragment in the molecule, or any other information (the value of a descriptor or metrics extracted/calculated for this molecular fragment in the structure of the molecule). In several proposals of QSAR classification models, graphs are used directly as data models. In other cases, starting from these representations, measurements of similarity or distance are obtained, and squared matrixes are built and used afterwards as data models.

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 40

As it can be observed, the classification model receives absolute measurements of the characteristics of the data set molecules as input data. In those matrixes, the columns of each row store values extracted from the characteristics of the molecules correspondent to that row. Duca and Hopfinger5 distinguished between absolute and relative similarity measurements. Absolute measurements are only dependent to the molecules’ characteristics, whereas relative measurements are dependent to a reference (internal or external). This consideration is the pillar of 3D/4D fingerprints or QTM (Quantum molecular similarity)6, for which a reference structure is used as a common core which provides a specific alignment of each molecule of the data set. Generally, data models used in the QSAR classification store, in different types of data structures, absolute measurements of the data set molecules. The data models based on fingerprints matrixes have been widely used with quite disparate results, since characteristics from the data set can result in fingerprints with very low or high density, or a high correlation among the bits, or even the absence of relevant information related to the property that is being studied. Thus, Xu et al.,7 establish that the models obtained for the prediction of chemical mutagenicity from a data set of 4252 mutagens and 3365 non-mutagens are quite dependent to the type of fingerprint used. On the other hand, models based on descriptors/properties matrixes require a complex selection of variables, due to the fact that a high number of descriptors can result in: over-fitting problems, high correlation among the variables, hard to interpret models or relevant information related to the property in study not being related to the selected descriptors. Helma et al.,8 compare representations based on descriptors generated for molecular fragments with physicchemicals properties of the molecules (LogP, hydrophilic and lipophilic surface areas). In those comparisons, the development of classification models for a data set of 684 mutagenicity,

ACS Paragon Plus Environment

4

Page 5 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

determined that the usage of representations based on fragments generates models 20% better that the usage of representations based on properties. Proposals based on graphs or molecular fragments also generally use absolute measurements. Molecules of the data set are fragmented, and matrixes of M (number of molecules) x N (different number of fragments obtained) are generated. From these matrixes, measurements of similarity or distance are obtained. In the last few years, the fusion of data models aiming to provide input data to the classification model containing the highest number of molecular characteristics that determine the property/activity that is under study has been proposed. These proposals combine or fuse different types of fingerprints, descriptors matrixes and/or matrixes of similarity, even algorithms in just on data model that is used to generate the QSAR classification model9-12. Data models that store absolute measurements of the molecular characteristics are independent to the property/activity that is being studied. These measurements are obtained from the molecular characteristics with independence to the set of molecules of the data set, or from the frame of reference related to that property/activity. Furthermore, relative measurements consider the data set characteristics or frame of reference. 4D fingerprints are a clarifying example in which, for its generation, a structure or core is aligned over each molecule of the data set. In this process, absolute measures of the molecular characteristics are considered, but referenced to a core and an alignment, thus different cores or alignments generate different 4D fingerprints. The use of relative measurements has been widely proposed for models based on fingerprints. Arif et al.,13 weigh the Tanimoto’s similarity calculus through a metrics based on the inversion of the molecule’s fragments frequency, which are represented for the bits of the fingerprints. Chen

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 40

et al.,14 obtain the substructures’ frequency in MACCs fingerprints, thus, building a representation of each molecule based on the eigenvalues related to the frequencies for every selected pattern. Vogt and Bajorah15 improve the similarity searching processes considering the significance of molecules according to the selection of characteristics or significant bits of the fingerprints. In this paper, the selection of potential or determinant bits involves the consideration of only these bits from the fingerprints for the calculus of the measurements of similarity, and not all the bits from the fingerprints. The selection process does not involve a modification in the structure of the fingerprints16. Urbano et al.,17 have obtained the relative distances between molecules of the data set for the calculus of approximate similarity measurements according to the non-isomorphic fragments obtained from the extraction of the Maximum Common Subgraph (MCS) of the data set, and combine similarity measurements and approximated similarity to generate data models which take into account, using relative measurements, the non-isomorphic fragments that determine the variability in the property that is being studied11. McLellan et al.,18 have studied the rank order entropy for the validation of QSAR models, varying the training and test data sets and observing the data strength. Other papers are based on obtaining the entropy from the fingerprints, weighting the bits of each fingerprint through the probability of their presence in the data set19. Palacios Bejarano et al.,20 have developed an algorithm for the extraction of patterns in the fingerprints, representing the data set. The data set is portrayed through a graph in which the nodes represent patterns of fingerprints common to subsets of molecules from the data set, and the root node stores a fingerprint or common pattern to every molecule of the data set. This graph

ACS Paragon Plus Environment

6

Page 7 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

is used for the extraction of similarity, dissimilarity, descriptors and relative distance measurements, of which the authors show the efficiency of this representation for QSAR applications for classification and prediction of properties. The proposals for these classification models based on graphs and descriptors for which it has been proposed the use of relative measurements are also widely considered. Biniashvili et al.,21 have studied the influence of shaded substructures comparing them with non-shaded substructures (not included in other models of greater size). Eriksson et al.,22 encoded R-Group in fingerprints, and obtain the distribution of each R-Group, which is used to weight its contribution to the classification model. Zhang et al.,23 have used structure-activity relationship matrix (SARM) models. The SARM data structure allows automatic and exhaustive extraction of SAR patterns of data sets and their organization into a chemically intuitive scaffold/functionalgroup format. In a typical SARM, each cell represents an individual compound, with rows and columns indicating compounds that share the same key (core) and value (R-Group). Cerruela et al.,24 used Weighted Maximum Subgraph Tree (WMCST) for the representation of data sets. This representation is based on a hierarchic structure in which the root node stores the fragment containing the Maximum Common Subgraph (MCS) of the data set molecules, the leaf nodes store the molecules and the remaining nodes of the hierarchic structures store the maximum common substructures to the child nodes. In this representation, the arcs store the nonisomorphic fragments or the structural difference between the parent and child nodes, which allows the author to perform the calculus of relative distances and approximated similarity measurements based on the structure and descriptors between the different nodes of the tree, which results in a generation of a quite efficient classification model.

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 40

In this paper, we show the upsides of using data models based on relative measurements of distance/similarity, versus those based on absolute measurements. Relative measurements are obtained through the consideration of a reference structure of the data set. This real or virtual structure is considered representative of the data set, and is the origin of the representation space of the data set. The relative distances between each pair of molecules in the plane formed by the structure of reference (origin) and the relative distances between each pair of molecules in the N dimensional space (being N the number of variables) are used to construct a data model. This data model consists on square non-symmetrical matrixes, which store relative distances; therefore, the classification process consists on finding the existence of a N space that solves the representation of molecules of the data set in this space. Applying Linear Support Vector Machine (SVM) it is possible to generate strong classification models, reproducible and easy to interpret, which considerably improve other proposals based on absolute measurements. This paper has been organized as follows: the introduction section states the background and aims of the research. In section 2 the selected data sets for validating our proposal and the pillars of itself are analyzed, describing the data model based in relative distances and its process of construction. In section 3, the experimental results are presented, parameters involved in the development of the classification model are analyzed and the results of the built models are presented and compared with those built using fingerprints and similarity matrixes. In section 4, the results obtained for the external validation of the built classification models for the studied data sets are described and the flexibility of our proposed to be used to other representations models such as those based in descriptors is presented. Finally, in the last section the research results are discussed.

ACS Paragon Plus Environment

8

Page 9 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

2.

Materials and Method

2.1.

Data sets description (chemical space)

Three data sets with different characteristics have been selected for the analysis and validation of the proposal presented in this paper, with the aim to prove its applicability to different chemical spaces. A Monoamine Oxidase inhibitor (MAO) dataset25 composed of 68 molecules divided into two classes: 38 inactive and 30 active molecules that inhibit monoamine oxidase (antidepressant drugs). This data set has been used by Gaüzère et al.,26 who proposed the use of a new graph kernel for the generation of classification models of this data set using Support Vector Matching (SVM) and leave-one-out procedure, results were compared with the ones obtained by other authors with the same data set but different kernel. For the different kernels used, the obtained accuracy varies between27,28 80% and 96%, each one of them correctly classifying 55 and 65 out of 68 molecules. The best results were obtained for the kernel based on distances between graphs (relative distances) which represent the molecules from the data set, and resulted in the authors’ proposal based on Graph Laplacian Kernel having an accuracy of 90%, and correctly classifying 61 out of 68 molecules. The Plasmodium Falciparum growth inhibitor (PLA) data set is composed by 201 active, and 349 non-active molecules. Hammann et al.,29 used an ant binary colony optimization algorithm (ACO) on the data set representation, based on fingerprints for its classification. The authors tested different types of fingerprints (MACCs, Standard and Extended), and finally selected MACCs fingerprints for their study, since this fingerprint results in a data set representation with greater darkness, the computational cost of the process is reduced. Several classification algorithms were used in this study: decision tree induction algorithms (J4.8 and C4.5) and

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 40

random forest (for which the data entry is the fingerprints matrix representing the data set) and SVM and artificial neural networks (ANN) (for which 27 descriptors 1D and 2D were also calculated). As a result of their proposal, accuracy values between 0.84 and 0.87 were obtained and AUC (Area under ROC curve) of up to 0.91 for the training phase was also obtained (authors did not carry out an external validation of such models). The Fontaine Factor Xa (FON) data set (http://cheminformatics.org/datasets/) is composed of 435 molecules that present high and low affinity as Xa factor inhibitors (156 of them have high affinity, whereas 279 present low affinity). This data set has been tested by Fontaine et al.,30 who used the anchor-GRIND (GRid-INdependent Descriptors) method and 3D descriptors derived from molecular interaction fields. Their method, based on relative measurements, resulted in classification models easy to interpret, and showed that the consideration of anchor points in a scaffold common to every molecule of the data set produces good statistic values in the classification. Two models were tested by the authors (anchor-MIF and MIF-MIF block), obtaining values of accuracy in the training of up to 88% and 84% for each case.

2.2.

Data sets representation models

Data sets used in this study are represented through a binary matrix F=MxL, where M is the number of molecules of the data set, L is the size of the molecular fingerprint and each element in the matrix can take 0 or 1 as value. PaDel-Descriptor31 software has been used for obtaining the fingerprints, which allows the obtaining of 12 types of fingerprints, using the Chemistry Development KIT (CDK)32 as a support. However, in this study, and with the aim to analyze the influence of size and density of the fingerprints, only 4 different types of fingerprints have been used: MACCs, Extended, PubChem and 2DAtomPairs fingerprints.

ACS Paragon Plus Environment

10

Page 11 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

2.3.

Data sets data models

With the aim to compare the goodness in the use of absolute distances to relative distances, for each data set the following data models have been built: •

Fingerprint matrix (F): each data set is represented by four matrixes F (one for every type of fingerprint previously described).



Similarity matrix (S): beginning with each F matrix, similarity matrixes have been built using the Tanimoto index. These matrixes are square and symmetrical, of a size equal to the number of molecules that form each data set.



Relative distances matrix (D): square and non-symmetrical matrixes that store relative distances between molecules from the data set and between the representative molecule and the others, and which are generated as described in Figure 1, which shows a flow diagram of the building of the relative distance matrixes.

For the generation of D matrixes, each fingerprint (F) matrix that stores an encoded representation is processed (MACCs, 2DAtomPairs, PubChem, Extended) in the following way: 1. The frequency for every bit in the fingerprint is calculated (every column in matrix F) and a vector fP of size L is obtained. This vector is normalized by the size of the data set, storing values between 0 and 1. 2. All those elements fP(i) , ,  =  ,  , being d a distance measurement between nodes i and j. Thus, the value of D(i, j) from the inferior diagonal represent the cost of reaching node j from node i directly.

As observed, D is not a symmetrical matrix. The superior diagonal stores the cost of converting a leaf node i to the root node and this root node to a new leaf node j (w(ei)-w(ej)), and the inferior diagonal stores the Euclidean distance between the nodes i and j (d(ei,ej)). From a

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 40

geometrical point of view of the graph, it means going through the nodes and obtaining the cost through the arcs (superior diagonal) and obtaining the cost without going through the graph (inferior diagonal). This kind of proposal has also been used in other papers34 that use relative measurements and allow the generation of data representations that include a refinement that favors the development of other QSAR models.

2.4.

Data sets classification model

In our study, Statistics and Machine Learning Toolbox of Matlab2017a35 and linear Support Vector Machine (SVM) as a classifying algorithm have been used to build the classification models. The data sets were trained with this algorithm, using the default parameters and crossvalidation with 10 folds, in a way that every molecule from the training sets participated at least once in the modeling and test phase. For each distance matrix of the three studied data sets, models were built for the whole data set. Besides, 5 partitions of the data sets were randomly generated and categorized as training and test groups. For each generated classification model, the external validation was carried out, this process that will be later described. So, in order to evaluate the results, sensitivity measurements were obtained (prediction accuracy for active compound):  = !

!" " #$

, specificity (prediction accuracy for inactive

!

$ compound): % = ! # , accuracy (overall predictive accuracy): &'' = ! $

"

!" #!$ " #!$ #" #$

, and AUC

(Area under Receiver Operating Characteristic), thus, being NA and NB the number of active and inactive molecules in the data set used, TA and TI the active and inactive elements correctly

ACS Paragon Plus Environment

16

Page 17 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

classified in the set, and FA and FI the active and inactive elements incorrectly classified in the set.

Fingerprint matrix

Similarity matrix

Relative distances matrix

Figure 2. Color maps for MAO (Top), PLA (center) and FON (bottom) corresponding to: fingerprints, similarity and relative distance matrixes built using MACCs fingerprints. Color scale goes from 1 (light yellow) to zero (dark blue).

3.

Experimental and Results The different characteristics of the selected data sets used in this study, and the different

properties from the different representation matrixes can be observed in Fig. 2. MAO data set contains a set of molecules with unequal similarity, although in the fingerprints there is a high percentage of common bits equal to 1, the molecules’ fingerprints from the data set show low density and a value equal to zero for many bits. PLA data set shows quite dense fingerprints,

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 40

which are very similar among themselves, this can be appreciated in their similarity matrix color map (center), which shows a low color difference in comparison with MAO and FON data sets. FON data set shows an intermediate behavior, with a higher density in its fingerprints than MAO data set, but with greater similarity between the elements from the data set. This data set contains molecules which share large structural fragments and small fragments of the anchor point are determinant for its activity. This characteristic can be seen in the building process of the classification model, as it will be described later. These characteristics can be clearly appreciated in the color maps of the relative distances matrixes in Fig. 2. In order to shows a color map with the same color interval than the fingerprint and similarity color maps, the relative distance matrixes have been normalized to the interval [0,1]. In these maps, the superior diagonal represents the data for the distances to the pattern fingerprint, and the inferior diagonal shows the distances between the molecules of the data set. Regarding MAO data set, there is a high variability of values of distances to the pattern fingerprint, even higher than the variability of distances between molecules. As we will later describe, this is due to the size of the pattern fingerprint and the high diversity of fingerprints existing in the data set. PLA data set shows very similar fingerprints, with a low similarity among the molecules; that is why the color map of relative distances is quite homogeneous and dark in both diagonals. Finally, in regards to FON data set, its intermediate behavior between MAO and PLA data sets can be clearly appreciated, with medium-high level of similarity among molecules, and with a similar variability between the molecules’ fingerprints and the pattern fingerprint.

ACS Paragon Plus Environment

18

Page 19 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

3.1.

Influence of the encoding algorithm

The different types of fingerprints are generated considering different structural characteristics from the molecules and different encoding algorithms, which have influence in the representation of the data sets.

Figure 3. Characteristics of MAO data set for the four types of fingerprints studied

Figure 3 shows the variability in the fingerprint’s density depending on the type of fingerprint used for the MAO data set. The higher densities are obtained for the Extended and MACCs fingerprints, whereas 2DAtomPairs algorithm generates very low density fingerprints.

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 40

Furthermore, pattern fingerprints also show different characteristics. While pattern density for MACCs and Extended algorithms is about 50% of the average density of the data set fingerprints, for PubChem and 2DAtomPairs algorithms, despite their low density, it is about 80%. Given the fact that the bits from the molecules and pattern fingerprints are weighted by its frequency in the fingerprint itself, the relationship between these factors results in a refinement in the representation of the molecules. It adds information about the calculated relative distances, as well as significant information about the relative input of each bit of the fingerprint for each molecule. For each of the fingerprints’ matrixes, its correspondent similarity matrix has been obtained using the Tanimoto index, and an analysis of the classification capacity of the data matrixes for the fingerprints, similarity and relative distances has been carried out. As it can be seen in Table 1, the classification power relies on two factors: a) the type of fingerprint, and b) the data matrix used in the process. When fingerprints matrixes are used, the accuracy is greater and the AUC is equal or superior to when the similarity matrix is used for the four types of fingerprints tested. Even though the size of the fingerprint does not directly impact in the classification capacity, the fingerprint’s density does. The worst behavior is seen in 2DAtomPairs, due to its low density. MACCs, PubChem and Extended fingerprints, despite their different sizes, show similar density, and therefore their behavior are similar. Low density fingerprints result in a pattern fingerprint or root node being a very low density fingerprint (0.02 for 2DAtomPairs against 0.13 for the others). However, if the pattern density is analyzed versus the average density of the data sets, ratios of 1.9, 1.5, 1.3 and 2.0 for MACCs,

ACS Paragon Plus Environment

20

Page 21 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

2DAtomPairs, PubChem and Extender fingerprints can be respectively found, showing that PubChem generates fingerprints with a higher number of common bits to the data set. In Table 1 the results obtained using our proposal for each of the used fingerprints is also shown. As it can be seen, in every case accuracy values close to 100% are obtained, and the same happens for AUC values. It is important to remark that our proposal generates data matrixes which are independent to its type of fingerprint, size or density, and thus, independent to the pattern or root node density. Values obtained for 2DAtomPairs, despite their low density, are equal to the ones obtained for the other types of fingerprints. Table 1. Classification results for MAO data set using fingerprints, similarity and relative distance matrixes Active class Fingerprint

MACCs

2DAtomPairs

PubChem

Extended

Data Matrix

Fingerprint

Accuracy

0.91

True positive 29

Inactive class

False positive 1

True negative 33

False negative 5

AUC

0.92

Similarity

0.87

29

1

30

8

0.96

Relative distances

0.99

29

1

38

0

1.00

Fingerprint

0.79

30

0

24

14

0.87

Similarity

0.74

25

5

25

13

0.82

Relative distances

0.97

29

1

37

1

0.99 0.92

Fingerprint

0.82

26

4

30

8

Similarity

0.74

22

8

28

10

0.86

Relative distances

0.99

29

1

38

0

0.99

Fingerprint

0.82

28

2

28

10

0.93

Similarity

0.82

28

2

28

10

0.89

Relative distances

0.97

28

2

38

0

1.00

The low influence of the type, size and density of the fingerprint can be clearly seen in Table 1. To simplify this process, a pre-processing of the different fingerprints matrixes has been carried out, removing all those bits which are zero for every element of the data set. This pre-processing results in the MACCs fingerprints passing from 166 to 55 bits, the 2DAtomPair fingerprint from

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 40

780 to 32 bits, the PubChem fingerprint from 881 to 182 bits and the Extended fingerprint from 1024 to 433 bits. In our proposal, the input data used by the classification algorithms are distance matrixes, these symmetrical matrixes of size MxM (being M the number of molecules of the data set) do not see their values affected when columns (bits) whose value is zero for every row are eliminated from the matrix, so from now on, this pre-processing will be used during the experimental process. A similar analysis has been carried out for PLA and FON data sets (see Supporting Information). In the case of PLA data set, the pre-processing stage generates fingerprints of 152, 331, 612 and 1011 bits for MACCs, 2DAtomPairs, PubChem and Extended fingerprints, respectively. For the FON data set, the sizes of the fingerprints after the pre-processing stage are 138, 258, 375 and 1007 bits for MACCs, 2DAtomPairs, PubChem and Extended fingerprints, respectively. For PLA data set the classification accuracy and AUC values using relative distances matrixes are equal to 100% independently of the type of fingerprints considered. For this data set, relatives distance matrixes highly improve the classification results of fingerprint and similarity matrixes (see Supporting information). However, for the FON data set, although relative distance matrixes slightly improve the classification results of fingerprints and similarity matrixes, the values of accuracy and AUC are very similar for all the types of input data (see Supporting information). For this data set, fingerprints and similarity matrixes generate very good classification statistics (values of accuracy and AUC close to 1), thus, the improvement from the use of relative distance matrixes is poor. The characteristics of this data set with a lot of similar molecules and high diversity, as described below, makes it difficult to improve the classification power.

ACS Paragon Plus Environment

22

Page 23 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

3.2.

Influence of the distance metric

The influence of the metric used for the calculation of relative distances between the different fingerprints and the pattern has been studied. For this study, 9 different metrics have been tested while maintaining constant the rest of variables in the study (the results shown are for MAO data set and MACCs fingerprint), resulting in two different types of results, as shown in Table 2. Table 2. Study of the distance metrics for the calculation of relative distances Active class Metrics

% Accuracy

Inactive class

% True Positive Rate

% False Negative Rate

% Positive Predicted Values

% False Discovery Rate

% True Positive Rate

% False Negative Rate

% Positive Predicted Values

% False Discovery Rate

AUC

Group A

98.5

97

3

100

0

100

0

97

3

0.99

Group B

97.1

93

7

100

0

100

0

95

5

0.98

The distances metrics Correlation, Cosine, Euclidean, Jaccard and Minkowski (Group A) are the ones that show the best results, whereas the metrics City block, Chebychev, Hamming and Squared Euclidean (Group B) give results slightly lower in their capacity of discovering false negatives, even though the AUC values are similar. Similar behavior has been observed for PLA and FON data sets (see Supporting Information). For PLA data set the application of different metrics does not affect to the classification statistics. In all cases, values of 100% of classification power have been obtained. In the other hand, the metrics behavior is slightly different for the FON data set than for the MAO data set. As observed in Supporting Information tables, the better behavior is obtained when Euclidean or Minkowski distances are used, but for this data set, Correlation and Cosine metrics generate worst results.

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 40

In any case, the observed results show that the proposed model has low sensitivity to the distances metrics used, given the fact that the data matrixes used are built from relative measurements, which are not much affected by the way they are done.

3.3. Influence of the data set characteristics Plasmodium falciparum growth inhibitor (PLA) data set shows different characteristics when compared to the MAO data set. For three of the different types of fingerprints used in the study, the correspondent fingerprint matrixes show columns in which the 560 molecules show a value equal to 1. This results in a pattern fingerprint being initially an array of 0s, and, therefore, during the pre-processing, as it has been previously described, the size of the fingerprints must be increased in a bit. The extended fingerprint has two common bits for every molecule, in this case, the patter fingerprint has not been altered. Thus, in Fig. 4 it can be observed that there is no difference in the densities of the different types of fingerprints as with MAO and FON datasets. This means that the molecules’ structural characteristics from this data set are well captured by the different encoding algorithms for every type of fingerprint. Even though there are fingerprints with high density, there are also fingerprints with high darkness, which proves a high diversity in the molecules of the set. In the PLA data set, there are two elements (PubChem code 433294 and 24867509), that correspond to Li atoms that generate fingerprints where there is no existence of 1s or there is only one bit set to 1 for the different types of fingerprints. These elements have been removed from the data set. On the other hand, Fontaine Factor Xa (FON) data set, with a similar size to PLA data set, has different characteristics. For every type of fingerprints, a pattern fingerprint can be obtained,

ACS Paragon Plus Environment

24

Page 25 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

although it has a very low density, as it can be seen in Fig. 4. For the different types of fingerprints, the data set has an average density higher than MAO or PLA data sets. A high molecular diversity is also present in this data set. However, there are a considerable number of elements in the set which generate exactly the same fingerprint.

Plasmodium Falciparum

Fontaine Xa inhibitor

Figure 4. Characteristics of PLA and FON data sets for the four types of fingerprints studied

The existence of structurally equal fingerprints for different molecular structures tells us that the types of fingerprints used are not capable of capturing decisive information (lack of

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 40

information in the representation model). This result in molecules structurally different being represented by the same fingerprint and, therefore, by the same row and column in the similarity and relative distance matrix, which means that some deviations will appear in the classification model built, as it will be described later. Despite these characteristics in PLA and FON data sets, the observed behavior for generating a classification model with the different types of fingerprints is like the one described for the MAO data set. Table 3. Classification models for PLA and FON data sets using MACCs fingerprint Active class Data set

PLA

FON

Data Matrix

Fingerprint

Inactive class

Accuracy

True positive

False positive

True positive

False positive

AUC

0.87

159

42

319

28

0.92

Similarity

0.88

162

39

319

28

0.92

Relative distances

1.00

201

0

347

0

1.00

Fingerprint

0.92

258

21

143

13

0.97

Similarity

0.92

260

19

139

17

0.97

Relative distances

0.97

272

7

148

8

0.98

The results obtained using the linear-SVM algorithm are similar, independently of the type of fingerprint used and the existence or absence of a pattern fingerprint. The results obtained with this algorithm were slightly better results than the ones for MACCs and PubChem fingerprints In Table 3 the results for MACCs fingerprint have been included. As it can be seen, for the PLA data set, good classification models were obtained when using fingerprints and similarity matrixes thanks to the characteristics of the data set, previously described. Thanks to these characteristics, perfect classification models are obtained (accuracy and AUC equal to 100%) when relative distance measurements are used.

ACS Paragon Plus Environment

26

Page 27 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In the case of the FON data set, as it can be seen in Table 3, good classification models are also obtained with fingerprints and similarity matrixes, despite the characteristics already described for this data set. In any case, the use of relative distance matrixes noticeably improves the characteristics of the model built obtaining values of accuracy and AUC close to 100%.

4.

External validation Generally, the external validation is carried out selecting a set of molecules from each data set,

which will be of use for the training phase; and another set of molecules that will be used in the test phase. These two sets are created before the training phase, resulting in the building of a strong model in the training phase that allows good results in the test phase. When data set are represented by MxN matrixes, where M is the number of molecules, and N is the set of variables that characterizes the molecules, the external validation is a simple process, since the model in the training phase has been built with the same set of N variables that will be used to represent the test set. However, when the data set is represented by MxM matrixes, the process becomes more complex. If the training set is composed of TR