Computing the Free Energy without Collective ... - ACS Publications


Computing the Free Energy without Collective...

0 downloads 96 Views 4MB Size

Article pubs.acs.org/JCTC

Cite This: J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Computing the Free Energy without Collective Variables Alex Rodriguez,†,§ Maria d’Errico,†,∥,§ Elena Facco,† and Alessandro Laio*,†,‡ †

SISSA, Scuola Internazionale Superiore Studi Avanzati, via Bonomea 265, I-34136 Trieste, Italy ICTP, International Centre for Theoretical Physics, Strada Costiera 11, I-34100, Trieste, Italy



S Supporting Information *

ABSTRACT: We introduce an approach for computing the free energy and the probability density in high-dimensional spaces, such as those explored in molecular dynamics simulations of biomolecules. The approach exploits the presence of correlations between the coordinates, induced, in molecular dynamics, by the chemical nature of the molecules. Due to these correlations, the data points lay on a manifold that can be highly curved and twisted, but whose dimension is normally small. We estimate the free energies by finding, with a statistical test, the largest neighborhood in which the free energy in the embedding manifold can be considered constant. Importantly, this procedure does not require defining explicitly the manifold and provides an estimate of the error that is approximately unbiased up to large dimensions. We test this approach on artificial and real data sets, demonstrating that the free energy estimates are reliable for data sets on manifolds of dimension up to ∼10, embedded in an arbitrarily large space. In practical applications our method permits the estimation of the free energy in a space of reduced dimensionality without specifying the collective variables defining this space.

1. INTRODUCTION Molecular dynamics and Monte Carlo simulations provide insight into the finite temperature properties of molecular systems by sampling the probability distribution of the coordinates of each atom. This probability is defined in an extremely high dimensional space. Therefore, it is common practice projecting it on a few collective variables (CVs). The CVs can be any explicit function of the coordinates, such as an angle, a distance, or a coordination number. The free energy is then defined as the negative of the logarithm of the probability distribution of the CVs in kBT units, so that a maximum in the reduced probability distribution corresponds to a minimum in the free energy. In molecular systems characterized by the presence of two or more metastable states, the free energy as a function of a suitably chosen CV has two or more free local minima. The height of the barrier separating those minima is an indication of how likely the transition between the metastable states is. Estimating the free energy is at the root of computational physics and chemistry, and several approaches are available to perform this task.1−3 Choosing the appropriate CVs can be cumbersome and it normally requires a rather detailed understanding of the process under study. However, nowadays many methods are able to provide a statistically converged sampling of the probability distribution of complex systems without specifying a priori the collective variables4,5 or using a very large set of CVs.6 This motivates the search for methods of analysis capable of finding the most appropriate CVs by analyzing a converged sampling of the full probability distribution. For example, one can attempt to reduce the dimensionality of the data while preserving as much information as possible7,8 and compute the probability © XXXX American Chemical Society

distribution by means of histograms. This approach is followed in principal component analysis9 and, within a more advanced formulation, in diffusion maps,10 locally linear embedding,11 tree preserving embedding12 and sketch-map.13 However, reducing the dimensionality can wash out essential kinetic or even thermodynamic information.14 For instance a twodimensional potential energy landscape including four minima connected in a loop by barriers of a comparable height cannot be faithfully represented using a single CV. Even more importantly, estimating a probability distribution or a free energy with histograms in more than three to four dimensions is computationally difficult due to the so-called curse of dimensionality.15 Any procedure fixing the number of dimensions to two or three for visualization purposes or to enhance the convergence of the estimate can lead to an important information loss if the intrinsic dimensionality of the system is larger. We here introduce a technique that provides estimates of the free energy in high-dimensional spaces without performing any explicit dimensional reduction. Our approach requires only defining a metric, namely, a distance between pairs of configurations, such as the RMSD or the dihedral distance. The free energy is then estimated in the coordinate space associated with the metric (for example the Cα coordinates or all the dihedral angles). The method is rooted on the idea of density estimation, a popular field of research in data science. The available approaches for estimating the density16−19 can be broadly divided in parametric and nonparametric. In parametric Received: August 29, 2017 Published: February 5, 2018 A

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

the space of the coordinates of these molecules, without performing any explicit dimensional reduction.

methods it is assumed that the underlying density function has a specific functional form, for example a mixture of Gaussians.20 In nonparametric methods no strong assumption is made on the functional form of the density function.15,17,21 A popular nonparametric method is the Kernel estimator,22,23 where the density is estimated by a sum of kernel functions centered at each point. Another popular nonparametric estimator is the knearest neighbor estimator (k-NN),24 where the local density at a point is measured as the ratio of the number of nearest neighbors k to the volume they occupy. A common trait of nonparametric methods is the dependence from a cutoff (usually called smoothing parameter) which determines the length scale of influence of the single point.15,18,25 This cutoff is represented by the width of the kernel function in kernel estimators and by the value of k in k-NN. It is well-known that using a global smoothing parameter in highly inhomogeneous data sets induces systematic errors (see refs 17 and 26 and references therein). In order to address this problem some authors propose to use a smoothing parameter that varies at different data points.15,27 A point-dependent smoothing parameter can be determined, for example, by optimizing a global measure of discrepancy of the density estimation from the true density.17,28 However, this global measure is generally a nonlinear function of all the data points.29 Its optimization can be computationally demanding for large data sets. A different approach for determining the smoothing parameter is the point adaptive estimation, first proposed by Lepskii30 and further developed by works of Spokoiny, Polzehl, and others.31−36 A related approach is introduced in Gach et al.,34 where the density estimator is chosen among a family of Haar wavelets with variable resolution levels. This method is based on the hypothesis that the density is locally constant but is limited to univariate density functions. Our approach is based on the ideas introduced in these works, and it is tailored to the challenging case of molecular simulations. In this case, data points (namely, the configurations) are embedded in a high-dimensional space but the probability distribution is extremely anisotropic due to the presence of steric restraints determined by the chemical nature of the molecules. These restraints prevent the system from moving significantly in most of the possible directions, reducing in practice the dimensionality of the space. We introduce a procedure based on finding, for each point, the size of the neighborhood in which the free energy is approximately constant. This test is automatically performed only along the “soft” directions, excluding those that are practically forbidden due to the presence of the restraints. In order to benchmark the accuracy of our free energy estimate, we considered data sets drawn from inhomogeneous probability distributions defined on curved manifolds in spaces with dimension ranging between 2 and 16. These manifolds are further embedded in a space with a much higher dimensionality, mimicking the conditions observed in molecular systems. The new estimator successfully predicts the free energy values, performing better than other methods. Importantly, the approach also provides an estimate of the uncertainty of the free energy. An accurate estimate of the uncertainty can be useful in advanced analyses tools based on density estimates.37−40 Finally, we analyze a molecular dynamics trajectory of RNA trinucleotide and a folding trajectory of a small protein. We show that our approach allows an estimate of the free energy in

2. LIKELIHOOD ESTIMATE OF THE FREE ENERGY {X1, ..., XN} is designated as a set of N-independent and identically distributed random vectors with values in D . We assume that the Xis lie on a manifold of dimension d ≤ D, constant in the data set. We aim to estimate the local density around each point Xi, defined with respect to the Lebesgue measure41 on the hyperplane of dimension d locally tangent to the manifold in that point. We will discuss later how d can be estimated in practical cases. We avoid projecting explicitly the points on the manifold, since in a sufficiently small neighborhood of each point the distance on the manifold is well-approximated by the Euclidean distance in D . {ri,l}l≤k denotes the sequence of the ordered distances from i of the first k-nearest neighbors. Thus, ri,1 is the distance between i and its nearest neighbor, ri,2 is the distance with its second nearest neighbor, and so on. The volume on the tangent hyperplane of the hyperspherical shell enclosed between neighbors l − 1 and l is given by vi , l = ωd(rid, l − rid, l − 1)

(1)

where the proportionality constant ωd is the volume of the dsphere with unitary radius. We conventionally set ri,0 = 0. It can be proved that if the density is constant, all the vi,l are independently drawn from an exponential distribution with rate equal to the density ρ (see ref 42 for a derivation). Therefore, the log-likelihood function of the parameter ρ given the observation of the k-nearest neighbor distances from point i is 3(ρ|{vi , l}l ≤ k ) ≐ 3 i , k(ρ) k

= k log ρ − ρ ∑ vi , l l=1

= k log ρ − ρVi , k

(2)

where Vi,k = ∑kl=1 vi,l is the volume of the hypersphere with center at i containing k data points. By maximizing 3 with respect to ρ, we find ρ = k/Vi,k, the standard k-NN estimator of the local density.17 The estimated error on ρ is given by the asymptotic standard deviation of the parameter estimate: ερ =

ρ k

k . Vi , k

=

Parameter d in eq 1 is the dimension of the manifold containing the data points. It can be estimated by one of the approaches for computing the intrinsic dimension of data sets.43 We here estimate it by the recently published TWO-NN method.42 In short, in this approach d is estimated from the ⎛ ⎞ r slope of the line fitting the value of −log⎜1 − F emp ri ,2 ⎟ as a ⎝ i ,1 ⎠

( )

( ), where F

function of log

ri ,2

ri ,1

emp

is for the empirical cumulative

distribution function (see the original work for the motivation of this formula). Since d is estimated from the statistical properties of the distances of the first and second neighbors of each data point, the approach is robust to density variations, if the density can be considered constant in the range of the distance to the second neighbor. Moreover, the approach allows estimating d as a function of the scale, distinguishing the relevant dimensions when the manifold is perturbed by a highB

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation dimensional noise, a situation often encountered in molecular dynamics. 2.1. Finding the Optimal Number of Neighbors. The error of the standard k-NN density estimator gets smaller as k increases. However, when k increases, the density in the neighborhood within a distance ri,k from the data point i can become nonconstant. This may induce systematic errors in the k-NN estimate. In our approach, we choose as k the largest possible value for which the condition of constant density holds within a given level of confidence. In order to find the optimal k for each point i we compare, for increasing k, the maximum likelihood of two models: (1) Model 1 (M1) in which the densities of point i and of its (k + 1)th nearest neighbor are different. We denote by j the index of the (k + 1)th nearest neighbor of point i. The likelihood of M1 is obtained by maximizing 3 i , k(ρ) + 3j , k(ρ′) with respect to two independent parameters ρ and ρ′. This gives ⎛ k2 ⎞ ⎟⎟ − 2k 3 M1 = max 3 i , k(ρ) + 3j , k(ρ′) = k log⎜⎜ ρ , ρ′ ⎝ Vi , kVj , k ⎠

Figure 1. Schematic representation of the density estimation for a twodimensional example. (A and B) Sample of 2000 points extracted from a uniform distribution and the same sample with 2000 additional points extracted from a Gaussian distribution. (C and D) D given in eq 3 as a function of k estimated for the two points highlighted in orange in panels A and B. The green line corresponds to the threshold above which the probability that the distributions around the neighborhoods of points i and k can be described with a single parameter is smaller than 10−6.

(2) Model 2 (M2), in which the density at point i is assumed to be equal to the density at its (k + 1)th nearest neighbor of index j. The likelihood of M2 is obtained by maximizing 3 i , k(ρ) + 3j , k(ρ) with respect to a single parameter ρ. This gives 3 M2 = max 3 i , k(ρ) + 3j , k(ρ) ρ

⎞ ⎛ 2k ⎟⎟ − 2k = 2k log⎜⎜ ⎝ Vi , k + Vj , k ⎠

k where Dk is greater than the threshold value, implying that the density at the kth neighbor of i is significantly different from the density at j. 2.2. Computing the Free Energy. Once the optimal k for point i is found, it is possible to compute the free energy of ⎛ k̂ ⎞ point i from the density ρi as Fi = −log(ρi ) = −log⎜ V i ⎟ (to ⎝ i ,kî ⎠ simplify the notation, we neglect the constant term log(N ) and set kBT equal to one). However, we verified that this estimator is affected by small systematic errors, which become important if the data points are embedded in a high-dimensional manifold. Indeed, in the procedure described in the previous section, at the exit value k̂ the log-likelihoods of the two models M1 and M2 are distinguishable with a very high statistical confidence (p-value = 10−6). It is therefore likely that the free energy close ̂ neighbor is already substantially different from the to the kth density close to data point i. In order to correct this trend, we use a likelihood model depending explicitly on the free energy, and with an extra variational parameter, hereafter denoted by a, aimed at describing the linear trend in the free energy as one moves further and further from the central data point. This leads to the following log-likelihood:

44

The two models are compared by a likelihood ratio test which involves estimating the difference Dk = − 2(3 M2 − 3 M1) = − 2k[log(Vi , k) + log(Vj , k) − 2 log(Vi , k + Vj , k) + log(4)] (3)

We consider the two models distinguishable with a high statistical confidence if D is larger than a threshold value Dthr. Since the number of parameters of the two models differs by one, Dk has an approximate χ2 distribution with one degree of freedom.45 We take Dthr = 23.928 which corresponds to a pvalue of 10−6. For each point i we choose the optimal value of k, hereafter denoted by kî , according to the condition kî : (Dk < Dthr ∀ k ≤ kî ) ∧ (Dkî + 1 ≥ Dthr )

This implies that for point i models M1 and M2 are consistent within a p-value of 10−6 for all the choices of k smaller than or equal to kî . To illustrate this procedure, we consider a sample of 2000 points extracted from a uniform distribution (Figure 1A) and the same sample with 2000 additional points extracted from a Gaussian distribution (Figure 1B). In Figure 1C,D, we show the value of Dk as a function of k, estimated respectively for a point in the uniform data set and for a point near a region of high curvature in the Gaussian data set (highlighted in orange in panels A and B). In the case of the Gaussian distribution, as a consequence of the nonuniform density, there exists a value of

kî

3(F , a|{vi , l}l ≤ kî ) =

∑ log(e−F+ al e−exp(−F+ al)v ) i ,l

l=1

k ̂ (k ̂ + 1) = −Fkî + a i i − 2

kî

∑ vi ,l ·e−F+ al l=1

(4) C

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 2. (Top and middle rows) Performance of the density estimator on two artificial data sets: (left) negative of the free energy of the two data sets; (center) average L1 error ϵ for different values of dc (top x-axis) and different values of k (bottom x-axis; for the case in the top, the error is shown separately for each of the nine Gaussians that compose the data set; we compare three density estimators: fixed Gaussian kernel (red dots), standard k-NN (blue triangles), and PAk (blue dashed line)); (right): distribution of the pull for one realization of the whole data set. The color code is the same. The normal Gaussian distribution is also shown as a reference (black dashed line). (Bottom row) Performance of the density estimator on artificial data sets of different dimensions d. The data sets are composed by four Gaussians of different heights and variances in a hypercube at different dimensions d (see the Supporting Information for a detailed description). The color codes are the same as those for the top and middle panels.

3. RESULTS 3.1. Validation of the PAk Estimator. The difficulties that hinder the estimate of the free energy in real systems are many: the existence of minima with very different free energies, the anisotropy of the shape of the free energy minima, a possibly large dimension of the manifold in which the data lay, and a large curvature of this manifold. In this section we benchmark the performance of PAk in numerical experiments where these problems are present. Moreover, we compare the results with those obtained using other well-established density estimators: (i) the standard k-NN method and (ii) the Gaussian kernel method, in which the density is estimated as a sum of Gaussians of fixed width dc centered on each data point of the sample. To compare the accuracy of the three methods, we measure the

Upon setting a = 0, this likelihood reduces to the one defined in eq 2 with F = −log ρ. We maximize 3 in eq 4 with respect to F and a by the Newton−Raphson approach. The asymptotic standard deviation of the estimate is obtained by computing the correspondent element of the inverse of the Fisher information matrix (see for instance ref 46.): εi =

4kî + 2 (k ̂ − 1)k ̂ i

i

(5)

The pointwise-adaptive k-NN free energy estimator, in short PAk, is provided by the Fi obtained maximizing eq 4. The uncertainty of this estimate is given by eq 5. D

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

The second case we considered is a nonisotropic free energy surface shown in the left panel of Figure 2. In this landscape40 the free energy varies with significantly different rates in different directions. Again, the performance of the three methods is shown in the center panel. Also in this case, PAk achieves an accuracy similar to the nonadaptive estimators with their optimal parameters and the pull distribution obtained strongly resembles a Gaussian with zero mean and variance equal to one (right panel). On the same data set, the Gaussian kernel estimator and k-NN estimator with optimal parameters (dcopt = 0.022 and kopt = 20) underestimate the error. We also benchmarked the approach on data sets of high intrinsic dimension harvested from a probability distribution including four Gaussians (see the Supporting Information for a detailed description of these distributions). The average L1 error of PAk ranges from ∼0.1 for d = 2 to ∼0.7 for d = 16 (see Table 1). The performances are slightly worse than those of a

deviation of the free energy estimate from the corresponding 1 N true value by the average L1 error ϵ = N ∑i = 1 |Fitrue − Fi|, where Ftrue is the free energy value of point i and Fi is the i estimated free energy. In addition, the approach introduced in this work provides an estimate of the uncertainty εi of the free energy Fi (eq 5). In order to assess the performance of this estimate, we computed the distribution of the pull47 defined for point i as Δi =

(Fitrue − Fi ) εi

(6)

where Ftrue is the ground truth free energy value. A shift from i zero of the average Δ implies that the free energy estimate is biased. Moreover, if the estimated error predicts correctly the difference between the true and the estimated free energy, the standard deviation of the Δ should be close to one and the probability distribution of Δ should be Gaussian. We first considered a data set of 12000 points extracted from a two-dimensional probability distribution with nine Gaussians. The ground truth free energy is shown in the left panel of Figure 2. The standard deviation σ for the three Gaussians A, B, and C is equal to 0.1, it is equal to 1 for D, E, and F, and it is equal to 3 for G, H, and I. From each of the three Gaussians with the same σ we extract 300, 1000, and 3000 points. This leads to differences of 8 logarithmic units between the free energy minima. To examine the performance obtained by the three methods, we averaged ϵ over 100 statistically independent realizations. We investigated how each Gaussian contributes to the performance by computing the error restricted to the points belonging to each of the nine square boxes on which the Gaussians are centered. The results are shown in the central panel. In the case of the fixed Gaussian kernel method (red dots), the dc parameter was varied in the interval 0.01−1.0. In the case of standard k-NN (blue triangles) k was varied between 8 and 200. PAk (blue dashed line) is parameter independent, so its ϵ is a single value. The values of dc and k that correspond to a small relative error depend on the characteristics of the density function, and the optimal value is different for different Gaussians. Thus, a value for dc, or k, which is optimal for all the Gaussians together cannot be found. Remarkably, the error of PAk is comparable with the one of a standard k-NN method in which the value of k is chosen independently for each Gaussian. We also computed Δ, defined in eq 6 for one realization of the whole data set (right panel). For the k-NN and the Gaussian kernel methods, the values of dc and k are chosen by minimizing the estimated error. For the Gaussian kernel method the error is evaluated by bootstrap.15 For the k-NN method the error is given by 1 . It can be seen

Table 1. Average L1 Error for the Data Sets in Different Intrinsic Dimensions with Different Methods k-NN

PAk

Gaussian kernel

dimension

ϵ

kopt

ϵ

dcopt

ϵ

2 4 8 16

0.11 0.22 0.40 0.68

141 69 33 23

0.10 0.16 0.30 0.59

0.015 0.065 0.155 0.245

0.20 0.36 0.76 1.21

standard k-NN estimator in which the value of k is optimized by minimizing the error, while considerably better than the best performances achieved with the Gaussian kernel. Once again, we recall that in practical cases it is not possible to optimize the value of k and the value of dc. Therefore, the error values shown in Table 1 for the k-NN and the Gaussian kernel methods should be considered lower bounds. The distributions of Δi for the data sets with d > 2 shown in Figure 2 display an approximately Gaussian trend. However, for d ≥ 8 and, even more significantly, for d = 16, the distribution of the Δ is not centered at zero anymore and its standard deviation is larger than one, indicating that the density estimate starts becoming biased. At these high dimensions the linear correction in eq 4 cannot capture anymore the trend of the variation of the density as a function of the distance from each point. In fact, due to the curse of dimensionality, the first neighbors are already far away on average and the density varies quickly with k. The loss of performance is even more evident for the other two density estimators we considered. Even if the values of k and of dc used in these approaches are optimized boosting their accuracy, the error prediction of the PAk estimator is the most accurate. Importantly, the error of PAk remains limited even at the highest d we considered. In the next sections we will illustrate that PAk is suitable for practical applications even for data sets with an intrinsic dimension of 10−15. 3.2. Free Energy Surface Reconstruction on Curved and High-Dimensional Manifolds. With the aim of testing our approach in more realistic scenarios, we then considered free energy surfaces of molecular systems obtained with enhanced sampling methods. We considered three systems, each studied with a different number of collective variables: the nucleation of the C-terminal of amyloid-β, projected on two collective variables,48 the folding of the third IgG-binding domain of protein G from

k

that the pull distribution obtained with PAk resembles almost perfectly a Gaussian with zero mean and standard deviation equal to one, as expected from an unbiased estimator that properly accounts for the error. The error is overestimated by the Gaussian kernel method while it is underestimated by the kNN method (notice the fat tail of the distribution at the right). It should be emphasized that the results for Gaussian kernel and k-NN are obtained with the values that minimize the error, a procedure that can be carried out only on artificial data sets, where the ground truth value of the free energy is known. The performance of these approaches on real world data sets is therefore likely to be even worse. E

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 3. Results on resampled free energy surfaces. On the left, a cartoon representation of the resampling and transformation process. The initial FES is resampled with a probability proportional to exp(−F). The two first coordinates are then transformed in a Swiss roll and the whole data set is embedded in a 20-dimension space by rotating along a random vector. On the middle column we show the correlation plot between the real free energy and the estimated by maximizing eq 4 with the error bars obtained from eq 5. The panels at the right show the distribution of the pull. The systems under study are (A) the 2D projection of the nucleation of the C-terminal of amyloid-β,48 (B) the 4D projection of the folding of the third IgG-binding domain of protein G from streptococcal bacteria (GB3),49 and (C) 7D projection of the conformational space of the intrinsically disordered protein human islet amyloid polypeptide (hIAPP).50

The free energies and their associated uncertainties for the transformed set of points are computed according to eq 4 and eq 5 and compared with the original free energies. The parameter d in eq 1 was estimated using the approach in ref 42. We remark that estimating the free energy using the value of the intrinsic dimension d rather than the dimension of the space in which the data are embedded is essential in order to obtain meaningful results. The results are summarized in the central and right columns of Figure 3. It can be seen that the method, especially at low intrinsic dimension, reproduces the ground truth free energy value even after applying a nonlinear transformation of the coordinates which induces a curvature in the manifold hosting the data points. As in the artificial test cases, the performance is worse at high intrinsic dimension, but

streptococcal bacteria (GB3) projected on four collective variables,49 and the conformational space of the intrinsically disordered protein human islet amyloid polypeptide (hIAPP) projected on seven collective variables.50 All the systems were treated in the same way, summarized in the cartoon of Figure 3. Initially, the FES was resampled in the space of the collective variables with a probability proportional to the exponential of negative of the free energy value. Then, the data points were twisted on a Swiss-roll by splitting the first of its coordinates in two by means of the transformation x1 = x cos(x) and x2 = x sin(x). Finally a rotation around a random vector in D = 20 was performed. In this manner each point sampled from the original distribution is embedded in a 20-dimensional space. F

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 4. Analysis of a MD trajectory of trinucleotide AAA.51 (A) Configurations projected in the two main TIC and colored according to their computed free energy in the RMSD space. The coordinates for the representative structures for the four Markov states found in ref 51. are highlighted with green crosses and their structures are shown nearby. (B) Correlation plot between the free energies computed with PAk using as metric the RMSD and the euclidean distance between the TIC coordinates. (C) Correlation plot between the free energy computed by applying PAk to all the TIC coordinates and by applying it to the two first TIC coordinates. (D) Pearson correlation coefficient between the free energies estimated with TIC coordinates and those estimated using the RMSD as a function of the number of TIC components.

In order to assess more quantitatively the reliability and the meaning of the free energies computed by the PAk estimator, we then analyzed the same trajectory mapped on a space of reduced dimensionality using the time-lagged independent components analysis. We first kept the 17 main components. This corresponds to a very mild dimensional reduction, in which the relevant kinetic information is likely to be retained. The intrinsic dimension42 of the trajectory mapped on these coordinates is 9, a value consistent with the one estimated using the original Cartesian coordinates. The free energy estimated by PAk in the TIC space and in the Cartesian coordinate space are compared in Figure 4B. Remarkably, the values estimated in the two spaces are well-correlated (with a Pearson correlation coefficient of r = 0.738). The free energies estimated in the two spaces are almost consistent within their error bars. This shows that the free energies estimated with PAk are relatively robust with respect to metric transformations, and they are very similar when they are computed in the space of the original coordinates or in a space of collective variables that are compliant with the kinetics (the TIC components, in the case of the example). As a case of ”extreme” dimensional reduction, we then computed the free energy using only the first two TIC components, namely, in the coordinates used in panel A. The correlation plot between these free energies and the ones computed in the space of all the solute coordinates (i.e., using RMSD as metric) is shown in panel C. Clearly, the dimensional reduction leads to an important information loss, since the free energies estimated in the two spaces are much less correlated. We should remark that the first two TIC components are possibly the best choice for performing a dimensional reduction to two, since they are maximally compliant with the kinetics by

even in these cases a fairly good correlation is retained. The pull distributions resemble the trend shown in the tests of Figure 5. In case A, with intrinsic dimension equal to two, the distribution resembles a Gaussian with zero mean and standard deviation equal to one. In the case of GB3, the distribution of the pull is also almost correct. Even in the case with the highest intrinsic dimension (d = 7) the pull distribution still resembles a Gaussian. 3.3. Free Energy of a Molecular System in the Space of the Solute Coordinates. We finally analyzed a molecular dynamics trajectory of the RNA trinucleotide AAA from ref 51. This trajectory explores several metastable configurations, characterized by important structural differences, and was analyzed in the original reference using time-lagged independent components (TIC)52 analysis, a powerful dimensional reduction technique. On the same trajectory we computed the free energy by PAk using as a metric the root mean square deviation (RMSD) between all the atoms in the trinucleotide. Since the number of atoms is 98, and the dynamics is performed by constraining the length of the 106 bonds, the dimension of the embedding space is D = 182. We first computed the intrinsic dimension of the trajectory by the approach in ref 42, obtaining a value of 10. We then computed the free energy of each configuration using the PAk estimator. The results are shown in Figure 4A. The x- and y-axes correspond to the two main TIC.51 The color code represents the free energy of each structure obtained by PAk. As it can be seen the results are qualitatively compatible with those of the original analysis, with all the Markov states found in ref 51 (marked with crosses) corresponding to low free energy regions. G

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 5. Two-dimensional projection of the configuration of the WW domain from ref 53 obtained by multidimensional scaling54 of the dihedral distances. The configurations are represented with points colored according to their cluster. The clusters are obtained by density peak clustering40 estimating the density with PAk. The structures in cartoon representation are the configurations corresponding to the cluster centers. Black dots are configurations assigned to the halo.40

decision graph is shown in the Supporting Information and allows recognizing 10 clusters. The most populated includes almost 60% of the configurations and corresponds to the folded state. The other nine clusters correspond to metastable configurations with low population. Most of the unfolded snapshots are assigned to the halo of the clusters.40 This means that these snapshots cannot be assigned unambiguously to any free energy basin, and therefore should be considered transition structures. The results of this analysis are summarized in Figure 5. For visualization purposes the distance matrix has been projected in two dimensions using multidimensional scaling with metric stress.54 We also verified that for this system the free energies are robust with respect to the choice of the metric used for measuring the distance between the configurations. Indeed, the free energies computed using the dihedral distance and the RMSD between the Cα atoms are highly correlated, with a Pearson coefficient of 0.984.

construction. Indeed, it can be seen from panel D that the Pearson correlation coefficients between the free energies estimated with the TIC and those estimated with the RMSD distances increase with the number of TIC components considered. In summary, these results show that PAk free energies in the space of all the solute coordinates are in agreement with those computed in a sufficiently large space of smartly chosen collective variables. However, the agreement is rapidly lost when the dimensional reduction becomes too extreme. 3.4. Free Energy of a WW Domain Mini-protein. We finally used PAk to compute the free energy from a 200 μs MD simulation of the FiP35 WW domain. The simulation was performed in explicit solvent at 395 K.53 During this simulation this mini-protein undergoes several transitions between the folded and unfolded states. We used for the analysis a configuration every 20 ns. We first estimated the distance between these configurations using the dihedral distance for the angles formed by the Cα of residues 4−32. The intrinsic dimension in the space associated with this distance is 14.42 Then, we estimated the free energies by PAk and clustered the configurations by the density peaks algorithm,40 using the negative of the free energies as efficacious densities. The

4. CONCLUSION AND PERSPECTIVE The approach described in this work allows an estimate of the free energy in high-dimensional spaces, such as the space of the configurations sampled in a molecular dynamics simulation of a biomolecule. The core of the method is a procedure that H

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation identifies for each data point the maximum number of nearest neighbors that can be employed in the free energy estimate without introducing systematic errors. Small biases induced by this procedure are mitigated by the introduction of a linear term correction to the free energy in the maximum likelihood equation. The estimator performs well in predicting the free energy and its uncertainty in all the data sets that have been tested, with an intrinsic dimension up to 16. We called this method point-adaptive k-NN estimator (PAk). The approach can be exploited not only for analyzing molecular simulations but in general in any context in which it is necessary to estimate a probability density, even in a highdimensional space. As in the standard k-NN estimator and in several related methods, in the PAk approach, the probability density estimation relies on the volume occupied by the k-nearest neighbors. However, to the best of our knowledge, we are the first to explicitly propose that this volume should be measured on the manifold in which the data lie, and not in the configuration space. This apparently trivial assumption makes the likelihood comparison in eq 3 numerically meaningful, and allows one to reliably find the optimal number of neighbors. By focusing on the embedding manifold, the approach performs an implicit dimensional reduction, but without defining explicitly the collective variables. Pictorially, the variables in which the free energy is defined can be imagined as forming a basis in the tangent plane of the manifold in each point. This basis can vary in different regions of the data set. However, in order to compute the free energy, it is not necessary to define those variables explicitly. Only their number, the intrinsic dimension of the manifold, needs to be known before applying our approach. The value of this parameter is here estimated following ref 42. It is well-known that in a molecular simulation the atomic coordinates are embedded in a very high dimensional space, but the process of interest happens in a much lower dimensional space. For example, in the trajectory the molecular dynamics simulation of the AAA trinucleotide, the dynamics takes place on a manifold whose dimensionality is approximately 10. This allows computing the free energy of each configuration using the RMSD metric, namely in the space of the coordinates of the solute atoms. The dimension of this space is in principle extremely large but, as we have seen, the PAk estimator provides free energy values with relatively small error bars. We have shown that these free energy values are in fair agreement with those computed in a space of collective variables specifically built to describe the kinetics of the system in an optimal manner. As the ordinary k-NN estimator, the method is affected by the curse of dimensionality. When the intrinsic dimension of the embedding manifold becomes very large, the distribution of the pull shows the presence of a bias and the error is underestimated. It is likely that at high dimensions the linear correction model in eq 4 is not accurate enough to properly compensate the variation of the free energy. More elaborated models capable of correcting this trend will be the object of our future investigation.





Setup procedure for the test data sets with different intrinsic dimensions and (Figure S1) decision graph for the density peaks clustering of the WW domain trajectory (PDF)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Alessandro Laio: 0000-0001-9164-7907 Present Address ∥

Janssen R BE, Antwerpseweg 15, 2340 Beerse, Belgium.

Author Contributions §

A.R. and M.d’E. contributed equally to this work.

Funding

We acknowledge financial support from the grant Associazione Italiana per la Ricerca sul Cancro 5 per mille, Riferimento 12214. Notes

The authors declare no competing financial interest.

■ ■

ACKNOWLEDGMENTS We thank Marco Borelli for stimulating discussion. REFERENCES

(1) Kumar, S.; Payne, P. W.; Vásquez, M. Method for free-energy calculations using iterative techinques. J. Comput. Chem. 1996, 17, 1269−1275. (2) Carter, E. A.; Ciccotti, G.; Hynes, J. T.; Kapral, R. Constrained reaction coordinate dynamics for the simulation of rare events. Chem. Phys. Lett. 1989, 156, 472−477. (3) Kumar, S.; Rosenberg, J. M.; Bouzida, D.; Swendsen, R. H.; Kollman, P. A. Multidimensional free-energy calculations using the weighted histogram analysis method. J. Comput. Chem. 1995, 16, 1339−1350. (4) Shaw, D. E.; Grossman, J. P.; Bank, J. A.; Batson, B.; Butts, J. A.; Chao, J. C.; Deneroff, M. M.; Dror, R. O.; Even, A.; Fenton, C. H.; Forte, A.; Gagliardo, J.; Gill, G.; Greskamp, B.; Ho, C. R.; Ierardi, D. J.; Iserovich, L.; Kuskin, J. S.; Larson, R. H.; Layman, T.; Lee, L. S.; Lerer, A. K.; Li, C.; Killebrew, D.; Mackenzie, K. M.; Mok, S. Y. H.; Moraes, M. A.; Mueller, R.; Nociolo, L. J.; Peticolas, J. L.; Quan, T.; Ramot, D.; Salmon, J. K.; Scarpazza, D. P.; Schafer, U. B.; Siddique, N.; Snyder, C. W.; Spengler, J.; Tang, P. T. P.; Theobald, M.; Toma, H.; Towles, B.; Vitale, B.; Wang, S. C.; Young, C. Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer. SC ‘14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; IEEE: New York, 2014; pp 41−53, DOI: 10.1109/SC.2014.9. (5) Sugita, Y.; Okamoto, Y. Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett. 1999, 314, 141−151. (6) Piana, S.; Laio, A. A bias-exchange approach to protein folding. J. Phys. Chem. B 2007, 111, 4553−4559. (7) van der Maaten, L.; Postma, E.; van den Herik, J. Dimensionality reduction: A comparative review. J. Mach. Learn. Res. 2009, 10, 66−71. (8) Piana, S.; Laio, A. Advillin Folding Takes Place on a Hypersurface of Small Dimensionality. Phys. Rev. Lett. 2008, 101, 208101. (9) Ringnér, M. What is principal component analysis? Nat. Biotechnol. 2008, 26, 303−304. (10) Coifman, R. R.; Lafon, S.; Lee, A. B.; Maggioni, M.; Nadler, B.; Warner, F.; Zucker, S. W. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 7426−7431. (11) Roweis, S. T.; Saul, L. K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 2000, 290, 2323−2326.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jctc.7b00916. I

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation (12) Shieh, A. D.; Hashimoto, T. B.; Airoldi, E. M. Tree preserving embedding. Proc. Natl. Acad. Sci. U. S. A. 2011, 108, 16916−16921. (13) Ceriotti, M.; Tribello, G. A.; Parrinello, M. Simplifying the representation of complex free-energy landscapes using sketch-map. Proc. Natl. Acad. Sci. U. S. A. 2011, 108, 13023−13028. (14) Dellago, C.; Bolhuis, P.; Geissler, P. Transition Path Sampling. Adv. Chem. Phys. 2003, 123, 1−78. (15) Scott, D. W. Multivariate density estimation: theory, practice, and visualization; John Wiley & Sons: Hoboken, NJ, USA, 2015. (16) Sheather, S. J. Density Estimation. Stat. Sci. 2004, 19, 588−597. (17) Silverman, B. W. Density estimation for statistics and data analysis; Chapman and Hall: New York, 1986. (18) Izenman, A. J. Review Papers: Recent Developments in Nonparametric Density Estimation. J. Am. Stat. Assoc. 1991, 86, 205−224. (19) Fix, E.; Hodges, J. L., Jr Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. International Statistical Review/Revue Internationale de Statistique 1989, 57, 238−247. (20) Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. B 1977, 39, 1−38. (21) Wand, M. P.; Jones, M. C. Kernel smoothing; CRC Press, 1994. (22) Nadaraya, É. A. On Non-Parametric Estimates of Density Functions and Regression Curves. Theory Probab. Its Appl. 1965, 10, 186−190. (23) Watson, G. S.; Leadbetter, M. R. On the Estimation of the Probability Density, I. Ann. Math. Stat. 1963, 34, 480−491. (24) Mack, Y.; Rosenblatt, M. Multivariate k-nearest neighbor density estimates. J. Multivar. Anal. 1979, 9, 1−15. (25) Orava, J. K-nearest neighbour kernel density estimation, the choice of optimal k. Tatra Mt. Math. Publ. 2011, 50, 39−50. (26) Heidenreich, N.-B.; Schindler, A.; Sperlich, S. Bandwidth selection for kernel density estimation: a review of fully automatic selectors. AStA Adv. Stat. Anal. 2013, 97, 403−433. (27) Jones, M. C.; Marron, J. S.; Sheather, S. J. A Brief Survey of Bandwidth Selection for Density Estimation. J. Am. Stat. Assoc. 1996, 91, 401−407. (28) Simonoff, J. S. Smoothing methods in statistics; Springer Science & Business Media: New York, 2012. (29) Jones, M. C.; Marron, J. S.; Sheather, S. Progress in data-based bandwidth selection for kernel density estimation. Comput. Stat. 1996, 11, 337−381. (30) Lepskii, O. V. On a Problem of Adaptive Estimation in Gaussian White Noise. Theory Probab. Its Appl. 1991, 35, 454−466. (31) Lepski, O. V.; Mammen, E.; Spokoiny, V. G. Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. Ann. Statist. 1997, 25, 929−947. (32) Polzehl, J.; Spokoiny, V. Image denoising: Pointwise adaptive approach. Ann. Statist. 2003, 31, 30−57. (33) Polzehl, J.; Spokoiny, V. Propagation-Separation Approach for Local Likelihood Estimation. Probab. Theory Relat. Fields 2006, 135, 335−362. (34) Gach, F.; Nickl, R.; Spokoiny, V. Spatially adaptive density estimation by localised Haar projections. Ann. Inst. H. Poincar Probab. Statist. 2013, 49, 900−914. (35) Rebelles, G. Pointwise adaptive estimation of a multivariate density under independence hypothesis. Bernoulli 2015, 21, 1984− 2023. (36) Becker, S. M. A.; Mathé, P. A different perspective on the Propagation-Separation Approach. Electron. J. Statist. 2013, 7, 2702− 2736. (37) Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Second International Conference on Knowledge Discovery and Data Mining; Association for the Advancement of Artificial Intelligence (AAAI): Palo Alto, CA, USA, 1996; pp 226−231.

(38) Ankerst, M.; Breunig, M. M.; Kriegel, H.-P.; Sander, J. OPTICS: Ordering Points to Identify the Clustering Structure. SIGMOD Rec. 1999, 28, 49−60. (39) Comaniciu, D.; Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 603−619. (40) Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492−1496. (41) Rosenthal, J. S. A first look at rigorous probability theory; World Scientific: Singapore, 2006. (42) Facco, E.; d’Errico, M.; Rodriguez, A.; Laio, A. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Sci. Rep. 2017, 7, 12140. (43) Campadelli, P.; Casiraghi, E.; Ceruti, C.; Rozza, A. Intrinsic dimension estimation: relevant techniques and a benchmark framework. Math. Probl. Eng. 2015, 2015, 759567. (44) Neyman, J.; Pearson, E. S. On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philos. Trans. R. Soc., A 1933, 231, 289−337. (45) Wilks, S. S. The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. Ann. Math. Stat. 1938, 9, 60−62. (46) Greene, W. H. Econometric analysis, 4th ed., International edition; Prentice Hall: Upper Saddle River, NJ, USA, 2000. (47) Demortier, L.; Lyons, L. Everything you always wanted to know about pulls, CDF Note 5776, version 2.10. Available at http:// inspirehep.net/record/1354911/files/ (Accessed Feb 13, 2018). (48) Baftizadeh, F.; Pietrucci, F.; Biarnés, X.; Laio, A. Nucleation Process of a Fibril Precursor in the C-Terminal Segment of Amyloid-β. Phys. Rev. Lett. 2013, 110, 168103. (49) Granata, D.; Camilloni, C.; Vendruscolo, M.; Laio, A. Characterization of the free-energy landscapes of proteins by NMRguided metadynamics. Proc. Natl. Acad. Sci. U. S. A. 2013, 110, 6817− 6822. (50) Zerze, G. H.; Miller, C. M.; Granata, D.; Mittal, J. Free Energy Surface of an Intrinsically Disordered Protein: Comparison between Temperature Replica Exchange Molecular Dynamics and BiasExchange Metadynamics. J. Chem. Theory Comput. 2015, 11, 2776− 2782. PMID: 26575570. (51) Pinamonti, G.; Zhao, J.; Condon, D. E.; Paul, F.; Noé, F.; Turner, D. H.; Bussi, G. Predicting the Kinetics of RNA Oligonucleotides Using Markov State Models. J. Chem. Theory Comput. 2017, 13, 926−934. PMID: 28001394. (52) Molgedey, L.; Schuster, H. G. Separation of a mixture of independent signals using time delayed correlations. Phys. Rev. Lett. 1994, 72, 3634−3637. (53) Shaw, D. E.; Maragakis, P.; Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; Eastwood, M. P.; Bank, J. A.; Jumper, J. M.; Salmon, J. K.; Shan, Y.; Wriggers, W. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science 2010, 330, 341−346. (54) Green, P. E. Marketing Applications of MDS: Assessment and Outlook. J. Mark. 1975, 39, 24−31.

J

DOI: 10.1021/acs.jctc.7b00916 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX