ProPOSE: Direct exhaustive protein-protein docking with side chain


ProPOSE: Direct exhaustive protein-protein docking with side chain...

0 downloads 44 Views 2MB Size

Subscriber access provided by Kaohsiung Medical University

Biomolecular Systems

ProPOSE: Direct exhaustive protein-protein docking with side chain flexibility Hervé Hogues, Francis Gaudreault, Christopher R. Corbeil, Christophe Deprez, Traian Sulea, and Enrico O. Purisima J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.8b00225 • Publication Date (Web): 14 Aug 2018 Downloaded from http://pubs.acs.org on August 17, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

ProPOSE: Direct exhaustive protein-protein docking with side chain flexibility Hervé Hogues, Francis Gaudreault, Christopher R. Corbeil, Christophe Deprez, Traian Sulea, and Enrico O. Purisima* Human Health Therapeutics, National Research Council Canada, 6100 Royalmount Avenue, Montreal, Quebec, Canada, H4P 2R2

* Corresponding author: Tel: +1 (514) 496-6343 Fax: +1 (514) 496-5343 Email: [email protected]

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Despite decades of development, protein-protein docking remains a largely unsolved problem. The main difficulties are the immense space spanned by the translational and rotational degrees of freedom and the prediction of the conformational changes of proteins upon binding. FFT is generally the preferred method to exhaustively explore the translation-rotation space at a fine grid resolution, albeit with the tradeoff of approximating force fields with correlation functions. This work presents a direct search alternative that samples the states in Cartesian space at the same resolution and computational cost as standard FFT methods. Operating in real space allows the use of standard force field functional forms used in typical non-FFT methods as well as the implementation of strategies for focused exploration of conformational flexibility. Currently, a few misplaced side chains can cause docking programs to fail. This work specifically addresses the problem of side chain rearrangements upon complex formation. Based on the observation that most side chains retain their unbound conformation upon binding, each rigidly docked pose is initially scored ignoring up to a limited number of side chain overlaps which are resolved in subsequent repacking and minimization steps. On test systems where side chains are altered and backbones held in their bound state, this implementation provides significantly better native pose recovery and higher quality (lower RMSD) predictions when compared with five of the most popular docking programs. The method is implemented in the software program: ProPOSE (Protein Pose Optimization by Systematic Enumeration).

2

ACS Paragon Plus Environment

Page 2 of 30

Page 3 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Introduction Structural knowledge of protein-protein complexes can help elucidate important biochemical mechanisms and enable the design of novel biologics, accelerating the drug discovery process. However, obtaining crystals of multi-protein complexes is technically challenging, resulting in the relatively small number of protein-protein complexes available in the PDB. Hence, high-quality in-silico predictions of inter-protein interactions could bridge the gap between the need for structural information and the paucity of experimental data. Despite years of efforts and multiple independent software implementations, reliable protein-protein docking remains largely intractable.1 The main challenges have been: (i) sampling the vast space spanned by the rigid-body translational and rotational degrees of freedom, and (ii) predicting the conformational changes of proteins upon complex formation from backbone and/or side chain rearrangements.2-4 Since their introduction 25 years ago, Fast Fourier Transform (FFT) methods have been the preferred choice for exhaustive sampling of discrete translational/rotational states at high resolution, e.g., at a grid resolution of 1Å and rotation increments of 6°.5-8 This is because FFT provides the ability to systematically evaluate all the translation states by casting the scoring of poses as discrete Fourier transforms on a grid.8 Non-FFT approaches are more heuristic and not as exhaustive in terms of a systematic global sampling of translational/rotational states and often require auxiliary steps or additional information to reduce their search space to a more tractable size. For example, the practioners of Rosetta-based docking suggest using programs such as PIPER9 and ZDOCK10 (both FFT methods) for initial global docking of antibody-antigen complexes in the absence of information about the epitope prior to refinement by SnugDock.11 HADDOCK, another popular non-FFT method, is designed to use sparse experimental data such as limited NMR constraints to guide a stochastic docking algorithm.12 ATTRACT does attempt global docking in real space by having starting configurations of the ligand in different orientations spread around the receptor prior to energy refinement.13 However, the granularity of the starting configurations is coarser than that achievable with FFT-based methods. On the other hand, ATTRACT and other non-FFT methods do operate in continuous space and arguably can explore selected regions of space more finely than discrete FFT methods. Each approach has its strengths and weaknesses and it is not the intent of this paper to favor either FFT or non-FFT

3

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

methods. More comprehensive assessments of the current state of protein-protein docking can be found in recent reviews.1,5,14 This work presents a method for direct exhaustive sampling at the same resolution as FFT. Direct sampling with such fine granularity has previously been regarded as too expensive to carry out.7 However, exhaustive rigid docking programs such as FRED and WILMA, which directly sample all the ligand poses, have been successfully used in protein-ligand docking of small molecules.15,16 This work extends the WILMA small-molecule protein-ligand docking paradigm to protein-protein docking. The new program, ProPOSE (Protein Pose Optimization by Systematic Enumeration), samples the discrete translation states directly in Cartesian space in comparable computational time as standard FFT methods. This is achieved by the rapid elimination of all poses that cause overlaps or those that are not interacting sufficiently with the receptor, leaving a large but tractable number of states to further process. These candidate poses are first scored using an approximate but efficient scoring scheme that focuses only on the interface region rather than the extensive 3-dimensional convolution integrals implicitly computed by FFT methods.6,8 The combination of rapid pose filtering and interface-only scoring allows the program to sample the space as efficiently as standard FFT methods. Moreover, direct scoring provides several advantages over the indirect scoring scheme used by FFT methods.7 First, pairwise terms found in common molecular force fields and used in various forms in current non-FFT docking methods can be readily computed, whereas the correlation functional form imposed by the FFT formalism limits the accuracy with which terms for van der Waals or H-bond interactions can be estimated.9 This may explain why FFT methods frequently implement geometric descriptors such as shape complementarity instead of actual energy terms.810,17

Second, the direct method adds the ability to count and group binding events as it computes

the score, allowing for example to keep track of side chain overlaps, providing opportunities to address flexibility in ways that would be impossible to implement with standard FFT methods. Conformational changes of proteins upon binding represents the main challenge of protein docking. While predicting the backbone changes remains the most difficult task,18 the ability of programs to reliably overcome side chain conflicts is imperative if a backbone sampling strategy is to be successful. As will be shown below, current docking programs are sensitive to the rotameric state of side chains even when the backbone is in the bound conformation, limiting their ability to discriminate favorable backbone conformations. This work

4

ACS Paragon Plus Environment

Page 4 of 30

Page 5 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

specifically addresses the sub-problem of side chain rearrangements upon complex formation given the bound form of the backbone. Only once this can be achieved reliably will improvements in backbone sampling translate in better global docking outcome. The particular attention given to side chains in this study is based on the fact that side chains mediate most inter-protein interactions as demonstrated by the impact critical mutations have on measured binding affinities.19 They also represent most of the conformational changes observed upon binding.2,20 Most docking algorithms resolve side chain conflicts late in their refinement stage.4 In their initial sampling stage, they overcome side chain overlaps either by softening their scoring functions to be less sensitive to atomic details or by removing or replacing all side chains by pseudo-atoms.7 By altering all side chains, these approaches inevitably discard or dilute important structural information since over 60% of interface side chains have been shown to retain their unbound conformation upon binding.21,22 The indiscriminate elimination of side chain interactions by docking programs generally compromises their initial set of preferred poses often to the point that no near native pose can be rescued later on with more detailed scoring functions or sampling. In contrast, this implementation identifies for each pose the conflicting side chains that interfere with docking. If the number of conflicting side chains exceeds a limit, typically 3 on either protein, then the pose is discarded immediately. Otherwise, the pose is processed further, initially ignoring the contribution of the interfering side chains and deferring the possibility of resolving the conflicts to the subsequent repacking stage. In this way, the contribution to binding from the other nonconflicting side chains in the initial structure is not degraded unnecessarily. Our motivation for implementing a new docking program came from the need for better docking accuracy in order to enable antibody engineering applications such as affinity maturation and de novo design.23,24 For example, the ADAPT affinity maturation platform25 currently requires the crystal structure of the complex as input, a limitation that would be eliminated if antibody-antigen docking could be achieved with sufficient accuracy. Antibodyantigen docking is a highly complex problem that in the most general case can involve loop modeling, optimization of VH-VL chain orientations and even homology modeling of the antibody and/or the antigen. Although there are reports of successful antibody-antigen docking,26,27 a general robust solution for global antibody-antigen docking remains elusive, especially if one contemplates docking hundreds or thousands of antibody variants in reasonable

5

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

time in a de novo design context. As a first step, this work addresses one aspect of that multifaceted problem, namely the issue of side chain flexibility. To train and calibrate the software a set of 241 antibody-antigen high-quality complexes was assembled from SAbDab.28 From these complexes a simulated set of antibody-antigen pairs was created by separating the binding partners and repacking the side chains of each individual protein using SCWRL4.29 SCWRL4 was shown to successfully recover the native rotameric state for over 80% of buried and 60% of solvent exposed side chains, a rate that mimics the empirical percentage of surface side chains that remains unchanged upon binding.21 This strategy has previously been used to create datasets that simulate pseudo-unbound structures from crystal structures of complexes.30 Thus, unlike most docking programs that train on actual crystal structures of unbound protein pairs,9,10,13 this software was trained strictly on a simulated repacked dataset, decoupling the problem of backbone conformational change from side chain conformational change. The program will thus be optimized to solve the specific sub-problem of predicting the binding mode when the backbones are already in their bound state with only side chain rearrangements expected to occur upon binding. For the more general docking problem requiring both side chain and backbone motion the program should then be combined with a good backbone sampling procedure, which remains a challenging research problem4,18 and is outside the scope of this study. To guard against overtraining and to validate its transferability to other protein systems, the program was tested on 150 non-antibody complexes derived from a curated version of the Protein-Protein Docking Benchmark Version 5.0.31 For comparison with state-of-the-art methods in the field of protein docking, four of the best ranking programs in a recent comparative study1 and one popular commercial program were also run on the same dataset. Of these programs, ZDOCK10, PIPER9 and MOE Protein-Protein Docker (Chemical Computing Group, Montreal) are standard FFT methods. ClusPro17 is a post-processing method that re-ranks the poses of PIPER and ATTRACT13,32 is a non-FFT atom-based heuristic method. It is common practice to report docking success when a near-native pose is found within the top ten or even top hundred best predictions.1 This study emphasizes success rates based on the single top-ranked prediction, which on large datasets involving hundreds of systems provides a more stringent assessment of the predictive power of docking programs. Returning a nearnative pose as the top prediction should be the ultimate goal of all docking programs.

6

ACS Paragon Plus Environment

Page 6 of 30

Page 7 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Nevertheless, for comparison with previous studies, the success rates within the top ten poses are also provided. In the following sections, the program workflow is briefly described followed by the training results on the Antibody dataset. Then the validation and comparative study on the Benchmark dataset is reported, followed by an analysis demonstrating the dependence of docking accuracy on buried interfacial surface area. Finally, results on true unbound docking is presented. Methods Docking algorithm. The goal of the docking program is to find the best scoring pose from the ensemble of all discrete poses of the ligand protein in reasonable computation time. Since scoring a pose accurately involves slow steps such as side chain repacking, energy minimization and molecular surfacing, a triage approach is used to select the most promising poses using faster but less accurate scoring schemes and passing only these poses to more elaborate scoring operations. This implementation operates in 3 stages (Fig. 1). In Stage1 about 1011 distinct rotational and translational states of the ligand protein are enumerated and tested for collisions and contacts against the receptor using precomputed bitmasks on a grid. As atomic collisions are tested, the number of clashing side chains in each protein is counted. If the number of conflicting side chains does not exceed a predefined limit (typically 3 side chains on either protein) and if a sufficient number of ligand atoms come in close contact with the receptor (typically 5 atoms) then the pose is scored using a rapid table lookup method (fast score). The contribution to the fast score of the colliding side chains is initially ignored. When prior knowledge on the binding site is available, a specific contact region on the ligand and/or the receptor can be defined and partial contact involving atoms from these regions are then required. This feature is used for antibody docking where binding is expected to involve the complementarity-determining region (CDR). Otherwise, contact atoms span the entire protein surface. Stage1 is critical since any discarded pose is definitely lost and cannot be salvaged by more accurate scoring methods at a later stage. It represents the main enrichment step keeping less than one out of a million poses.

7

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Ligand rotation and translation ~1011 poses

Stage 1

Overlap and contact filtering Sidechain clash count Fast score

Stage 2

50,000 poses Sidechain repacking Rigid body minimization Full score

Stage3

500 poses Hierarchical clustering All-atom minimization Final score

Top pose

Figure 1. The 3 stages of the ProPOSE docking procedure.

In Stage2 the 50,000 best fast-scoring poses from Stage1 are repacked using two versions of a rotamer library (see repacking section). The clashing sidechains identified in Stage1 are sampled with a high-resolution version of the rotamer library while the other non-clashing interface side chains use a lower resolution version. If a clashing side chain cannot be resolved then the pose is rejected, otherwise after a constrained rigid-body minimization the repacked pose gets rescored using a more elaborate and accurate scoring scheme (full score). In the final stage (Stage3) the top 500 poses of Stage2 are clustered based on the interface backbone RMSD. The best representative poses of at most 32 clusters in complex with the receptor undergo H-bond network optimization followed by an all-atom restrained minimization using the AMBER FF99SB force field with a 4R dielectric model.33,34 The minimized structures 8

ACS Paragon Plus Environment

Page 8 of 30

Page 9 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

are then rescored using the last and most elaborate scoring scheme (final score) and the topscoring pose is returned. Scoring functions. At any stage, the scoring function is a linear combination of basic terms (Table 1). Since the selection of terms and the accuracy of their implementation vary, each stage uses a unique set of weights obtained during training by optimizing the early enrichment of near native poses (see parameter calibration below). During training, a pose was considered a success if the interface backbone RMSD ≤ 5Å relative to the bound state. For each stage, optimal coefficients were obtained by maximizing the BEDROC35 score on 1000 independent samples with replacement (bootstrap) cycles. The BEDROC α coefficients were adjusted depending on the desired level of early enrichment to 5, 10 and 30 for the stages 1, 2 and 3 respectively. There are 10 distinct scoring terms representing basic interaction energy between the interacting proteins. The pairwise terms include a smoothed van der Waals 6-12 Lennard-Jones term using Amber FF99SB parameters (Evdw),33,34 an electrostatic energy term (Eele) using a distance dependent dielectric (4R), an anisotropic H-bond term (Ehbond) and a penalty term (Ehflaw) for unsatisfied polar groups.15 Other non-pairwise terms include the polar and non-polar buried surface areas (Epbsa, Enpbsa), which for efficiency are approximated in Stage1 using mapped atomic properties, but in the following stages, are calculated using a slower but more accurate molecular surfacing routine.36,37 Since most of the proteins are considered rigid, the only intramolecular energy terms involve the mobile side chains with the rotamer internal energy (Erself) expressed as the log of the rotamer probability29 and the intramolecular non-bonded interaction energy of the rotamer (Ernonb). In the final stage, once the top poses are clustered, an estimate of the configurational entropy (Ecsiz) is included as the log of each cluster size.38 After the final minimization, the electrostatics reaction field energy (Eerf) is computed using the boundary element solution of the Poisson equation in a continuum dielectric.39 To accelerate scoring, all Stage1 energy terms are precomputed and mapped onto a grid. In Stage2, only farfield energy contributions are mapped on a grid and near interactions (distance820Å2), decoy poses rarely challenge the native state with comparable BSA values. For those systems, scoring using a descriptor that correlates with BSA such as shape complementarity or just a van der Waals term should in principle easily discriminate near native poses. As the BSA decreases, other descriptors, less dependent on surface area, are needed to discriminate the true binding events. The H-bonding, H-flaw-penalty and electrostatics terms used in all stages of its docking procedure are mandatory for ProPOSE to recover some of the complexes with smaller BSA. Interestingly, the BSA distribution of decoy poses is very similar across systems and depends mainly on the size of the proteins and on the level of protein flexibility tolerated in the initial docking stage (Fig. S6). Allowing more side chain conflicts in Stage1, which amounts to increasing flexibility, offers more opportunities for decoy states to bury surface and thus improve their Stage1 score. While many of these decoy poses will be penalized or rejected during the repacking step in Stage2, their increased presence in the top 50,000 candidates of Stage1 reduces the enrichment of near-native poses and affects overall performance. If instead, to keep decoys at bay, flexibility is too restricted, native poses may also be discarded, as their side-chain conflicts exceed the tolerated limit. The optimal level of flexibility, namely tolerating up to 3 conflicting side chains on both the ligand and the receptor, was selected as the optimal value on the Antibody RR training set (Fig. S7). Unbound docking. This paper focused on overcoming side chain conflicts in protein-protein docking. Hence, the training and validation described thus far were done on simulated datasets in which backbones were kept in their bound conformations. However, real-life problems involve

21

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

docking true unbound protein pairs in which both backbone and side chain movement occur. It is informative to examine the performance of the program in its current level of development when challenged with such cases. Moreover, most current docking programs have been optimized to handle the unbound docking problem and it may be considered a fairer comparison of ProPOSE versus those methods to use unbound protein datasets rather than the simulated sets used above. The Protein-Protein Docking Benchmark Version 5.0 set contains the unbound structures of the ligand and receptor of each complex.31 Analogous to the RB, BR and RR scenarios in the simulated sets, three scenarios were tested: unbound-bound (UB), bound-unbound (BU) and unbound-unbound (UU). The performance of all methods tested on the UU set was dismal using the Top 1 pose with at best about a 15% success rate and few high-quality poses (Fig 4A). Using the Top 10 poses, one gets somewhat better results with success rates rising to about 40% for ClusPro and about 30% for ProPOSE, ZDOCK and PIPER (Fig 4B). Again, there were few high-quality poses found even when using the Top 10 poses. Results obtained by ZDOCK, PIPER, ClusPro and ATTRACT in the unbound (UU) scenario recapitulated published results on the same dataset.1 Despite the presence of systems with significant backbone displacement between their bound and unbound conformations,31 the overall performance of ProPOSE was comparable to the other methods for the UU scenario. This is surprising since ProPOSE does not tolerate any backbone conflicts, backbones being considered rigid with only a limited number of sidechain overlaps tolerated. While ProPOSE offers no clear advantage over the other methods for the UU scenario, it consistently predicted more high-quality poses in the UB and BU scenarios, illustrating the strength of ProPOSE when presented with backbones in their bound conformation. When even just one of the docking partners has the correct backbone conformation, there is a significant boost in docking accuracy for ProPOSE, especially in producing high-quality poses, which is not observed to the same extent for the other methods. This suggests that ProPOSE is more likely to benefit than the other programs from future improvements in backbone sampling and prediction.

22

ACS Paragon Plus Environment

Page 22 of 30

Page 23 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 4. Absolute docking success rate for the Top 1 and Top 10 pose predictions under four scenarios: (BB) Bound ligand with bound receptor; (UB) Unbound ligand with bound receptor; (BU) Bound ligand with unbound receptor; (UB) Unbound ligand with bound receptor; (UU) Unbound ligand with unbound receptor; Colors indicate the quality of the predictions according to the CAPRI metric as High (green), Medium (yellow) or Acceptable (coral).

Conclusion We have implemented a non-FFT direct exhaustive Cartesian search to completely sample the translational and rotational space for protein-protein docking. This software implementation demonstrates that by quickly eliminating overlapping or non-interacting states and focusing the initial scoring on the interface region, direct search can sample the space as efficiently as FFT-based methods, providing better scoring accuracy and control of side chain flexibility. Allowing excessive or indiscriminate flexibility during the initial sampling stage introduces too many decoy states that reduce early enrichment. By tolerating only a limited number of side chain conflicts, the program efficiently reduces the search space and improves the recovery rate. 23

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

This docking program had a focused objective, which was to predict the binding mode given the correct protein backbone structures. The decision was made to intentionally decouple the problem of side chain flexibility from other confounding factors in the protein-protein docking problem in order to better define the sources of difficulties and to more cleanly solve this important sub-problem. To this end the software was trained strictly on overcoming side chain conflicts using antibody-antigen simulated systems where plausible side chain perturbations were introduced and subsequently validated on non-antibody test systems with backbones held in the bound form. On the validation set, the program achieved a success rate approaching 50% when side chains were most altered (RR) and 90% when side chains were in their bound state (BB), outperforming five widely used and commercially available docking programs. It may seem odd to devote so much effort to achieving good docking performance on datasets with backbone conformations in the bound state since these will rarely be encountered in real-world scenarios. An even stronger case might be made against the relevance of docking performance with side chains in the BB state. However, the ability to dock under these more favorable conditions can be seen as an advantage and a minimum requirement in order to address the more difficult unbound docking problems where backbone sampling will need to be introduced. Of course, this advantage can only be fully realized if and when a robust backbone sampling method that visits the bound state is achieved, which remains a very challenging task. For many current algorithms, the impaired ability to dock these idealized cases by softening their scoring functions is presented as an acceptable tradeoff for achieving some success in the true unbound cases. However, in our opinion such a compromise greatly limits their ability to take advantage of any future improvement in backbone sampling as demonstrated by their modest docking success when presented with the native backbone conformation. Moreover, this tradeoff seems unnecessary since on the test set of real unbound docking problems (UU), for which most tested programs were tuned, ProPOSE achieved comparable results and even provided higher quality predictions than the other programs while performing significantly better when provided with native backbone conformations. This suggests that one does not have to sacrifice accuracy in scoring functions in order to cope with conformational flexibility. The encouraging message then is that any future progress in sampling of near-native backbone conformations should have a

24

ACS Paragon Plus Environment

Page 24 of 30

Page 25 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

direct and significant impact on enhancing the performance of ProPOSE on unbound docking problems.

Appendix A. Supplementary Information Details of the antibody-antigen and protein-protein data sets as well as additional figures cited in the text can be found in the Supplementary Information. The ProPOSE software is available upon request to the authors.

25

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References 1.

Huang, S.-Y., Exploring the potential of global protein–protein docking: an overview and

critical assessment of current programs for automatic ab initio docking. Drug Discov. Today 2015, 20, 969-977. 2.

Betts, M. J.; Sternberg, M. J. E., An analysis of conformational changes on protein–

protein association: implications for predictive docking. Protein Eng. 1999, 12, 271-283. 3.

Bonvin, A. M. J. J., Flexible protein–protein docking. Curr. Opin. Struct. Biol. 2006, 16,

194-200. 4.

Andrusier, N.; Mashiach, E.; Nussinov, R.; Wolfson, H. J., Principles of flexible protein-

protein docking. Proteins Struct. Funct. Bioinf. 2008, 73, 271-289. 5.

Soni, N.; Madhusudhan, M. S., Computational modeling of protein assemblies. Curr.

Opin. Struct. Biol. 2017, 44, 179-189. 6.

Padhorny, D.; Kazennov, A.; Zerbe, B. S.; Porter, K. A.; Xia, B.; Mottarella, S. E.;

Kholodov, Y.; Ritchie, D. W.; Vajda, S.; Kozakov, D., Protein–protein docking by fast generalized Fourier transforms on 5D rotational manifolds. Proc. Nat. Acad. Sci. U.S.A. 2016, 113, E4286-E4293. 7.

Huang, S.-Y., Search strategies and evaluation in protein–protein docking: principles,

advances and challenges. Drug Discov. Today 2014, 19, 1081-1096. 8.

Katchalski-Katzir, E.; Shariv, I.; Eisenstein, M.; Friesem, A. A.; Aflalo, C.; Vakser, I. A.,

Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc. Nat. Acad. Sci. U.S.A. 1992, 89, 2195-2199. 9.

Kozakov, D.; Brenke, R.; Comeau, S. R.; Vajda, S., PIPER: An FFT-based protein

docking program with pairwise potentials. Proteins Struct. Funct. Bioinf. 2006, 65, 392-406. 10.

Chen, R.; Li, L.; Weng, Z., ZDOCK: An initial-stage protein-docking algorithm. Proteins

Struct. Funct. Bioinf. 2003, 52, 80-87. 11.

Weitzner, B. D.; Jeliazkov, J. R.; Lyskov, S.; Marze, N.; Kuroda, D.; Frick, R.; Adolf-

Bryfogle, J.; Biswas, N.; Dunbrack Jr, R. L.; Gray, J. J., Modeling and docking of antibody structures with Rosetta. Nat. Protocols 2017, 12, 401-416.

26

ACS Paragon Plus Environment

Page 26 of 30

Page 27 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

12.

Dominguez, C.; Boelens, R.; Bonvin, A. M. J. J., HADDOCK:  A Protein−Protein

Docking Approach Based on Biochemical or Biophysical Information. J. Amer. Chem. Soc. 2003, 125, 1731-1737. 13.

Zacharias, M., Protein–protein docking with a reduced protein model accounting for side-

chain flexibility. Protein Sci. 2003, 12, 1271-1282. 14.

Park, H.; Lee, H.; Seok, C., High-resolution protein–protein docking by global

optimization: recent advances and future challenges. Curr. Opin. Struct. Biol. 2015, 35, 24-31. 15.

Hogues, H.; Sulea, T.; Purisima, E., Exhaustive docking and solvated interaction energy

scoring: lessons learned from the SAMPL4 challenge. J. Comput.-Aided Mol. Des. 2014, 28, 417-427. 16.

McGann, M., FRED Pose Prediction and Virtual Screening Accuracy. J. Chem. Inf.

Model. 2011, 51, 578-596. 17.

Comeau, S. R.; Gatchell, D. W.; Vajda, S.; Camacho, C. J., ClusPro: an automated

docking and discrimination method for the prediction of protein complexes. Bioinformatics 2004, 20, 45-50. 18.

Kuroda, D.; Gray, Jeffrey J., Pushing the Backbone in Protein-Protein Docking. Structure

2016, 24, 1821-1829. 19.

Moal, I. H.; Fernández-Recio, J., SKEMPI: a Structural Kinetic and Energetic database of

Mutant Protein Interactions and its use in empirical models. Bioinformatics 2012, 28, 2600-2607. 20.

Vakser, Ilya A., Protein-Protein Docking: From Interaction to Interactome. Biophys. J.

2014, 107, 1785-1793. 21.

Beglov, D.; Hall, D. R.; Brenke, R.; Shapovalov, M. V.; Dunbrack, R. L.; Kozakov, D.;

Vajda, S., Minimal ensembles of side chain conformers for modeling protein–protein interactions. Proteins Struct. Funct. Bioinf. 2012, 80, 591-601. 22.

Smith, G. R.; Sternberg, M. J. E.; Bates, P. A., The Relationship between the Flexibility

of Proteins and their Conformational States on Forming Protein–Protein Complexes with an Application to Protein–Protein Docking. J. Mol. Biol. 2005, 347, 1077-1101. 23.

Li, T.; Pantazes, R. J.; Maranas, C. D., OptMAVEn – A New Framework for the de novo

Design of Antibody Variable Region Models Targeting Specific Antigen Epitopes. PLoS ONE 2014, 9, e105954.

27

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

24.

Lapidoth, G. D.; Baran, D.; Pszolla, G. M.; Norn, C.; Alon, A.; Tyka, M. D.; Fleishman,

S. J., AbDesign: An algorithm for combinatorial backbone design guided by natural conformations and sequences. Proteins Struct. Funct. Bioinf. 2015, 83, 1385-1406. 25.

Vivcharuk, V.; Baardsnes, J.; Deprez, C.; Sulea, T.; Jaramillo, M. L.; Corbeil, C. R.;

Mullick, A.; Magoon, J.; Durocher, Y.; O'Connor-McCourt, M.; Purisima, E. O., Assisted Design of Antibody and Protein Therapeutics (ADAPT). PLoS ONE 2017, 12, e0181490. 26.

Totrov, M.; Abagyan, R., Detailed ab initio prediction of lysozyme–antibody complex

with 1.6 Å accuracy. Nat. Struct. Biol. 1994, 1, 259. 27.

Méndez, R.; Leplae, R.; Lensink, M. F.; Wodak, S. J., Assessment of CAPRI predictions

in rounds 3-5 shows progress in docking procedures. Proteins Struct. Funct. Bioinf. 2005, 60, 150-169. 28.

Dunbar, J.; Krawczyk, K.; Leem, J.; Baker, T.; Fuchs, A.; Georges, G.; Shi, J.; Deane, C.

M., SAbDab: the structural antibody database. Nuc. Acids Res. 2014, 42, D1140-D1146. 29.

Krivov, G. G.; Shapovalov, M. V.; Dunbrack, R. L., Improved prediction of protein side-

chain conformations with SCWRL4. Proteins Struct. Funct. Bioinf. 2009, 77, 778-795. 30.

Gao, Y.; Douguet, D.; Tovchigrechko, A.; Vakser, I. A., DOCKGROUND system of

databases for protein recognition studies: Unbound structures for docking. Proteins Struct. Funct. Bioinf. 2007, 69, 845-851. 31.

Vreven, T.; Moal, I. H.; Vangone, A.; Pierce, B. G.; Kastritis, P. L.; Torchala, M.;

Chaleil, R.; Jiménez-García, B.; Bates, P. A.; Fernandez-Recio, J.; Bonvin, A. M. J. J.; Weng, Z., Updates to the Integrated Protein–Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J. Mol. Biol. 2015, 427, 3031-3041. 32.

Zacharias, M., ATTRACT: Protein–protein docking in CAPRI using a reduced protein

model. Proteins Struct. Funct. Bioinf. 2005, 60, 252-256. 33.

Case, D. A.; Cheatham, T. E., III; Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M., Jr.;

Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R. J., The Amber Biomolecular Simulation Programs. J. Comput. Chem. 2005, 26, 1668-1688. 34.

Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; Simmerling, C.,

Comparison of multiple Amber force fields and development of improved protein backbone parameters. Proteins Struct. Funct. Bioinf. 2006, 65, 712-725.

28

ACS Paragon Plus Environment

Page 28 of 30

Page 29 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

35.

Truchon, J. F.; Bayly, C. I., Evaluating Virtual Screening Methods: Good and Bad

Metrics for the "Early Recognition" Problem. J. Chem. Inf. Model. 2007, 47, 488-508. 36.

Chan, S. L.; Purisima, E. O., A New Tetrahedral Tessellation Scheme for Isosurface

Generation. Comput. Graph. 1998, 22, 83-90. 37.

Chan, S. L.; Purisima, E. O., Molecular Surface Generation Using Marching Tetrahedra.

J. Comput. Chem. 1998, 19, 1268-1277. 38.

Shortle, D.; Simons, K. T.; Baker, D., Clustering of low-energy conformations near the

native structures of small proteins. Proc. Nat. Acad. Sci. U.S.A. 1998, 95, 11158-11162. 39.

Purisima, E. O., Fast Summation Boundary Element Method for Calculating Solvation

Free Energies of Macromolecules. J. Comput. Chem. 1998, 19, 1494-1504. 40.

Desmet, J.; De Maeyer, M.; Hazes, B.; Lasters, I., The dead-end elimination theorem and

its use in protein side-chain positioning. Nature 1992, 356, 539-542. 41.

Gordon, D. B.; Mayo, S. L., Branch-and-Terminate: a combinatorial optimization

algorithm for protein design. Structure 1999, 7, 1089-1098. 42.

Looger, L. L.; Hellinga, H. W., Generalized dead-end elimination algorithms make large-

scale protein side-chain structure prediction tractable: implications for protein design and structural genomics11Edited by F. E. Cohen. J. Mol. Biol. 2001, 307, 429-445. 43.

Lovell, S. C.; Word, J. M.; Richardson, J. S.; Richardson, D. C., The penultimate rotamer

library. Proteins Struct. Funct. Bioinf. 2000, 40, 389-408. 44.

Moghadasi, M.; Mirzaei, H.; Mamonov, A.; Vakili, P.; Vajda, S.; Paschalidis, I. C.;

Kozakov, D., The Impact of Side-Chain Packing on Protein Docking Refinement. J. Chem. Inf. Model. 2015, 55, 872-881. 45.

Lensink, M. F.; Velankar, S.; Wodak, S. J., Modeling protein–protein and protein–

peptide complexes: CAPRI 6th edition. Proteins Struct. Funct. Bioinf. 2017, 85, 359-377. 46.

Chen, J.; Sawyer, N.; Regan, L., Protein–protein interactions: General trends in the

relationship between binding affinity and interfacial buried surface area. Protein Sci. 2013, 22, 510-515.



29

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For Table of Contents use only: ProPOSE: Direct exhaustive protein-protein docking with side chain flexibility Hervé Hogues, Francis Gaudreault, Christopher R. Corbeil, Christophe Deprez, Traian Sulea, and Enrico O. Purisima*

30

ACS Paragon Plus Environment

Page 30 of 30