Computational approaches to de novo protein tertiary structure prediction, including those based on the preeminent “fragment-assembly” technique, have failed to scale up fully to larger proteins (on the order of 100 residues and above). A number of limiting factors are thought to contribute to the scaling problem over and above the simple combinatorial explosion, but the key ones relate to the lack of exploration of properly diverse protein folds, and to an acute form of “deception” in the energy function, whereby low-energy conformations do not reliably equate with native structures. In this article, solutions to both of these problems are investigated through a multistage memetic algorithm incorporating the successful Rosetta method as a local search routine. We found that specialised genetic operators significantly add to structural diversity and that this translates well to reaching low energies. The use of a generalised stochastic ranking procedure for selection enables the memetic algorithm to handle and traverse deep energy wells that can be considered deceptive, which further adds to the ability of the algorithm to obtain a much-improved diversity of folds. The results should translate to a tangible improvement in the performance of protein structure prediction algorithms in blind experiments such as CASP, and potentially to a further step towards the more challenging problem of predicting the three-dimensional shape of large proteins.
Proteins are at the heart of cellular function, making possible most of the key processes associated with life. The three-dimensional structure of any given protein is known to be one of the major determinants of its distinctive functional properties. Thus, protein structure determination is a fundamental step towards the understanding of the function of these important building blocks of life. Gaining insight into the structure-function relationship in proteins can assist, for example, in the design of proteins with novel specific functionalities, in the design of drugs and vaccines, and in the understanding of pathologies characterised by protein misfolding (e.g., Alzheimer’s and Parkinson’s diseases). Given the limitations of experimental methods, however, computational approaches have become the cornerstone of protein structure analysis.
Predicting the three-dimensional structure of a protein molecule, starting only from its amino acid sequence, remains a formidable challenge in computational biology. In recent years, different computational methods have been proposed with a view to tackling this problem, each leveraging observed relationships between the amino acid sequences and three-dimensional structures of proteins. For example, the comparative or homology modelling approach (Martí-Renom et al., 2000) is rooted in the idea that proteins that are closely related in evolutionary terms (homologous proteins) are more likely to have very similar global sequences, and therefore very similar structures. However, these methods require the existence of a homologous protein with a known structure, so they usually cannot be used to infer the structure of an entirely novel sequence. Other approaches, termed de novo or template-free modelling methods, seek to overcome this limitation by instead focusing on local sequence-structure relationships. These techniques are exemplified by the popular fragment-assembly class of approaches (Simons et al., 1997). Currently, fragment-assembly methods represent one of the most promising approaches to protein structure prediction, having shown a remarkable performance in the biennial critical assessment of protein structure prediction (CASP) experiments (Moult et al., 2014; Kryshtafovych et al., 2014).
Like many methods for protein structure prediction, fragment-assembly approaches use a simplified representation of a protein’s tertiary structure, which includes information about backbone torsion angles only. However, the strength of fragment-assembly methods lies in their ability to leverage existing structural information from protein structure databases to reduce the search space. Specifically, local sequence-structure correlations are used to identify a finite set of candidate fragments for every window of residues in the protein chain. In this way, the continuous representation typically used in protein structure prediction is further compressed through the definition of a finite set of possible choices for each residue in the target protein. Fragment assembly thus remodels protein structure prediction as a combinatorial optimisation problem (Papadimitriou and Steiglitz, 1982; Cook et al., 1998), which involves the selection of one of the available fragment choices for each position in the protein chain (see Handl et al., 2012, for a Markov chain analysis of fragment assembly).
Despite the strengths of fragment assembly, there is a general consensus that limitations in fragment quality, inaccuracies of (especially low-resolution) energy functions, and the size of the search space make de novo prediction a very difficult problem (R. Das, 2011; Kim et al., 2009). In particular, the prediction accuracy of fragment-assembly methods, has been observed to decrease for larger proteins, and particularly those with high contact order (Kryshtafovych et al., 2014). Recent work aimed at improving the quality of search has taken its inspiration from method developments in the optimisation literature, and attempts at improving search performance have included state-of-the-art techniques such as evolutionary algorithms, replica-exchange methods, and response surface methodologies (Bowman and Pande, 2009; Brunette and Brock, 2008; Simoncini et al., 2012). While such approaches have been reasonably successful in generating conformations with energy values that are lower than those obtained using, for example, the well-established Rosetta protocol (Rohl et al., 2004), the corresponding improvements in prediction performance (measured commonly as the structural similarity to the native structure) have been small and often inconsistent, with deteriorations in accuracy observed for a subset of proteins. As a consequence, state-of-the-art approaches to fragment assembly continue to rely on thousands of restarts rather than a smaller number of runs of a more sophisticated search technique. This situation seems counterintuitive, in the sense that random restarts are blind to the results of previous executions, which results in redundancies and does not provide an efficient approach to sampling. Yet, recent analysis of the search trajectories of longer Rosetta runs and an advanced sampling protocol indicate that they fail to capitalise on the power of the underlying search heuristics, and are unable to explore a significant number of different folds within a single search trajectory (Kandathil et al., in press). Such limited exploration of the search space is particularly problematic given the known ruggedness of the energy landscape and its deceptive features, including known inaccuracies in the energy function that make it difficult to correctly differentiate between the quality of different local optima. In view of this, it seems that a successful search technique for fragment assembly will have to incorporate improved mechanisms to generate and retain low-energy structures that correspond to distinctly different folds.
This report describes a new search heuristic for fragment assembly designed to address these distinctive challenges of the problem. First, given the lack of exploration observed in existing methods, and in view of the ruggedness that characterises the energy landscape, an improved sampling protocol is proposed. The new method aims to attain an appropriate balance between the exploration and the exploitation of the conformational search space. The ratio between exploration and exploitation is known to play a critical role in determining the performance of search algorithms (Črepinšek et al., 2013). In essence, the proposed method is a memetic algorithm (MA), embedding the successful Rosetta protocol as its local search heuristic. An important characteristic component of the proposed MA is the use of specialised genetic operators, which exploit problem-specific knowledge to explicitly encourage the exploration of more diverse protein folds. Second, it is the authors’ view that, in light of the well-known inaccuracies of low-resolution energy functions (Bowman and Pande, 2009), the generation and preservation of a diverse range of candidate structures is a crucial prerequisite for achieving a robust and competitive performance. Drawing inspiration from the evolutionary multimodal optimisation literature (S. Das et al., 2011), a stochastic ranking-based procedure is implemented within the proposed MA as a mechanism to strike a trade-off between the optimisation of energy and the identification (and retention) of diverse conformations. The results of this study illustrate conclusively that enhanced optimisation of energy alone results in problematic sensitivity to the accuracy of the energy function, but significant improvements in robustness can be obtained through the explicit consideration of conformational diversity during the search.
The remainder of this article is structured as follows. Section 2 provides an introduction to protein structure prediction, setting the necessary background for this work. Section 3 provides all details of the proposed memetic algorithm, including the specialised genetic operators implemented. The results for this algorithm demonstrate its ability to identify deep local optima within the search space, but also highlight the issues of optimising energy alone. Section 4 takes these results forward into the description and evaluation of an improved version of the algorithm that uses the stochastic ranking-based procedure to induce a robust balance between exploitation and exploration. Finally, Section 5 discusses the main findings of this study and concludes.
Notwithstanding the use of computer-intensive molecular dynamics simulations to model the dynamical folding process of proteins, protein (tertiary) structure prediction has been treated, fundamentally, since the seminal report of Anfinsen (1973), as a problem of (static) energy minimisation. Progress in this approach has been gradual and has been achieved through a combination of staged improvements to energy functions, the availability of greater and greater computational power, and a growing knowledge base of common structural forms. The latter includes the use of fragments of structure in what is termed “fragment assembly,” the preeminent approach to tackling “new folds”—that is, proteins that fall outside a threshold of similarity to other known proteins. These gradual improvements have enabled larger and larger proteins to be “solved” more accurately through computational prediction, but routine and reliable structure prediction of large proteins remains illusive.
Proteins are fundamental elements of living cells, performing a range of biological functions. They are involved, for example, in transport, structural, enzymatic, hormonal, regulatory, and defensive processes. Amino acids, the building blocks of proteins, are small molecules that follow the general structure presented in Figure 1 (left side). Proteins are linear chains of amino acids, held together by peptide bonds, as illustrated on the right side of Figure 1. Hence, protein chains are also referred to as polypeptides.
Proteins display complex structures commonly described in terms of three main levels of organisation. The linear sequence of amino acid residues constitutes the primary structure of a protein. The secondary structure describes the arrangement of amino acids within short stretches of a polypeptide chain into motifs such as -helices and -sheets, connected to each other by chain regions called loops. The tertiary structure defines the overall folding of the protein chain in three dimensions, where secondary structure elements are packed into compact domains. Tertiary structure is characterised by the formation of long-range (high-order) contacts; this describes the fact that regions of the polypeptide chain that are a considerable distance apart in the sequence can be close together in three-dimensional space. The tertiary, three-dimensional conformation typically constitutes the functional state of the protein molecule.1
2.2 Protein Structure Prediction
Since the function of a given protein is determined to a large extent by its three-dimensional structure, direct experimental approaches have been used to determine protein structures in the laboratory. Prominent examples of such techniques are X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. However, these experimental methods are often time-consuming and expensive, and in many cases require the development of experimental protocols specific to the protein of interest. The relative ease of obtaining protein sequence information has led to a large gap between the availability of sequence and structure information for proteins. As of 28 August 2015, the RCSB Protein Data Bank (PDB) had 111,558 structure entries (Berman et al., 2000; RCSB PDB, 2015), while the UniProtKB/TrEMBL protein sequence database contained 50,011,027 sequence entries (The UniProt Consortium, 2015a; 2015b). The challenge to bridge such an ever-increasing gap has generated considerable interest in exploring computational approaches for protein structure prediction.
It is generally accepted that the amino acid sequence encodes all the information related to the three-dimensional structure of a protein in a given environment. In other words, the specific configuration of amino acid residues in a protein is what determines how it folds into a unique and compact three-dimensional conformation, often referred to as the native state. Among all the possible conformations that a protein can adopt, it is believed that its native state corresponds to the one with the lowest overall free energy (Anfinsen, 1973). Hence, the process of inferring the functional, energy-minimising conformation for a protein molecule from its linear sequence of amino acids can be posed as an optimisation problem. This problem is commonly referred to as the protein structure prediction (PSP) problem, and represents one of the most active and challenging research areas in the field of computational biology. Let be the set of all potential conformations for a given protein sequence—that is, the search (or conformational) space—and let denote a fixed energy model that maps each possible conformation to an energy value . PSP can be more formally stated as the problem of finding the conformation such that .
Methods originally aimed at solving PSP problems have found applications in other biological problem areas. For example, PSP methods have been used in the design of a highly specific endonuclease (Chevalier et al., 2002) and an enzyme that catalyses a Diels-Alder reaction (Siegel et al., 2010), a reaction not catalysed by any known enzyme in nature. The experimentally synthesised enzyme possesses the predicted structure and performs the predicted reaction. The structure prediction method Rosetta has been used in the design of a novel fold (Kuhlman et al., 2003) and in the prediction and engineering of sites and affinity for DNA (Havranek et al., 2004) and ligand binding (Meiler and Baker, 2006; Wang et al., 2010). This method also has important applications in drug and vaccine design (Whitehead et al., 2012; Azoitei et al., 2011) and medicine, especially in understanding the pathology of diseases such as Alzheimer’s, Parkinson’s, and prion diseases (Chiti and Dobson, 2006). These diseases are characterised by the incorrect folding of certain proteins.
2.3 Fragment-Assembly Methods for Protein Structure Prediction
Methods employing fragment assembly have proven to be the most promising approaches to PSP, starting from only sequence information. They are consistently assessed as providing the best performance in the template-free modelling category of CASP experiments (Tai et al., 2014; Moult et al., 2014). Fragment assembly is based on the principle that, in proteins, the sequence of a short stretch of amino acids (e.g., nine or ten residues) has a strong influence on the local structure of that region (Simons et al., 1997; Han and Baker, 1996). Thus, it is possible to construct complete structures for a target protein by assembling short fragments of local structure information. These fragments are typically drawn from proteins of known structure, based on similarity in amino acid sequence (Rohl et al., 2004; Gront et al., 2011). Each window of residues in the target has a set of fragments associated with it. Once a library of fragments has been generated for a given target, the fragments are typically assembled using a Metropolis Monte Carlo optimisation scheme (Rohl et al., 2004). The key advantage of this approach is that it does not require the availability of structures for proteins with very similar global sequences (and therefore structures); this is sometimes referred to as template-free prediction. This has enabled fragment-assembly methods to correctly predict the structure of proteins with hitherto unobserved overall topologies (Kuhlman et al., 2003; Jones et al., 2005). Several fragment-based methods for PSP have been proposed over the years (e.g., Simons et al., 1997; Xu and Zhang, 2012; Jones, 2001; Lee et al., 2004).
Fragment-assembly methods typically employ a two-phase process. In the first phase, a low-resolution representation of the protein is used to rapidly explore the space of possible conformations. The purpose of this phase of prediction is to sample a wide range of plausible overall topologies (folds) for a given target sequence. New conformations are generated by replacing the structural information (configuration of backbone torsion angles) for a stretch of residues in the incumbent structure with that information in a fragment chosen from the fragment library. This process is termed fragment insertion and is illustrated in Figure 2. Following each such insertion, a relatively inexpensive knowledge-based scoring function is used to assess the quality of the new candidate structure, whose acceptance is commonly determined using the Metropolis criterion (Metropolis et al., 1953). The fragment insertion operation provides a means of exploring various possible conformations for a target. In other words, fragment insertion forms the search operator for the optimisation process. Moreover, fragment libraries describe what choices of structural information are available at any given residue. The set of fragments for a target protein defines, therefore, the search space of the optimisation problem, transforming the otherwise continuous space of structural parameters into a set of discrete configurations. Protein structure prediction, at this low-resolution fragment-assembly phase, is thus a combinatorial optimisation problem that can be stated simply as that of finding the best-performing (i.e., energy-minimising) combination of the available fragment choices for each residue.
The low-resolution phase is designed to enable rapid conformational variation in the structure. It is in this phase that the overall fold of the structure is determined. The purpose of the second, all-atom phase of prediction is to convert the low-resolution structures into complete models of protein structure, including all side-chain atoms. In this phase, more computationally expensive procedures are used to refine the structure using small perturbations in order to arrive at compact, low-energy structures. This phase of the prediction process usually does not involve fragment assembly and only performs small refinements to an already compact structure. Since this work is primarily concerned with methods for realising improved exploration of different overall folds, it focuses on the low-resolution fragment-assembly prediction step.
Section 2.4 details the low-resolution fragment-assembly phase of the standard Rosetta protocol, as this forms the basis for the methods proposed in this study. For more detailed descriptions of the Rosetta protocol, we refer the reader to Rohl et al. (2004), Gront et al. (2011), Misura and Baker (2005), Simons et al. (1997), and Simons et al. (1999).
The low-resolution phase of Rosetta starts from a completely extended conformation of the protein chain. As mentioned above, a low-resolution representation of the protein structure is used. This is illustrated in Figure 3. In this representation, the atoms of the side chain of each amino acid are abstracted as single pseudoatoms placed at the centroid of the side chain. Bond lengths and bond angles are set to idealised values (these are derived by statistical analysis of experimentally derived structures; Engh and Huber, 1991), and the backbone torsion angles (, and in Fig. 3) are the sole parameters that are varied during the optimisation process. The process of fragment insertion in Rosetta thus entails replacing the backbone torsion angles for a set of residues in the target protein chain with those drawn from a fragment. The structure generated as a result of each fragment insertion (or move) is evaluated using a scoring function (see below), and each move is either accepted or rejected on the basis of the Metropolis criterion, using a fixed value of the temperature parameter, . This process is repeated by selecting insertion windows in the target sequence uniformly at random, choosing a fragment to insert at that window, and evaluating the move at each step.
The low-resolution phase of Rosetta is composed of four stages. Stages 1 to 3 employ nine-residue fragments (9-mers), whereas stage 4 employs 3-mer fragments. Thus, two separate fragment libraries have to be composed prior to prediction, one for each fragment length considered. Each low-resolution stage also employs a different scoring function, which is a linear weighted sum of ten energy terms (Rohl et al., 2004). The scoring terms are mostly derived on the basis of Bayesian statistical analysis of known protein structures (Simons et al., 1997, 1999) and are designed to capture local and global structural properties of real protein structures. These terms capture effects such as solvation (interactions with the protein’s environment) and interactions between residues and elements of secondary structure. The weights of each of the scoring terms are gradually increased as Rosetta proceeds through the low-resolution stages, until the weights reach their final values in stage 4. Each low-resolution stage is allocated a certain budget of scoring function evaluations, and the default values can be increased or decreased using a user-supplied multiplier (the increase_cycles input parameter).
2.5 Recent Approaches to Improving Conformational Search
In Section 1, it was pointed out that most state-of-the-art structure prediction pipelines make use of a large number of independent restarts of the search heuristic, and that this constitutes an inefficient approach to exploration. In recent years, different methods have been proposed, seeking to leverage the strengths of existing approaches, such as Rosetta, while attempting to integrate more advanced optimisation schemes. For example, probabilistic search frameworks have been proposed, whereby the choice of a local region of conformational space to search is informed by the energy or score values obtained by the conformations accessed in initial rounds of sampling (Simoncini et al., 2012; Simoncini and Zhang, 2013). An evolutionary algorithm employing crossover and mutation operators in a fragment-assembly context has also been described (Olson et al., 2013). Saleh et al. (2013) also described the use of evolutionary and memetic algorithms involving fragment assembly, and showed that the MA can reduce the likelihood of deep descent into local optima on the energy landscape, which is known to be detrimental to predictive accuracy (Molloy et al., 2013) and has been an issue for some of the above approaches. Owing to the inaccuracies in typical low-resolution scoring functions, some first methods have now started to consider structural diversity at strategic points during the search—for instance, by using clustering or a grid structure (Shehu and Olson, 2010; Molloy et al., 2013). This is the work most closely related to this study.
Note, however, that our approach differs significantly. As will be described in detail in Sections 3 and 4, the main innovative elements in the proposed method are: (i) the use of the successful Rosetta method as a local search operator, thus drawing heavily upon the fragment libraries, energy functions, and sampling mechanisms already used in Rosetta, which facilitates a direct comparison; (ii) the use of specialised genetic operators that focus on loop regions of the protein chain, explicitly encouraging conformational space exploration (Kandathil et al., in press); (iii) the incorporation of stochastic ranking as a mechanism to explicitly control the balance between exploration and exploitation during the search; and (iv) the use of novel conformational diversity measures, found during our ongoing work to effectively differentiate between compact structures with different folds (Garza-Fabre et al., 2015).
3 Rosetta-Based Memetic Algorithm
The term memetic algorithm was originally coined by Moscato (1989). This term has been widely adopted in the literature to denote a broad class of metaheuristics that extend population-based methods, such as evolutionary algorithms, by incorporating problem-specific knowledge, usually in the form of a local search strategy or through the use of specialised search operators (Moscato and Cotta, 2003; Hart et al., 2005). An important number of successful applications of MAs have been reported, covering a wide range of optimisation problem classes and application domains.2
In this section, an MA is proposed in the context of the fragment-assembly approach to protein structure prediction. The proposed MA is first introduced in detail in Section 3.1. Then, Section 3.2 describes the experiments performed and presents the results of the evaluation of this method and its comparison with respect to Rosetta, a well-established option and one of the most successful sampling protocols in this area.
3.1 Method Design
Our Rosetta-based memetic algorithm (RMA) is built upon the framework of genetic algorithms (Goldberg, 1989). Our incorporation of problem-specific knowledge in the RMA relates to the use of Rosetta as a local improvement search strategy. In addition, the implemented genetic operators exploit information from secondary structure predictions as a means of boosting the exploration of the space of protein folds.
Algorithm 1 outlines the proposed RMA, which consists of four consecutive stages, each of which is based on the corresponding stage of the standard Rosetta protocol. In stage 1, lines 1 to 3 in Algorithm 1, an initial parent population is created by generating N fully extended conformations. Each parent individual is then independently processed using stage 1 of Rosetta. Given that van der Waals forces comprise the only information used at this stage to guide the search process, such an initialisation step is aimed at generating a potentially diversified set of valid protein structures. Stages 2, 3, and 4 of the RMA are based on the iterative procedure depicted in lines 5 to 14 of Algorithm 1. These stages differ in the implementation of distinct Rosetta stages as the local improvement strategy, which implies the use of different energy functions and fragment-insertion lengths (refer to Section 2.4 for details). The processing of each of these stages involves a total of Gmax generations. During the first generation, a new population is obtained by improving the existing parent individuals through local search (based on the respective Rosetta stage). In contrast, subsequent generations produce an offspring population based on mating selection and the application of genetic operators (recombination and mutation). Offspring individuals are then subjected to local improvement. The resulting improved offspring compete against the parent individuals in order to survive from one generation to the next (survival selection).
3.1.1 Mating Selection
In order to moderate selection pressure and to foster a good mixing of genetic material that favours exploration, the proposed RMA implements a random and panmictic mating selection strategy. That is, parents are selected in random order, without replacement, to form the pairs of individuals that are subjected to recombination. This allows each parent individual to be selected and considered as a source of genetic material exactly once. Offspring produced by recombination are then processed by the mutation operator and the local improvement strategy.
3.1.2 Genetic Operators
The genetic operators of the RMA, both recombination and mutation, exploit available information from secondary structure predictions in order to intensify exploration with regard to those regions of the protein chain that have been predicted to be loops.3 The design of these operators is premised upon our belief that the increased exploration of the space of possible loop configurations can contribute significantly to the discovery and investigation of different protein folds during the search process. Thus, genetic operators have been implemented in such a way that their application affects only the residues located at loop regions. This has the additional advantage of preserving the gains that have been achieved in terms of the optimisation of secondary structure elements.4 Details on the recombination and mutation operators are provided next.
The loop-based recombination operator takes two parent individuals as input. It produces as output two new offspring individuals by cloning the given parents and, based on a given probability pc, interchanging the configuration (torsion angle values) of a randomly selected loop region between them. This operation is equivalent to the application of the well-known two-point crossover operator where the crossover points are set to the start and end residues of a loop region. The functioning of this operator is illustrated in Figure 4.
Mutation is based on fragment insertions, the same perturbations enforced by the standard Rosetta protocol (see Sections 2.3 and 2.4). During stages 2 and 3 of the RMA, nine-residue insertions are considered, and three-residue insertions are used during stage 4. All offspring generated by recombination are subjected to mutation. This operator attempts a random fragment insertion at each insertion window, based on a given probability pm. There exist a total of windows at which a fragment can be inserted, where denotes the length of the protein sequence and f is the fragment length (i.e., ). Hence, by using, for instance, a mutation probability of , , about n fragment insertions can be expected to occur from the mutation of each individual. In order to constrain the application of this operator to loop regions, only the subset of insertion windows involving one or more loop residues is to be considered. Moreover, if mutation occurs at an insertion window spanning both loop and non-loop residues, the original configuration of all non-loop residues is preserved.
3.1.3 Survival Selection
At the end of all (but the first) generations in stages 2 to 4 of the RMA, all parent individuals and all produced offspring are considered to be candidates to survive and to form the next generation’s population. This survival selection scheme is usually referred to as plus (+) selection in the context of evolution strategies (Beyer and Schwefel, 2002). In order to select survivors, the set of individuals is ranked, and the N top-ranked solutions are chosen. Here, the energy of the protein conformations encoded by candidate solutions is used as the only ranking criterion.
3.2 Experiments and Results
This section presents the experiments performed and the results obtained during the evaluation of the proposed RMA. First, Section 3.2.1 details the protein targets considered, the performance measures used, and the experimental conditions adopted. Then, Section 3.2.2 reports the results regarding the comparison of the RMA with respect to the standard Rosetta protocol. Finally, Section 3.2.3 discusses the relevance of exploiting secondary structure information within the design of the RMA’s operators.
3.2.1 Experimental Setup
Protein Targets. Table 1 lists the set of protein targets used during the experiments conducted. These targets varied in length and secondary structure classification, as detailed in the table. Throughout this study, the different targets will be referred to by their corresponding identifiers from the PDB (Berman et al., 2000).
Performance Measures. As a performance measure, the energy of the protein conformations will be used, as computed by the low-resolution score function that Rosetta employs during stage 4 (see Section 2.4). In the experiments reported, energy values have been normalised to the range based on the minimum and maximum values reached for each particular protein target, considering all different methods compared. Energy values are to be minimised in all cases. Furthermore, a structural measure is used to evaluate the quality of the candidate conformations for a protein. Structural quality is usually assessed in the literature with respect to another reference structure, generally the native structure as determined by experimental methods. Perhaps the most well-known structural quality measure is the root-mean-square deviation, or RMSD for short. RMSD is defined as the square root of the mean of the squared deviations (distances) between corresponding atoms in the two structures compared. Calculation of this measure requires the structures to be aligned so that the RMSD is minimised, and it this minimum RMSD value is then reported. Such an alignment between the two structures is achieved by the Kabsch algorithm, which provides the correct rigid-body rotation in all cases (Kabsch, 1976, 1978). In this study, we calculated RMSD using C atoms. Lower values for this measure, expressed in ångströms (Å),5 indicate better correspondence to the native structure.
Settings for the Approaches Compared. During the experiments performed, both Rosetta and the RMA were run to generate a set of 1,000 candidate conformations for each protein target. Therefore, a total of 1,000 individual Rosetta trajectories were considered, each of which produced a single solution. The recommended settings for Rosetta were used in all cases.6 With regard to the RMA, a population size of was always used. A single execution of the RMA produces a total of N solutions. Hence, ten independent executions of the RMA were carried out, each using the equivalent computational effort of N trajectories of the standard Rosetta protocol. The additional control parameters of the RMA were set as follows: , , .
Conformational Space Discretisation. The analysis presented in Section 3.2.3 inquires into the ability of the RMA to explore the conformational search space. A small protein, 1enh, and a new measure of conformational diversity are used for this purpose; see Figure 5.7 As is illustrated in this figure, a given protein conformation can be coarsely described by its vector, which is composed by a set of distances that account for the relative positions of secondary structure elements with respect to each other. A total of distance values define a vector for a protein with E secondary structure elements. Thus, three-dimensional vectors are used here to represent the folded states for protein 1enh. From this, it follows that the maximum value that can be reached for each dimension , can be derived by computing this measure from a fully extended conformation. Preliminary testing on this particular target indicated that the bulk of conformations explored during the search process produce values within of this upper bound. Therefore, we focus on the range from 0% to 50% of the maximum possible values and, in all the cases, this range has been split into a total of sub-ranges in order to achieve a discretisation of the space of all possible vectors. Despite the conceptual simplicity of this approach, the distances between secondary structure elements have been found in the authors’ ongoing work to be important descriptors of the folded state of a protein molecule (Garza-Fabre et al., 2015).
|PDB .||SS .||.||PDB .||SS .||.||PDB .||SS .||.||PDB .||SS .||.||PDB .||SS .||.||PDB .||SS .||.|
|PDB .||SS .||.||PDB .||SS .||.||PDB .||SS .||.||PDB .||SS .||.||PDB .||SS .||.||PDB .||SS .||.|
3.2.2 Comparison with the Standard Rosetta Protocol
In order to investigate the suitability of the RMA, it is imperative to evaluate this method with respect to the standard Rosetta protocol. This is not only due to the fact that Rosetta is widely accepted as being representative of the state of the art, having shown a remarkable performance among existing fragment-assembly methods for protein structure prediction, but also because the RMA is based to a large extent on Rosetta, as is detailed in Section 3.1. Therefore, a comparison to Rosetta is essential for highlighting the relevance of all other components of the RMA to which Rosetta has been coupled. The results of this comparison are shown in Figure 6, which contrasts the energy and RMSD values scored by Rosetta and the RMA on a set of protein targets (for table-format results, please see Section B of the supplementary material).
From Figure 6, it is possible to observe that the RMA exhibits a better performance than Rosetta with regard to energy, a behaviour that remains consistent across all 30 protein targets considered. The RMA was not only able to produce lower energies than Rosetta, but also to ‘narrow’ the distribution of energy values scored in all cases. Given that energy is the only optimisation criterion used to guide the search process, this performance in terms of energy is indicative of the effectiveness of the proposed search strategy. By focusing now on the results for the RMSD measure, the figure shows that for a number of proteins (namely, 1elwA, 1fna, 1hz6A, 1iibA, 1npsA, 1ten, 1tig, 1tit, 256bA, 2chf, and 2ci2I; proteins are referred to throughout by their PDB identifiers), most (or all) of the solutions generated by the RMA potentially correspond to native-like conformations (as is suggested by the low RMSD values, below 5 Å). For all these targets, note that Rosetta was also able to produce conformations providing low RMSD values during the executions performed. The RMA, however, has clearly increased the likelihood of sampling and better populating those low-RMSD regions of the conformational space. For some other targets, such as 1kpeA and 1vcc, in spite of a better average performance and overall tendency shown by the RMA (i.e., a narrow distribution of points at low energy and low RMSD regions), slightly lower RMSD values have been reached by a few of the Rosetta trajectories. An interesting characteristic common to all of the targets discussed so far is that the energy function seems to be “well-aligned” to the RMSD measure; that is, for all these targets, lower energies tend to be associated with lower RMSD values, as can be appreciated from the plots. Therefore, in such a favourable scenario, a search strategy that is effective at optimising the energy function can certainly be as effective at locating native-like conformations.
However, the results also reveal that this important alignment between the energy function and the RMSD measure is not as evident when considering some of the remaining protein targets. A clear example of this is protein 1opd, whose corresponding plot in Figure 6 exhibits an almost opposite correlation between the two criteria. As a consequence, an outstanding performance in terms of energy does not necessarily translate into an acceptable performance in terms of RMSD. In the presence of well-known inaccuracies of energy functions (Bowman and Pande, 2009), even the most successful methods for searching the huge conformation space might fail at identifying, preserving, and exploiting native-like conformations. Nevertheless, the results also underline that Rosetta’s low-resolution energy functions are sufficiently informative to allow the search process to identify different compact (protein-like) folds. For several targets (1acf, 1c9oA, 1ctf, 1eyvA, 1gvp, 1wit, and 2vik), some structures fall within 5 to 6 Å RMSD of the native structure, and for some of these, the RMA seemed to be able to exploit a small energy gradient, thus improving the relative frequency of such structures compared to the results of Rosetta. In contrast, Rosetta was more effective at producing lower RMSD values for a series of targets, including 1bgf, 1cg5B, 1lis, and 1opd. This appears to be because the energy gradient for these targets is misleading, and Rosetta’s local convergence lends it robustness in this setting.
Finally, it is worth noting from the plots for several targets—namely, 1acf, 1bkrA, 1c9oA, 1ctf, 1dhn, 1eyvA, 1gvp, 1wit, and 2vik—that the distribution of RMSD values obtained by the RMA covers a wide range (with deviations of more than 10 Å in most cases), despite the low variation with regard to energy. Such a dispersion in the distribution of RMSD values could be explained, to a certain extent, by the multimodal nature of the energy landscape. Multiple solution clusters shown in these plots can potentially represent different attraction basins. The plots for proteins 1ctf and 1dhn, for instance, suggest the existence of at least two well-defined local basins where the efforts of the RMA were concentrated. Rosetta’s results for these targets (as well as for 1hz6A, 1iibA, 1ten, and 1tig) provide additional evidence of this. In the presence of multiple, equally fit energy basins, or when the basins where native-like conformations reside are not rewarded over non-native basins, there is no mechanism that can be exploited to assist search algorithms in identifying the most promising basins.8 This motivates the analysis presented in Section 4, where an alternative selection strategy is equipped into the RMA as a means of encouraging diversification and the retention of a population of solutions that can span multiple basins reached throughout the search process.
3.2.3 Genetic Operators and Secondary Structure Information
The purpose of this section is to discuss the role of the implemented genetic operators, and how the incorporation of secondary structure prediction information within these operators contributes to the exploration behaviour of the proposed RMA.
Four variants of the RMA were analysed and compared: (i) RMA without genetic operators—that is, setting and ;9 (ii) RMA with standard genetic operators; (iii) RMA with loop-based operators, as described in Section 3.1.2; and (iv) RMA with loop-based operators but using incorrect secondary structure information. For the second variant of the RMA, using standard operators, the two-point crossover operator was utilised.10 The standard mutation was similar to the loop-based mutation introduced in Section 3.1.2, but it took into account all W possible insertion windows (using , consequently) and was also allowed to affect the configuration of non-loop residues. Therefore, both standard and loop-based operators were functionally similar, differing in whether they could be applied to all residues or operated on loop regions only. The fourth RMA variant was considered in this analysis to illustrate the crucial role that the quality of the secondary structure predictions plays in the effectiveness of the proposed loop-based operators. In the incorrect secondary structure information provided to this RMA variant, loop regions (as in the real secondary structure information, derived from the native structure) were identified as secondary structure elements, and secondary structure elements were identified as loops, representing the worst possible prediction scenario (in terms of loop region identification).
The experiment conducted investigated the extent to which different regions of the conformational space were reached and exploited throughout the search process. This analysis focused on a small protein target, 1enh, and was based on the discretisation of the conformational space described at the end of Section 3.2.1. As the result of this analysis, Figure 7 reports the frequency with which each region of this discrete space was considered while using the four above-mentioned variants of the RMA.11
As is shown in Figure 7, the first variant of the RMA tends to overemphasise the exploitation of very compact regions of the conformational space. This is due to the high selection pressure that results from the use of an elitist and extinctive selection scheme, and the lack of genetic operators, which are essential mechanisms for promoting exploration. It is evident from the figure that the implementation of standard operators (second RMA variant) has contributed significantly to achieving increased exploration. Note, however, that the incorporation of problem-specific knowledge into these operators has further improved the exploration capabilities of the algorithm. This can be seen from the results of the third variant of the RMA, which used specialised, loop-based operators. This confirms the underlying hypothesis that guided the design of these operators: that the configuration of loop regions is a major determinant of the arrangement and packing of secondary structure elements (in a fragment-based prediction context), so that the increased exploration in terms of loop configurations should contribute to intensifying the exploration of the space of potential conformations.
On the other hand, the results of the fourth RMA variant confirm that the utilisation of erroneous information can be detrimental and may produce even worse results than the use of no information. The fourth (misinformed) RMA variant has a decreased exploration performance when compared to the second variant, which used standard (uninformed) operators. This is particularly evident when analysing dimension pairs and of the discretised conformational space. An additional experiment was carried out to investigate whether the results that the RMA exhibited in Section 3.2.2 can be improved by providing this method with the real (rather than the predicted) secondary structure information. No noticeable differences in performance were found. Although the secondary structure predictions used are not completely accurate, these findings suggest that the location and elongation of loop regions, the only information exploited by the RMA, is accurate enough overall.12 The results of this experiment, as well as detailed information about the accuracy of the secondary structure predictions used, can be found in Section C of the supplementary material.
4 Improving Robustness to Deal with Inaccuracies of Score Functions
In Section 3.2.2, we found that the original implementation of the RMA, employing energy as the only selection criterion, was effective at producing lower energy and RMSD values than the standard Rosetta protocol for a significant number of the protein targets considered. We also found, however, that existing inaccuracies in the scoring functions prevented the RMA from consistently generating native-like (low-RMSD) conformations. It was therefore necessary to devise strategies that can increase the robustness of the proposed method in order to cope with this issue.
Given the lack of correlation between the energy and RMSD criteria, and recognising also the multimodal nature of the PSP problem, this section explores an alternative selection mechanism that aims to provide an effective means of regulating selection pressure and enhancing the diversity preservation capabilities of the RMA. This mechanism is motivated by our belief that, in the presence of inaccurate energy functions, the generation of a diverse set of candidate conformations, which can potentially span multiple global and local energy minima, can constitute a more robust approach (i.e., with better chances of achieving native-like structures) than to simply succeed in producing a single global minimum (the general tendency of evolutionary optimisation methods; Shir et al., 2010; S. Das et al., 2011). This follows an approach of explicit diversity maintenance similar to that adopted in some works (Sastry et al., 2005; D. Goldberg et al., 1992) on dealing with deceptive “trap” functions (D. Goldberg, 1987; D. E. Goldberg, 1992), or other functions that tend to lead an optimiser away from the best configurations (Watson et al., 1998).
As a population-based approach, the RMA possesses the natural advantage that multiple solutions can be reached during a single execution, in contrast to single-solution-based methods (e.g., Rosetta) that rely on performing multiple individual executions (or restarts) in the hope that each of them can discover a different solution. Nevertheless, the RMA needs to be equipped with effective mechanisms to foster the generation and preservation of a diversity of protein conformations in its population.
This section proceeds as follows. In Section 4.1, the alternative selection technique is introduced, and its implementation details in the context of the RMA are described. The suitability of this strategy is then evaluated in Section 4.2, with respect to the basic, energy-based RMA and with respect to the standard Rosetta protocol.
4.1 Stochastic Ranking-Based Survival Selection
Runarsson and Yao (2000) introduced a rank-based selection mechanism as a means of dealing with constrained optimisation problems. This mechanism, which they called stochastic ranking, employs a bubble-sort-like procedure to rank a population of candidate individuals for selection purposes. The characteristic feature of this strategy is the incorporation of a user-defined parameter that represents the probability of using either one or the other of two criteria, namely an objective function and a penalty function (sum of constraint violations), each time a pair of solutions is compared (in order to determine which one is fitter) during the ranking process. Such a parameter, therefore, removes the dependence on hard-to-tune penalty factors, allowing the user to control the balance between the two criteria in order to prevent over- and under-penalisation scenarios. The promising behaviour exhibited by stochastic ranking motivated further research regarding this proposal, and a number of works based on this constraint-handling technique have been reported (Mezura-Montes and Coello Coello, 2011).
It is possible, nevertheless, to use a similar procedure to achieve the desired balance when considering two arbitrary selection criteria.13 Algorithm 2 presents such a generalised version of stochastic ranking. It ranks a list of solutions based on a given pair of criteria c1 and c2 that, without loss of generality, are assumed to be minimised. The core of the ranking process is detailed in lines 3 to 22 of Algorithm 2, the application of which can be referred to as a “sweep.” During a sweep, ranking occurs by iteratively comparing adjacent individuals. First, lines 5 to 10 stand for the cases where the competing solutions are indifferent14 either with respect to a single criterion, so that the remaining criterion is always used to discriminate between them, or with respect to both criteria, in which case a random decision is made. The general case is described in lines 12 to 20, where the bias parameter denotes the probability of adopting either c1 or c2 as the underlying discrimination criterion. The ranking process is completed by applying (at most) I sweeps. Note, however, that this iterative process can stop earlier if no change in the rank ordering of solutions occurs within a complete sweep; this was done in the original implementation of stochastic ranking (Runarsson and Yao, 2000).
As was pointed out by Runarsson and Yao (2000), when the maximum number of sweeps I approaches , the ranking will be determined by the criterion favoured by the setting of parameter . Therefore, the right selection bias can be equivalently achieved either by adjusting or by increasing I in response to a given value.15 If we fix I, parameter thus becomes solely responsible for regulating the strength of the bias. In this study, a value of is considered, where M is the number of solutions to be ranked. Figure 8 illustrates the effectiveness of this method for introducing a bias in the selection process through the use of different values for parameter . As can be seen from this figure, a value of provides the best trade-off between the criteria. It can also be seen that even subtle deviations from this value are capable of producing significant biasing effects to emphasise one criterion over the other.
4.1.1 Selection Criteria and Implementation within the RMA
The first selection criterion focuses on the quality of individuals, which is measured as the energy of the encoded conformations. It is worthwhile to remember that different energy functions are used by the RMA at different stages of the search.
Note that whereas the energy values used as the first selection criterion are to be minimised, the second, diversity-based criterion is to be maximised. In this way, the aim of incorporating the stochastic ranking-based procedure into the RMA is to achieve a suitable balance between quality and structural diversity in order to drive selection.
An elitist step was introduced in order to prevent the loss of promising solutions regardless of the value chosen for . Initially, the lowest-energy solution from the pool of individuals was selected. All other individuals were ranked on the basis of stochastic ranking in order to determine the remaining survivors.
4.2 Experimental Results
This section investigates the advantages of replacing the original, energy-based selection mechanism of the RMA with the stochastic ranking procedure introduced in the previous subsection. The two versions of the RMA are compared with respect to each other, and with respect to Rosetta, on a set of 30 protein targets; see Table 1. The results are evaluated in terms of the energy of the candidate conformations and the RMSD measure. Details of these criteria, and all settings and experimental conditions adopted for Rosetta and the RMA, have already been defined in Section 3.2.1. Three different values for the parameter of stochastic ranking were analysed, . The results for the energy and RMSD criteria are respectively presented in Figures 10 and 11 (for table-format results, refer to Section B of the supplementary material).
As expected, the use of the stochastic ranking strategy reduced selection pressure with respect to energy. Figure 10 exhibits a clear and consistent tendency in the energy values obtained by the four variants of the RMA. Higher energies tend to be produced as parameter is decreased in order to moderate the emphasis given to the energy criterion (over the diversity criterion) during selection.16 Note, however, that the energies scored by most configurations of the RMA are significantly better than those obtained by the standard Rosetta protocol. The only exception is the configuration using , which achieved results comparable to or worse than those of Rosetta in most cases.
The results for the RMSD measure highlight the suitability of the stochastic ranking selection for enhancing the robustness of the RMA. As can be seen from Figure 11, the consideration of structural diversity as an additional criterion to guide selection has increased the likelihood of the RMA reaching and preserving more native-like (low-RMSD) conformations for a number of protein targets (in comparison to the use of energy only). The use of , which provides a balanced trade-off between the two selection criteria, seems to produce the most competitive performance in most cases. The stochastic ranking selection allowed the RMA to outperform the lowest RMSD values achieved by Rosetta in several targets—for example, 1cg5B, 1ctf, 1kpeA, 1who, and 2vik. Conversely, though RMA improved the central tendencies in most cases, no configuration of the RMA was able to reach the minimum RMSD values produced by Rosetta for some other targets, such as 1bkrA, 1eyvA, 1lis, 1opd, and 1tul. Such minimum RMSD structures reached by Rosetta tend to be associated with higher energies, as we also observed in Figure 6, and are therefore difficult for the RMA to retain. A further point of interest relates to the performance that the stochastic ranking-based RMA exhibits when dealing with targets for which the energy function seems to be well-correlated with the RMSD measure (1elwA, 1fna, 1hz6A, 1iibA, 1npsA, 1ten, 1tig, 1tit, 256bA, 2chf, and 2ci2I; see Figure 6). As we found in Section 3.2.2, a selection based solely on energy is able to produce satisfactory results under such a scenario. The competitive (and to a certain extent comparable) performance shown by the stochastic ranking-based RMA illustrates the ability of this strategy to maintain an acceptable degree of success on such moderate-difficulty targets, while having to be more robust under more challenging conditions.
Finally, a relevant subject of analysis concerns the availability of native-like fragments. Fragment-assembly methods rely on the existence of native-like configurations in the conformational space defined by the fragment libraries employed. For some of the targets—for instance, 1tul and 1dhn—no native-like structures were sampled during our experiments, regardless of the search method used. This may suggest that native-like configurations are not covered, or are only scarcely represented, in the fragment libraries adopted for this study, and is an issue that deserves further investigation.
4.2.1 Diversity Generation and Preservation
Similar to Section 3.2.3, we proceed to analyse and discuss the roles that both the genetic operators and the survival selection strategy have in terms of diversity generation and preservation. These roles are illustrated in Figure 12, which contrasts the distributions of solutions obtained by four variants of the RMA: (i) RMA using energy-based selection, without using genetic operators; (ii) RMA using energy-based selection, using loop-based recombination and mutation; (iii) RMA using stochastic ranking-based selection, , without using genetic operators; and (iv) RMA using stochastic ranking-based selection, , using loop-based recombination and mutation.
As is shown in Figure 12, without the use of the genetic operators, the energy-driven RMA tends to produce compact, well-defined solution clusters. Each cluster is the result of one of the ten independent RMA executions performed, each of which produced a total of very similar conformations. The lack of appropriate mechanisms to boost exploration, and the high selection pressure that arises from the use of an elitist and extinctive discrimination strategy based solely on energy, can lead to premature convergence. The inclusion of genetic operators in the energy-based RMA allows this method to discover and to exploit more promising attraction basins of the energy landscape, as the results for the second RMA variant suggest. The use of these specialised search operators was found previously in Section 3.2.3 to increase conformational space exploration. Similar effects were achieved with the third variant of the RMA. This variant does not employ genetic operators, but replaces the energy-based selection with the stochastic ranking approach. The diversity preservation capabilities of this alternative selection scheme reduce selection pressure with respect to the energy criterion, encouraging a better sampling of the conformational space. Although the individual use of these mechanisms has clearly contributed to RMA’s performance, Figure 12 suggests that their combined use (fourth variant) allowed the RMA to further improve in both generating and retaining a diverse set of solutions throughout the search process.
In order to investigate this further, Figure 13 contrasts the behaviour of the RMA when using the energy- and stochastic ranking-based selection strategies. The behaviour of the RMA throughout the search process is evaluated in terms of the offspring survival rate, convergence, and population diversity observed as the result of each application of the survival selection process. Results are provided for protein 1cg5B, but similar results were observed for additional targets (see Section D of the supplementary material). The use of the stochastic ranking selection prompted a noticeable drop in the number of surviving offspring. As we discussed before, the stochastic ranking strategy reduces selection pressure with respect to energy (by preventing overemphasising this criterion during selection). This slows the convergence speed, as is evident from the convergence curves in Figure 13. However, note that, in spite of this reduced selection pressure in terms of energy, the overall selection pressure of the algorithm rises as a consequence of incorporating an additional discrimination criterion. That is, in order to be selected, candidate individuals need to stand out in terms of the two implemented criteria (as was suggested by the analysis of Section 4.1; see Fig. 8), which leads to a decrease in the offspring survival rates. On the other hand, Figure 13 shows that whereas the energy-based selection tends to lose diversity and produce a set of very similar (or duplicate) individuals at the end of the search process, the stochastic ranking approach is clearly more effective at maintaining the population’s diversity. Diversity preservation is important as a means of producing a more robust solution set consisting of potentially different protein folds discovered during the search process. Moreover, the consequent diversity of genetic material within the population is beneficial and further increases exploration, as it is exploited through recombination.
Among all the possible conformations that a protein can adopt, it is believed that its native state, in which it performs its biological functions, corresponds to the one with the lowest overall free energy (Anfinsen, 1973). From this hypothesis, it follows that predicting structure from sequence is a matter of optimising an energy function with respect to the space of possible tertiary structure configurations. This approach, termed (de novo) protein structure prediction (PSP), has been pursued for several decades, and considerable progress has been made in inferring structures close to the native form, as determined by experimental methods such as X-ray or NMR techniques. This study focused on a combinatorial optimisation form of the PSP problem. The fragment-assembly class of methods, which has been the most successful approach to de novo PSP to date, works with a finite set of tertiary structure fragments, rather than a continuous space of bond angles. While the approach works very well on some smaller proteins, it is still the case that larger proteins (say of 100 residues and above) generally present a serious challenge. The Rosetta method, a leading example of fragment assembly (which we closely followed here), uses many independent optimisation restarts to obtain enough different candidates to be able to make a prediction of structure, and even with this approach it has been found to be far from reliable across different protein targets.
In this article we have proposed a new sampling protocol for fragment assembly, the Rosetta-based memetic algorithm. The RMA seeks to overcome the limitations of existing sampling protocols by implementing mechanisms that ensure an appropriate exploration of different protein folds. First, problem-specific knowledge is incorporated into a set of genetic operators that are designed to act on the loop regions of candidate structures. This is based on the understanding that the configuration of loop regions is correlated with the three-dimensional arrangement and packing of secondary structure elements in a fragment-based prediction context, and that a focused exploration of the space of possible loop configurations will translate to an extensive exploration of the space of protein folds. Second, basin-hopping (and appropriate descent into local optima) is further facilitated by using the framework of a memetic algorithm that uses the well-established Rosetta protocol as a local search routine. The experiments performed confirm that the new protocol achieved highly competitive results in terms of optimisation performance (i.e., the minimisation of energy), when evaluated with respect to the standard Rosetta protocol on a large set of protein targets, although the results did not always translate into improvements in prediction performance.
This last finding is not unexpected. In addition to the challenges arising from the size and multimodality of the search space, protein structure prediction is known to be sensitive to the energy functions used. Whereas state-of-the-art energy functions are often useful in identifying protein-like structures, they are known to have only limited power in pinpointing the most accurate (native-like) folds, a limitation that has not been addressed despite significant research effort focused on the development of more accurate functions. This poses a problem for optimisation protocols that cannot overly rely on the relative rankings between different local optima, which (dependent on the protein) may be more or less “deceptive” (D. Goldberg, 1987; D. E. Goldberg, 1992). As explicit diversity preservation (niching) has been recognised to be essential in similar scenarios (Sastry et al., 2005; D. Goldberg et al., 1992; Watson et al., 1998), an alternative selection scheme, based on stochastic ranking (Runarsson and Yao, 2000), was integrated into the proposed RMA, with the purpose of regulating selection pressure and enabling diversity maintenance. The results obtained indicate that this modification allows the RMA to display a more robust performance and to improve upon Rosetta’s performance in terms of the optimisation of both energy and correspondence to the native structure.
In summary, we posit that, due to the enduring inaccuracies of state-of-the-art energy functions, the design of search protocols that explicitly encourage the generation and preservation of diverse folds is a valuable research direction in protein structure prediction. Here, corroborating evidence of this was presented: we described a memetic algorithm that incorporates explicit mechanisms to foster conformational diversity, and we illustrated that this approach can lead to powerful and robust sampling protocols that can offset the problematic bias that is introduced by inaccurate scoring functions. Specifically, the first part of this report illustrates conclusively that improved optimisation of energy alone, which was successfully achieved, results at times in a problematic sensitivity to the accuracy of the energy function. In the second part of the report, significant improvements were demonstrated in robustness through the explicit consideration of diversity during the search (via the stochastic ranking method). Overall, the consideration of structural diversity as an additional criterion to guide offspring generation and selection increased the likelihood of reaching and preserving more native-like (low-RMSD) conformations for the majority of the targets studied in this work. We intend to exploit this finding in our future work, and will investigate possibilities for further improvement of the proposed approach—for instance, by exploring different measures of conformational diversity, alternative mechanisms of diversity maintenance, and further tuning and adjustments to the overall search protocol.
We finish by highlighting the key contributions of this work to the wider research community. From the perspective of protein structure prediction, we present a memetic algorithm for fragment assembly that shows significant promise in comparison to the state-of-the-art Rosetta technique. In line with the core spirit of memetic algorithms, our method uses an established search technique (Rosetta) as a local search strategy, and we designed specialised genetic operators and selection schemes to encourage the exploration and retention of diverse conformations. It is our view that exploration performance in general, and memetic algorithms in particular, have been paid insufficient attention in the context of fragment assembly, and our results help to illustrate the significant improvements that can be achieved by emphasising exploration and the preservation of conformational diversity. Our article also highlights the possibility of employing stochastic ranking as a general mechanism for diversity preservation. In particular, we used such ranking here to account for the multimodal nature of the optimisation problem, as well as the lack of accuracy in the objective function considered. We expect that our approach will be equally useful in other problem domains, including application areas of multimodal optimisation and problem domains that involve noisy objective functions or other types of uncertainties.
Some proteins are composed of multiple polypeptide chains called subunits, the spatial arrangement of which is described by the protein’s quaternary structure.
Neri and Cotta (2012) have presented a review of MAs to address problems in discrete, continuous, large-scale, constrained, and multiobjective optimisation, as well as in optimisation in the presence of uncertainties.
The secondary structure predictions are based on PSIPRED 3.3 (Jones, 1999). Dependence on this information does not imply additional computational effort; it is derived during the construction of fragment libraries.
Here, the term secondary structure element is used to refer to non-loop regions of the protein chain.
1 Å = 10−10 m.
As recommended, the settings used for Rosetta were: ; ; ; ; ; and .
Protein 1enh has only three secondary structure elements (all -helices) separated by two loop regions. This reduced number of secondary structure elements makes this protein a suitable candidate for the analysis conducted, in contrast to all targets listed in Table 1, which involve four or more secondary structure elements.
In a blind prediction scenario, information from native states, and therefore from RMSD, is unavailable.
The exploration behaviour of the first RMA variant is equivalent to that of Rosetta; the perturbations enforced by the Rosetta-based local search are the only mechanisms used for accessing new candidate conformations.
As we stated in Section 3.1.2, loop-based crossover is equivalent to two-point crossover using the first and last residues of a loop as the crossover points. To enable a more reliable analysis, the minimum and maximum separation between the crossover points in the standard operator were set to the minimum and maximum lengths of a loop region for the protein considered, respectively. This avoids allowing the differences to be observed in search behaviour to be attributed to the use of different magnitudes for the applied perturbations.
A single execution of each variant of the RMA was performed using a population size of . All candidate solutions accepted during the search process are covered by the results reported. Solutions rejected during the application of the Rosetta-based local improvement strategy have been discarded in all cases.
The average secondary structure prediction accuracies for , , and loop residues, and, the overall three-state accuracy of targets considered, were , , , and , respectively (Rost and Sander, 1993). Further details can be found in Section C of the supplementary material.
Note that this approach could potentially be extended to consider also more than two criteria.
In this study, two solutions and are said to be indifferent if and only if they present exactly the same value with respect to the evaluation criterion under consideration.
Increasing the number of sweeps I, however, would raise the computational effort required for ranking.
Note that the use of the conventional energy-based selection in the RMA is equivalent to the use of the stochastic ranking-based selection by setting parameter to its maximum possible value, .