## Abstract

We investigate the utility of a mathematical framework based on discrete geometry to model biological and synthetic self-assembly. Our primary biological example is the self-assembly of icosahedral viruses; our synthetic example is surface-tension-driven self-folding polyhedra. In both instances, the process of self-assembly is modeled by decomposing the polyhedron into a set of partially formed intermediate states. The set of all intermediates is called the configuration space, pathways of assembly are modeled as paths in the configuration space, and the kinetics and yield of assembly are modeled by rate equations, Markov chains, or cost functions on the configuration space. We review an interesting interplay between biological function and mathematical structure in viruses in light of this framework. We discuss in particular: (i) tiling theory as a coarse-grained description of all-atom models; (ii) the *building game*—a growth model for the formation of polyhedra; and (iii) the application of these models to the self-assembly of the bacteriophage MS2. We then use a similar framework to model self-folding polyhedra. We use a discrete folding algorithm to compute a configuration space that idealizes surface-tension-driven self-folding and analyze pathways of assembly and dominant intermediates. These computations are then compared with experimental observations of a self-folding dodecahedron with side 300 μm. In both models, despite a combinatorial explosion in the size of the configuration space, a few pathways and intermediates dominate self-assembly. For self-folding polyhedra, the dominant intermediates have fewer degrees of freedom than comparable intermediates, and are thus more rigid. The concentration of assembly pathways on a few intermediates with distinguished geometric properties is biologically and physically important, and suggests deeper mathematical structure.

## 1 Introduction

Since the mid-1990s, an important goal in engineering has been to understand how to manufacture materials and devices by mimicking physical and biological mechanisms of self-assembly. Crystallization is a familiar physical mechanism of self-assembly. More broadly, self-assembly in physics includes the emergence of order in various geometric models of condensed matter driven by the minimization of free energy. While all living systems self-assemble while dissipating energy, the term self-assembly in biology usually refers to the formation of biomolecules such as microtubules, and small organisms such as viruses. It is a fundamental challenge in materials chemistry and nanotechnology to create experimental paradigms inspired by these natural examples. Such a “bottom-up” approach to manufacturing eliminates the cost of manipulating individual components at small scales. Moreover, it is conceivable that larger components could be functionalized with a wide range of properties (chemical, geometric, magnetic, photonic), offering the possibility of designer materials [26, 74].

Our primary purpose in this article is to explain how an important biological example—the self-assembly of icosahedral viruses—can inspire and guide the development of self-assembly in technology. To this end, we introduce the reader to the elegant natural design of viruses and a set of mathematical models for the structure and self-assembly of viruses. The viral capsid has inspired the development of several techniques for synthesizing polyhedra on scales ranging from 10 nm to 1 mm. The particular technology studied here is surface-tension-driven self-folding, described in greater detail below. While these two examples of self-assembly have completely different physics, we show that they can be modeled by a common mathematical framework rooted in discrete geometry. In both instances, the process of self-assembly is modeled by decomposing the polyhedron into a set of partially formed intermediates. The set of all intermediates is called the configuration space, pathways of assembly are modeled as paths in the configuration space, and the kinetics and yield of assembly are modeled by rate equations, Markov chains, or cost functions on the configuration space. The particular decomposition that is chosen is distinct for viruses and self-folding polyhedra. Informally, each decomposition may be thought of as a discrete geometric idealization of experimental observations.

Our emphasis lies in understanding the *pathways of self-assembly*. Such a study involves an interplay between natural and synthetic self-assembly. While we are inspired by nature, we hope that an analysis of synthetic self-folding polyhedra can reveal new insights about the self-assembly of viruses. In our work, the synthetic polyhedra have side 300 μm and the self-folding process may be observed under an optical microscope. By contrast, it is very challenging to describe the *equilibrium* crystalline structure of complete viruses at high resolution, and there is no understanding of the pathways at a comparable level of detail. Pathways cannot be observed by microscopy, and must be inferred from other experimental data. The interpretation of such data requires a model, and a wide range of theoretical models have been used to model virus assembly, as reviewed in Sections 45–6. Our work allows us to test the utility of some of these models by direct observation of assembly pathways in a distinct, but related, context.

In what follows, we first introduce the reader to self-assembly and describe our experiments on self-folding polyhedra. We then discuss discrete geometric aspects of the self-assembly of viruses, highlighting recent work by several groups. We focus on viral tiling theory, the building game, and the role of RNA folding in the assembly of the bacteriophage MS2. We then discuss how these ideas can be adapted to our experiments, present our results, and review some common lessons from these studies. As will become clear in our discussion, developing a consistent mathematical theory of self-assembly (natural or synthetic) is challenging, and requires an interplay between careful modeling of the experiments, continuous math (applied probability, differential equations), discrete math (geometric combinatorics, rigidity theory), and scientific computation. A secondary purpose of our article is to introduce the reader to many unanswered questions in the area, all of which are easily illustrated in the context of self-assembly of polyhedra.

## 2 Assembly versus Self-Assembly

Let us first distinguish between assembly and self-assembly. Classroom models of polyhedra are usually made from a material such as cardboard by folding a *net*, gluing faces as they meet at edges [78, p. 12]. A net is a simply connected, non-overlapping polygon made up of faces of the polyhedron attached at edges. Informally, a net may be obtained from a polyhedron by cutting along the edges of a spanning tree, flattening out the faces as they become free to rotate. The reader who unfolds a cube with this procedure will soon discover the 11 nets for the cube listed in Figure 1.^{1} Polyhedra fascinated medieval polymaths such as Dürer, Kepler, and Leonardo, and this approach is at least as old as Dürer and Strauss' classic work, *The Painter's**Manual* [17].

Note that this method, and art forms such as kirigami and origami, require an external creator actively cutting, folding, and gluing the edges in a prescribed sequence. The goal of self-assembly is to decompose the polyhedron into suitable subunits, and to use physical principles to cause the subunits to assemble without active manipulation. The dominant physical principles governing self-assembly depend strongly on the length scale. This is best explained with examples. On the nano scale, some examples to keep in mind are the following: (i) polyhedral molecules such as C_{60} and closo boranes whose subunits are single atoms interacting through covalent bonds [44]; (ii) synthetic supramolecular cages whose subunits are small molecules interacting through weak (non-covalent) bonds [47, 53, 71]; (iii) polyhedral cages formed from DNA bricks and tiles [12, 15, 40]; (iv) viral capsids whose subunits are protein molecules held together by electrostatic bonds and hydrophobic interactions between protein residues. In these examples, the dominant physical terms are bond energies and thermal fluctuations. Subunits form bonds as they collide while undergoing stochastically driven motion. Thermal fluctuations also cause bonds to break and re-form, allowing the polyhedra to self-correct during assembly. While these heuristics are broadly accurate, the precise kinetics of self-assembly is far more subtle, as we discuss in detail below.

Our main interest is to mimic similar self-assembly on much larger scales (100 nm to 1 mm). We illustrate this idea with a historical curiosity. In his delightful book, *Mathematical Snapshots* [69], Steinhaus cuts a net of a polyhedron into two “half-nets” that pop up under the effect of elastic energy as illustrated in Figure 2. Steinhaus is silent on whether he discovered this technique, but there is no debating its ingenuity. We view it as an early example that shows us how to use physics to self-assemble polyhedra from a net with just a little help from a creator. In our experiments, the polyhedra are much smaller (100 nm to 1 mm). Capillarity dominates gravity on these length scales and may be used to drive self-assembly from a net as explained below. Although our work in this article is restricted to the use of capillary forces, the self-folding of polyhedra from nets could also conceivably be achieved using magnetic forces, residual stress, or pneumatics [48].

## 3 Surface-Tension-Driven Self-Folding

Photolithography is an accurate and inexpensive microfabrication process used to pattern thin films. But its reliance on projective optics inherently restricts patterning to flat sheets. Folding allows us to transform these flat sheets into three-dimensional shapes. However, while sheets can be folded by hand or small probes at supra-millimeter length scales, it is necessary to use self-assembly at sub-millimeter length scales. The mathematical challenge is to code the folding pathways and outcome through the information implicit in the geometry and interactions of units on the patterned sheet.

Since polygonal faces can be patterned with high accuracy, the first challenge is to build complex polyhedra with high yield. In our experiments, a small net (side length 300 μm) made of nickel-coated polygonal panels joined with solder hinges was patterned on a substrate and then released in a high-boiling-point solvent. When the hinge melts, the molten solder cannot spread laterally, because solder does not wet nickel. Instead, the high surface tension of solder causes the panels to rotate. Snapshots of surface-tension-driven self-folding of a dodecahedron are shown in Figure 3 (see also Figures 13,^{14}–15). The reader is invited to view movies of polyhedra self-folded by this process ([60, Supplementary Information]; http://www.youtube.com/watch?v=GL0im9b6GgU). This experimental procedure is discussed in detail in [48]. A simple model for the physics of the surface-tension-driven hinge is analyzed in Section 7.1.

We stress that there is no external control in these experiments except the temperature of the solvent and the choice of initial net. Folding usually begins with the outer panels, and the order of folding is principally determined by the relative position of panels, though there are significant fluctuations in the sequence in which panels fold. The main sources of randomness are fluctuations in the amount of solder deposited at a hinge, thermal gradients within the solvent, and mechanical (fluidic) agitation of the solvent. The temperature distribution within the solvent and variations in the amount of solder determine the order in which hinges liquefy. For small polyhedra, the thermal gradients on the length scale of a single net are minimal. Our experiments also randomize the placement of polyhedra in the assembly dish. Fluidic agitation has the beneficial effect of correcting for small misalignment errors without breaking apart panel seams. Despite these efforts, self-folding sometimes goes awry and faces can be misaligned. This immediately suggests some natural questions:

*Feasibility:*Is it possible to build polyhedra of arbitrary complexity by self-folding?*Rational design:*How must an initial net be chosen to maximize yield?*Assembly pathways:*What determines the pathways of self-assembly? Can these be directed through specific intermediates?

These questions have only been investigated for a few polyhedra to date, and our prior work addressing rational design is reported in [3, 60]. One of our main conclusions in [60] is that in order to fold more complex polyhedra, it is necessary to tackle a combinatorial explosion in the space of nets and folding pathways. Viruses face (and rapidly solve) a similar assembly problem despite a combinatorial explosion. How could insights from the assembly of viruses be brought to bear on self-folding? In order to explain the similar conceptual issues in both fields, we now discuss the self-assembly of viruses, paying particular attention to discrete geometric models.

## 4 The Structure of Icosahedral Viruses

### 4.1 Virus Genomes and Structure

Viruses are the most populous and genetically diverse organisms on earth. Several estimates place the number of viruses at 10^{31} and the number of distinct genotypes as high as 10^{6}–10^{8} [6]. (Note, though, that most viruses are marine bacteriophages [70], and there are several challenges in estimating and classifying viral population and diversity.) At present, the NIH-NCBI public database contains complete genetic sequences for approximately 4000 viruses.

Viruses lack the biosynthetic machinery for independent existence. The simplest virus particles consist of a genome (RNA or DNA) encapsulated in a protein shell (the capsid). The function of the capsid is to protect the genome outside a host, to disassemble when the virus attacks a host, and to assemble rapidly when the genome uses the cellular machinery of the host to produce nucleic acid and protein. The natural design of a virus reflects two deep forms of elegance:

*1. Genetic economy*: This principle was formulated by Crick and Watson, as discussed later [13]. It is best illustrated by glancing through the NCBI data on the simplest viruses with a single-stranded RNA genome. Among the 1017 known ss-RNA genomes, most viruses typically have genomes that consist of about 5000 nucleotides, which code for very few proteins (1–10). A central example is the bacteriophage MS2, which infects the bacterium*E. coli*. The MS2 genome was the first to be completely sequenced [24]. The genome is a single strand of RNA that is 3569 nucleotides long and has four genes that code for four proteins: coat protein, maturation protein, and lysis and replicase enzymes. Of these, the coat protein and maturation protein determine the structure of the capsid as discussed below. The lysis enzyme degrades the cell wall of the host bacteria, and the replicase catalyzes the production of RNA.*2. Structural symmetry*: Early crystallographic studies of viruses in the 1940s and 1950s found that many viruses possess either helical or icosahedral symmetry. As an example, we illustrate the structure of MS2 in Figure 4. The MS2 capsid consists of 180 copies of the coat protein and one unit of the maturation protein, which serves as a terminal point. The units of the coat protein are arranged in a manner that has rotational icosahedral symmetry. The symmetry of the capsid influences the organization of the genomic RNA in proximity to the capsid (see [73, Figure 4] for cryo-EM pictures, and Figure 5 for a coarse-grained description). In addition to crystalline symmetry, it is now also known that genetically unrelated viruses can share very similar virion architecture, down to the persistence of specific protein folds [45]. We discuss a ubiquitous trapezoidal motif in greater detail below.

### 4.2 Goldberg Polyhedra and the Caspar-Klug Construction

Crick and Watson hypothesized that the crystallographic symmetry of viruses was due to the repeated use of a few protein subunits in the construction of the virus [13]. Such a construction is clearly conducive to genetic economy: If the subunits consist of a small number of proteins (for example, just one coat protein for MS2), the genome can be very small. Caspar and Klug laid the foundations for structural virology of *icosahedral viruses* a few years later [10]. (By icosahedral virus, we mean a virus whose capsid has icosahedral symmetry.) They assumed that the viral capsid is built out of subunits that assembled into a shape with icosahedral symmetry, and deduced that the viral capsid must consist of 60*T* protein subunits, with . Let us explain this restriction.

The class of Caspar-Klug viruses are modeled on the *Goldberg polyhedra*. While studying the isoperimetric problem for polyhedra, Goldberg discovered a family of polyhedra with the following properties: (i) The faces of the polyhedron are pentagons or hexagons; (ii) each vertex is trivalent; (iii) the polyhedron has icosahedral symmetry. Each such polyhedron (written *G*(*h*, *k*)) is characterized by its index [27]. The combinatorial topology of *G*(*h*, *k*) may be described by folding a net whose vertices lie on a hexagonal lattice as follows.

Assume given a lattice *L* of regular hexagons of unit length. Let *L*′ be the dual lattice formed by the centers of the hexagons, and let **e** and **f** be unit basis vectors for *L*′. Given an index (*h*, *k*), consider a walk in the dual lattice *L*′ from a point *O* to the point *P* = *O* + *h***e** + *k***f**. Continue this walk, tracing out the net of an icosahedron with vertices in the lattice *L* and edges of length |*OP*|. When the net is folded into an icosahedron *I*, the polyhedron formed by the restriction of the lattice *L*′ to *I* gives a triangulation *S*(*h*, *k*) of *I*. The polyhedron formed by the restriction of the hexagonal lattice *L* to *I* is topologically equivalent to the Goldberg polyhedron *G*(*h*, *k*). (The folding construction does not give *G*(*h*, *k*) in general, since gluing along edges such as *OP* gives “folded” hexagons and “curved” pentagons, whereas *G*(*h*, *k*) has flat hexagons and pentagons). Examples of *G*(*h*, *k*) are shown in Figure 6. An interesting survey of Goldberg polyhedra in art and science may be found in [33].

The topology of *G*(*h*, *k*) can be obtained by duality from *S*(*h*, *k*). We compute the area of each face of the icosahedron *I* to see that it contains *T* = *h*^{2} + *hk* + *k*^{2} faces of *S*(*h*, *k*). Thus, *S*(*h*, *k*) has 20*T* faces. Each face has three edges, each of which is shared by exactly two faces. Thus *S*(*h*, *k*) has 30*T* edges. By an application of Euler's formula, the triangulation *S*(*h*, *k*) has 10*T* + 2 vertices. By duality, *G*(*h*, *k*) has 10*T* + 2 faces, of which 12 are pentagons. Thus, there are 10(*T* − 1) hexagons, and 60*T* subunits in all.

Goldberg's construction was rediscovered by Caspar and Klug. When applied to viruses, the faces in the Goldberg polyhedron correspond to clusters of protein subunits called *capsomeres*. In the simplest setting, there are two kinds of capsomeres, hexamers and pentamers, corresponding to the hexagons and pentagons respectively. As a consequence, the capsid has 12 × 5 + 10(*T* − 1) × 6 = 60*T* protein subunits. (While it is simplest to imagine that each subunit is a single protein molecule, it is also possible for the subunit to consist of a cluster of protein molecules.) The number of subunits in a capsid is amenable to direct measurement, and the Caspar-Klug “quantization” of capsid size provides a fundamental classification of viruses.

### 4.3 Recent Developments: Viral Tiling Theory

The Caspar-Klug theory coarse-grains the all-atom structure of the virus into a Goldberg polyhedron. It allows the possibility that other polyhedra with icosahedral symmetry may provide a more refined description of the capsid. We present two recent examples that demonstrate such an interplay between polyhedral models and experimental data. In both instances, high-resolution crystallography was the motivation for new tilings of the sphere that resolve puzzles in structural virology.

We first discuss Twarock's resolution of the structure of certain “forbidden” viruses. The Caspar-Klug theory does not allow certain *T*-numbers, for example *T* = 2 and *T* = 6. However, icosahedral viruses with these *T*-numbers are known to exist. Some yeast bacteriophages have *T* = 2 [11]. Moreover, certain tumor-causing polyomaviruses, such as SV40, have a capsid structure modeled on, but distinct from, the *T* = 7 viruses. Roughly, these capsids have *pentamers* at the centers of the *hexagons* of *G*(2, 1) polyhedra [52, 63]. Thus, they have 360 subunits (*T* = 6), rather than 420 (*T* = 7). Twarock made the beautiful discovery that the structure of these anomalous viruses corresponds to tilings of the sphere by darts and kites [75]. There is an interesting parallel here with aperiodic tilings and quasicrystals with “forbidden” symmetry [68]. The Penrose tiling with darts and kites is the simplest aperiodic tiling of the plane that possesses fivefold rotational symmetry. It is remarkable that the same tiles may be used to construct tilings of the sphere with icosahedral symmetry, resolving the structure of anomalous viruses.

Mannige and Brooks have further refined the Caspar-Klug theory to explain a structural similarity in virus architecture across different genetic strains [54, 55]. An important finding from the first high-resolution images of viral capsids was that the shape of the individual protein subunit in some genetically distinct viruses is roughly trapezoidal [1, 32]. This motif has now been seen in several virus strains. In order to explain the emergence of the trapezoidal motif, Mannige and Brooks tile the hexavalent lattice *L*′ with a single polygonal prototile with σ sides such that: (i) there is no overlap between prototiles; (ii) there is no gap between prototiles (i.e., each edge in the tiling is an edge between precisely two prototiles).^{2} As in the Caspar-Klug construction, the only admissible tilings of *L*′ are those that can be folded into a tiling of the icosahedron. This restriction is surprisingly rigid: The prototile must have σ = 5. Moreover, by comparing all known tilings of the hexavalent lattice *L*′ by pentagonal prototiles, one sees that the only possibile tiling is the one shown in Figure 7. Observe that the pentagonal prototiles can be flattened into trapezoidal prototiles. Thus, this construction provides a natural discrete geometric model for the trapezoidal motif discovered in structural virology. The tiling by trapezoidal prototiles is *chiral*, a property of critical importance for biomolecules.

Mannige and Brooks compared their discrete model with all-atom structures for all 63 viruses whose resolved structure is stored in the VIPERdb viral capsid data bank. This comparison requires the introduction of a metric that measures a suitable distance between an all-atom model and the trapezoidal tiling. Roughly speaking, the method in [54] involves projecting the three-dimensional all-atom model radially outwards onto a sphere, and analyzing the “overlap” with a pure tiling. With this metric, they find that their monohedral tilings fit experimental data in the majority of examples, showing that the individual protein subunits in genetically distinct viruses from eight families exhibit the same trapezoidal shape.

## 5 The Self-Assembly of Icosahedral Viruses

Our description so far has been restricted to the *equilibrium structure* of the viral capsid. We now turn to the use of discrete geometry to understand the *pathways of self-assembly*.

### 5.1 Crytallization and Directed Self-Assembly

In a seminal experiment in 1955, Fraenkel-Conrat and Williams showed that fully functional tobacco mosaic virus (TMV) could be created from a solution of purified RNA and capsid protein [25]. Their experiment suggested that the self-assembly of viruses was a form of crystallization. Here is the early dogma, as stated by Caspar and Klug in 1962 [10]:

Self assembly (of a virus) is a process akin to crystallization and is governed by the laws of statistical mechanics. The protein subunits and the nucleic acid chain spontaneously come together to form a simple virus particle because this is their lowest (free) energy state.

*directed*through specific intermediates by sequential nucleation [51]:

We feel that protein folding is speeded and guided by the rapid formation of local interactions which then determine the further folding of the peptide. This suggests local amino acid sequences which form stable interactions and serve as nucleation points in the folding process.

In summary, we should note that the Levinthal paradox does not preclude the possibility that stochastically forced energy minimization may lead a protein or virus to its minimum-energy state. But it does emphatically point to the fact that such models are valid only if they predict realistic time scales and pathways through a few dominant intermediates. In fact, energy minimization is widely used in protein folding, and virus self-assembly is compatible with a more nuanced view of pathways [14, 66, 77]. We now introduce a class of discrete geometric models that allows us to quantitatively analyze the energy landscape and distinguished pathways for self-assembly of polyhedra. When applied to ss-RNA viruses such as MS2, this class of models has recently been used to identify sequence-specific nucleation sites and stable intermediates as in Levinthal's original vision (see [21] and Section 6).

### 5.2 Zlotnick's Model, the Building Game, and Energy Landscapes

We now discuss an approach to virus assembly that was pioneered by Zlotnick [81]. A similar description in synthetic self-assembly was adopted by Hosokawa et al., though the planar geometry in their model is rather simple [35]. The underlying physical caricature is a process of capsid assembly driven solely by the attachment and detachment of single subunits. The geometry can be described by the *building game*—a growth process on a fixed polyhedron that serves as a simple model for the formation of fullerenes and viral capsids [76]. We first introduce the model and then discuss its utility and flaws.

We formulate the building game as a coloring problem. We begin with a connected polyhedron with all faces colored white. We call this state □. At the first step, an arbitrary face is colored black. At each step that follows, a white face that shares an edge with the contiguous, connected cluster of black faces is colored black. The procedure terminates when all faces in the polyhedron are colored black. We denote the terminal state by ▪. The configuration space, , consists of the set of all distinct (i.e., non-congruent) black clusters formed with this algorithm, ranging from the empty configuration □ to the complete polyhedron ▪. Each such arrangement of black clusters is called an *intermediate*. Two intermediates are neighbors if they differ by the attachment or detachment of a single black face. Thus, the configuration space is a graph, or more properly a multigraph, with the intermediates as vertices and edges linking two neighbors. The number of black faces in an intermediate provides a natural ordering of as a partially ordered set. An assembly pathway is an increasing path in that begins with the empty cluster □, increases by one black face at each step, and terminates at ▪.

*a*

_{jk}≥ 0. Then the evolution of the population of states is given by the

*master equation*where (as is conventional)

*A*denotes the matrix with entries

*a*

_{jk},

*a*

_{jj}= −∑

_{k|j→k}

*a*

_{jk}, and

*c*= (

*c*

_{1},…,) denotes the connection of each intermediate in a configuration space , with size denoted by . The matrix

*A*is the generator of a continuous-time Markov process on , and the master equation is the forward equation associated to this Markov process.

It is important to note at the outset that both energetic and kinetic descriptions are simple to formulate. The complexity of self-assembly is a consequence of the complexity of the configuration space . The energy functions *E* and rate constants *A* used in applications are usually quite simple. For example, a natural energy function *E* assigns a unit energy to each edge (bond) between black faces. This *E* depends only on the topology of the intermediates. It favors intermediates that are less spread out (more “compact” in the jargon of chemists), since they form multiple bonds while growing, and have lower energy than intermediates that grow with the formation of only one bond at each attachment step. Similarly, a simple choice for rates is to assume that *a*_{jk} is proportional to the number of edges joining intermediates *j* and *k*. It is also common to introduce an inverse temperature β > 0 and Gibbs weights *e*^{−βE} on , and to choose a matrix *A* that has the Gibbs distribution as a unique equilibrium distribution. (That is, *A* has a unique left eigenvector with positive entries π_{j} proportional to *e ^{−βEj} ,j* = 1,…, .)

### 5.3 Dominant Intermediates and Pathways

This model has been used in applications as follows [81]. First, the entire configuration space has been enumerated for a few solids (see Table 1). Next, the energy functions and solutions to the rate equations are computed numerically and the solutions compared with experiments. Some care is needed in these comparisons, since the number of distinct intermediates in far outweighs the number of distinct intermediates that can be observed in experiments. (For a moderately complex Goldberg polyhedron, we expect the number of intermediates to be greater than 10^{23}.) Moreover, even for simple polyhedra such as the dodecahedron there has been no systematic attempt to determine the rates *a*_{jk} empirically. Rather, experimental comparisons are between solutions to Equation 1 with assumed rate constants, and experimental observations of the population of a few intermediate species of the capsid. This seems like a limited comparison at first sight, but it is justified by an interesting empirical observation in the biochemistry literature: For several chemically natural *E* and *A*, such as the examples above, the *same* intermediates dominate self-assembly [23].

Polyhedron . | No. of faces . | No. of intermediates . | No. of edges in . | No. of assembly pathways from □ to ▪ . |
---|---|---|---|---|

Tetrahedron | 4 | 5 (5) | 4 (4) | 1 (1) |

Cube | 6 | 9 (8) | 10 (8) | 3 (2) |

Octahedron | 8 | 15 (12) | 22 (14) | 14 (4) |

Dodecahedron | 12 | 74 (53) | 264 (156) | 17,696 (2,166) |

Icosahedron | 20 | 2650 (468) | 17242 (1984) | 57,396,146,640 (10,599,738) |

Truncated tetrahedron | 8 | 29 (22) | 65 (42) | 402 (171) |

Cuboctahedron | 14 | 341 (137) | 1636 (470) | 10,170,968 (6,258) |

Truncated cube | 14 | 500 (248) | 2731 (1002) | 101,443,338 (5,232,294) |

Truncated octahedron | 14 | 556 (343) | 3071 (1466) | 68,106,377 (5,704,138) |

Polyhedron . | No. of faces . | No. of intermediates . | No. of edges in . | No. of assembly pathways from □ to ▪ . |
---|---|---|---|---|

Tetrahedron | 4 | 5 (5) | 4 (4) | 1 (1) |

Cube | 6 | 9 (8) | 10 (8) | 3 (2) |

Octahedron | 8 | 15 (12) | 22 (14) | 14 (4) |

Dodecahedron | 12 | 74 (53) | 264 (156) | 17,696 (2,166) |

Icosahedron | 20 | 2650 (468) | 17242 (1984) | 57,396,146,640 (10,599,738) |

Truncated tetrahedron | 8 | 29 (22) | 65 (42) | 402 (171) |

Cuboctahedron | 14 | 341 (137) | 1636 (470) | 10,170,968 (6,258) |

Truncated cube | 14 | 500 (248) | 2731 (1002) | 101,443,338 (5,232,294) |

Truncated octahedron | 14 | 556 (343) | 3071 (1466) | 68,106,377 (5,704,138) |

Such a collapse of the complexity of onto a few dominant intermediates is suggestive of deeper mathematical structure. At present, this collapse is based on computational experiments in which both *E* and *A* are used to define dominant intermediates and pathways. Every energy function *E* may be used to rank states (*j* ≤ *k* if *E*_{j} ≤ *E*_{k}), and to define the path of least energy between □ and ▪ (see Figure 8). From a kinetic standpoint, given *A*, the states *j* may be ranked by considering the equilibrium distribution π_{j} associated to *A*. For several chemically natural choices of *E* and *A*, the top few intermediates are the same. As we show in Section 9, a similar collapse onto a few dominant intermediates is also seen in self-folding polyhedra. A broader empirical study of , *E*, and *A* will be presented in the forthcoming article [36].

### 5.4 Some Flaws in the Building-Game Model

There are many distinct models for virus self-assembly. We have emphasized the building-game framework because it is best suited to our work on self-folding. Let us briefly discuss some shortcomings of this framework in the context of virus assembly.

The building game assumes that the final polyhedron is “known” even if assembly proceeds by random attachment, and it cannot account for the formation of distinct or malformed polyhedra from the same subunits. In the first experiments with fullerenes, it was noted that when one fullerene was found, several others were found too (e.g., C_{60} and C_{70} were found together) [44]. It is also known that capsid assembly often leads to malformed and partially formed shells [57, 58]. Another, more serious flaw in this model is that it does not account for the maturation of the capsid protein. True biological assembly of a number of plant viruses includes not just the formation of a proto-shell, but a second stage, *maturation*, that consists of growth and buckling of the capsid into a shape with icosahedral symmetry that is critical for infectivity in some viruses [7, 37].

It is possible to partly address the formation of different polyhedra with a “semi-discrete” model. Berger et al. [4] idealized each protein subunit as a point in space labeled with a set of finite conformations. A finite set of local rules then describes the geometry of the connections between these labeled points. This model allows the formation of both complete and malformed capsids. Finally, extensive molecular dynamics (MD) simulations with varying degrees of approximation have been used to model capsid assembly (see for instance [31, 57, 62]).

## 6 Self-Assembly of MS2—the Role of RNA Folding

### 6.1 RNA Folding and Hamiltonian Paths

The NCBI database lists the primary structure (i.e., complete A,C,G,U sequence) of about 1000 ss-RNA viruses. The tertiary (i.e., folded, packed) structure of these genomes within the capsid in vivo is of fundamental interest, and it is an important computational challenge to predict the tertiary structure based on primary structure.

The icosahedral symmetry of the capsid need not be matched by a symmetric arrangement of the genome within the capsid, but there are beautiful examples where this is indeed the case. The icosahedrally averaged distribution of the genomic RNA of MS2 is shown in Figure 5. Another example of genomic symmetry is provided by *nodaviridae* such as Flock House virus (FHV) [72]. We summarize the discovery of Hamiltonian paths in the context of FHV below [65]. We then discuss how Hamiltonian paths were recently exploited to obtain sequence-specific information about the assembly of MS2 [18, 20, 21].

The FHV genome consists of two single strands of RNA that lie roughly on a dodecahedral cage within the capsid. The RNA strands have an orientation, and RNA1 and RNA2 each cover the dodecahedron, traversing every edge in opposite directions. The crossings of RNA1 and RNA2 at vertices can be classified intro three junctions (A,B,C), with A having least energy and C having the most energy. The A junctions are chiral and may be further divided into two types, A1 and A2 [65, Figure 3]. In the first approximation, the RNA folding problem is an energy minimization problem on the set of all *decorations* of the dodecahedron such that: (i) all edges of the dodecahedron are occupied by rigid, duplex RNA segments; (ii) all vertices of the dodecahedron are occupied by an A, B, or C junction; (iii) the decoration has icosahedral symmetry [65]. The set of decorations is also subject to topological constraints. First, each RNA strand must remain connected. Second, the genome must contain no knots, since a knotted RNA molecule cannot be translated by the cellular machinery of the host.

The condition that the RNA strands remain connected forbids a naive minimizer that places 20 identical A junctions at each vertex of the dodecahedron, because this decoration actually separates each RNA strand into 12 separate rings. A more sophisticated possibility is to consider a decoration that visits each vertex of the dodecahedron twice, in opposite directions, with a choice of A1 and A2 junctions at each vertex. However, all such paths are knotted.^{3} Restricting attention to paths with A and B vertices consisting of a path on the dodecahedron with side branches, Bruinsma and Rudnick were finally led to consider the RNA molecule as a *Hamiltonian path* on the dodecahedron. While this restriction was initially posed to find a topologically admissible least-energy configuration, it also has interesting implications for assembly. The building game for the dodecahedron can be augmented to include an RNA decoration. The intermediates now consist of connected black clusters along with a Hamiltonian path that links the centers of these clusters (see [20, Figure 1]). This is a fully discrete configuration space, and as before, one can seek the lowest-energy pathway. Surprisingly, the states of the capsid in the first nine steps of the least-energy Hamiltonian cycle assembly pathway coincide with the assembly pathway shown in Figure 8.

### 6.2 Hamiltonian Paths and Packaging Signals in MS2

In the examples so far, virus self-assembly was modeled with ideas from statistical physics and discrete geometry. None of these models incorporates information on virus self-assembly that is specific to a particular genetic sequence. In very interesting recent work, Twarock, Stockley, and their coworkers have combined the previous theoretical models with bioinformatics and experiments to elucidate the assembly pathways of MS2 and the evolutionarily related phage GA, with careful attention to the sequence specificity of the assembly process. In order to explain their work, it is necessary to appreciate the *structural* role that RNA plays in the assembly process for MS2.

We note at the outset that conformational changes in the capsid protein(s) help balance genetic economy and structural symmetry. For example, if the genome of an icosahedral virus codes for just one coat protein, and the coarse-grained capsid is a Goldberg polyhedron, conformation changes in the protein allow it to flip between pentamers and hexamers. This conformation change is accelerated by RNA folding. As the RNA folds, exposed stem loops serve as nucleation sites for the conformation change in the capsid protein [19, 22]. These loops are known as *packaging signals*. The packaging signals associated to a particular genome are sequence-specific. They are not all of the same strength, and typically only the packaging signals with high affinity are well understood. We now explain how Hamiltonian paths were recently used to discover a hierarchy of packaging signals for the MS2 genome [18, 20].

As we have noted above, the MS2 genome codes for four proteins. Of these, only two—the maturation protein and the coat protein (labeled A)—are used in the capsid. The capsid contains 1 copy of the maturation protein and 180 copies of the coat protein. A glance at Figure 4 reveals, however, that the coat protein exists in three different conformations, labeled A, B, and C. In solution, the coat proteins bind in pairs to coexist as asymmetric A/B dimers and symmetric C/C dimers. RNA stem loops trigger conformation changes in the coat protein from symmetric C/C dimers to asymmetric A/B dimers.

The discovery of the crystalline structure of the RNA genome, in combination with the crystalline structure of the capsid, suggests that the process of self-assembly in some viruses should be viewed as a *co-assembly* of RNA and capsid. The RNA is packaged so that it traces the shape of the polyhedron shown in Figure 5 as the capsid assembles into the polyhedron shown in Figure 4. The first conformation change occurs at the highest-affinity packaging signal on the RNA strand. This packaging signal is a well-understood 19-nt sequence of the MS2 genome, called the TR site. The coat protein first binds to the RNA at the TR site, flipping conformation in the process in order to facilitate the formation of fivefold clusters, avoiding electrostatic repulsion of the F-G loops around these axes. As the RNA binds, the remainder of the RNA is rearranged (some stem loops are already present; some others might form in the process), allowing for further binding events to promote capsid formation. This heuristic picture can be modeled accurately with discrete geometry, by a suitable extension of the building game. Intermediates consist of partially formed capsid polyhedra that are decorated by segments on the RNA polyhedron, with each vertex on the RNA polyhedron corresponding to a packaging signal (see [20, Figure 1]). Figures 4 and 5 now suggest that there must be 60 packaging signals on the RNA genome that serve as nucleation sites for conformational flips as co-assembly proceeds. However, the TR site is atypical—very few packaging signals are known, certainly not 60 of them. How could such packaging signals be discovered? We now explain how Hamiltonian paths elucidate a *sequence-specific* question: Where do packaging signals lie on the RNA genome?

A solely bionformatic search for packaging signals involves the statistical analysis of primary structure along with computations of secondary structure. Roughly speaking, the packaging signal is a short motif within the secondary structure, and these motifs are sought among a large number of computer-generated possible secondary structures. Such an analysis is possible for short motifs in short genomes (for example, for the ss-RNA of the virus STMV); however, it is computationally intractable for MS2. In order to search for packaging signals in MS2, Twarock and coworkers assume that the RNA is folded along a Hamiltonian path on the RNA polyhedron shown in Figure 5. There are 40,678 Hamiltonian paths with icosahedral symmetry linking the 60 vertices. A further restriction on this set is provided by the single unit of maturation protein—it binds at two sites near the 5′ and 3′ ends on the genome, effectively converting it (topologically) to a circle. Of the 40,678 Hamiltonian paths, only 66 are consistent with binding the maturation protein. It is computationally tractable to compute the co-assembly intermediates along each of these 66 paths, and to compare the icosahedrally averaged density of the RNA with experimental data from mass spectrometry and cryo-EM. From these computations, they discover that only 3 of the 66 paths are admissible [20, 18]. The geometric arrangement of the RNA can then be combined with bionformatic analysis of sequence data to identify packaging signals. Roughly, the search in sequence space for packaging signal motifs becomes tractable once there are only a few folding pathways that need to be considered. In this manner, a family of 60 packaging signals and a precise co-assembly sequence have been discovered for MS2.

Remarkably, the same (discrete-geometric) assembly pathway is also used by the evolutionarily related ss-RNA virus GA. Both MS2 and GA are *leviviridae* that infect enterobacteria. Their genomes are roughly the same size and code for four proteins. While a comparitive study of four related enterobacteriaphages (GA, MS2, fr, and Qbeta) revealed that they are *T* = 3 viruses with similar crystalline structure [39], the recent studies based on Hamiltonian paths show that these two viruses share the same assembly pathway. In summary, these recent studies demonstrate a convergence of several themes: A basic assembly model rooted in discrete geometry and statistical physics may be combined with the analysis of genetic sequences and experimental data to provide a concrete and elegant instance of an assembly pathway from primary sequence to quaternary structure that is conserved by evolution.

## 7 Physical Modeling of Surface-Tension-Driven Self-Folding

We now return to surface-tension-driven self-folding and the basic questions that drive our work: Why do some nets fold better than others? Are there dominant pathways for succesful self-folding? What is common to natural and synthetic self-assembly?

We will use a discrete geometric framework to analyze these questions. As in the models of viral capsid assembly, we first define and compute an ideal configuration space , define an energy landscape and a master equation on , and use these to define dominant pathways and intermediates. In order to motivate a discrete geometric model, we first review the physics of the surface tension hinge and some prior work. Our work culminates in a comparison between theory and experiment for self-folding polyhedra that is described in Section 9.

### 7.1 The Surface-Tension Hinge

We first contrast the relative magnitudes of capillarity and thermal fluctuations. In the experiments reported here, the hinges are made of solder. The side of each panel is 300 μm, and the sides of hinges are in the range 30–60 μm, so the area of the solder is on the order of 10^{−8} m^{2}. A typical value of the surface tension of solder is 0.5 N/m. Thus, an approximate scale of the capillarity is 10^{−8} J. The temperature of the solvent is at most 200°C = 473 K, which gives typical thermal energy *kT* ≈ 6.5 × 10^{−21} J. Thus, capillarity strongly outweighs thermal fluctuations.

*L*denote the length of the segments

*AP*and

*BP*. We then find with some basic trigonometry that the area of the droplet,

*A*(ϕ), is given byIn this regime, the area strictly decreases with increasing ϕ and can be inverted to yield ϕ(β,

*A*). Thus, the amount of solder deposited controls the angle of rotation of the hinge. When the droplet wets the panel (β ∈ (0, π/2)), the area is not a monotonically decreasing function of

*A*and there can be two positions of static equilibrium for a given area. To summarize, it is necessary that the panels should not be wetted by the solder. Equation (2) then serves as a first approximation in the laboratory.

### 7.2 Locking Hinges and Compact Nets

In order to realize the precise dihedral angle between faces (for example, 116.6° for the dodecahedron), it is also necessary to incorporate additional solder hinges at the external edges. We term these *locking* or *self-aligning**hinges* in contrast with the *folding hinges* that lie within the net (see Figure 10). While the amount of solder deposited provides control over the dihedral angle of folding, cooperativity between panels also plays an important role. For instance, the truncated octahedron has two distinct dihedral angles (109.47° and 125.26°), and it may be succesfully self-folded with an equal amount of solder on each folding hinge [60].

Our discrete model of self-folding relies on two experimental observations: (i) Folding begins at the outer boundary of the net; (ii) locking hinges impose the dihedral angle. We explain (i) as follows. While inertial forces are much weaker than capillarity, their effect is strong enough to cause the outer faces to fold first. As for (ii), while the amount of solder deposited at the folding hinge provides some control over the angle of folding, it does not ensure the exact dihedral angle. Instead, when the two faces meet, their locking hinges meet and self-align to correct the dihedral angle. The role of the locking hinge is especially strong for corners that are rigid (e.g., when three squares or regular pentagons meet at a corner). As a consequence, nets that have more corners where locking hinges meet self-fold with higher yield.

*compactness determines the success of self-folding*. The use of the term compactness was motivated by its use in protein folding. Since we do not wish to confuse our use of compactness with its common meaning in mathematics, let us describe precisely what it means here. The nets for a given polyhedron differ in their geometry and combinatorial topology. A natural geometric quantification of the intuitive idea that a compact net is “not too spread out” is to compute the radius of gyration

*R*

_{g}of a planar net, defined byHere is the planar net, and . Nets also differ in their combinatorial topology. Since an important role in folding is played by the corners at which locking hinges lock panels in place, it is natural to expect that the more such corners a net possesses, the higher the likelihood of succesful folding. In [3, 60] a vertex on the boundary of the net was called a

*vertex connection*if it is shared by two faces that do

*not*share a common edge (see Figure 10). This motivates the use of the total number of vertex connections of a net (denoted

*V*

_{c}) as a purely combinatorial measure of compactness (again, compactness is meant in the intuitive sense of “not too spread out”). The use of

*R*

_{g}and

*V*

_{c}as predictors of yield was tested systematically for the dodecahedron, icosahedrons, and truncated octahedron in [60].

## 8 Discrete Models of Self-Folding Pathways

### 8.1 Gluing at Vertex Connections and the Configuration Space

We idealize the experimental observations of self-folding in a discrete folding algorithm termed *gluing at vertex connections* [60]. A self-folding pathway is treated as a discrete sequence of states = (*S*_{0}, *S*_{1}, …, *S*_{n}) between a net *N* = *S*_{0} and the convex polyhedron *P* = *S*_{n}. While each state *S*_{k} is a collection of edges, faces, and vertices, we identify it with the closure of its embedded image in Euclidean space for simplicity of description. We generalize the notion of folding and locking hinges and vertex connections from the planar net *S*_{0} to each state *S*_{k}: We call the edges on the boundary of *S*_{k}*locking hinges*, and we call all other edges *folding hinges*. A vertex on the boundary of *S*_{k} is called a *vertex connection* if (i) it lies on the boundary of two faces of *S*_{k} that do not share a folding hinge, and (ii) the exterior angle at the vertex in the plane of these faces is less than π.^{4} The states *S*_{k} are generated in sequence from the state *S*_{0} as follows. For each *k* ≥ 0 a vertex connection *v*_{k} of *S*_{k} is chosen, and all locking hinges that meet at *v*_{k} are glued in pairs to form a state *S*_{k+1}. When gluing faces at locking hinges, the faces are only allowed to rotate rigidly through the dihedral angle about folding hinges that meet at *v*_{k}. The procedure terminates when the polyhedron is formed.

As in the building game, the set of distinct (i.e., non-congruent) states obtained by enumerating all pathways is called the *configuration space*. (Two states are congruent if there is a map *L*(*x*) = *a* + *Qx* with and a rotation matrix *Q* ∈ *SO*(3) such that .) States that can be linked by gluing vertex connections are *neighbors*. Thus, is a multigraph with an edge for each move that folds or unfolds a state into a neighbor. We find it more convenient to treat as a graph and to keep track of the number of ways a state *S*_{j} can fold into a state *S*_{k}. We call this integer the *degeneracy* and denote it by *D*_{jk}. The space is also a poset, because the number of edges glued provides a partial order on states. We use this ordering in the visual representations of for the cube [60, Figure 7] and dodecahedron (Figure 16).

Some shortcomings of this model must be noted at the outset. First, when the folding algorithm gives rise to flexible corners, it is necessary to augment the configuration space with additional states in order to obtain a reasonable description of experiments. Such an extended configuration space for the octahedron is described in [61]. Second, the apparent generality of the formulation should not disguise the fact that it has only been used to compute configuration spaces for a few simple experimentally feasible polyhedra. We have not shown that the algorithm is well posed in general. That is, given an arbitrary convex polyhedron and a net for this polyhedron, we have not shown that the algorithm above will fold the net into the polyhedron. Within the restricted setting of a few polyhedra, we show that the notion of a discrete configuration space provides a useful description for the analysis and design of experiments and suggests interesting mathematical questions.

### 8.2 The Distance Function

Since the states in are embedded in , we assigned a cost *d*(*S*_{k}, *S*_{k+1}) to each edge between neighboring states based on the distance traveled by faces in as the polyhedral linkage *S*_{k} is transformed into *S*_{k+1} by gluing at a vertex connection [60]. It is convenient to informally refer to the cost *d*(*S*_{k}, *S*_{k+1}) as the distance between *S*_{k} and *S*_{k+1}. The underlying heuristic idea here is that the energy dissipated when moving a panel scales linearly with Euclidean distance in an overdamped flow. Thus, the distance between the net and polyhedron serves as an easily computed proxy for the viscous energy dissipated along an assembly pathway, which incorporates the kinematic constraints of folding by rigid rotations about folding hinges. We now recall the definition of *d* from [60].

*S*that consists of a collection of polygons, an edge

*e*that separates

*S*into two polygonal surfaces Ω

_{1}and Ω

_{2}with common boundary

*e*, and an angle of rotation θ. We define a new surface by rotating Ω

_{1}and Ω

_{2}rigidly about

*e*through θ. In this process, each point on the surface Ω

_{i}at a distance

*r*from the edge

*e*moves a distance

*r*θ

_{i}, where θ

_{1}+ θ

_{2}= θ. The total squared distance traveled by Ω

_{i}is then ∫

_{Ωi}(

*r*θ

_{i})

^{2}

*dA*=

*I*

_{i}θ

_{i}

^{2}, where

*I*

_{i}denotes the first moment of area of Ω

_{i}about the axis

*e*. We then define the cost of this move, , by the formula

*S*

_{k}into a state

*S*

_{k+1}by gluing all faces at the vertex connection

*v*

_{k}consists of a number of substeps each of which involves a single rotation. The distance between

*S*

_{k}and

*S*

_{k+1}is defined by minimizing the distance over the set of possible surfaces between

*S*

_{k}and

*S*

_{k+1}that differ by the rotation of a single face. Let us make this precise. Assume faces

*f*

_{1},…,

*f*

_{l}are glued at

*v*

_{k}by rotating each face about the edge

*e*

_{1},…,

*e*

_{l}. Let σ denote a permutation on

*l*symbols, and assume the faces are rotated in the order

*f*

_{σ1},

*f*

_{σ2},…,

*f*

_{σl}. We then form the surfaces =

*S*

_{k+1}each of which differs from its preceding state by the rotation of

*f*

_{σj}about

*e*

_{j}through the dihedral angle. Since

*T*

_{j}

^{σ}and

*T*

_{j+1}

^{σ}, 0 ≤

*j*≤

*l*− 1, differ through a rotation, the distance between them is defined by Equation 4. (Note that the surfaces

*T*

_{j}

^{σ}do not lie in except at the endpoints

*j*= 0 and

*j*=

*l*.) We then set

The above definition of *d* has the following important intuitive consequence. For fixed θ, is smaller if *I*_{1} ≪*I*_{2} than if *I*_{1} ≈ *I*_{2}. Thus, it costs less to fold a face on an extremity of *S* than to rotate two equally balanced domains Ω_{1} and Ω_{2}. Hence, when computing shortest paths from a net to a polyhedron we usually find that the faces on the boundary of *S* fold first, until no such moves are possible.

### 8.3 Computational Determination of Dominant Intermediates

We use two approaches to determine dominant intermediates and pathways. The first is based on cost functions on , and the second on the kinetics of assembly.

The method of cost functions takes the following form: We assign a weight to each edge in . The cost of a path from a net *S*_{0} to the polyhedron *P* is the sum of the costs over each edge in the path. We then look for common structure in the optimal paths linking *S*_{0} and *P*. For example, are these paths spread out on a large set, or do they funnel through a few intermediates? We have mainly used the distance (5) as the cost function. Since the configuration space is discrete, we distinguish between *geodesic paths*, which are obtained by minimizing the distance over all paths in the configuration space that connect *S*_{0} and *P*, and *greedy paths*, obtained by choosing *S*_{k+1} to be the nearest neighbor “below” *S*_{k} at each step.

The kinetics of assembly can be modeled by the master equation (1) if we assume that a large number of nets fold simultaneously. In contrast with virus self-assembly, we always assume that folding is irreversible. As a consequence, *A* is lower triangular in a basis where the states are ordered in levels according to the number of locking hinges glued (as shown in [60, Figure 7] and Figure 16). The matrix *A* is mainly composed of lower triangular blocks, where the *k*th block links level *k* with level *k* + 1. However, there can be exceptional nonzero entries because of folding moves that jump more than one level. Since *A* is triangular, its diagonal entries are its eigenvalues and the spectral decomposition is easily computed. For configuration spaces such as the cube and dodecahedron that include a single terminal state, all diagonal entries except the last are strictly negative and the unique normalized equilibrium distribution is concentrated on the terminal state, i.e., *c* = (0,…, 0, 1).

*c*(

*t*). We choose a time scale in Equation 1 so that the spectral gap is 1 and consider the averageat

*m*points

*t*

_{1},…,

*t*

_{m}. The initial condition

*c*

_{j}(0) is 1 if

*j*corresponds to a net, and 0 otherwise. We rank the intermediates according to their relative magnitude in

*p*(

*t*

_{j}) and consider the top

*M*states at times

*t*

_{1},…,

*t*

_{m}. We then score the intermediates by a weight

*w*

_{j}for

*j*= 1,…,

*m*, and rank the intermediates that appear in the top-

*M*list according to their cumulative score. Of course, if

*t*

_{m}≪1, then all the nets appear, and if

*t*

_{1}≫1 only the complete polyhedron appears. In the examples, we have chosen

*m*= 4,

*t*

_{1}= 2,

*t*

_{2}= 4,

*t*

_{3}= 6,

*t*

_{4}= 8,

*w*

_{j}= 1 for all

*j*,

*M*= 10 for the dodecahedron, and

*M*= 5 for the octahedron. We call the resulting intermediates

*prevalent*.

In our experiments, we have considered two main choices for *A*. The first of these is the degeneracy, and we set *a*_{jk} = *D*_{jk}. We have also considered the kinetics under rates of the form *a*_{jk} = *f*(*d*(*S*_{j}, *S*_{k})) where *f* is a positive decreasing function (we used *f*(*x*) = 1/*x*, *f*(*x*) = *e*^{−x}, and *f*(*x*) = 1/(log(1+*x*)), *x* > 0). Note that the degeneracy *D*_{jk} relies only on the combinatorial topology of and is independent of the distance *d*(*S*_{j}, *S*_{k}); thus these two measures sense quite different properties of .

These methods of extracting dominant intermediates are admittedly ad hoc, as is our emphasis on the distance (5). For example, it is clear from Equation 6 that prevalence is determined by the spectral decomposition of A. We found the above scheme convenient; other choices are certainly possible. We have not (yet) shown that the geodesic distance between the net and polyhedron is correlated with the energy dissipated in the Stokes flow. Nor have we fully justified the use of the distance function to define rates *A*. At this time, the main justification for these choices is that they yield dominant intermediates that agree with experiments in a manner that is insensitive to the choice of cost functions and kinetics. We view these results as a reflection of the robust nature of the energy landscape of self-folding for these polyhedra, and expect them to be reinforced by detailed studies with more realistic cost functions and rates. The motivation for these choices in connection with other heuristics about self-assembly are discussed again in Section 10.2.

## 9 Self-Folding the Dodecahedron

### 9.1 Description of the Experiments

Since the dodecahedron has 43,380 nets, we restrict ourselves to a much smaller set of nets to obtain a manageable comparison with experiment. An exhaustive search through all 43,380 nets reveals that only 21 of them have the maximal number of vertex connections (*V*_{c} = 10 in this case; see Figure 11). We compute all intermediates that originate in these nets to find a poset with 2799 vertices that we call the restricted configuration space . This provides a piece of that is large enough to show interesting structure, but small enough to be experimentally tractable. A brief description of the computational scheme used to compute is presented at the end of this section.

In order to formulate questions that allow meaningful contact with the experiments, it is necessary to first describe the scale of the experiments. All 21 nets were patterned on a single mask in random order in order to minimize the role of manufacturing defects. 50 samples of each net were self-folded in the experiment. Nets were released from the substrate, folded in batches of 60 each (so approximately 3 from each net), then visually sorted under an optical microscope into three different grades. All nets and high-quality SEM images of dodecahedra folded from these nets are shown in Figure 11. The yield from the experiments is shown in Figure 12. This set of experiments allows a statistically meaningful formulation of the yield problem: How can the yield of self-folded polyhedra from a particular net be estimated a priori?

In addition to the above experiments, movies of the folding process were made for all 21 nets. Each of these movies was based on the folding of a single net in isolation. In order to compare the observed folding process with pathways computed from a discrete model, we visually extract snapshots of the folding process (as shown in Figures 13,^{14}–15). Since we follow the folding of a single net in isolation in each movie, a direct comparison between theoretically computed and experimentally observed pathways is statistically meaningful only when we focus on the emergence of a few dominant intermediates (since we then have a sample size of 21). These experimental comparisons may appear restricted at first sight, but it should be recognized that the experiments are challenging and the visual sorting of states is time-consuming. The results we report are the first systematic examination of pathways of synthetic self-assembly on this length scale.

### 9.2 Results

The configuration space is too large to be represented effectively in a figure. In order to help the reader understand the nature of , we portray a subgraph of in Figure 16 and Figure 17. In comparison with the simpler configuration spaces of the cube (see [60, Figure 7]) and octahedron [61], we note that the dodecahedron involves many more levels of folding intermediates. In the early stages, faces fold independently at the boundary of the dodecahedron net. Thus, the comparison with experimental pathways is more meaningful in the later stages of folding.

There are typically between 10^{6} to 10^{7} folding pathways from a net to the dodecahedron in . We considered three main computations on these folding pathways: (i) predicting yield; (ii) computation of shortest paths; (iii) determining prevalent intermediates. We discuss these in turn.

*S*

_{0}to the polyhedron

*S*

_{m}. In order to assign a probability to each pathway, we first assign a weight

*w*(

*S*

_{j},

*S*

_{j+1}) ≥ 0 to each edge and then define the weight of each pathway to be . A yield function on a pathway is defined in a similar manner. First, each edge is assigned a cost , and the cost for a pathway is the sum of costs along the edges in the path. Both these notions are used to introduce a weighted cost function

*C*(

*N*) and expected yield

*Y*(

*N*) for each net

*N*as follows:Here denotes the set of pathways that originate at net

*N*. The expected yield

*Y*(

*N*) is normalized by averaging over all initial nets.

The above procedure formalizes the problem of predicting yield. However, it involves both a choice of weight on paths *w* and a cost function *c*. We explored various choices and compared the numerical computations with the experimental yield. The weight was chosen to be *w*(*S*_{j}, *S*_{j+1}) = exp (−β*d*(*S*_{j}, *S*_{j+1})), where β > 0 is an ad hoc parameter. Various models of the cost function *c* were investigated, including the distance between two states, the degeneracy, and the number of non-rigid vertices in the state *S*. While it was possible to vary the parameter β and find the best match between *Y*(*N*) and the observed yield, these comparisons did not convey much insight. Moreover, the comparison of predicted *Y*(*N*) with experiment was difficult, since the error bars on experimental yield measurements are sufficiently large that several nets are clustered together (since only 50 samples are folded for each net). In presenting this procedure, our main hope is that it provides guidelines for statistical estimation and prediction at a future time when a much larger number of samples can be tested simultaneously and sorted automatically.

On the other hand, the comparison of computed pathways and experiments does convey genuine insight, and these results are reported in detail. We computed both geodesics and greedy paths by evaluating the distance function on these paths. In order not to overwhelm the reader with data, we classify the observed pathways according to the penultimate intermediate. These results are shown in Figures 18 and 19 along with a few experimental pathways for ease of comparison. While the space contains 23 states on level 9, an important feature of these computations is that the greedy paths and global geodesics funnel through only four of these 23 states. Moreover, all the greedy paths funnel through states that consist of two half dodecahedra linked by a hinge (states 2788, 2791, and 2797). Of these three, 2797 is the most important, since 16 of the 21 nets funnel through it. These three states, and 2797 in particular, are especially conducive to self-folding with no misfolding: Each half dodecahedron is rigid, and the intermediate has a single rotational degree of freedom about the hinge. When the two halves mate, several edges lock into place simultaneously.

When we compare with experiments, we find that the greedy paths replicate the observed pathways better. These computed pathways are a rough approximation of the true continuous pathway, and the experiments do not permit a statistically reliable resolution of fine distinctions such as whether intermediate 2788, 2791, or 2797 best approximates the observed half dodecahedra. Nevertheless, the trend towards such pathways is clear, as is the fact that the greedy paths are much better approximations to the observed pathways than the global geodesics.

The kinetically dominant intermediates were determined by the method outlined in Section 8.3. The prevalent intermediates are shown in Figure 20. It is important to note that *all* rate functions yield exactly the same set of prevalent intermediates and that these are in accordance with the dominant intermediates determined from the funnel of greedy paths.

### 9.3 Computational Scheme

A Mathematica package was written to compute . The first step was to find the 21 nets of the dodecahedron with maximum *V*_{c}. To do so, we enumerated all spanning trees on the 1-skeleton of the dodecahedron, used these to obtain all 43,380 nets for the dodecahedron, and computed *V*_{c} for each net. The spanning trees were enumerated by algorithm S in Knuth's book [43, §7.2.1.6]. Algorithm S takes time exponentional in the number of vertices, but it is computationally feasible for the dodecahedron. We then chose the 21 nets with maximum *V*_{c} and computed by performing each possible folding move on each net, testing intermediates for congruence as needed. For each pair of intermediates *S*_{j} and *S*_{j+1} on a folding pathway, we kept track of the degeneracy of the edge linking these states (the number of times that intermediate *i* folded into *j*), as well the distance between intermediate states. It was easy to calculate greedy paths: We started at an initial vertex and repeatedly chose the vertex at the end of the cheapest outgoing edge, keeping track of the edges traversed. Since is a poset, its geodesics were found with dynamic programming in time proportional to the size of .

## 10 Outlook

We present our conclusions in two parts. First, we give a set of findings specific to polyhedra. Second, we speculate more broadly on the role of mathematical theory in the field of self-assembly. Here we discuss some recent investigations that are similar in spirit to our work, and suggest areas for common inquiry.

### 10.1 Synthetic Self-Assembly of Polyhedra

Viruses are marvels of natural engineering and have inspired several synthetic emulations. While we have focused on self-folding, other important synthetic examples arise in supramolecular chemistry and DNA nanotechnology. In the first category, chemists have explored the assembly of polyhedra from molecular tiles held together by non-covalent bonds. Important recent examples of such molecular design are the work of Fujita's and Ward's groups [53, 71]. DNA nanotechnology emerged from Seeman's fundamental insight that the stiffness of DNA molecules could conceivably be used to design and build molecular crystals independent of any biological context. This is a booming area of enquiry that we cannot adequately survey (see [64, 67, 79] for examples of breakthrough experiments, and [5, 12, 16] for examples of polyhedra built with synthetic DNA). The conceptual framework outlined in this article is well suited to the design and analysis of these self-assembling systems. While the modular subunits vary from experiment to experiment—they may be edges [5, 71], faces [53], or junctions [34]—a common feature of the assembly pathways is that they are described by combinatorial decompositions of polyhedra. Thus, in each instance, we may construct a configuration space for the analysis of assembly.

However, few properties of these configuration spaces are understood. To date, the shapes built by synthetic self-assembly are relatively simple polyhedra—the Platonic solids, simple prisms, and simpler Archimedean solids. In order to build more complex shapes, even moderately complex polyhedra (20–30 faces), it is necessary to understand how the physics of a particular technique of self-assembly can be coupled with geometric combinatorics to provide sufficient control of the assembly pathway. Our work constitutes mainly an exploration of the geometry of folding, with the use of optimal paths serving as a crude approximation of the physics. Let us stress two key findings.

First, depite a combinatorial explosion in the configuration space, we found that pathways focus through a few dominant intermediates. We used two measures to quantify “dominance”: (i) *metric:* greedy paths on or were computed and seen to concentrate on a small subset of possible intermediates; (ii) *kinetic:* prevalent intermediates were computed using the kinetic equation (1) for several choices of rates. Both methods give essentially the same set of intermediates. Moreover, computed pathways compare favorably with experiment.

Second, we discovered that the dominant intermediates have good rigidity properties. More precisely, intermediates such as 2797 for the dodecahedron, and similar intermediates for the cube and truncated octahedron, have a single rotational degree of freedom about a central hinge. Passage through these intermediates is particularly well suited to self-folding, since several edges mate simultaneously, reducing the chance of misfolding. Note that our cost functions have no a priori connection to rigidity theory. Thus, the rigidity of the dominant intermediates is surprising. Recent experiments in self-folding show that modifications to precursor nets such as removal of even a single panel can have a strong influence on self-assembly and direct pathways through more rigid intermediates. This strategy was used to enrich one octahedral isomer over another [61].

These lessons are not restricted to self-folding. Dominant compact intermediates and pathways have been discovered in several model systems. Various measures of dominance seem to lead to the same intermediates, and further these intermediates typically have favorable properties such as compactness, energetic stability, or fewer degrees of freedom. The principal challenge in engineering self-assembling systems is to understand how to guide a system through these intermediates. Viruses such as MS2 use an exquisite interplay between RNA folding and capsid assembly to direct succesful assembly. At present, such precise direction lies outside the scope of any synthetic technique (though current experiments in DNA nanotechnology are exploring the design of DNA sequences that break assembly into a hierarchy of steps). The decoding of the intricate sequence-specific mathematical structure of the assembly process for MS2 suggests an inspiring inverse challenge: Can we design a (synthetic) sequence of nucleotides that code for protein shells that assemble as gracefully as MS2?

### 10.2 Discrete Geometry and Self-Assembly

Polyhedra are mathematical objects of antiquity. However, studies of self-assembling polyhedra give rise to new questions that are of practical importance and intrinsic interest. The examples surveyed in the article show that essential insights into the biological, chemical, or physical mechanism of self-assembly can be captured with a discrete algorithm for building a polyhedron. Similar ideas are also used to model protein folding [46, 66] and directed assembly of sphere packings [2, 56]. The common structure of these models involves a geometrically defined discrete configuration space, , and a Markov process on with transition probabilities defined using biochemical or physical energy functions. As the strength of thermal fluctuations decreases, it is also natural to consider gradient descent of the energy. In biophysical models, the main goal is to understand how natural energy functions rapidly drive the system to a preferred minimum energy. In synthetic examples, the goal is to understand how to direct assembly through preferred pathways.

As this rough description suggests, there are two main mathematical aspects to these studies. The first involves geometric combinatorics—to enumerate and classify the configuration space . The second is to understand how the geometry of affects an energy landscape on . In the models we consider, the geometry of intermediates is modeled faithfully, while using a naive approximation of the physics. While the agreement with experiment is surprising at first sight, our cautious view is that the intrinsic geometry of the configuration space underlies robust self-assembly. In addition to our work on self-folding, this heuristic idea plays an important role in the kinetics of capsid formation, both in a crude model using the building game [81], and in a careful study of directed assembly that takes into account the role of RNA folding [18, 20, 21]. In the protein folding literature, computations have shown that even randomly chosen energies can lead specific HP strings into a compact state [66]. In this view, the energy landscape of protein folding is a *folding funnel* attracting all initial states into the narrow neck of the funnel, followed by a rapid passage down the neck [8, 14, 49].

A precise understanding of the geometry of the configuration space for models of self-assembly promises to be of broad interest. At present, our understanding of the structure of the configuration spaces relies mainly on computational explorations for simple polyhedra. These computations suggest interesting problems in geometric combinatorics that are yet to be investigated systematically. For example, how big is the configuration space for the building game? How do growth algorithms for polyhedra suggested by experiments (the building game, self-folding, local rules) compare with well-studied combinatorial methods such as shellability [80]? Do the above configuration spaces have favorable CAT(0) geometry that underlies robust assembly? (Compare with the speculative ideas of Gromov in [29].)

The size of the configuration space provides a simple measure of the complexity of self-assembly. By this measure, for the same polyhedron, self-folding is a more complex growth process than the building game. On the other hand, the configuration spaces we have explored are far smaller than the configuration space for simple protein-folding models like the 27-mer [66]. Thus, self-assembly of polyhedra provides problems of intermediate complexity, amenable to experimental and theoretical investigation. Our hope is that heuristics common to several models of self-assembly, such as the formulation of a folding funnel and directed pathways, can be better understood in the investigation of polyhedra. We also hope that the immediacy of the technological challenge of self-assembly will stimulate further interdisciplinary work in the area.

## Acknowledgments

RK, JK, and GM acknowledge support from NSF grants DMS 07-48482 and EFRI 10-22638 (BECS). SP and DHG acknowledge support from NSF grant CMMI 12-00241 and EFRI 10-22730 (BECS).

## Notes

The reader is also cautioned not to try this with the dodecahedron or icosahedron—both these polyhedra have 43,380 nets! These exact numbers for the Platonic solids are exceptions: Enumerating nets is a subtle problem, and it is not yet known whether every convex polyhedron can be unfolded along its edges to a (non-overlapping) net [59].

This is an example of a *strongly balanced* tiling in the sense of Grünbaum and Shephard [30].

This is a computational observation in [65], and it is not clear if a rigorous proof exists.

Note that condition (ii) was not explicitly stated in [60], though it is implicit in the nets studied there.

## References

## Author notes

Author contributions: DHG, GM: Designed research, analyzed data, wrote the paper; RK, JK, SP: Performed research with equal contributions.

Contact author.

Work performed at Brown University. Current address: Dropbox LLC, 185 Berry Street, San Francisco, CA 94107. E-mail: ryankap@gmail.com

Division of Applied Mathematics, Brown University, 182 George St., Providence, RI 02912. E-mail: joe_klobusicky@brown.edu (J.K.); govind_menon@brown.edu (G.M.)

Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD 21218. E-mail: shivendra@jhu.edu

Department of Chemical and Biomolecular Engineering and Department of Chemistry, The Johns Hopkins University, Baltimore, MD 21218. E-mail: dgracias@jhu.edu