DNA partitions into triplets under tension in the presence of organic cations, with sequence evolutionary age predicting the stability of the triplet phase

Abstract Using atomistic simulations, we show the formation of stable triplet structure when particular GC-rich DNA duplexes are extended in solution over a timescale of hundreds of nanoseconds, in the presence of organic salt. We present planar-stacked triplet disproportionated DNA (Σ DNA) as a possible solution phase of the double helix under tension, subject to sequence and the presence of stabilising co-factors. Considering the partitioning of the duplexes into triplets of base pairs as the first step of operation of recombinase enzymes like RecA, we emphasise the structure–function relationship in Σ DNA. We supplement atomistic calculations with thermodynamic arguments to show that codons for ‘phase 1’ amino acids (those appearing early in evolution) are more likely than a lower entropy GC-rich sequence to form triplets under tension. We further observe that the four amino acids supposed (in the ‘GADV world’ hypothesis) to constitute the minimal set to produce functional globular proteins have the strongest triplet-forming propensity within the phase 1 set, showing a series of decreasing triplet propensity with evolutionary newness. The weak form of our observation provides a physical mechanism to minimise read frame and recombination alignment errors in the early evolution of the genetic code.


Introduction
Under tension in aqueous solution with small or monatomic counterions, the DNA duplex stretches, unwinding if not topologically constrained, and eventually denatures. The extension against force shows a jump by a factor of ∼1·5−1·7 (depending on sequence, pulling geometry and solution) at ∼65-70 pN (Bosaeus et al. 2012;Liu et al. 2010;Vlassakis et al. 2008;Williams et al. 2002). Several models have been proposed to explain the sudden increase in length, which is widely agreed to be the signal of a collective structural transition. The formation of regions of single-stranded DNA (ssDNA) (Williams et al. 2001) or of ladder-like stretched and untwisted double-stranded DNA (dsDNA) have been suggested (Bosaeus et al. 2017;Cluzel et al. 1996;Konrad & Bolonick, 1996;Lebrun & Lavery, 1996;Smith et al. 1996). Although the structural details and degree of order are not clear or universally agreed in the literature, in general at modest extensions of long sequences not dominated by AT base pairs, we expect to see a clear two-state transition to a partly untwisted ladderlike structure, in which the base pairs remain intact but the average rise per base pair is equilibrated to a new value of ∼5·8 Å, compared to the rise in unstretched B-DNA of 3·4 Å.
The stretched phase or phases are known by the umbrella label of 'S-DNA'. For GC-rich structures having strong hydrogen bonding, the base pairing is preserved in the putative S-DNA structure, and the base stacking may also be somewhat preserved by tilting and sliding of the base pairs or by opening of 'bubbles' between base pairs (Konrad & Bolonick, 1996). Reorientation of the base pairs may increase the solvent-exposed area by changing the groove width while permitting them to remain in contact such that a complete water gap does not open between them (Roe & Chaka, 2009). The detailed S-DNA structure, particularly the inclination, depends on the pulling scheme. Experimental results of Albrecht et al. show that the differences in behaviour of DNA when using different pulling schemes can depend on the loading rate, with stretching forces propagating first at high pulling rates, but shearing becoming dominant at slower pulling rates (Albrecht et al. 2008). Others have also shown that there are structural differences when using different pulling mechanisms (Bag et al. 2016;Danilowicz et al. 2009;Roe & Chaka, 2009).
The most readily available information on DNA under tension is the empirically measured force-extension curve (Rief et al. 1999;Smith et al. 1992), which provides the clear signal of some transition, but no atomistic-level information. This is supplemented by fluorescence and polarised-light studies (van Mameren et al. 2009;Nordén et al. 1992;King et al. 2013), and by atomistic simulations which are able to provide explicit descriptions of the DNA but which are limited in the accessible timescales and system sizes (Bag et al. 2016;Konrad & Bolonick., 1996;Lebrun & Lavery, 1996;Li & Gisler, 2009;Roe & Chaka, 2009). DNA is often subjected to tension in its biological context, for purposes including transport, transcription and tertiary structure manipulation (Nicklas, 1988). A striking example of this is the crystal structure (pdb: 3cmt) of DNA bound to the RecA protein (Chen et al. 2008), a snapshot of homologous DNA recombination in action. Homologous recombination using RecA is employed for DNA repair (Kuzminov, 1999), and via other recombinases is also the fundamental process of sexual reproduction, in which copies of similar genes are exchanged between DNA strands from either parent. In the RecA structure, the extended protein-bound DNA duplex does not adopt a recognised S-like configuration, but rather disproportionates into groups of three bases, with orderly planar base stacking retained within each triplet. This triplet disproportionation has been observed in solution when bound to RecA (Nordén et al. 1992), in crystallographic structures of the RecA-DNA complex (Chen et al. 2008) and also has been suggested as a stable phase even without co-factors (Bosaeus et al. 2017). Simple energy minimisation (without thermal fluctuations or any explicit solvent) of d [GCG] 4 DNA under constraints of a 1·5 relative extension and also of imposed helical symmetry with unit cell 3 bp yields a structure partitioned into triplets of three stacked base pairs separated by a stack-breaking gap of approximately one base-pair thickness (Bertucat et al. 1998). The early minimisation calculation shows that a period 3 or 'triplet' structure is physically plausible when considered purely in static chemical terms, although it has not since been experimentally or computationally observed in solution. We note that the term 'triplet' here refers to a duplex of length three bases (perhaps more properly a 'triad'), and not to a triplex of three strands.
Orderly triplet formation when complexed is in contrast to current general understanding of the structural behaviour when extended in solution, which leads us to examine whether the triplet phase can be stabilised in solution and if it could in this case be considered a canonical biologically active structure of DNA on the same footing as the A, B and Z forms. Using molecular dynamics simulations of duplex DNA with an applied force, we do not observe stable triplet structure in an aqueous solution of monatomic counterions, but do find that it is stable without specific complex to a structured enzyme, forming triplets in a solution either of terminus capped monomeric arginine peptides (Ac-Arg + -NHMe Cl − ) or (more weakly) of the well-known intercalant ethidium bromide (Et + Br − ).
The presence of intercalators has been observed to drive significant alterations in the quasiequilibrium force versus extension curve, with an effect at high concentrations of smoothing over the B to S transition and possibly of modifying the structure of the S phase (Vladescu et al. 2007). By carrying out simulated stretching experiments in the presence of DNA-binding cofactors, we intend to reduce the barrier and collectivity associated with the B-S transition, thus increasing the likelihood that the sub-microsecond simulation timescale can describe the typical laboratoryscale (milliseconds) process, and also to investigate the structural role of biologically relevant moieties (Arg) in relation to DNA under tension.
We discuss planar-stacked triplet disproportionated DNA as a solution phase of the double helix under tension, and refer to it as 'Σ DNA', with the three right-facing points of the Σ character serving as a mnemonic for the three grouped base pairs. In the same way as for unstretched Watson-Crick base-paired DNA structures, we remark that the structure of the Σ phase is linked to function: the partitioning of bases into codons of three base pairs each is the first phase of operation of recombinase enzymes such as RecA, facilitating alignment of homologous or near-homologous sequences. By showing that this process does not require any very sophisticated manipulation of the DNA, we position it as potentially appearing as an early step in the development of life, and correlate the postulated sequence of incorporation of amino acids (phase 0 or 'the GADV world' (Ikehara et al. 2002), phase 1 or 'GADVESPLIT' (Koonin & Novozhilov, 2009;Wong, 1975Wong, , 2005, then the remaining amino acids, CMFYHRKNQ), into molecular biology with the ease of Σ-formation for sequences including the associated codons. We also note that the machinery of nucleotide to peptide translation occurs necessarily with reference to triplets of bases, so that further investigation into the Σ phase of single and double strands of RNA and DNA might be a valuable source of insight into the origins not only of recombination, but also of gene expression.

Triplet propensity calculation
To calculate the triplet propensity, we proceed from datasets giving base-pair step formation energies via a brute force Monte Carlo approach. Initially, a random sequence of 60 amino acids is drawn with equal probability for each of the 'phase 1' amino acids. A DNA duplex is then defined by randomly drawing a codon for each amino acid (based on those codons available in the modern 'universal' genetic code). In order to have equal sequence content in both strands, the initial sequence is then concatenated with its complement, generating a palindromic test DNA sequence of 360 bp. Defects are then inserted at random into the sequence such that ⅓ of base steps are broken, and allowed to equilibrate their positions for 150 MC steps per defect present, where an MC step is a defect repositioning attempt via Metropolis Monte Carlo at 300 K. This long sequence then serves as a 'reservoir' of stack breaks: in order to find the triplet propensity for a given test codon, it and its complement are inserted into the reservoir duplex and re-equilibrated for a further 30 steps/defect before collecting statistics for the distribution of stack breaks. The triplet propensity ΔG τ is stated simply as a negative log probability for breaks to form at each end of the test codon but not inside it. The calculation was repeated (with re-drawing of the reservoir sequences) until convergence, which was validated by comparing redundant codons: the ΔG τ for all pairs such as GGA·TCC and TCC·GGA was verified to differ at worst in the third significant figure. The short script used for the calculation is included as Supplementary software.

Atomistic simulation protocols
Molecular structures were prepared using the Nucleic Acid Builder (NAB) (Macke & Case, 1998). Salt was represented using the Joung-Cheatham parameters for ions (Joung & Cheatham, 2008) and TIP3P model of water was used for solvation (Jorgensen et al. 1983). The ethidium molecule was represented using the GAFF (Wang et al. 2004) with partial charges and bond parameters assigned via the ANTECHAMBER tool (Wang et al. 2006). Simulations were run using the GPU-accelerated implementation of pmemd (Salomon-Ferrer et al. 2013) in the AMBER16 package (Case et al. 2016) using the AMBER 99SB+bsc0 force field parameters for the biomolecules (Hornak et al. 2006;Pérez et al. 2007). Generated DNA fragments were simulated in a rectangular periodic box with 10 Å distance from the DNA to the box boundaries. Sodium and chloride ions were added so as to give an electrically neutral system with approximately physiological Cl − concentration (order 0·1 M) ( Table 1). The SHAKE algorithm (Ryckaert et al. 1977) was used to constrain all bonds involving hydrogen atoms. To calculate electrostatic interactions, the particle mesh Ewald sum was employed with a 2 fs time step (Darden et al. 1993). The direct part of the Lennard-Jones interactions was cut off at 8 Å.
EtBr and arginine intercalators were initialised in the bulk solvent rather than being manually inserted between base pairs: this approach is likely to underestimate the amount of intercalator bound at a given extension, but has the advantage that no bias is introduced with respect to the binding site or pose. The simulation trajectories were collected at a rate of one frame per 2 ps.
For each calculation, 16 independent instances were prepared and equilibrated in the B conformation for 10 ns. Pulling of the DNA then took place using steered molecular dynamics, for 16 instances of six different systems, over a time period of 150 ns (giving a pulling rate of 0·068 nm ns −1 ). Due to the slow kinetics of intercalator dissociation in unstretched DNA, which for typical mono-intercalators is of the order of one per second (Biebricher et al. 2015), binding of intercalators was essentially irreversible in silico except via extensionally driven conformational change. The choice to make multiple 150 ns simulations rather than fewer multi-microsecond runs was a decision to pursue good stochastic sampling of an explicitly non-equilibrium process, rather than attempting the computationally very challenging goal of equilibrium-like sampling, which is not achieved even on the laboratory timescale of <1 s per cycle of extension and relaxation.
DNA was pulled using force applied to the centres of geometry of the top and bottom base pairs, such that no topological restraint was applied and force was distributed equally between 3′ and 5′ strand ends. The choice to spread the applied force was made based on an assumption that the distribution of applied force in an early biological context was likely to be relatively smooth (e.g. via hydrodynamic combing or weakly specific binding of cofactors) rather than limited to two specific atoms.
Averaging of angles (for study of DNA structural parameters) was carried out by taking the mean cosine and sine, then the arctangent of the mean values. Structures were analysed using CURVES+ (Lavery et al. 2009).

Base-pair step classification
Base-pair steps were classified according to the following rules (see Fig. 5, e.g structures): • β: ('Base-paired and stacked'): one or more Watson-Crick hydrogen bonds were preserved for each pair in the step, with rise <5·6 Å. • ζ: ('Zipper'): one or more hydrogen bonds were present between each base and the backbone of the opposite strand. • σ : ('Space'): one or more WC hydrogen bonds were preserved for each pair in the step, rise was ⩾5·6 Å, and at least one of the two vertical pairs of bases was completely separated, with no contact ⩽3·5 Å. • τ: ('Tilted'): one or more WC hydrogen bonds were preserved for each pair in the step, rise was ⩾5·6 Å, but one or more atomic contacts remained for each vertical pair of bases. • μ: ('Melted or mismatched'): WC hydrogen bonding to the opposite base was disrupted, and not replaced with zipper hydrogen bonding.
Hydrogen bonds were defined for triangles donor-Hacceptor such that the angle at H was >135°and the distance donor-acceptor was <3 Å (cpptraj defaults) (Roe & Cheatham, 2013

Sequence dependence of disproportionation
It is not clear what form the original genetic code had, as it is likely to have co-evolved to some extent with the associated enzymes of transcription and translation. We can make guesses about the history of the genetic code by considering the metabolic networks leading to the different amino acids: it is hypothesised that a list of so-called 'phase 1' amino acids were present earlier in evolution than the remaining 'phase 2' amino acids, based on the complexity of the cellular machinery used in current organisms to synthesise, for example, methionine (M) from threonine (T) (Koonin & Novozhilov, 2009;Wong & Bronskill, 1979). If the genetic code in the epoch of a much simplified amino-acid alphabet already had the current structure of three base pairs per codon ('triplet code') it was therefore highly redundant at this time.
In the current triplet code, the 'phase 1' amino acids supposed to have been incorporated earliest into biology (the list of GADVESPLIT) are coded by triplets which have a specific physical tendency: the energetic cost to break base stacking at the triplet boundary is low, relative to the complete modern genetic code. In this paper, we first motivate the statistical observation of preferential triplet disproportionation in the phase 1 genetic code. We then analyse atomistic simulation data to show that disproportionation into coding triplets occurs spontaneously under tension for appropriate sequences and solution conditions.
The weak form of our observation provides a physical mechanism to minimise read frame and recombination alignment errors in the early evolution of the genetic code. We further motivate this to make the stronger claim of a possible route for the origin of the triplet genetic code, with the three basepair structure arising from simple physical conditions in the absence of the sophisticated enzymatic machinery which later evolved to maintain the triplet code in modern organisms with a full alphabet of amino acids.

Estimation of sequence-dependent triplet-forming propensity
The free energy of base stacking in duplex DNA was long ago calculated combinatorially for the various pairs of bases, by Friedman and Honig (Friedman & Honig, 1995). Stacking energies are directional (crucially, GC·GC is stronger than CG·CG), here when writing a sequence the direction 5′ → 3′ is assumed.
From the tabulated data (Friedman & Honig, 1995), if we approximate the stacking energy for complementary duplex DNA as the sum of the stacking energies for the two base steps, the weakest step is CG·CG (−10·14 kcal mol −1 ). We therefore hypothesise that a series of codons of the form GNC (where N means 'anything') should have the special property of partitioning naturally at the codon boundary C to G when under tension. We make a statistical analysis based on the stack-breaking free energies, to produce a measure of the propensity for a given codon, when embedded in a wider sequence, to partition extension to the codon boundaries.
To calculate the triplet propensity, we take a brute-force Monte Carlo approach, selecting a long sequence and allowing stack breaks to equilibrate according to a Boltzmann distribution, before sampling the probability for each codon to partition stack-break defects such that it has a stack break at each end but not inside it.
Tabulating this information for the 20 canonical amino acids (showing the triplet propensity for the most tripletprone codon available to each amino acid), we can see a clear pattern of reduced triplet disproportionation energy for the primordial 'stage 1' amino acids ( Table 2).
The dramatic pattern evident in the tabulated partitioning energies is that stage 1 amino acids overwhelmingly have relatively favourable free energies to partition into triplets aligned to their codons. The exception to this pattern is interesting: arginine (R) is not listed in Wong & Bronskill's (1979) tabulation of stage 1 amino acids (Wong & Bronskill, 1979), possibly due to the large energetic cost needed to synthesise it from citrulline in modern organisms (Ratner & Petrack, 1953).
It has been advanced the CGN and AGN codons which yield arginine in the modern genetic code previously coded for the chemically similar non-canonical amino acid ornithine (Jukes, 1973), and that the function of this codon was usurped in a presumably dramatic evolutionary event when selection advantage was found in having access to the more strongly basic arginine molecule. The original phase 1 list GADVESPLIT contains no basic amino acids at all making the addition of ornithine seem valuable in order to form a good range of folded proteins, and the replacement of ornithine with arginine a beneficial evolutionary step in giving access to a stronger base.
Arginine stands out for a second reason: the DNA-binding recombinase RecA achieves triplet disproportionation by cradling the negatively charged DNA in a large number of positively charged R side chains (and some K). Thus, we should perhaps not be surprised if a phase of biochemical evolution in which control of triplet disproportionation is important should have some means to produce either arginine or similar moderately bulky basic residues.
The weaker statement of this work, that the stage 1 part of the genetic code is structured so as to support a minimisation of read-frame errors by physically favouring the partition into aligned triplets under tension, is related to a known subtle and remarkable property of the genetic code. This property is that its redundancy is structured almost optimally so as to support overlapping codes orthogonal to the primary code specifying amino acids (Itzkovitz & Alon, 2007), allowing the evolution of sequence changes altering DNA structure and interactions even within protein-coding regions, without changing the coded protein.
The overall flexibility of the genetic code in allowing arbitrary steganographic codes is not however sufficient to explain the strong pattern which we observe: Table 3 shows that codons for phase 1 amino acids are significantly more able to encode this partitioning than those in phase 2. We further observe that the residues advanced by Ikehara et al. (Ikehara et al. 2002) as forming the minimal set for a functional proteome (marked ☥ in Table 3) are also those which partition most naturally into triplets.
We should note that the calculation of stacking energies by Friedman and Honig is a very elegant paper but is not the most recent attempt to measure this quantity. On the experimental side, Protozanova et al. have measured free energy of stacking for base-pair steps in a nicked DNA duplex (Protozanova et al. 2004). It is attractive to generalise these stacking-free energies for nicked DNA to the intact DNA double helix; however, NMR analyses have shown that the broken phosphodiester bond pushes the conformation away from canonical B-form (Pieters et al. 1989;Roll et al. 1998; Snowden-Ifft & Wemmer, 1990) which leaves Table 2. Phase one amino acids (*,☥) tend to have (at least one) codon associated with them that partitions favourably at its boundaries

Amino acid
Codon-anticodon The most favourable partitioning is for the phase 0 (DAGV) amino acids (☥). a question mark over the applicability of the data for our purposes. Protozanova et al. report values of −0·91 and −2·17 kcal mol −1 for CG·CG and GC·GC stacks, while further experimental results for association of completely free duplex ends by Kilchherr et al. report −2·06 and −3·42 kcal mol −1 (Kilchherr et al. 2016). Calculations by Lemkul and MacKerell (Lemkul & MacKerell, 2017) make a more complex treatment of the base stacking interactions, arriving at values for the key CG·CG and GC·GC comparison of −7·59 versus −11·69 kcal mol −1 . Overall then, while there is significant disagreement on the magnitude (and sometimes the ranking) of the 10 stacking energies for complementary base-pair steps, the key feature of a weak CG stack which points to the existence of a GNC triplet code is preserved. We include as Supplementary data repeated calculations of triplet formation-free energies with the Protozanova et al. and Lemkul and MacKerell datasets (Fig. S1). These Supplementary calculations show the same privileged status for the GADV and phase 1 amino acids as those based on the Honig dataset, although many details are different, and also in the Lemkul and MacKerrell calculation, the residue glycine loses its privileged triplet status almost entirely due to a very weak reported value for the GG·CC stack. We also include as a Supplementary software the python script used to carry out the triplet propensity calculation.

Spontaneous triplet disproportionation under tension, amplified in the presence of organic cations
Beyond the pairwise hydrophobic and electrostatic interactions of base stacking (covered by the classic calculation used to generate Tables 2 and 3) the potential importance of complex entropic, structural and solvent effects makes it necessary to carry out a full atomistic molecular dynamics investigation of DNA under tension. Given the expected importance of sequence effects, simulations were run both with a low-entropy sequence of d[G 12 C 12 ] (encoding four glycines and four prolines) and a sequence chosen to show strong triplet disproportionation based on  (Ikehara et al. 2002)). DNA duplexes were stretched by an additional 100 Å from their relaxed lengths, over a time period of 150 ns, giving a stretching rate of 0·029 Å ns −1 bp −1 . Because of the apparent importance of arginine, based on Table 2 and on the RecA structure (Chen et al. 2008), simulations were run both in NaCl and in a solution of Ac-Arg-NHMe Cl.
We find that for the GC-rich sequence encoding phase 1 amino acids, the triplet-disproportionated Σ conformation of DNA is observed, with the strongest triplet formation taking place in the presence of the terminus-capped arginine residues (Ac-Arg-NHMe). Figure 1 shows a regular pattern of vertical stripes (preferred base-pair step breaks) with period 3 bp, over a large range of extensions. The lowentropy sequence in the presence of arginine shows some weak structure at high extensions, due to exclusion effects which disfavour binding of cations to adjacent sites. In the high-entropy sequence, some structure of period 3 is seen, even in the absence of arginine; however, this is relatively weak (as suggested by the order 1 k B T free energies of disproportionation in Table 2).
The triplet-disproportionated structures show the essential features of Σ-DNA (Fig. 2) as seen in the RecA-bound crystal structure: preserved Watson-Crick base pairing, approximately planar orientation of the bases (Fig. 3), and a large cavity every third base pair.
Extending beyond approximately 3 Å bp −1 leads to breakup of the Σ phase and also to loss of Watson-Crick hydrogen bonding, as the bases interdigitate with each other and hydrogen bond to the backbone.
The question of planarity of the base stacking is important to monitor as a structure of stretched DNA in which formation of stack breaks is avoided by collective tilting of the stack has been associated with stretching in the 3′-3′ geometry (Roe & Chaka, 2009). Without intercalators, the average base-pair inclination in the presence and absence of intercalators up to the extension point of 1·5 was flat (Figs 3b, 3d Table 3. All codons, coloured by G τ with a light colour indicating higher probability to disproportionate into triplets The pattern purine-x-pyrimidine gives greatest triplet disproportionation, also having a (bulky) G or C base in the middle of the codon gives more favourable Σ formation. Because energies were found in the duplex form, complementary codons (e.g. GGC, GCC) necessarily have the same G τ . ☥, * indicates phase 1 amino acids, ☥ indicates phase 0. X indicates a stop codon. and 3f), which indicates no collective tilting. For the highentropy sequence, in the presence of intercalators, this trend continued after an extension of 1·5 but without intercalators showed a sudden drop at around 1·7. For the lowentropy sequence (G 12 C 12 ·G 12 C 12 ), the change of average inclination up to an extension of 1·5 is the same as for the high-entropy sequence. The bare sequence and the one in the presence of arginine reach a maximum inclination at a relative extension of 1·6-1·7 and drop afterwards (Figs 3a and 3c), but in the presence of EtBr, a continuous increase is observed after extension 1·5, followed by a second flat region after 1·6 (Fig. 3e). That average inclination tends to be small during DNA extension for the high-entropy sequence with or without organic cations is consistent with the results put forward from experiment (Bosaeus et al. 2017) as suggesting Σ formation even in free solution.  were classified according to the physical interactions which were either broken or preserved, with Watson-Crick hydrogen bonds and planar base stacking giving way over medium extensions of 1·3-1·5 to a complex regime in which regions of melted base pairs (no or mismatched hydrogen bonding) compete with regions in which the Σ phase dominates with WC hydrogen bonding preserved, but with stacking periodically either broken or reduced by non-collective tilting. At extensions beyond 1·7, in the absence of intercalator molecules, the zipper phase emerges.

Base-pair step classification
In the presence of intercalators, the zipper phase is destabilised and WC base-paired conformations are stabilised. Although the count of conformations consistent with the Σ phase increases again at high extension with intercalators, the threefold periodicity is weakened. The presence of the Σ phase (in the full sense of two β-steps alternating with one σ-step) reaches a maximum average proportion of about 30%, with the remainder at this point made up mostly of melted or indeterminate configurations, or of boundary regions between melted and structured phases (Figs 4h  and 4l).

Discussion
The mechanics of DNA under tension are affected by various factors including the counterions (Vlassakis et al. 2008), sequence (Rief et al. 1999) and temperature (Fu et al. 2010). Molecular combing results show that DNA in its stretched form is not denatured, with a double-helix structure which is characterised by a diameter of 1·2 nm (Maaloum et al. 2011). X-ray diffraction of stretched cross-linked films of a mixed sequence of DNA show gaps of ∼8 Å (André et al. 2008), nearly the same size as seen in the DNA-RecA complex. These experimental results suggest the existence of a stretched form of solution DNA with preserved base-pair stacking.
In order to relate the physics of DNA stretching with its function in storing and copying information, we have estimated the sequence dependence of the free energy cost in water at moderate ionic strength to unstack a given codon from its neighbours at the codon boundaries, leaving the codon itself intact as a triplet of stacked base pairs. This initial calculation reveals that although the aqueous solution environment does not strongly drive triplet partitioning, that a distinct hierarchy of triplet formation energies exists with respect to sequence features. The triplet formation energy estimates show that sequences coding for 'stage 1' amino acids hypothesised to have appeared early in evolution (plus arginine) are more likely to partition into triplets at the codon boundaries than to partition in any other pattern when under tension. In order to investigate this phenomenon, we carried out pulling simulations of DNA duplexes encoding stage 1 amino acids, in the presence of Fig. 2. The 'primordial' sequence partitions under tension predominantly at the CG steps, forming triplets (a), with Watson-Crick hydrogen bonding and planar base stacking preserved subject to some thermal disorder (a, b). Triplets are stabilised by one or two arginines (cyan) intercalating the stretched base steps (b, c) with non-specific binding that tends to place the charged end of the side chain close to the phosphate, and partially or entirely excludes water from between the bases; (c) is a zoomed and rotated view of the highlighted cavity in (b).
arginine and also of ethidium bromide, as well as control simulations using low-entropy sequences, and also tested aqueous conditions with monatomic salt only.
In order to observe substantial triplet disproportionation, both a bulky organic cation (i.e. arginine or ethidium) and a sequence selected from codons yielding phase 1 amino acids was required, with the combination of these two factors operating in a non-additive way to produce a solution structure of stacked base-pair triplets. Overstretching the Σ-duplex led to formation of interdigitated zipper DNA, while stretching without cofactors or appropriate sequence led to disordered but not fully denatured structure consistent with experiments (Balaeff et al. 2011;Bosaeus et al. 2017) and simulations (Konrad & Bolonick, 1996).
Ikehara et al have shown that codons matching the pattern GNC probably constituted the original, minimal functional For the low-entropy sequence G 12 C 12 ·C 12 G 12 , average inclination remains flat up to extension of 1·5, but it experiences a sudden change after the extension passes 1·5 (a, c). In the presence of EtBr inclination increases smoothly after the extension point of 1·5 and reaches the second flat region of extension beyond 1·6 (e). genetic code. These authors argue based on multiple strands of reasoning that the amino acids GADV, translated from these codons, constitute the unique minimal adequate set to generate functional globular proteins (Ikehara et al. 2002). We argue that it is no coincidence that the same GNC codons are those which drive maximal triplet disproportionation, allowing recombination and possibly protein synthesis to operate without the sophisticated enzymatic machinery which exists today. We hypothesise that a bootstrap process took place, with crude triplet disproportionation facilitating recombination, driving-accelerated evolution and leading to more sophisticated protein-DNA interactions, in turn allowing expansion in further stages of the genetic code.
We attempted to reductively study the factors driving Σ formation of DNA, which we deem to be force (applied via filament formation in the case of RecA, but capable of being applied in many ways), sequence content and the presence of cationic amino acids (as part of a structured protein today, possibly not so in early biology). We have found that these factors operate collectively to drive Σ formation, in competition with tilting or melting of the DNA base pairs. We do not rule out Σ DNA as a long-lasting stable collective structure of DNA while under tension in solution; however, we are also not able to strongly confirm this as the microsecond timescale accessed is considerably shorter than the longest (1 s) equilibration timescale of the system.

Speculative box
We speculate that ornithine (instead of arginine) may play the role of triplet promoter in some organisms, or have done so earlier in the evolutionary process. The usurpation of ornithine codons to instead signify arginine may have led to a jump in the efficiency of recombination and an evolutionary explosion. Fig. 4. Base-pair steps (four bases) were classified by local conformation as β: base-paired and stacked, μ: melted, ζ: zipper, as planar with broken stacking (σ) or as τ : tilted. The left panels (a, c-k) show the three major states of the DNA, with a melting transition over extensions 1·2-1·6, followed (in the absence of intercalator) by a hyper-stretched zipper conformation. The right panels (b, d-l) show the incidence of states (σ, τ) in which the rise exceeds 5·6 Å, with preserved Watson-Crick hydrogen bonding. In systems with intercalator and a triplet coding sequence (h, l); steps at the codon boundary (p3) have an enhanced proportion of σ states, peaking in the extension range 1·4-1·5.

Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033583517000130. ANDRÉ  Note that the initials β, τ, σ, ζ, μ do not refer to phases (collective structures) but local conformations. A step labelled as 'β' would, for example, be consistent with the A, B, C, Z, or Σ phases, all of which include base stacking and Watson-Crick hydrogen bonding.