Hostname: page-component-77f85d65b8-g98kq Total loading time: 0 Render date: 2026-03-27T02:40:34.113Z Has data issue: false hasContentIssue false

Chasing collective variables using temporal data-driven strategies

Published online by Cambridge University Press:  06 January 2023

Haochuan Chen
Affiliation:
Laboratoire International Associé Centre National de la Recherche Scientifique et University of Illinois at Urbana-Champaign, Unité Mixte de Recherche n°7019, Université de Lorraine, 54506 Vandœuvre-lès-Nancy, France
Christophe Chipot*
Affiliation:
Laboratoire International Associé Centre National de la Recherche Scientifique et University of Illinois at Urbana-Champaign, Unité Mixte de Recherche n°7019, Université de Lorraine, 54506 Vandœuvre-lès-Nancy, France Theoretical and Computational Biophysics Group, Beckman Institute, and Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, IL 60637, USA
*
*Author for correspondence: Christophe Chipot, E-mail: chipot@illinois.edu
Rights & Permissions [Opens in a new window]

Abstract

The convergence of free-energy calculations based on importance sampling depends heavily on the choice of collective variables (CVs), which in principle, should include the slow degrees of freedom of the biological processes to be investigated. Autoencoders (AEs), as emerging data-driven dimension reduction tools, have been utilised for discovering CVs. AEs, however, are often treated as black boxes, and what AEs actually encode during training, and whether the latent variables from encoders are suitable as CVs for further free-energy calculations remains unknown. In this contribution, we review AEs and their time-series-based variants, including time-lagged AEs (TAEs) and modified TAEs, as well as the closely related model variational approach for Markov processes networks (VAMPnets). We then show through numerical examples that AEs learn the high-variance modes instead of the slow modes. In stark contrast, time series-based models are able to capture the slow modes. Moreover, both modified TAEs with extensions from slow feature analysis and the state-free reversible VAMPnets (SRVs) can yield orthogonal multidimensional CVs. As an illustration, we employ SRVs to discover the CVs of the isomerizations of N-acetyl-N′-methylalanylamide and trialanine by iterative learning with trajectories from biased simulations. Last, through numerical experiments with anisotropic diffusion, we investigate the potential relationship of time-series-based models and committor probabilities.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Fig. 1. (a) Schematic representation of a neural network used in an autoencoder (AE), or in a time-lagged autoencoder (TAE). (b) Schematic representation of a Siamese neural network used in modified TAEs, in state-free reversible VAMPnets (SRVs), and in a slow feature analysis (SFA). (c) Calculation of the reweighting factor $ \Delta {t}_m^{\prime } $ in Eq. (11). (d) Workflow employed in this work of data-driven collective-variable (CV) discovery from biased molecular dynamics (MD) simulations.

Figure 1

Fig. 2. Potential energy surfaces of $ V(X,Y) $ with (a) α = 1.0 and (e) α = 10.0; Time evolution of X and Y when (b) α = 1.0 and (f) α = 10.0; Projections of the encoded variable $ \xi $ on X and Y from AEs training with trajectories of (c) α = 1.0 and (g) α = 10.0; Projections of the encoded variable $ \xi $ on X and Y from TAEs training with trajectories of (d) α = 1.0 and (h) α = 10.0.

Figure 2

Fig. 3. Projections of the one-dimensional CV $ \xi $ learned from unbiased trajectories when $ \alpha =10.0 $ on X and Y of (a) TAE, (b) modified TAE, (c) SFA and (d) SRVs. Projections of the 2D CVs ($ {\xi}_1 $,$ {\xi}_2 $) learned from unbiased trajectories when $ \alpha =10.0 $ on X and Y of (e,i) TAE, (f,j) modified TAE, (g,k) SFA and (h,l) SRVs. Projections of the 2D CVs ($ {\xi}_1 $,$ {\xi}_2 $) learned by SRVs from an unbiased trajectory when $ \alpha =10.0 $ (m,n), from an egABF biased trajectory reweighted by Eq. (11) (o,p), and from an egABF biased trajectory reweighted by Eq. (12) (q,r).

Figure 3

Fig. 4. (a) Structure of NANMA and two dihedral angles $ \phi $ and $ \psi $ as candidate CVs. (b) Reference free-energy landscape of the NANMA isomerization along $ \phi $ and $ \psi $. The grey dots show the minimum free-energy pathway (MFEP) from C7eq to C7ax via C5. The dominant free-energy barrier on the MFEP is marked by the red circle. (c) The PMF along the learned CV $ {\psi}_1 $. The three basins correspond to C7eq, C5 and C7ax, respectively. (d) The value of learned CV $ {\xi}_1 $ projected on ϕ and ψ. (e) Structure of trialanine, the candidate CVs ($ {\phi}_1 $, $ {\psi}_1 $, $ {\phi}_2 $, $ {\psi}_2 $, $ {\phi}_3 $, $ {\psi}_3 $), and the eight basins (A, B, M1, M2, M3, M4, M5, M6) found in the free-energy landscapes along the reference CVs in (f) and the learned CVs in (g). The basins A, B, M1, M2, M3, M4, M5 and M6 are marked in blue, red, yellow, green, orange, white, cyan and pink, respectively. (f) Reference 3D free-energy landscape along ($ {\phi}_1 $, $ {\phi}_2 $, $ {\phi}_3 $) and the corresponding MFEP (black dots). (g) 3D free-energy landscape along the learned CVs ($ {\xi}_1,{\xi}_2,{\xi}_3 $) and the corresponding MFEP (black dots). (h) MFEPs found from the reference free-energy landscape along ($ {\phi}_1 $, $ {\phi}_2 $, $ {\phi}_3 $) (red), the free-energy landscape along the learned CVs ($ {\xi}_1,{\xi}_2,{\xi}_3 $) (blue), and the free-energy landscape reweighted from ($ {\xi}_1,{\xi}_2,{\xi}_3 $) to ($ {\phi}_1 $, $ {\phi}_2 $, $ {\phi}_3 $) (green).

Figure 4

Fig. 5. (a) Berezhkovskii–Szabo potential energy surface (Berezhkovskii and Szabo, 2005). The latent variables projected onto (X, Y) learned by AEs, TAEs and modified TAEs in three diffusivity conditions: (bd) $ {D}_x/{D}_y=0.1 $, (eg) $ {D}_x/{D}_y=1.0 $ and (hj) $ {D}_x/{D}_y=10.0 $. The AEs and TAEs are trained with a neural network architecture of 2-10-1-10-2 with linear activation functions used in all layers. The modified TAEs are trained with a 2-20-20-1 neural network, and the hyperbolic tangent is used as the activation functions for the two hidden layers with 20 computational units. The time lag for TAEs and modified TAEs is 10 steps.

Supplementary material: PDF

Chen and Chipot supplementary material

Chen and Chipot supplementary material

Download Chen and Chipot supplementary material(PDF)
PDF 495.4 KB

Review: Chasing collective variables using temporal data-driven strategies — R0/PR1

Conflict of interest statement

The reviewer has collaborated with the authors on other projects.

Comments

Comments to Author: The manuscript by Chen and Chipot compares existing a number of machine-learning methods that aim at discovering collective variables for free energy calculations. Specifically, this work focuses on auto-encoders (AEs) as well as time-series-based variants, including time-lagged AEs (TAEs) and modified TAEs. The manuscript is well-written, although too technical, and provides a short and clear description of the methods used to compare these methods in a few toy models. I find the manuscript of interest to the community due to the fair comparison it provides between AE, TAE, and modified TAE. However, there are still some points that need to be addressed:

1) This work only deals with very simple toy models. I think the authors can at least discuss or speculate on the performance of these methods when larger systems such as proteins and large-scale conformational changes are studied using such methods? For instance, is there a difference in the complexity of these algorithms as the number of DOFs increases?

2) The last part of the Results section (Potential connections between TAE, modified TAE, and committor) is somewhat too short and unclear. It is an interesting section but the authors need to make the connection to the rest of manuscript more clear and potentially expand on it. The committor function is suggested as “the reaction coordinate” in the transition path theory literature. The modified TAE (like the other algorithms discussed) tried to identify the most relevant collective variable as well. There is also similarity between Relations (4) and (15). However, the manuscript does not attempt to go beyond noticing the similarity and directly jumps to a numerical example. I think some theoretical work or at least deep discussion is missing here before this jump.

More minor points:

1) Fig. 5: Why the color is flipped in H as opposed to B and E?

2) Page 2: “the variables that can maximize the explained variances do not always necessarily coincide with the DOFs of the process of interest” (I feel adding “important” before DOFs or replacing “of” with “relevant to” makes this clearer)

3) Page 2: “observationraises” (typo)

4) Eq.1: Is ∆t timestep? Is this the same timestep used in MD simulations. If not, please use a different term.

5) Page 6: the line after Eq. 5 states “C00, C01 and C11 are defined …” (Are these supposed to be “C(t,t), C(t,t+τ), and C(t+τ,t+τ)“?)

6) Page 10, Relation (13): There is some inconsistencies with the units. Since potential is shown in kcal/mol, this should somehow come out of Eq. 13 but it does not.

7) Page 10: “start contrast” must be “stark contrast”

Review: Chasing collective variables using temporal data-driven strategies — R0/PR2

Conflict of interest statement

NA.

Comments

Comments to Author: In this work, the authors review and examine the deep learning based methods in collective variable (CV) discovery, including AE, TAE, modified TAE, SRV, and SFA. Experiments unveil that AE learns high-variance modes instead of slow modes and TAE learns the mixture of these two modes. Modified TAE, SFA, and SRV appropriately learn the slow modes. Further experiments on NANMA and trialanine show the reweighting schemes enable deep learning models to learn CVs from biased trajectories. Overall, this work is well motivated and easy to follow. It also includes convincing experiments to evaluate different deep learning based methods for CVs. I only have following a few mild comments before the paper get published.

1. How are the architectures of the deep neural networks determined. For instance, NANMA uses 4-12-10-8-6-4-2, which are quite a few layers considering the small number of neurons per layer. Also, why are tanh used as activation instead of more widely used ReLU-ish functions?

2. In page 15, the authors mention “The deviation between the blue and red curves may stem from discretization issues and difficulty to enhance sampling in the three-dimensional space.” Could the authors elaborate more on what the discretization issues are.

3. The authors include experiments on a triple-well potential, NANMA, and trialanine which are convincing. However, these are still low-dimensional problems compared with molecular simulations in practice. How do the authors comment on the generalization of the reviewed methods on larger systems?

4. How will deep learning models (e.g., TAE, modified TAE, etc.) perform if the dimension of latent space is smaller than the actual CVs? Can the models learn the most dominant variables automatically?

Recommendation: Chasing collective variables using temporal data-driven strategies — R0/PR3

Comments

Comments to Author: Reviewer #2: In this work, the authors review and examine the deep learning based methods in collective variable (CV) discovery, including AE, TAE, modified TAE, SRV, and SFA. Experiments unveil that AE learns high-variance modes instead of slow modes and TAE learns the mixture of these two modes. Modified TAE, SFA, and SRV appropriately learn the slow modes. Further experiments on NANMA and trialanine show the reweighting schemes enable deep learning models to learn CVs from biased trajectories. Overall, this work is well motivated and easy to follow. It also includes convincing experiments to evaluate different deep learning based methods for CVs. I only have following a few mild comments before the paper get published.

1. How are the architectures of the deep neural networks determined. For instance, NANMA uses 4-12-10-8-6-4-2, which are quite a few layers considering the small number of neurons per layer. Also, why are tanh used as activation instead of more widely used ReLU-ish functions?

2. In page 15, the authors mention “The deviation between the blue and red curves may stem from discretization issues and difficulty to enhance sampling in the three-dimensional space.” Could the authors elaborate more on what the discretization issues are.

3. The authors include experiments on a triple-well potential, NANMA, and trialanine which are convincing. However, these are still low-dimensional problems compared with molecular simulations in practice. How do the authors comment on the generalization of the reviewed methods on larger systems?

4. How will deep learning models (e.g., TAE, modified TAE, etc.) perform if the dimension of latent space is smaller than the actual CVs? Can the models learn the most dominant variables automatically?

Reviewer #3: The manuscript by Chen and Chipot compares existing a number of machine-learning methods that aim at discovering collective variables for free energy calculations. Specifically, this work focuses on auto-encoders (AEs) as well as time-series-based variants, including time-lagged AEs (TAEs) and modified TAEs. The manuscript is well-written, although too technical, and provides a short and clear description of the methods used to compare these methods in a few toy models. I find the manuscript of interest to the community due to the fair comparison it provides between AE, TAE, and modified TAE. However, there are still some points that need to be addressed:

1) This work only deals with very simple toy models. I think the authors can at least discuss or speculate on the performance of these methods when larger systems such as proteins and large-scale conformational changes are studied using such methods? For instance, is there a difference in the complexity of these algorithms as the number of DOFs increases?

2) The last part of the Results section (Potential connections between TAE, modified TAE, and committor) is somewhat too short and unclear. It is an interesting section but the authors need to make the connection to the rest of manuscript more clear and potentially expand on it. The committor function is suggested as “the reaction coordinate” in the transition path theory literature. The modified TAE (like the other algorithms discussed) tried to identify the most relevant collective variable as well. There is also similarity between Relations (4) and (15). However, the manuscript does not attempt to go beyond noticing the similarity and directly jumps to a numerical example. I think some theoretical work or at least deep discussion is missing here before this jump.

More minor points:

1) Fig. 5: Why the color is flipped in H as opposed to B and E?

2) Page 2: “the variables that can maximize the explained variances do not always necessarily coincide with the DOFs of the process of interest” (I feel adding “important” before DOFs or replacing “of” with “relevant to” makes this clearer)

3) Page 2: “observationraises” (typo)

4) Eq.1: Is ∆t timestep? Is this the same timestep used in MD simulations. If not, please use a different term.

5) Page 6: the line after Eq. 5 states “C00, C01 and C11 are defined …” (Are these supposed to be “C(t,t), C(t,t+τ), and C(t+τ,t+τ)”?)

6) Page 10, Relation (13): There is some inconsistencies with the units. Since potential is shown in kcal/mol, this should somehow come out of Eq. 13 but it does not.

7) Page 10: “start contrast” must be “stark contrast”

Recommendation: Chasing collective variables using temporal data-driven strategies — R1/PR4

Comments

No accompanying comment.