Principal Investigators:
Prof. Dr. Cecilia Clementi
Project Manager:
Andrea Guljas
additional Affiliation:
International Max Planck Research School (Biology and Computation)
HPC Platform used:
NHR@ZIB: Lise GPU cluster
Project ID:
bep00118
Date published:
Researchers:
Dr. Nicholas Charron, Dr. Félix Musil, Klara Bonneau, Aldo S. Pasos-Trejo, Yaoyi Chen, Jacopo Venturin
Introduction:
Computational tools such as Molecular Dynamics (MD) have revolutionized the way we study biomolecules; however, they are severely limited by the computational cost of running simulations on biological time- and length-scales. Various coarse-grained (CG) models have been developed which rely on simpler representations of molecular systems than atomistic MD. While these models are difficult to configure using physical intuition, we have shown that by using state-of-the-art machine learning methods, it is possible to design accurate and efficient CG models which can correctly reproduce protein dynamics. By enhancing both our training dataset and network architecture, we hope to produce a “universal” CG model to study biological systems.
Body:

For most of the latter half of the 20th century, the “structure-function paradigm” formed the foundational concept of structural biology. This theory postulates that a protein’s biochemical role is a direct consequence of its three-dimensional structure [1]; however, we now understand that protein dynamics and interactions, rather than their static structure, are in large part responsible for nearly all biological functions [2]. Despite these conceptual advancements, standard scientific tools fall short of accurately characterizing protein motions on biologically relevant time- and length-scales. While experimental methods are often able to capture snapshots of a protein’s structure in its most stable states, they are unable to model essential transitions between these states which provide insight into their functional mechanisms. Similarly, despite their breakthrough success in protein structure prediction, AI models such as AlphaFold [3] do not provide information about protein dynamics and interactions.
 

To bridge this gap, molecular dynamics (MD) techniques have become standard tools in the study of biomolecules. At atomistic resolution, MD methods can accurately characterize protein motions, large-scale conformational changes, and interactions with other proteins. They do so by computing the interactions between atomic components and determining the forces acting on each atom [4]. Yet, while modern MD simulations yield accurate results, these methods are extremely computationally expensive and as such cannot be used to model large protein systems on long timescales. However, it is often the case that not every atom is essential in determining the long timescale properties of proteins [5,6]. Based on this idea, many attempts have been made in the past few decades to design coarse-grained (CG) models. In such models, groups of atoms are collectively represented by CG “beads”; because only a small fraction of the atoms are retained, these models are often several orders of magnitude faster in simulation. However, the interactions between the CG beads are difficult to model using physical intuition, as opposed to their atomistic counterparts. Thus, while various CG models have been developed to study the dynamics of specific proteins [7,8], a reliable, general-purpose CG model for the efficient simulation of large biomolecules is missing.
 

Considering these limitations, we are developing accurate and transferable CG models to simulate large molecular systems on long timescales. We use state-of-art machine-learning tools such as graph neural networks and a large set of atomistic simulations to train the models to learn the energetics from atomic interactions and thus reproduce the correct dynamics of the biomolecules of interest. We have shown as a proof-of-concept that expressing the CG energy with a suitable neural network enables the design of fast and accurate CG models [9-11]. We have also shown that a “universal” protein model can be learned from MD simulations of many proteins and demonstrated that such a model can be used to correctly simulate proteins outside of the training dataset [12]. We are now expanding this project on multiple fronts, by enhancing our training datasets and neural network architecture, with the belief that these approaches will produce transferable CG models with practical biomedical applications.
 

[1] C. B. Anfinsen. Science 181 (1973), pp. 223-230.
[2] H. Frauenfelder et al. Science 254 (1991), pp. 1598-1603.
[3] J. Jumper et al. Nature 596.7873 (2021), pp. 583–589.
[4] R. B. Best. Biomolecular Simulations. Springer New York, 2019, pp. 3–19.
[5] J. N. Onuchic et al. Annu. Rev. Phys. Chem. 48.1 (1997), pp. 545–600.
[6] F. Noé and C. Clementi. Curr. Opin. Struct. Biol. 43 (2017), pp. 141–147.
[7] C. Clementi. Curr. Opin. Struct. Biol. 18.1 (2008), pp. 10–15.
[8] W. G. Noid. J. Chem. Phys. 139.9 (2013).
[9] J. Wang et al. ACS Central Science 5.5 (2019), pp. 755–767.
[10] F. Noé et al. Annu. Rev. Phys. Chem. 71.1 (2020), pp. 361–390.
[11] B. E. Husic et al. The Journal of Chemical Physics 153.19 (2020), p. 194101.
[12] N. E. Charron et al. arXiv:2310.18278 (2023).
[13] M. Majewski, A. Pérez, P. Thölke, Nat Commun 14, 5739 (2023). https://doi.org/10.1038/
[14] DFG CRC1114: Scaling Cascades in Complex Systems, https://www.mi.fu-berlin.de/en/sfb1114/index.html
 

Institute / Institutes:
Freie Universität Berlin
Affiliation:
Freie Universität Berlin
Image:
Figure 1