Hostname: page-component-77f85d65b8-pztms Total loading time: 0 Render date: 2026-03-29T08:48:23.787Z Has data issue: false hasContentIssue false

Computational methods for binding site prediction on macromolecules

Published online by Cambridge University Press:  12 March 2025

Igor Kozlovskii
Affiliation:
Constructor Knowledge Labs, Bremen, Germany School of Science, Constructor University Bremen gGmbH, Bremen, Germany Tetra D AG, Schaffhausen, Switzerland
Petr Popov*
Affiliation:
Constructor Knowledge Labs, Bremen, Germany School of Science, Constructor University Bremen gGmbH, Bremen, Germany Tetra D AG, Schaffhausen, Switzerland
*
Corresponding author: Petr Popov; Email: ppopov@constructor.university
Rights & Permissions [Opens in a new window]

Abstract

Binding sites are key components of biomolecular structures, such as proteins and RNAs, serving as hubs for interactions with other molecules. Identification of the binding sites in macromolecules is essential for structure-based molecular and drug design. However, experimental methods for binding site identification are resource-intensive and time-consuming. In contrast, computational methods enable large-scale binding site identification, structure flexibility analysis, as well as assessment of intermolecular interactions within the binding sites. In this review, we describe recent advances in binding site identification using machine learning methods; we classify the approaches based on the encoding of the macromolecule information about its sequence, structure, template knowledge, geometry, and energetic characteristics. Importantly, we categorize the methods based on the type of the interacting molecule, namely, small molecules, peptides, and ions. Finally, we describe perspectives, limitations, and challenges of the state-of-the-art methods with an emphasis on deep learning-based approaches. These computational approaches aim to advance drug discovery by expanding the druggable genome through the identification of novel binding sites in pharmacological targets and facilitating structure-based hit identification and lead optimization.

Information

Type
Review
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided that no alterations are made and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use and/or adaptation of the article.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Table 1. List of methods for prediction of protein–small molecule binding sites

Figure 1

Figure 1. Schematic presentation of the sequence-based methods. The top part demonstrates the pipeline for a template-based approach: the target sequence is aligned against a database of template sequences with known binding residues, and the output binding residues are defined by the consensus score from the alignment. The bottom part demonstrates the pipeline for ML or DL methods. First, the feature vectors (e.g., sequence or physicochemical properties) or the embeddings (e.g., using language models) are calculated. Then, a method uses a moving window across the sequence and feeds feature vectors for each position into an ML or DL model outputting a binding score for each position, or utilizing a larger DL model to get binding scores for each position simultaneously.

Figure 2

Figure 2. Schematic presentation of the structure template-based methods. In the first stage, the target is screened against a database of template structures with known binding sites. In the second stage, the output prediction is obtained based on the most similar template structures with respect to the target.

Figure 3

Figure 3. Schematic overview of geometric methods for binding site detection. (a) Generation of occupancy grid and calculation of the fraction of directions enclosed by the target macromolecule for each empty grid point (used, for example, in POCKET (Levitt and Banaszak, 1992), LIGSITE (Hendlich et al., 1997), PocketPocker (Weisel et al., 2007), SiteMap (Halgren, 2009), CAVIAR (Marchand et al., 2021)). (b) Rolling of spheres with two different radii around the target macromolecule. The spheres with a larger radius remove the smaller ones. The remaining small spheres are clustered to get final predictions (used, for example, in APROPOS (Peters et al., 1996), PHECOM (Kawabata and Go, 2007), (Masuya and Doi, 1995), GHECOM (Kawabata, 2010), and POCASA (Yu et al., 2010)). (c) The addition-removal algorithm, is used in Delaney (1992), Kleywegt and Jones (1994), and Brady and Stouten (2000). Each step consists of adding and removing the surface-exposed points until the convergence. The target macromolecule is represented with a lilac surface, and grid points and probe spheres are shown with circles.

Figure 4

Figure 4. Schematic presentation of the energy probe-based methods. (a) Different probes (shown as red, blue, and green circles) are placed on a 3D grid around the target macromolecule (shown as a lilac surface) and their interaction energies with the target’s atoms are calculated. (b) The probes corresponding to the high-energy values are filtered out. (c) The remaining probes are clustered. (d) The filtering procedure is applied to remove non-relevant clusters.

Figure 5

Figure 5. Schematic presentation of the machine learning-based methods. On the top, the target structure is represented as a surface, and feature vectors are calculated for the surface points. On the bottom, feature vectors are calculated for the target’s residues or atoms. Then, an ML classifier predicts the binding scores for the points, residues, or atoms, based on the input feature vectors. Finally, the output predictions are filtered by a score threshold and clustered.

Figure 6

Figure 6. Schematic presentation of the DL-based methods. Most of the methods utilize graph-based or voxel grid representations of the target macromolecular structure. Then, they sample either sub-graphs or sub-grids around the structure and classify their centers as belonging to the binding site or not. Alternatively, they use segmentation models to operate with the full graph or grid.

Figure 7

Table 2. List of methods for prediction of protein–peptide binding sites

Figure 8

Table 3. Performance of protein–peptide binding site detection methods on test benchmarks retrieved from Kozlovskii and Popov (2021a), Abdin et al. (2022), and Fang et al. (2023)

Figure 9

Table 4. List of methods for prediction of nucleic acid–small molecule binding sites

Figure 10

Table 5. List of methods for prediction of protein–ion binding sites

Supplementary material: File

Kozlovskii and Popov supplementary material

Kozlovskii and Popov supplementary material
Download Kozlovskii and Popov supplementary material(File)
File 128.9 KB