Erasmus Mundus Joint Master - ChEMoinformatics+

PyMOL: A Powerful Tool for Molecular Visualization and Structural Analysis

ChEMoinformatics+ — 2025-07-06T19:31:53Z

by Michele Brignoli, Track « Chemoinformatics and Physical Chemistry », Milan-Strasbourg, 2025

Molecular visualization is essential for gaining a deep understanding of molecular interactions and properties. Among the various tools available, PyMOL [1] stands out as one of the most versatile and powerful platforms for visualizing molecular structures. It is available in both a free, open-source version and a more feature-rich paid version, which includes access to advanced visualization capabilities. Numerous online tutorials and resources support new users in exploring both basic and advanced functionalities, from generating high-resolution molecular images to creating animations of dynamic molecular interactions.

The software is especially valuable for visualizing binding sites and analyzing biomolecular interactions (Figure 1), making it a crucial asset in rational drug design [2] by highlighting hydrogen bonds, hydrophobic contacts, and other key residues. PyMOL also integrates external tools and plugins, enhancing its functionality for advanced tasks like conformational analysis and molecular dynamics visualization.

With Python scripting, users can streamline workflows, execute complex manipulations, and automate routine tasks. This scripting capability empowers users to create complex visualizations, develop custom plugins, and tailor the software to meet specific research needs. Additionally, an active community of users contributes a wealth of scripts, plugins, and tutorials, offering robust support for both beginners and advanced users alike [3].

The latest versions of PyMOL include enhanced features such as real-time collaboration and integration with machine learning models. With its ongoing development and strong community support, PyMOL is positioned as one of the foundational tool for future innovations in molecular visualization and structural analysis. Its accessibility, precision, and versatility make it an invaluable resource for a wide range of users, from students to professional researchers in structural biology and computational chemistry. Therefore, I think that PyMOL is becoming indispensable in scientific research across bioinformatics and chemoinformatics disciplines.

Figure 1.

A fragment of TGF-β3 within the active site of integrin αVβ8 (PDB: 8VS6). On the left, polar contacts are highlighted with a red dashed line. On the right, the surface view shows polar residues in red and nonpolar residues in green.

Michele Brignoli

References:
[1] The PyMOL Molecular Graphics System, Version 3.0 Schrödinger, LLC.
[2] Shuguang Yuan et al. "Using PyMOL as a platform for computational drug design." Wiley Interdisciplinary Reviews: Computational Molecular Science, 7 (2017). https://doi.org/10.1002/wcms.1298.
[3] Magnus Kjaergaard et al. "A Semester-Long Learning Path Teaching Computational Skills via Molecular Graphics in PyMOL." The Biophysicist (2022). https://doi.org/10.35459/tbp.2022.000219.

Advances in Autonomous Chemical Research

ChEMoinformatics+ — 2025-07-06T19:21:44Z

by: Trung Le, Track « Chemoinformatics and Materials Informatics », Bar Ilan - Strasbourg, 2025

With the rise of artificial intelligence (AI) models such as ChatGPT, DeepSeek, Mistral AI, or DALL-E, AI has growing importance in many applications, including chemistry. In chemistry, artificial intelligence is developed as a subject of chemoinformatics. Chemoinformatics and automation have already been associated with running complex chemical experiments for screening, synthesis, and other tasks. The paper by Boiko et al. [1] envisions a "CoScientist," a multiple large language model (LLM) based intelligent agent, to support chemists in designing and running chemical experiments.

The CoScientist browses the internet and relevant documentation, and uses application programming interfaces (APIs) to control robotic devices. The prototype uses a modular architecture (Figure 1). The main module, the “Planner”, orchestrates the actions of software processes (workers) able to search the web (GOOGLE), browse web pages (BROWSE), prototype scripts for controllers (PYTHON) with the help of relevant documentation (DOCUMENTATION), and finally, run the experiment on the hardware (EXPERIMENT).

Figure 1.

(a) The Planner agent orchestrates the actions of workers to search the internet, design an experiment, generate the controller scripts, and perform the experiment. (b) A list of tasks successfully achieved with the help of the CoScientist. (c) An illustration of the CoScientist hardware.

Nature 624, 570–578 (2023). https://doi.org/10.1038/s41586-023-06792-0

The prototype has been assembled around a liquid handler and a heater-shaker to act autonomously using data from the internet, performing the necessary calculations, and ultimately writing and runner the controller code for the hardware. The system demonstrated "reasoning" capabilities as it was able to identify and search for missing information, solving multi-step problems.

Overall, the result presents a promising proof of concept for the future of autonomous experiments. Echoing the words of Derek Lowe, “It's not that machines are going to replace chemists. It's that the chemists who use machines will replace those that don't” [2].

References:
[1] Boiko, D.A., MacKnight, R., Kline, B. et al. Autonomous chemical research with large language models. Nature 624, 570–578 (2023). https://doi.org/10.1038/s41586-023-06792-0
[2] Muratov, E. N., Bajorath, J., Sheridan, R. P., Tetko, I. V., Filimonov, D., Poroikov, V., Oprea, T. I., Baskin, I. I., Varnek, A., Roitberg, A., Isayev, O., Curtalolo, S., Fourches, D., Cohen, Y., Aspuru-Guzik, A., Winkler, D. A., Agrafiotis, D., Cherkasov, A., & Tropsha, A. (2020). Qsar without borders. Chemical Society Reviews, 49(11), 3525–3564. https://doi.org/10.1039/d0cs00098a

Revolutionizing oncology with gene therapy: the role of computational methods

ChEMoinformatics+ — 2025-07-06T19:08:46Z

by: Toma Legrand, Track « Chemoinformatics for Organic Chemistry », Lisbon-Strasbourg, 2025

Cancer's complexity and adaptability make it one of the most challenging diseases to treat. Conventional therapies like radiation and chemotherapy often fail to distinguish between cancerous and healthy cells, resulting in many unwanted side effects.

Several gene therapy approaches have been approved recently as cures for cancers. Behind the scenes, a number of computational approaches, involving chemoinformatics and bioinformatics, have made these successes possible. More accurate, personalized, and successful gene therapies are likely to revolutionize oncology.

A spectacular step toward personalized medicine for cancer treatment is monitoring gene expression in tumors relative to healthy tissues. In my opinion, one such game-changing computational approach is mapping out gene networks (Figure 1). Indeed, atypical gene expression processes are frequently associated with cancers. Gene networks rationalize the interactions between genes as observed in cells. Mapping such networks is a powerful way to identify potential weak gene therapy targets as they appear as anomalies in cancer cells compared to healthy ones.

Figure 1

Example of gene-networks workflow

Nat Protoc 18, 1745–1759 (2023). https://doi.org/10.1038/s41596-022-00797-1

Identification of a gene can lead to the identification of relevant protein targets—such as those resulting from the transcription of the identified gene. It is then possible to design small molecules targeting these proteins. Their 3D structures are invaluable for this task when known experimentally. If not, they can be deduced from homology modeling or artificial intelligence models, using AlphaFold [2], for instance.

Analogous to gene interactions, protein-protein interaction networks have become an extremely valuable strategy. This is illustrated, for instance, by the MaSIF software [3]: it builds a network connecting two proteins if their shape allows for molecular recognition. This complements the potential to develop personalized anti-cancer drugs disrupting presumably pathogenic protein-protein recognition or protein sequences suitable for gene therapy.

Developing a one-size-fits-all treatment is extremely difficult because tumors within the same person might differ greatly from one another. Computational techniques, on the other hand, are becoming increasingly efficient and enable the examination of large quantities of genetic data from various malignancies, facilitating personalized gene therapy. However, there exists a huge gap between numerical models of a tumor and reality: how can we ensure gene treatments are effectively delivered to the appropriate cells? Are these treatments safe? What are the ethical pitfalls of such therapies?

References:
[1] Rosenthal, S.B., Wright, S.N., Liu, S. et al. “Mapping the common gene networks that underlie related diseases.” Nat Protoc 18, 1745–1759 (2023). https://doi.org/10.1038/s41596-022-00797-1
[2] Jumper, J., Evans, R., Pritzel, A. et al. “Highly accurate protein structure prediction with AlphaFold.” Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2
[3] Gainza, P., Sverrisson, F., Monti, F. et al. “Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning.” Nat Methods 17, 184–192 (2020). https://doi.org/10.1038/s41592-019-0666-6

Role of chemoinformatics in the research field of flavor compounds

Laura Belvisi — 2025-01-21T10:36:58Z

by: Pierre-Alexandre Ho, Track «In Silico Drug Design», Strasbourg-Milan-Paris, 2023

The aim of chemoinformatics is to use chemical information to predict the compounds behavior. Largely used in pharmaceutical research, its scope extends to other domains including the food industry. For example, flavor ingredients, which are used for many applications (e.g. enhancing taste), can sometimes be toxic, while others have a health benefits. Naturally, the question of the prediction of the properties of these molecules can be asked. The purpose of this blog article is to provide examples of chemoinformatic applications in food industry.

In food materials, there is a category called “GRAS” (Generally Recognized As Safe) for compounds with no risks for humans. Many tests are performed to obtain this qualification but the literature suggests to replace / combine some of them with QSAR (Quantitative Structure Activity Relationship) technics to infer biological activities based on the chemical structure of a molecule. Alternatively, the biological profile of flavor compounds can be assessed by the comparison between GRAS flavors, natural and drug datasets (1). Chemical space analysis have been used in the aim identifying, for instance, similarities between GRAS flavors and approved antidepressant drugs.

The biomolecular basis of flavor perception are also explored using molecular dynamics simulation. For example, this method was performed to analyze the interaction between peptides and taste receptors enabling the discovery of new flavor compounds. It was used also to explore the behavior of flavor compounds in interaction with plastic packaging and in the strong alcoholic environment of spirit beverage such as Scotch whiskey (2).

Currently, artificial intelligence is used to characterize and identify flavor compounds. These techniques are coupled to high resolution analytical chemistry techniques. The aim is to supplement, to de-risk and to make more objective the work of human panelist in odor identification (3). Such tools are being developed for the flavor engineering industry to design new flavors (4).

In a nutshell, chemoinformatics has emerged as a versatile toolkit (Figure 1) for characterizing, identifying and predicting future flavor compounds.

Figure 1: Chemoinformatics use for flavors compounds discovery (4).

References
1. Medina-Franco JL, Martínez-Mayorga K, Peppard TL, Del Rio A. Chemoinformatic Analysis of GRAS (Generally Recognized as Safe) Flavor Chemicals and Natural Products. Taylor P, editor. PLoS ONE. 2012;7(11):e50798. https://doi.org/10.1371/journal.pone.0050798
2. Shuttleworth EE, Apóstolo RFG, Camp PJ, Conner JM, Harrison B, Jack F, et al. Molecular dynamics simulations of flavour molecules in Scotch whisky. J Mol Liq. 2023, 383:122152. https://doi.org/10.1016/j.molliq.2023.122152
3. Shang L, Liu C, Tang F, Chen B, Liu L, Hayashi K. Artificial intelligence-based gas chromatography-olfactometry for sensory evaluation of key compounds in food ingredients. 2022 Apr 22 [cited 2023 Oct 23]; Available from: http://biorxiv.org/lookup/doi/10.1101/2022.04.20.488977
4. Kou X, Shi P, Gao C, Ma P, Xing H, Ke Q, et al. Data-Driven Elucidation of Flavor Chemistry. J Agric Food Chem. 2023;71(18):6789–6802. https://doi.org/10.1021/acs.jafc.3c00909

Chemoinformatics: Past, Present, Future

Laura Belvisi — 2025-01-21T10:12:26Z

by:Eliya Davidov, Track «Chemoinformatics and Materials Informatics», Bar Ilan-Strasbourg, 2023

In 1998, Frank Brown coined the term Chemoinformatics, defining it as "all the information resources that a scientist needs to optimize the properties of a ligand to become a drug." Yet, the foundations of chemoinformatics trace back much earlier, to the 1950s and 60s, when computational chemistry first began taking shape [1].

In 1957, Ray and Kirsch published the first algorithm for substructure searching. Their groundbreaking paper described "a collection of machines... capable of performing a complete data processing task involving data storage facilities." This laid the groundwork for structure, similarity, and substructure searching in databases — core concepts that would later become vital in chemoinformatics. By 1963, Vleduts proposed the concept of "skeleton reaction schemes" and reaction centers, suggesting the possibility of machine-aided synthesis: "the possibility of a machine solution... the selection of ways synthesizing a given compound" [2]. Another pivotal moment came in 1962 when Hansch introduced QSAR (Quantitative Structure–Activity Relationships), which link a biological activity to chemical structure using factors (molecular descriptors) such as steric effects, electronic properties, and hydrophobicity.

In recent decades, with the rise of artificial intelligence, chemoinformatics has evolved. Its scope now extends beyond ligand optimization to encompass "the application of informatics methods to solve chemical problems" [3]. Without exhaustivity, this includes predictive modeling for biological activity, drug discovery, ligand-based design, 3D molecular docking, protein-ligand interactions, virtual screening, simulations, and molecular dynamics (Figure 1). Although much of the field focuses on biology, chemoinformatics also plays a role in materials science, aiding in the design of batteries, energetic materials, and other physical systems.

What lies ahead for chemoinformatics? With AI, increasing computational power, and the surge of big data, the future promises new breakthroughs. AI is expected to push chemoinformatics into uncharted territories, such as drug discovery for rare diseases. Quantum computing will certainly be a major game changer in the realm of simulations and modeling, allowing for new algorithmic approaches to solve, for instance, complex graph isomorphism problems.

Figure 1. Chemoinformatics emerged as a field from the solutions found to data related problems shared by many other scientific domains. Medicinal chemistry and drug discovery subjects are still today strong driving forces in chemoinformatics.

References
1. P. Willett, Chemoinformatics: a history. WIREs Comput. Mol. Sci., 2011, 1, 46-56. https://doi.org/10.1002/wcms.1
2. G.E. Vleduts, Concerning one system of classification and codification of organic reactions. Inf. Stor. Ret. 1963, 1, 117–146. https://doi.org/10.1016/0020-0271(63)90013-5
3. J. Gasteiger, The central role of chemoinformatics. Chemometr. Intell. Lab. Syst. 2006, 82, 200–209. https://doi.org/10.1016/j.chemolab.2005.06.022

ZINC20 - Enabling Drug Discovery Through Comprehensive Chemical Search

ChEMoinformatics+ — 2024-06-26T08:55:30Z

by: Xinyue Gao, Track « In Silico Drug Design », Strasbourg-Milan-Paris, 2022

Chemical databases have become indispensable resources empowering research across pharmaceutical, biotech, and materials science domains. By aggregating vast collections of compounds and associated data, these databases allow scientists to efficiently explore chemical space to identify new drug candidates, optimize materials properties, and understand fundamental molecular interactions.

Among available chemical repositories, ZINC20 stands out for its commitment to comprehensiveness, advanced search capabilities, and accessibility. As discovering new biologically active small molecules relies on screening diverse compound libraries, ZINC20's extensive collection of over 1.4 billion compounds gives researchers an unprecedented starting point.

However, sheer size poses steep computational challenges. Traditional fingerprint-based similarity searches scale are scaling linearly with databases size, but database sizes are currently growing by orders of magnitude. Approximate feature-tree methods enable fast exploration of huge tangible chemical spaces by representing compounds, but to the cost of a maybe less expressive molecular representation.

ZINC20 balances these trade-offs via SmallWorld – an algorithm that indexes explicit molecular graphs for rapid similarity calculations. By precomputing synthetically accessible organic molecule graphs, SmallWorld can look up a query graph and quickly traverse the map to identify nearest neighbors in graph edit distance space. This retains full structure details while allowing sub-second searches across databases of >100 billion compounds.

Complementing SmallWorld, ZINC20 also incorporates Arthor – a custom toolkit for ultrafast substructure and pattern matching. Arthor represents molecules in a compact binary format optimized for regex-style queries. By distributing across a compute cluster, Arthor can search for complex molecular patterns in just seconds.

These innovations allow ZINC20 to make chemical space exploration a truly interactive experience. Researchers can quickly retrieve analogues in response to biological data, interactively explore structure-activity hypotheses, and easily purchase compounds for testing. Virtual screening workflows also become nimbler and more comprehensive.

However, users should be aware that ZINC20 focuses on commercially available content. So, while coverage spans billions of novel compounds, ZINC20 may lack some public or poorly documented molecules. Integrating additional databases like ChEMBL, DrugBank, and PubChem can help fill the gaps.

ZINC20 makes cross-database integration straightforward via a suite of flexible web APIs. Users can access compounds, substructures, similarity calculations & more programmatically. And the full database is downloadable to allow creating custom workflows locally.

ZINC20 also sets best practices with rigorous attention to data quality and standardization. Compounds are regularly updated and annotated with curated purchasability data to simplify acquisition. Structures and calculated physiochemical properties are checked against reference datasets. And community feedback helps continue improving ZINC20.

This communal spirit epitomizes the promise of open chemical databases – democratizing access to fuel more inclusive research. By providing robust tools freely to all scientists without restrictions, resources like ZINC20 empower investigations that would otherwise be infeasible. And support for virtual screening of huge make-on-demand catalogs allows pursuing risky but high-reward hypotheses relying on novel chemistry.

Ultimately, ZINC20's technical innovations and commitment to accessibility usher in a new era for computational drug discovery. As datasets continue growing in the big data regime, advanced machine learning approaches are becoming essential. Resources like ZINC20 that lower barriers for exploring billions of compounds will only increase in strategic value. As computational power catches up with data volumes, comprehensive high-quality open databases is expected to enable a new wave of therapeutics to enhance health and longevity worldwide.

Figure 1 SmallWorld indexes the topological space of organic molecules into anonymous graphs

J. Chem. Inf. Model. 2020, 60, 12, 6065–6073

References:
Irwin, John J., Khanh G. Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R. Wong, Munkhzul Khurelbaatar, Yurii S. Moroz, John Mayfield, and Roger A. Sayle. “ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery.” Journal of Chemical Information and Modeling 60, no. 12 (December 2020): 6065–73. https://doi.org/10.1021/acs.jcim.0c00675.
Nicola, George, Tiqing Liu, and Michael K. Gilson. “Public Domain Databases for Medicinal Chemistry.” Journal of Medicinal Chemistry 55, no. 16 (August 23, 2012): 6987–7002. https://doi.org/10.1021/jm300501t.

ProtGPT2: Designing Novel Proteins with Deep Learning

Laura Belvisi — 2024-01-20T16:11:12Z

by: Thalita Cirino do Nascimento Track « Chemoinformatics and Physical Chemistry », Milan-Strasbourg, 2022

My interest in chemistry began in high school, specifically in understanding the origin and evolution of life. While reading about related research, I came across an interview with Noelia Ferruz from the Institute of Molecular Biology of Barcelona (IBMB). What intrigued me about Noelia's work was her innovative approach inspired by how Nature has evolutionary ‘designed' a variety of proteins with different functions and topologies.

Indeed, peptides and protein structures have evolved through mutations and recombination that accumulated during over more than 3 billion years. This allows nature to explore a huge protein sequence space to find new biological functions. She combined this understanding of molecular evolution with natural language processing (NLP) systems, such as GPT-3, which can generate human-like text after ‘reading' millions of web pages and books. Since a protein is represented by a sequence of letters, a model similarly able to be trained on a massive database of over 50 million natural protein sequences was developed: the ProtGPT2 [1]. It implicitely integrates patterns and rules about how amino acids are strung together and, unlike previously designed de novo structures, ProtGPT2's proteins resemble the complexity of natural proteins with folding patterns and longer loops necessary for interacting with other molecules and functionalization (Figure 1). However, database searches revealed that the artificially generated and natural proteins are distantly related, more as a third-degree cousin than as a sibling. This suggests that ProtGPT2 is not simply copying existing proteins but combines amino acid building blocks in new ways.

Therefore, ProtGPT2 shows potential as a generative model capable of rapidly exploring new areas of the protein sequence space. Through numerous computational predictions, Noelia Ferruz team provides encouraging evidences that a large proportion of these sequences may fold into stable and functional structures resembling those found in Nature. While beyond the scope of the 2022 study, such experimental confirmation is needed to draw some conclusions about the folding and activities of ProtGPT2's generated proteins.

Figure 1. An overview of the protein sequences space. Each node represents a sequence. Two nodes are linked when they are sufficiently homologous. Colors depict the different structural domains and examples of AlphaFold predicted structures of protGPT2 generated sequences are given with their respective number: all β structures (751), α/β (4266 and 1068), membrane protein (4307), α+β (486), and all-α (785). ProtGPT2 generated sequences are represented by white nodes. The PDB code of the most homologous natural structure is given, with the corresponding identity percentage. Also, the AlphaFold confidence score (pLDDT) is provided.

1. Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13, 4348 (2022). https://doi.org/10.1038/s41467-022-32007-7

Chematica (Synthia) – a new software tool for synthesis planning, that combines artificial and expert intelligence to outperform both

Laura Belvisi — 2024-01-09T12:33:05Z

by: Iryna Boiko Track « In Silico Drug Design » Strasbourg-Milan-Paris, 2023

Last year, I had an amazing opportunity to attend a lecture of Bartosz A. Grzybowski, a Polish scientist, who introduced his group's ground-breaking work on an advanced synthesis planning tool – Chematica, now commercialized by Merck KGaA as Synthia.

Chematica's synthetic pathway designs have reached a level where they are indistinguishable from those created by humans, and sometimes even surpass them in terms of efficiency and elegance. Several complex natural product syntheses proposed by the algorithm have been successfully realized in the lab.[1]

The success of Chematica can be attributed to the combination of machine learning techniques with an expert-based approach. Over nearly a decade, the authors manually identified approximately 100,000 reaction types. The implementation of the software took almost 20 years from its initial conception. For popular reactions, where large amounts of data are available, machine learning algorithms were utilized. Different cross-reactivities and conflicting groups were also encoded into each reaction rule. Quantum Chemistry and Molecular Mechanics calculations were occasionally incorporated. This hybrid model demonstrated superior performance compared to purely expert-based or purely ML-based softwares.[2,3]

A distinguishing feature of Chematica are the scoring functions, which help navigate through vast networks of synthetic possibilities. At each step, the software must select the most feasible retrosynthetic pathway to prevent combinatorial explosion (Figure 1).

The scoring functions evaluate both the reactions and the sets of generated substrates. The chemicals' scoring function (CSF) accounts for variables such as the number of stereocenters, rings and the length of SMILES of each substrate to avoid big, more complex synthons. The reaction scoring function (RSF) approximates the difficulty of a particular operation based on conflicting or fragile functional groups, possibilities for non-selectivity and the need for protective groups. Thus, each variable increases its value for less favorable pathways, with user-defined coefficients. The RSF and CSF are summed up and the pathway with the lowest score is selected.[2]

But, as focusing on one step at a time may lead to a dead end later on, Chematica simultaneously explores ‘wide' and ‘deep'. It also considers tandem reactions and ‘tactical combinations'—two-step sequences that initially increase structural complexity but enable simplification later. Notably, Chematica is not biased towards reactions commonly reported in literature, allowing it to assign high ranks to newly developed or specific reactions, leading to more elegant solutions compared to purely ML-driven softwares.[2,3]

Grzybowski claims that they have effectively taught the computer the rules of Chemistry. Does it mean that organic chemists will lose their jobs soon? We shall see, but one thing is certain—chemoinformaticians will be in high demand.

Figure 1. Synthetic options during iterative retron-to-synthon expansion around scabrolide A target. Only few initial expansions are shown.[3]

[1] B. Mikulak-Klucznik, P. Gołębiowska, A. A. Bayly, O. Popik, T. Klucznik, S. Szymkuć, E. P. Gajewska, P. Dittwald, O. Staszewska-Krajewska, W. Beker, T. Badowski, K. A. Scheidt, K. Molga, J. Mlynarski, M. Mrksich, B. A. Grzybowski, Nature 2020, 588, 83–88 (https://doi.org/10.1038/s41586-020-2855-y).
[2] B. A. Grzybowski, T. Badowski, K. Molga, S. Szymkuć, WIREs Comput. Mol. Sci. 2023, 13:e1630 (https://doi.org/10.1002/wcms.1630).
[3] K. Molga, S. Szymkuć, B. A. Grzybowski, Acc. Chem. Res. 2021, 54, 1094–1106 (https://doi.org/10.1021/acs.accounts.0c00714).

Halogen bond: definition, modelling and applications

Laura Belvisi — 2023-07-27T10:49:24Z

by: Leonardo Raso Track « Chemoinformatics and Physical Chemistry », Milan-Strasbourg, 2022

Halogens are often being considered as apolar substituents in many organic compounds. Nevertheless, studies in recent years showed that they can have significant interactions with Lewis bases, the so-called halogen bond. [1] This interaction is due to a region of positive electrostatic potential close to the halogen atom. More specifically, this region is situated on the prolongation of the axis which connects the halogen to its neighbor atom and is called σ-hole (Figure 1). The σ-hole can be characterized with three parameters:

magnitude, which is the maximum positive value on a chosen isoelectronic surface;
size, which is the area of the positive region of the electrostatic potential on a certain isoelectronic surface;
extension, which is the distance where the potential goes from positive to null.

In quantum chemistry the σ-hole can be studied using ab initio methods, but these allow to study only small molecules. Nevertheless, it can be interesting to study the role of these interactions in large systems, such as protein-ligand complexes. For these purposes, it is more convenient to use molecular mechanics.

Differently from ab initio methods, the σ-hole must be explicitly integrated in the force field. There exist two approaches to do it. In the simplest one, an extra positive charge is added in proximity of the σ-hole, but this only models the electrostatic properties of the system. The other approach, more accurate, reproduces the anisotropy of the electronic density of the halogen. It uses angle dependent parameters in the Van der Waals and Coulomb terms of the force field.

In protein-ligand interactions halogen bonds can give significant contributions, since each amino acid of a protein has at least one basic function, namely the oxygen in the carbonyl group. Indeed, halogen bond found a place in medicinal chemistry in recent years.

For instance, Zhou and Wong investigated the role of the halogen bond in Haspin kinase using theoretical methods.[2] The authors studied the interaction between the kinase and four halogenated tubercidin ligands and noticed an increasing binding energy for ligands with heavier halogen substituent. This trend supports the thesis that halogen bond may be modulated in a series of halogenated inhibitors. This is an interesting aspect of the halogen bond: its ability to be tuned. The σ-hole parameters highly depend on the nature of the halogen, in particular the heavier is the halogen the bigger are the magnitude and the size of the σ-hole. Also, the scaffold of the ligand plays a role in its properties. Thus, making the halogen bond a versatile tool in medicinal chemistry and drug discovery.

Figure 1: Electrostatic potential projected on a surface of 0.001 au electron density of methane (A), fluoromethane (B), bromomethane (C) and iodomethane (D). Image taken from reference [1].

[1] M. H. Kolář and P. Hobza, “Computer Modeling of Halogen Bonds and Other σ-Hole Interactions,” Chemical Reviews, vol. 116, no. 9. 2016. doi: 10.1021/acs.chemrev.5b00560.
[2] Y. Zhou and M. W. Wong, “Halogen Bonding in Haspin-Halogenated Tubercidin Complexes: Molecular Dynamics and Quantum Chemical Calculations,” Molecules, vol. 27, no. 3, p. 706, Jan. 2022, doi: 10.3390/molecules27030706.

Deep Docking: a brief introduction

Laura Belvisi — 2023-06-16T13:33:53Z

by: Dina Khasanova Track «Chemoinformatics and Materials Informatics», Bar Ilan-Strasbourg, 2022

Drug discovery is an extensive and rigorous process. It takes a long time to bring a molecule “from a bench to a bedside”. Virtual screening can significantly enhance drug discovery, but conventional docking methods are regarded as computationally expensive as the size of available chemical libraries is growing exponentially. In order to address this challenge some approaches are developed. One of them is Deep Docking, which suits for docking billions of molecular structures without significant loss of potential drug candidates [1], according to authors.

The protocol includes eight steps: (1) molecular descriptors calculations, (2) receptor preparation, (3) random sampling of the chemical library, (4) ligand preparation in 3D, (5) molecular docking, (6) statistical model training, (7) model inference and (8) repeat at point (3) biasing the sampling toward more potent active molecules. The procedure can be completely automated on high performance computing centers.

More in details (Figure 1).
1. For each molecule in a chemical library, the molecular descriptors are computed.
2. The raw PDB structures are prepared all-atom, fully parameterized and docking calculations are initialized.
3. A dataset is randomly sampled from the chemical library
4. Each chemical structure from this dataset is prepared in 3D with a physical model.
5. Prepared ligands are docked into the protein target using a conventional docking protocol. Best scored molecules are labeled "actives" and the others, "inactives".
6. A QSAR model is optimized, trained and validated on a dataset to discriminate between "actives" and "inactive" instances based on the molecular descriptors.
7. The resulting QSAR deep model is used to categorize all the molecules in the chemical library as "actives" and "inactives".
8. The algorithm repeats from step (3), with a dataset added with compounds that are more likely to be "active", and bit more stringent definition of the categories "actives" and "inactives".

DD_pipeline

Figure 1: Workflow of the DD pipeline adapted from ref. [2]

Gentile, F., Agrawal, V., Hsing, M. et al. Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery. ACS Central Science 6, 939-949 (2020). https://doi.org/10.1021/acscentsci.0c00229

These steps are repeated up to a maximum number of iterations. At the end, compounds categorized as "actives" are the hits of the virtual screening [2]. The dataset used to build the QSAR model evolves at each iteration, and reinforce the performances of the QSAR model: at each iteration it is more predictive as suggested by the enrichment values measured on the test datasets.

This open source project is available on GutHub and is provided with a graphical user interface DD-GUI, that simplifies the access to this tool. It can be installed for Linux, Mac and Windows platforms [3].

References
[1] Gentile, F., Yaacoub, J.C., Gleave, J. et al. Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking. Nat Protoc 17, 672–697 (2022). https://doi.org/10.1038/s41596-021-00659-2
[2] Gentile, F., Agrawal, V., Hsing, M. et al. Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery. ACS Central Science 6, 939-949 (2020). https://doi.org/10.1021/acscentsci.0c00229
[3] Yaacoub, J.C., Gleave, J., Gentile, F. et al. DD-GUI: A graphical user interface for deep learning-accelerated virtual screening of large chemical libraries (Deep Docking). Bioinformatics 38, 1146-1148 (2022). https://doi.org/10.1093/bioinformatics/btab771