Structure Representations for Molecular Similarity Analysis
The different facets of the chemical structures relevant for particular applications are reflected in the structure representations used to define various application-focused similarity measures. The most commonly used representations are considered below.
Perhaps the simplest and most natural representation is directly based on the common (matching) and different (mismatching) parts or fragments of the structures. In chemoinformatics, they correspond to various subgraphs (in particular, the maximum common subgraph) of their molecular graphs.5 This approach works best when the differences between the structures are relatively small and localized, such as the variations in only one (or very few) atoms, substituents or functional groups attached to or present within a common structural scaffold,13 and will not be useful in more general cases.
The standard approach to the estimation of topological similarity for more different and diverse structures (where the differences cannot be easily isolated) is based on the consideration of large numbers of very simple and often overlappingsubstructural fragments and/or other structural features. Most commonly only the presence of each feature in a given structure is taken into account (thus, from a mathematical point of view, structures are represented by sets of features that are encoded as bit vectors often called molecular fingerprints or molecular keys),14 although the fragment/feature occurrence counts can also be used (giving rise to fragment multisets).5 the term ‘molecular fingerprint’ reflects the fact that these entities provide unique representations of each molecule, but individual bit values in them are not characteristic and may be difficult to interpret in isolation. From a practical viewpoint, the fingerprint bit vectors should have manageable size and reasonable population density (proportion of the ‘on’ bits), since comparing too sparse or too dense vectors (where all or almost all positions are zeroes or ones, respectively) is not very informative. In order to achieve this and still cover the most representative set of fragments, hashing algorithms may be used to map each fragment to specific bit vector positions that are associated with several different fragments,15 making the fingerprint/hologram nature of the representation even more pronounced. Nevertheless, an important property of all types of fingerprints is that the bit positions associated with a certain substructure will also be set in the fingerprints of larger structures containing it.15
Among the most commonly used molecular fingerprint schemes one should mention the MACCS keys,16 CACTVS/PubChem keys,17 Extended Connectivity Fingerprints (ECFP),18 Daylight fingerprints,15 and the proprietary procedures such as the ChemAxon19 fingerprints. Major differences between them are related to the sets or classes of fragments and other structural features included in the fingerprint generation as well as to the hashing algorithms. In addition, some of the schemes are not sufficiently documented, and some discrepancies between different implementations may exist. All these differences lead to some uncertainties complicating the use of the resulting similarity measures (in particular, the values obtained from different schemes are not directly comparable). Even more importantly, due to the universal ‘common denominator’ nature of the fingerprint-based similarity measures, they may mask or distort the importance of the fine structural differences relevant for a particular problem. Numerous attempts have been made to optimize the fingerprint schemes and/or select the best scheme and parameter set for a specific task (e.g. similarity-based search for active compounds, see Section 220.127.116.11).20-22 Nevertheless, we believe that the fingerprint-based similarity measures probably can be used most safely to recognize very similar or very dissimilar structures, while any interpretation of small changes in similarity (especially outside of the high or low similarity ranges) should be very cautious.
The physical, physico-chemical or topological characteristics of the compound that are expected to be relevant for a particular problem can be represented by simple vectors of molecular descriptor values.5 Such descriptors may be bulk or macroscopic properties (e.g. solubility, lipo- philicity, acidity, density, or heat of vaporization) measured experimentally or predicted by means of the quantitative structure-property models, as well as molecular graph invariants (from molecular weight, number of rotatable bonds, or number of hydrogen bond donors and acceptors to topological indices reflecting various structure properties) or quantum chemical parameters (e.g. molecular orbital or ionization energies). In order to simplify the processing, eliminate the effect of different parameter scales, and enable visual representation of the similarity relationships, the dimensionality reduction techniques such as the principal component analysis may be used.23,M
A similar approach can be used to estimate the molecular similarity based on the chemical reactivity or biological activity profiles represented by suitable parameters such as the equilibrium or rate constants, effective concentrations, or induced changes in the ‘omics’ fingerprints.
The analysis of similarities in the patterns of potential intermolecular interactions (in particular, ligand interactions with a biological target) requires some representation of the molecular interaction fields. In the 3D QSAR, the sampling of these fields at the nodes of a rectangular grid is traditionally used.25,M However, the representations based on a set of Gaussian functions positioned on atoms and/or interaction centers27-29 enable the analysis of similarity both in molecular shape and specific interactions. They also have important advantages in terms of computational performance as well as stability with respect to grid changes, imperfect alignment, and minor conformational variations.
Another approach considers the potential interactions of small molecules on a topological (2D) level of chemical structure defined by their structural formulas. In order to represent them in a common frame of reference, the Molecular Field Topology Analysis (MFTA)30,:51 builds a molecular supergraph, that is, a kind of superstructure such that all structures belonging to a series under study can be superimposed onto the supergraph and characterized in a uniform way. As molecular descriptors, MFTA employs local physico-chemical parameters (atom and bond properties) that can be quickly evaluated from a structural formula. In particular, electrostatic (e.g. effective atomic charge), steric (van der Waals radii of atoms and groups), lipophilicity, and hydrogen-bonding descriptors may be considered. Although this method was developed primarily for the QSAR analysis (it and its applications will be discussed in more detail in Section
6.3.1 below), such supergraph-based uniform representations of potential atom-centered interactions (or other relevant local parameters) can also be used to estimate the similarity between compounds or to match their structures in the most meaningful way (i.e. most similar with respect to specific parameters).