 # Molecular Similarity Functions

A molecular similarity function maps a pair of molecules Mi and Mj (described by some suitable representation that is relevant to the intended application) to a real-valued similarity measure S(Mi,Mj) reflecting a ‘degree’ or ‘amount’ of similarity between them. In most cases, such functions are symmetric, i.e. S(Mi,Mj) = S(Mj,M), although asymmetric similarity measures have been proposed (see later) and may be more efficient in some applications where there is a natural asymmetry in the roles of the two molecules (e.g., a query/tem- plate molecule vs. library molecules, or a parent structure vs. its derivatives).

The similarity functions are usually constructed as normalized ratios, with the numerator expressing some measure of size or count of the common elements and the denominator expressing a measure of total size or element count of the molecules (or other objects). Thus, their values lie in the interval from 0 to 1. The unity value is taken when a molecule is compared to itself (S(Mi,M) = 1) or any other molecule that is identical in terms of a representation used for the similarity calculation. The zero value means that the molecules are ‘completely’ different (although such situations are quite rare in practice).

The distance or dissimilarity between molecules is commonly defined as a similarity complement to unity32 according to eqn (6.1). This distance is also limited to the [0,1] interval, which may seem counterintuitive (after all, the chemical space is unlimited). Moreover, for many similarity measures the corresponding distance is not a metric in the mathematical sense. In particular, the identity of indiscernibles axiom is usually violated (as noted above, different molecules may have the same representation and thus S = 1 and D = 0). The triangle inequality is also often violated (although it may be satisfied in certain special cases).32 Thus, it is best to view this parameter as a ‘distance coefficient’ or ‘dissimilarity index’, which is rather informal but adequate in most applications. Although some nonlinear transformations of the similarity values may be constructed in order to overcome the ‘limited distance’ problem, in most cases such complexity is not necessary.

Numerous similarity and distance measures have been proposed in the literature.153233 Sometimes the same or closely related parameters in different or even the same fields were developed by different authors independently. In this chapter, we consider only a few similarity functions that have found a widespread use in chemoinformatics (see Table 6.1). It is remarkable that the functions intended for different types of molecular representations have a substantial similarity that allows us to formulate them in a uniform way. For two molecules A and B, they are defined in terms of the generalized ‘size’ values for their intersection (common part) |AnB |, their union |AuB |, as well as individual molecules | A| and | B| (in some cases, the union size is more conveniently expressed as a sum of individual molecule sizes corrected by the intersection size according to eqn (6.2)). Various quantities used in these calculations are illustrated in Figure 6.1. Naturally, the way they are defined depends on the molecular representation selected for the study. For the set-based representations (in particular, molecular fingerprints) of the molecules A and B, we can denote their individual components by Ak and Bk, respectively. They can represent the presence (0/1) or occurrence count of a particular fragment or a structural feature, or simply the value in the k-th position of the fingerprint vector. Then the sizes for the similarity calculation are defined by eqn (6.3)-(6.6). In the most common case where A and B are bit vectors, the equations can be simplified by introducing the bit counts a, b and c representing respectively the number of bits that are set only in A, only in B and in both vectors (see Figure 6.1). In practical implementations, optimized bitwise operations are often used.  Figure 6.1 Size quantities used in the calculation of similarity functions for two molecules. Molecule A is represented by the red circle, molecule B is represented by the blue circle. Their intersection (common part) AnB is shown in magenta, their union AuB is indicated by the pink contour. Lowercase letters denote the sizes of individual areas: the red crescent a is the difference AB (part of A not present in B), the blue crescent b is the difference BA (part of B not present in A), the magenta lens c is the intersection AnB, and the white area d represents elements of the representation that are not present in either A or B (this quantity is not used in the similarity functions considered here).

For the graph-based representations of the molecules A and B, the size is usually defined as a number of edges (bonds) in the corresponding molecular graphs (cardinality of the edge set), denoted by |?(A)| and |?(B)|. The intersection between the molecules is then defined as their maximum common edge substructure (MCES), i.e. a common subgraph containing maximum number of edges (for more rigorous graph-theoretical terminology and overview of algorithmic approaches see ref. 34 and 35). the focus on the matching bonds generally provides a more ‘chemically meaningful’ picture of structural relations.536 The size quantities for the similarity calculation are defined by eqn (6.7)-(6.10). In the vector-based representations the molecules A and B are characterized by the descriptor vectors A and B with individual components Ak and Bk, respectively. The descriptors may also be derived by sampling the continuous 3D molecular fields at the nodes of a rectangular grid. The individual molecule size is then defined as a squared norm of the corresponding vector (denoted by All2 and IIBII2), while the intersection size is defined as a dot (inner) product of the vectors (A,B). The size quantities for the similarity calculation are defined by eqn (6.11)-(6.14). For the representations based on a continuous 3D molecular field, the molecules A and B are characterized by point functions A(x,y,z) and B(x,y,z), respectively. The size quantities for the similarity calculation are defined in basically the same way as for the vector-based representations. However, the dot products of vectors are replaced by the inner products of functions in a function space (i.e. their overlap integrals over volume) according to eqn (6.15)-(6.18). As mentioned above, the field functions can be represented as combinations of Gaussians to enhance the performance and stability.27-29 Using the appropriate size quantities for the selected molecular representation, various similarity functions defined in Table 6.1 are calculated. For the most part, the application of these measures is quite straightforward. However, some points deserve additional discussion.

• (1) the combination of the Tanimoto similarity function with some kind of the molecular fingerprint (bit vector) representation works rather well in most applications and seems to be the most popular approach to the structural similarity analysis. In fact, it may be regarded as the ‘default’ similarity measure that is implied in the absence of additional qualifications. Nevertheless, we have shown that many other approaches are possible in this field and may be more suitable to each particular problem (see also the discussion of the applicability domains of molecular similarity measures in Section 6.2.1.2).
• (2) For the representations involving bit vectors (i.e. vectors having 0 or 1 as component values), the same size quantities and thus the same similarity measures are obtained when using the equations defined for the set- and vector-based representations (except the potentially diminished performance in the latter case). However, for the multisets (i.e. vectors with non-negative integer components) the results from the set- and vector-oriented equations will be different.
• (3) if a representation involves only non-negative components (binary, integer, real, or functional), it can be shown that a specific ordering of various similarity measures is observed (eqn (6.19)).s,33,3y however, care should be taken if negative component values are possible (e.g. for the lipophilicity, atomic charge, or molecular electrostatic potential). in this case, the intersection size |АпБ| may also become negative, leading to more complicated relations (even the [0,1] bounds on the similarity values can be violated).2^37 (4) the pearson correlation coefficient is closely related to the Cosine/ Carbo similarity (it is calculated in the same way after centering the vectors by subtracting their mean values) and can also be used as a

 Function Definition Notes Tanimoto (Jaccard) Traditionally called Tanimoto similarity in chemoinformatics context, although the same measure was proposed much earlier by Jaccard Dice Hodgkin-Richards Named after Dice for representations based on sets and graphs, after Hodgkin and Richards for representations based on vectors and functions. Denominator is the arithmetic mean of individual sizes Cosine Carbo Equals to cosine of the angle between vectors. Named after Carbo for representations based on vectors and functions. Denominator is the geometric mean of individual sizes Min Denominator is the maximum of individual sizes Max Denominator is the minimum of individual sizes Tversky Asymmetric similarity function proposed by Tversky (depends on parameters а,в > 0)

“Calculation of the similarity functions for two molecules A and B is based on the following size quantities: |A|, size of molecule A; |B|, size of molecule B;

|AOB|, size of their intersection (common part); |A^B|, size of their union. For the bit vector representations, the bit counts a, b and c represent the number of bits that are set only for molecule A, only for molecule B and for both molecules, respectively.

similarity measure, especially when the representation vector components are similar in nature and scale.

• (5) The Tanimoto similarity and most of the other symmetric similarity functions (with the possible exception of SMax) exhibit the so-called size-dependent behavior.53839 When the two molecules differ significantly in their size and/or complexity, the denominator of the similarity ratio becomes dominated by the contribution from the larger molecule. The resulting similarity values are not only lower than intuitively expected but also less sensitive to changes in the smaller molecule.
• (6) The asymmetric similarity function proposed by Tversky40 (eqn (6.20)) allows control of the issues arising from the asymmetry between the molecules (which may be caused by differences in their sizes or roles). In fact, it is not a single function but a family of similarity functions dependent on two non-negative parameters (a and в) that specify the relative weights of the molecules A and B in the comparison. If a = в, the resulting similarity function is symmetric. For example, the Tanimoto function is obtained by letting a = в = 1 and the Dice function by letting a = в = 1/2. Often, a simplified form of the Tversky function is used that is obtained by letting в = 1 - a, and thus involves only one asymmetry parameter (eqn (6.21)).41 In the similarity-based virtual screening context (see Section 6.2.3.1), the computational experiments indicate that better results (i.e. higher hit rates) are obtained when the weight for the query molecule is close but not equal to unity.41 One common option is using a = 0.95.29 Recently a new ‘second-order’ Hausdorff-like similarity measure was proposed,42 which is applicable to complex chemical systems (e.g. mixtures or ionic liquids), to individual compounds represented by the sets of their metabolites or substructures, or to arbitrary sets of structures. For two sets A = {a,} and B = {b} it is defined as a normalized sum of pairwise similarities between component structures according to eqn (6.22). This approach takes into account all components of the sets, providing more informative and intuitively desirable comparison. Concluding this section devoted to the definitions of molecular similarity measures, we will present an illustrative example. Figure 6.2 demonstrates the calculation of some measures discussed above for two compounds, cytidine 1 and lamivudine 2 (see the structures in Scheme 6.1). In particular, the Tanimoto similarity based on the molecular fingerprint representation is shown in Figure 6.2a and the tanimoto similarity based on the molecular graph representation is shown in Figure 6.2b. In Figure 6.2c, the Tanimoto and Carbo similarity functions are calculated for the vector representation of atomic charges on the topological (2D) structure level based on the MFTA supergraph. Finally, in Figure 6.2d, the Tanimoto and Carbo similarity functions are calculated for the vector representation involving a set of calculated 1D (global) molecular descriptors and predicted properties. It should be noted that most of the similarity values are moderate, although lamivudine would normally be perceived by a medicinal chemist as a quite close analog of cytidine (and was specifically developed as such). In addition, different values are obtained when using different similarity functions, representations and underlying parameters. In practical applications of the molecular similarity analysis (see Section 6.2.3) it is important to calibrate and/or validate a similarity measure before using it to make any decisions. The apparent very high similarity based on the 1D descriptors may be regarded as a kind of ‘artifact’ caused by using a small set of descriptors dominated by only two parameters with large absolute values (molecular weight and topological polar surface area). In practice, some prescaling or dimensionality reduction procedure based on a representative training set should be applied before using such descriptors in similarity calculations (see Section 6.2.1.3). 