Applications of Molecular Similarity Analysis
In this section, we will consider several areas of the computational pharmacology and toxicology where the approaches based on the molecular similarity analysis have been successfully used. Most of them are explicitly or implicitly based on the similarity property principle discussed in Section 6.2.2. Thus, it is advisable to bear in mind all the limitations and caveats mentioned there. It should also be noted that the similarity-based approaches, in contrast to the QSAR analysis and related techniques, do not construct specialized prediction models (e.g. using machine learning methods). instead, their ‘model’ is simply a set of suitably represented molecules, sometimes with associated property/activity information.
Similarity-Based Virtual Screening
The virtual screening (often also called in silico screening) can be defined as the application of chemoinformatics techniques to analyze large libraries of chemical compounds in order to identify potentially promising structures during drug discovery and development. The virtual screening workflows may involve a variety of approaches based on the available target and ligand information, as well as the filters taking into account the predicted ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties, drug-likeness estimates, and other data.83 Importantly, the virtual screening is not expected to provide any guarantee that the screened compounds will indeed be active or have other desired properties, or that no active compounds will be missed. Instead, it ranks the library (database) compounds according to some estimated preference in such a way that the compounds closer to the top of the list (should) have a higher chance of being active. In other words, the top (priority) list is enriched in active compounds. In addition to the correctly predicted active compounds (true positives) and inactive compounds (true negatives), two kinds of errors are possible. False positives are the inactive compounds erroneously ranked as active, and false negatives are the active compounds erroneously predicted to be inactive. During the design of the virtual screening procedures, an important step is their validation, i.e. estimation of the recognition accuracy for active and inactive compounds and of the enrichment factors for top-ranking hits.
In the similarity-based virtual screening, the ranking of the library compounds is based on their calculated similarity to some known active compound (reference or probe molecule). This process was traditionally called the similarity search. It is assumed that the similarity property principle is obeyed, so the compounds sufficiently close to the reference in chemical space should also possess some activity.84,8S Historically, this was one of the first and most prominent applications of the molecular similarity analysis in chemistry, and most of the research on the development and optimization of the similarity measures (see Section 6.2.1) was aimed at improving its performance and efficiency. Of course, compared to other, more sophisticated virtual screening methods, the similarity search is a rather rough technique that is likely to produce a lot of both false positives and false negatives (caused by the irregular activity landscapes and/or by the insufficiently relevant similarity measures). Nevertheless, it can be quite useful in the beginning of a drug discovery project when no additional information on the target and ligands is available beyond the structures of one or a handful of hit com- pounds.85 In such situations any guidance in choosing the following steps of research to map the structure-activity relationship would be valuable. In addition, the similarity search often has better computational performance.
The parameters of the similarity search (in particular, the molecular representation, similarity function, and the desired similarity level) should be chosen in such a way as to minimize the number of errors and to avoid a trivial situation where only the reference molecule itself will be found. The optimization and calibration of these parameters was a topic of extensive research.5^20-22 A common rule of thumb based on an early seminal study86 suggests that the Tanimoto similarity value STan > 0.85 reflects a high probability that two compounds would have the same activity. However, subsequent research has shown that the problem of activity-relevant similarity values is much more complicated (and even led to calling this rule a ‘0.85 myth’).6 As discussed in Section 6.2.1, different molecular representations and different fingerprint schemes result in substantially different similarity values for the same molecules.^22 In addition, the distribution of the similarity values (diversity) and the sensitivity of activity to structural changes significantly depends on a particular activity class and data set.^22 In a more thorough statistical analysis it is shown that the values of > 0.8 (for MACCS fingerprints) or STa_n > 0.3 (for ECFP4 fingerprints) are much more likely for compounds sharing the same activity than for the comparison of two randomly selected compounds or one active and one randomly selected compound. Nevertheless, these values could not be reliably applied as distinct thresholds for the similarity search as the number of false positives and false negatives would be high and hardly predictable.22 An interesting approach aiming to improve the virtual screening performance by augmenting the usual activity-independent similarity measures with background knowledge was proposed.87 An extension component in such ‘hybrid’ measures is calculated as a fingerprint-based similarity using structural fragments that are believed to be required or beneficial for activity.