# Material and Methods

## Material

### Data set

The data set was retrieved from the literature. Among all of the binding data for specific targets, only measurements using human A2B adenosine receptor subtype cloned in HEK-293 cells and [3H]DPCPX, as the radiolabeled ligands, were considered. A rigorous curation of structural data and elimination of questionable data points was performed. They include the removal of duplicates, detection of valence violations, ring aromatization and standardization of tautomeric forms. Finally, in order to obtain a reliable datasets, incomplete or unclear data was deleted. The output data set contains 413 xanthines and deazaxanthine.

### Molecular Descriptors

A 2D molecular descriptors available in the DRAGON (version 6) and MOE (version 2008.10) software has been used in the present work. They include, for instance, pure topological descriptors, walk and path counts, connectivity indices, information indices, or 2D-autocorrelations. Taking into account the structural diversity of the compounds an initial subset of descriptors was computed for each molecule from the SMILES (Simplified Molecular Input Line Entry Specification) inputting of chemical structures. By disregarding descriptors with constant or near constant values inside each class, two final subset of 403 and 146 molecular descriptors were generated by using DRAGON and MOE, respectively.

## Methods

As we mentioned in the introduction, the project is based on two main parts:

1. Sample interval selection

2. Classification and estimation the range of each ligand activity

.

### Sample Interval Selection

To achieve best results, two different interval were chosen.

*Determined interval.*

In this way, determined interval is intended. It is explained in the next session.

*Undetermined interval.*

Clustering method was used for sample clustering. A principal application of this method is the classification of compound databases into groups of similar compounds. This method depends on the calculation of log activity for all ligands. K-means method was used for clustering. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the shortest path [12, 13].

K-means clustering aims to partition the *n* observations into *k* (≤ *n*) clusters and it wants to minimize the within-cluster sum of squares. The objective function is:

The k-means algorithm is below.

· Determine k (k is the number of cluster)

· Initialize means

· Assign each point to nearest mean.

· Update all means

· Go back to third step

### Classification and Estimation the Range of Each Ligand Activity

After choosing interval between ligands, we want to classify data. So, OAA method was used for this part.

*OAA*

The OAA modeling uses a system of M classifiers, where M is the number of classes. Each model is trained to define a discriminative boundary between a particular class of samples and the remaining ones. Every model is trained with the same dataset but different class labels. All models are trained independently. For classification in each step, GA_LDA was used [14, 15].

Ggenetic algorithm is adaptive heuristic search algorithm based on the evolutionary ideas of natural selection and genetics. It mimics some of the processes observed in natural evolution. The idea with GA is to use this power of evolution to solve optimization problems.

LDA is method used in machine learning to find a linear combination of features which characterizes some classes of objects. LDA is closely related to Bayesian classifier. The distance to each class is usually calculated using Euclidean distance as follows: