# NAIVE BAYES AND DATA INTEGRATION

One of the most powerful consequences of assuming that dimensions are independent is that we can develop classification models that donâ€™t assume that all the dimensions follow the same probability model: We can use Poisson distributions for some dimensions, discrete distributions for others, Gaussians, Bernoulli, ... you get the idea. If we tried to model the correlation in two-dimensional data where one of the dimensions is 0, 1, 2, ..., and the other is any number between -^ and +^, we would have a very difficult time finding an appropriate distribution. Since naive Bayes just ignores these correlations, it gives us a very simple and powerful way to combine data of different types into a single prediction.

A good example where naive Bayes classification has been used to integrate multiple types of data to make predictions is protein-protein interactions (Jansen et al. 2003). Here, the task is to figure out which pairs of proteins physically interact in the cell based on a number of types of data, such as gene expression patterns, presence of sequence motifs, subcellular localization, or functional annotation. Since gene expression patterns can

FIGURE 10.7 Graphical models representation of data integration. This simple generative model can be used to integrate different types of data. A naive Bayes classifier can then be used to predict protein interactions based on multiple genomic measurements such as correlation of gene expression patterns (Expr.), shared subcellular localization (Loc.), and the presence of compatible motifs in the primary amino acid sequence (Motifs). The interaction for the new pair (n + 1) is assumed to be unobserved, but the data of each type is observed.

be systematically measured, this data is typically in the form of a real number. On the other hand, protein localizations and functional annotations are usually discrete categories. Sequence motifs might occur 0, 1, 2, ..., times, so they are natural numbers. If all these data are assumed to be independent, the parameters for each data type can be estimated independently, and all the features can be combined using a simple generative model (illustrated in Figure 10.7).