# Bayesian Network Classifiers: Problem Formulation

While Bayesian network classifiers have proven to give accurate and good results in a transportation context (Janssens et al., 2004; Torres & Huber, 2003), Achilles' tendons obviously are the decision rules, which can be derived from the Bayesian network. As mentioned above, Bayesian networks link more variables in sometimes complex, direct and indirect ways, making interpretation more problematic. Second, each decision rule that is used for predicting a particular dependent variable within a network contains the same number of conditions resulting in potential sub- optimal decision-making.

To illustrate this, the procedure of transforming a Bayesian network into a decision table (i.e. rule-based form) is shown in Figure 7.2. We preferred to make this transition because decision rules and the corresponding decision table formalism have some advantageous properties. The first reason is that a decision table is exclusive, consistent and complete. This behaviour is not guaranteed by traditional production systems and it represents a clear advantage of decision tables for any modelling purpose. Secondly, the decision table provides a suitable formalism for representing various types of interactions between variables, such as conditional relevance and conceptual interaction. For an extensive review, we refer to Wets (1998).

The left part of Figure 7.2 is an example of a pruned network. A pruned network is a network that is reduced in size such that the loss of accuracy on the dependent variable for the unseen test data is limited. In the left part of the figure, the different variables in the network are represented as boxes and each state in the network is shown with its belief level (probability) expressed as a percentage and as a bar chart. In the middle part of the figure, evidences are entered for every independent variable (see Section 3.3), resulting in a probability distribution of the target variable. This process is repeated for every possible combination of states (of independent variables). When an evidence is entered in the network, this is shown in the figure as a shaded box and as a 100% belief. As already introduced in Section 3.4, the direction of the arcs is preferably interpreted as an association rather than as a causality relationship. This means that not only child nodes but also parent nodes can influence the probability distribution of the dependent variable. For this reason, evidences need to be entered for every independent variable, regardless of whether these variables are child or parent nodes. There is one exception in this regard and that is the concept of d-separation. In this chapter, we will not elaborate into detail on this, but it means that in the case of d-separation, entering evidences for an independent variable will have no effect on the dependent variable. More information about this can be found in Pearl (1988) and in Geiger and Pearl (1988).

As it can be seen from this figure, every rule contains the same number of condition variables for this particular network. For the example shown here, this number is equal to 4. Moreover, the number of rules that are derived from the network is fixed and can be determined in advance for a particular network (i.e. per dependent variable). This number is equal to every possible combination of states (values of the condition variables). Therefore, the total number of rules, which has to be derived from the network shown in Figure 7.2 is equal to 5*7*2*4 = 280, assuming that the with-whom attribute is taken as the class attribute. Especially when more nodes are incorporated, this number is likely to become extremely large. While this does not need to be a problem as such, it is obvious that a number of these decision rules will be redundant as they will never be 'fired'. This flaw has no influence on the total accuracy of each Bayesian network classifier (see Janssens et al., 2004), but

Figure 7.2: Calculating probability distributions and entering them in a decision table.

it is clearly a sub-optimal solution, not only because some of the rules will never be used, but also because this large number of conditions do not favour the interpretation. Clearly, decision trees do not suffer from this problem. In a decision tree, the 'depth' of the tree only determines the maximum number of conditions that is used in decision rules. This is a maximum, and non-fixed number. In Section 7.2, we will elaborate more into detail on this.

For both reasons mentioned in this section, that is the possibility of combining the advantage of Bayesian networks (take into account the interdependencies among variables) and the advantage of decision trees (derive easy understandable and flexible (i.e. non-fixed) decision rules), and for the reason mentioned before, that is, deal with the variable masking problem in decision trees, the idea to integrate both techniques into a new classifier was conceived.