Constructing Attributes from Observed Characteristics

The standard correction for fitting a non-linear relationship between two variables is to breakdown the continuous variables, such as the number of debit days or the age of accounts, into several discretized variables. An example of discrimination is shown in Table 41.1. The intervals are selected empirically according to the gap between frequencies of defaulters and non-defaulters.

Table 41.1 shows that the number of debit days above 22 days over a 6-month period corresponds to 11% of all accounts and to 61% of all defaults, with an average default frequency of 6.36%. Conversely, 63% of all accounts have zero debit days over the same reference period

TABLE 41.1 Discrimination of continuous variables

and corresponds to 19% of all defaults, but with a much lower default frequency of 0.36%. The optimum discrimination is determined empirically. In this case, the optimum discretization is shown in Table 41.1. It consists of dividing the number of debit days into four binary variables taking the values 0 or 1 according to whether the number of debit days is within the above intervals. In other words, each of these four binary variables is defined as indicator functions:

The indicator function means that AT is equal to 1 when n — 0 and zero otherwise. The three other intervals allow defining the remaining variables, using the indicator function (noted 1):

Using such variables, we can now construct the Logit model using:

The Logit model becomes p(y) — 1/[1+ exp(-F)] where Fis the linear function of discretized variables as above.

Fitting and Back Testing Scoring Models

The model should be fitted to various samples of the population. A first sample provides a first fit. Then, an "out-of-the-sample" fit, using another sample of the population, serves for checking that the same model provides similar results. In- and out-of-the-sample fits are conducted over the same reference period.

They differ from back tests that compare what the model predicts with what has happened in reality, from historical data on defaults, or making a comparison "before and after." Any scoring model will predict p(y). When performing a back test, we should compare the finding of the models for sub-groups of the population with historical default frequencies. In theory, there should be a good match, in terms of ranks, between the score and such historical frequencies of defaults. In practice this is not always the case.

The process for mapping scores to default probabilities involves several steps. First, the population is divided in sub-groups according to ranges of values of p(y) ranked in ascending order. Then, for each group, an historical default frequency is calculated. For calculating a default frequency, a starting date is defined. At this date, we have only sound individuals. The default frequency DF over a period, t, is the ratio of counts of defaults to the initial number of individuals as of the starting date 0, or DF(0, t). For having 1-year default frequency, t should be set to one year.

DF(0, t) — number of defaults within [0, t] I initial number of individuals

Assigning default frequencies to scores requires a "mapping" of scores and default frequencies to a master scale valid across segments. This master scale is simply a scale of range of default probabilities. The mapping process should ensure that the default frequency derived from scores increases monotonously when moving along the scale from low-risk obligors to high-risk obligors. The mapping issue is discussed in the last section of this chapter.

Found a mistake? Please highlight the word and press Shift + Enter