Welcome to our deep dive into Weight-of-Evidence, variable selection, credit modelling and more.

In our previous post we defined Weight-of-Evidence (WoE), described how it is used as a feature transform, and explored a simple example.

In this post we describe the why. When we understand the why, we’ll be able to apply WoE with more understanding, hopefully leading to better models and more informed decisions.

In the previous post we defined the WoE  as:

$$\mbox{WoE} = \ln \left( \frac{\mbox{good distribution}}{\mbox{bad distribution}}\right).$$
This makes sense for credit risk modelling because, with this definition, positive values of WoE indicate a ‘good’ loan.

Since in this post we need to be more precise and more general, it might be useful to unpack this definition. What happens here is that we take the $$\ln$$ of the ratio of the distributions of two classes, namely ‘good’ and ‘bad’. In credit modelling ‘bad’ is usually indicated by the random variable $$Y=1$$ because this is the quantity that is predicted by the model. In general, this is referred to as the positive class. The equivalent, mathematical formulation of the WoE for the single variable $$x$$, is therefore,
$$\mbox{WoE} = \ln\left( \frac{P(x|Y=0)}{P(x|Y=1)}\right).$$

In this post we’re taking a more general approach and talk about WoE and logistic regression as a general method for developing explainable, binary classification models. In this more general situation it is more convenient to define the WoE in terms of the positive class as,
$$\mbox{WoE} = \ln\left( \frac{P(x|Y=1)}{P(x|Y=0)}\right).$$
This is the definition that will be used henceforth.

Let’s dive a little deeper into WoE, using a principled approach.

In general we want to  calculate the probability, 𝑃(𝑌=1|𝐗),  given the parameter values $$\mathbf{X}$$, with $$\mathbf{X} = \left[ x_1, \ldots, x_p\right]$$.

The log-odds (lo) is log of the ratio of the probabilities of the two classes $$Y=1$$ and $$Y=0$$, given by,

$$\mbox{lo} = \ln\left( \frac{P(Y=1|\mathbf{X})}{P(Y=0|\mathbf{X})}\right).$$
This is a general definition and only involves the ratio of the probabilities of the two classes. Using Bayes rule, we can reformulate it as,

\begin{aligned} \mbox{lo} &= \ln\left(\frac{f(\mathbf{X}|Y=1) P(Y=1) / f(\mathbf{X})} {f(\mathbf{X}|Y=0) P(Y=0)/f(\mathbf{X})} \right) \\ &= \ln\left(\frac{P(Y=1)}{P(Y=0)}\right) + \ln\left(\frac{f(\mathbf{X}|Y=1) }{f(\mathbf{X}|Y=0) } \right). \end{aligned}

In the expression above, the first term is the prior log-odds — the ratio of the probabilities of the two classes  before any observation is made. The second term is the ratio of the two class probability distributions.

Compare this with the WoE definition which is the ratio of the class probability distributions, but for a single variable.

If we make the Naïve Bayes assumption that all the variables are statistically independent, then the joint class probability distributions factorise and the log-odds is reduced to,

\begin{aligned} \mbox{lo} &= \ln\left(\frac{P(Y=1)}{P(Y=0)}\right) + \sum_{j=1}^p \ln\left(\frac{f(x_j|Y=1) }{f(x_j|Y=0) } \right) \\ &= \ln\left(\frac{P(Y=1)}{P(Y=0)}\right) + \sum_{j=1}^p w_j\end{aligned},
Note that $$w_j$$ is the WoE of the j-th variable defined as before,
$$w_j = \ln\left(\frac{f(x_j|Y=1) }{f(x_j|Y=0) } \right).$$

Summary

The key take-away of this section is that the Naive Bayes assumption leads to a particularly simple and interpretable expression for the log-odds. In the next section we’ll derive a second expression for the log-odds that will neatly tie WoE to logistic regression.

Logistic Regression

Logistic regression is, for many, the classifier of choice for binary classification. Even if a deep neural network is used for binary classification, the final layer may well be the sigmoid function. Conceptually, all that the neural layers below the classifier do, is to perform a sequence of nonlinear transformations on the input data in order to maximally separate the two classes.

Next, we’ll explore the intimate relationship between WoE and logistic regression.

Logistic regression calculates the probability (in our credit modelling example, this will be the PD, or Probability of Default) of a sample, using the sigmoid function,
$$P(Y=1|\mathbf{W}) = \frac{e^{\beta_0 + \beta_1 w_1 + \cdots + \beta_p w_p} }{ 1 + e^{\beta_0 + \beta_1 w_1 + \cdots + \beta_p w_p }},$$
where the calculated WoE values are passed to the sigmoid function as input features.

Things to note:

1. The PD is calculated using the WoEs of the 𝑝 variables.
2. The parameters, $$\beta_0, \beta_1, \ldots, \beta_p$$, are calculated by maximising the log-likelihood. This can be done using your favourite machine learning library.

We can again calculate the log-odds after applying the sigmoid function to get:
$$\ln \left( \frac{P(Y=1|\mathbf{W})}{P(Y=0|\mathbf{W})} \right) = \beta_0 + \beta_1 w_1 + \cdots + \beta_p w_p.$$
If we compare this with the log-odds expression for the Naive Bayes expression we found above, you’ll notice something interesting. The Naive Bayes assumption (reminder: assuming that all the variables are statistically independent) implies the following choice for the logistic regression coefficients,

\begin{align} \beta_0 &= \ln \left( \frac{P(Y=1)}{P(Y=0)} \right) \mbox{ The prior log-odds}\\ \beta_1 &= \beta_2 = \cdots = \beta_p = 1. \end{align}

Therefore, if the Naive Bayes assumption holds, we don’t even need to train a logistic regression classifier. We already know what the coefficients should be!

Of course, Naive Bayes is a strong assumption and will rarely, if ever, hold in practice. So we’ll almost always train a logistic regression classifier anyway, so that we can take variable dependencies into account.

These equations also tell us something important. In credit modelling, the prior log-odds is the log of the  ratio of the bad loans to the good loans in a portfolio. Any financial institution prefers this ratio to be small, i.e. the prior log-odds to be large negative (you wouldn’t be very successful if you had many bad loans on your book!). This serves as a strong Bayes prior. If no evidence is provided in the form of the WoE data, the prior will assign a ‘good’ predictive label. As a result, the evidence (in terms of WoE values) will need to overcome this large prior before a ‘bad’ predictive label is assigned.

Let’s for a moment assume that we have done a really good job in selecting independent variables so that the logistic regression coefficients are all 1. This means that every positive WoE value nudges our believe away from the prior, i.e. pushing the prediction away from a ‘good’ predictive label towards a ‘bad’ predictive label. In terms of a credit score, where higher means more credit-worthy, positive WoE values, lower the score.

Summary

Although the Naive Bayes assumption is a strong assumption, it is sometimes unavoidable in order to make the problem computationally tractable. This assumption leads to  particularly simple expressions for the coefficients and intercept of the logistic regression model. It is illuminating that these expressions are the same for all the variables and do not depend on the details of the binning.

In order to relax the Naive Bayes assumption somewhat, and still retain explainability, the coefficients of logistic regression is re-estimated from the training data in an attempt to capture some of the dependencies between the variables. Note that the expression for the intercept is not affected by the Naive Bayes assumption.

Thus, if care is taken in the selection of the variables, then the model is a linear combination of the WoE values. This is interpretable since it is possible to explain in detail how the different variables contribute towards the final prediction

But the story doesn’t end here.

Variable selection

It should be clear by now that any modeller will spend much time on selecting suitable variables. But what does suitable even mean? There are two main things to consider:

1. The individual variables should be able to distinguish between the two classes, e.g.  ‘good’ and ‘bad’ loans. We can also think in terms of the predictive power of an individual variable. We’ll investigate this idea in more detail later.
2. The selected variables will be combined to develop the logistic regression model. This requires that they work well together as a predictive “team”. As mentioned earlier, ideally they would be statistically independent. Regardless, the WoE values of each of the selected variables should individually nudge the credit score in the right direction. The stronger the prior, the more nudging will be required.

In credit modelling the variable selection procedure is roughly the following:

1. Start from hundreds, sometimes thousands, of variables. The most promising variables can be selected using specific measures (we’ll explore one of these, called Information Value (IV), a little bit later). Since the number of starting variables is so high, some kind of automatic variable selection procedure is essential.
2. Once the number of variables has been reduced to a manageable number, let’s say around 50, these remaining variables are carefully investigated to reduce the number of variables further. This involves adjusting the bins (read our first post on WoE if this sounds unfamiliar) and determining the predictive power of the individual variables.
3. Once the number of variables have been reduced further, let’s assume we now have around 15, we then investigate  these variables as a predictive, cooperative “team”, i.e. how well  they combine to create an accurate predictive model. At this step, the variables are often selected according to their statistical independence. One method of achieving this is with clustering, where we can use an algorithm to cluster variables into correlated groups. We can also calculate additional measures, like Variance Inflation Factor (VIF), which gives a quantitative measure of the degree to which information is lost if a variable is removed from the selection (we’ll likely cover these in detail in a future post).

Investigating the WoE trend

Let’s return to the ‘age’ data we examined previously and look in more detail at its WoE values. The figure below shows the WoE for each bin as a blue line.

The screenshot is from our very own Praexia 🚀, which is a cloud-based binary classification modelling tool that uses Weight-of-Evidence.

Note: Since this is a credit risk modelling example, in the figure below the WoE is again defined as the log of the ‘good’ distribution over the ‘bad’ distribution. This means that positive WoE values indicate a ‘good’ loan and negative values a ‘bad’ loan.

From the figure above, we can make a few observations. The WoE data for this population indicates that  younger individuals, up to about 28 years old, as well as those over 49, tend to be ‘bad’ (as indicated by their negative WoE values). There is also some fluctuation in the WoE trend, which is something we want to avoid. More severe fluctuations should either be smoothed out by changing the binning or, if that is not successful, discarding the variable entirely from the selection.

Why do we try and avoid fluctuations? These fluctuations in the WoE values tend to destabilise the model. Said differently, the model predictions can become very sensitive to small changes in the data, as you “hop” out of one bin and into another.

Information Value (IV)

Time to talk about Information Value. In the title of the figure above there is a quantity called IV. This is the Information Value associated with the variable. In general the IV is a quantitative measure of the difference between two probability distributions say, $$\mathbf{p} = [p_1, \cdots, p_n]$$ and $$\mathbf{q} = [q_1, \cdots, q_n]$$. As a reminder,  the components of the two distributions are always positive and add up to 1. The IV is then defined as,
$$\mbox{IV} = \sum_{j=1}^n \left( p_j – q_j \right) \ln \frac{p_j}{q_j}.$$
This is a symmetric version of the Kullback-Leibler divergence. It is easy to see that the IV is always nonnegative. Moreover, it is zero if and only if the two distributions are identical. In our credit modelling example the two distributions are the ‘good’ and ‘bad’ distributions.

The IV of a variable is a measure of how well that variable distinguishes between the two classes, in the sense of how much the two class probability distributions differ.

In credit modelling, the IV is used as an indicator of the ability of the variable to separate the ‘good’ examples from the ‘bad.  There is a rule-of-thumb that is widely accepted of how the value of a variable is reflected by its IV, see for example Naeem Siddiqi.  This is shown in the table below:

The Gini value

The other quantity shown in the title of the ‘age’ example above, is the gini value. The gini value gives an indication of the predictive power of that variable. This means how well the variable, all by itself, is able to distinguish between the two classes. In order to calculate the gini value for a variable, a logistic regression  classifier is built for that variable. For a single variable, the logistic classifier is given by:

$$P(Y=1|x) = \frac{e^{\beta_0 + \beta_1 w} }{ 1 + e^{\beta_0 + \beta_1 w }},$$
where $$w$$ the WoE value of the variable. No need to estimate the two parameters, we already know them:

\begin{align} \beta_0 &= \ln \left( \frac{P(Y=1)}{P(Y=0)} \right) \mbox{ The prior log-odds}\\
\beta_1 &= 1.
\end{align}

(Reminder again that $$\beta_0$$ is the prior log-odds, and that, since a single variable is statistically independent, its coefficient is 1, just like before!)

This is interesting since it tells us quite a bit about the predictive power of a single variable. To wit, the decision boundary  is given by
$$\beta_0+ w = 0,$$
where the intercept is the prior log-odds,
$$\beta_0 = \ln \left( \frac{P(Y=1)}{P(Y=0)} \right).$$

ROC Curves

We now use this model to classify the samples in the dataset, conveniently split into training-, validation-,  and test sets. The results are summarised in the following figure that shows the Receiver Operator Curves (ROC). As a historical aside, the naming convention stems from the attack on Pearl Harbour during WWII during the investigation into why the Receiver Operators did not detect the incoming aircraft.

The ROC curves plot the True Positive Rate (TPR)  against the False Positive Rate (FPR) where,
\begin{align} \mbox{TPR} &= \frac{\mbox{Predicted True Positives}}{\mbox{Total True Positives}}\\ \\ \mbox{FPR} &= \frac{\mbox{Predicted False Positives}}{\mbox{Total True Negative}} \end{align}

These curves are actually parametric curves where the parameter is the probability threshold.  For example, if the probability threshold is set at 0, it means that all samples with a probability higher than zero will be classified as the Positive, i.e. every sample will be classified as Positive including all the True Positive samples. Therefore, the TPR will be 1. At the same time all the Negative samples will also be classified as positive and the FPR is also 1. This takes us to the upper right-hand corner of the graph. On the other hand, if the the probability threshold is set at 1, then no sample will be classified as Positive, consequently both the TPR and FPR will be zero. This takes us to the lower left-hand corner of the graph.

If we have a model that is purely chance, i.e. it randomly assigns samples to the two classes, then we slide down the diagonal as the probability threshold is changed  from 0 to 1.

There’s also this video, which is a gentler introduction to the interpretation of the ROC curve, if you’d like a more thorough explanation.

If we take a look at our ROC curve below, we’ll see that our model using the ‘age’ variable is just slightly better than chance. Its predictive value is low.

In order to calculate a quantitative value for the predictive power of a model, we can use the area under the ROC curve (AUC). For a perfect model AUC = 1, for a random model this is a half.

The gini value is a scaled version of the AUC and given by,
$$\mbox{gini} = 2 * \mbox{AUC} – 1.$$

Some other interesting points:

1. All variables  have exactly the same $$\beta+0$$, no matter how the binning is done. All that is different between different variables is the value of the WoE.
2. If the prior $$\beta_0$$ has a large negative value, as in consumer risk modelling,  it will  generate a small PD (probability of default) value, or a high(ish) credit score. Therefore, unless the WoE value is large enough to cross the decision boundary, the PD will always be small and the credit score always be high. Said differently, The WoE value has to be large for a single variable to overcome the prior log-odds and predict a large PD. In that sense, the predictive power of a single variable is limited, as illustrated by the ROC curves in the figure above.

It is possible to gain more understanding from exactly how the WoE determines the predictive power of a variable.

The WoE value is given by,
\begin{align} w &= \ln \left( \frac{ \mbox{count bad/total count bad}}{\mbox{count good/total count good}} \right)\\ &= \ln \left( \frac{ \mbox{count bad}}{\mbox{count good}} \right) – \beta_0, \end{align}
where ‘count good’ and ‘count bad’ are the counts for the specific bin associated with the WoE and
$$\beta_0 = \ln \left( \frac{\mbox{total count bad}}{\mbox{total count good}} \right).$$
$$\beta_0$$ therefore cancels in the expression for the decision boundary so that it is given by,

$$\ln \left( \frac{\mbox{count bad}}{\mbox{count good}}\right) = 0,$$
where the counts are for the specific bin WoE value. For positive values of the left-hand-side, a ‘bad’ label will be assigned and this can only happen if the actual ‘bad’ bin-count is higher than the actual ‘good’ bin-count. For highly imbalanced classes such as as expected in credit modelling, the total ‘bad’ count is much lower than the total ‘good’ count. This will generally also be reflected in the individual bin counts. Only rarely will a single variable assign a PD that will result in a ‘bad’ classification.

The variables need to combine as a team in order to overcome the large prior log-odds.

This does not mean that the gini is totally worthless as an indicator for the predictive value of a variable. It will often be low, but a positive WoE value (remembering that in our definition we use the log of ‘bad’ over ‘good’) will nudge the PD higher, towards the ‘bad’ classification and lower the credit score, even if it is by a small amount.

Summary

The point is that the predictive power of a single variable is limited. This should not be all that surprising! A closer look at the logistic regression model for a single variable, based on WoE, tells us exactly what the extent of this limitation is.

Balancing the classes

As we have mentioned a number of times, the credit modeller will always have to deal with highly imbalanced ‘good’ and ‘bad’ classes. The question is, should you balance the classes before training? The answer, for the credit modeller,  is no. For this application the class imbalance does not indicate any structural deficiency in the data, but reflects reality.  The imbalance carries information that should not be discarded.

Let’s revisit the one-variable situation where the decision boundary is given by:
$$\beta_0 + w = 0,$$
and suppose we have perfectly balanced our classes. For example we might have taken a random sample from the ‘good’ class to equal the number of ‘bad’ samples. This means that:
$$\beta_0 = 0$$
and that the decision boundary, for all variables, is given by
$$w=0.$$
This means that the decision boundary is given by:
$$\ln \left( \frac{\mbox{bad count}}{\mbox{good count}} \right) = 0,$$
where, as before, the counts are for the specific bin with that particular WoE value.  Keeping in mind that the WoE values are fixed for a specific model and it is safe to say that the model will assign far too many ‘bad’ labels.

The situation does not improve significantly by adding more variables. For balanced classes the prior log-odds will still be zero and the WoEs again the log of the ratio of the ‘bad’ count over the ‘good’ count.

Summary

The analysis of this section should provide the modeller with a deeper understanding of what the effect might be if classes are balanced. This should help with better informed decisions of when class balancing is appropriate.

Conclusion

In this post we explored the relationship between WoE and logistic regression. It turns out that there is a particularly good fit. (We have a strong suspicion that historically, WoE was introduced exactly for this reason. Perhaps there is a reader who can point us to the original source.) Investigating this relationship in more detail helps us understand the effect of the many design choices that the modeller can make. We believe that a better understanding of the processes involved will lead to more effective, robust models.