A whirlwind introduction to Weight-of-Evidence (WoE)
Today, we’re going to talk about Weight-of-Evidence (WoE) and how it is used in the field of credit modelling.
If you’re someone who understands the intricacies of Weight of Evidence (WoE), give yourself a pat on your back. If, on the other hand, you’re unsure about what WoE is all about, we’re here to help.
Let’s start at the beginning. Combined with logistic regression, WoE is a great method for binary classification. If you’re a credit modeller in particular (which is the example we’ll use), then there is even more reason to consider it.
Before diving deeper, let’s first discuss why WoE remains a common credit modelling tool 50 years after its initial introduction. In credit modelling, prediction explainability is essential. In other words, being able to explain exactly why a particular prediction was made is a crucial property of any credit model. Intuitively this makes sense: a prediction can have a major impact on the individual who, for example, is applying for a loan. A financial institution should be able to explain precisely why a loan was denied and what is required to qualify for a loan. As a result, many (if not most) credit modellers are created using a logistic regression model, which is a linear classifier that makes explainability relatively simple.
Two things to keep in mind, before we continue:
Logistic regression is a linear classifier, but the WoE transformation is nonlinear.
At its core, the WoE transformation allows us to turn continuous variables into categorical variables. This has a number of advantages, which may not be immediately obvious.
The WoE transformation allows you to combine numerical and categorical values. For example, missing values for a numeric variable can be given a special label such as ‘missing’, ‘-1’, etc, instead of being dropped or ignored. In credit modelling it is also common to sometimes assign special categories to even numeric variables. These categorical values are then used in combination with numerical values of the variable.
WoE is also robust against outliers, which we’ll cover a bit later in this post.
Categories that are sparsely populated can be combined so that WoE can express information of the combined category. The (univariate) effect of each category can be compared across categories and across variables since the WoE is a standardised (non-dimensional) value. WoE, being on a logarithmic scale, combines in a natural way with logistic regression. All of these properties make the WoE a very natural and useful transformation in the domain of binary classification.
Looking to use Weight-of-Evidence to create your own binary classification models? Us too! That’s why we built Praexia 🚀, which is a cloud-based credit modelling tool.
What is WoE and how to calculate it?
As an example, let’s consider a specific variable, age. This is a continuous variable with values ranging from 0 to above 100 (in exceptional cases). In our day-to-day lives, we already assign this continuous variable into equal sized bins of a year each. For example, we say a person is 68 years old, and not 68.321 years old. Using WoE, we’ll use more powerful bins than singular years.
So then, the first step in the WoE transformation is to bin a variable. This is performed for both continuous and categorical variables. Our age variable can be seen as a continuous variable, binned into 1 year bins. But we don’t need to use bins of equal, one-year width. We can, for example, bin the age variable as follows:
Note again how, after binning, our age variable becomes categorical. For example, all ages between 29 and 38 will be assigned to the same bin.
As you can imagine, it is critical to get the choice of bins right. A credit modeller (or anyone using WoE) will spend a lot of time experimenting with different bin choices in order to optimise a number of metrics (we’ll touch on these metrics in a second!)
Every sample (or “instance”) from the age variable is associated with a label: 0 for “good” and 1 for “bad”. We then execute the following procedure:
Assign each sample to its respective bin
For each bin, count the number of “good” and the number of “bad” samples. This gives us the “Good count” and “Bad count” for each bin.
Calculate the “Good” and “Bad” distributions by dividing by the total number of good / bad samples in the entire set.
All clients with ages, say, between 29 and 38 fall in the same bin and assigned the same WoE value of 0.113034.
The original values, e.g. the different ages in the dataset, are first transformed to WoE values using a lookup table. It is the WoE values that are used as input features for training the credit model.
Naturally, the model does not distinguish between clients with values that fall in the same bin.
Two clients with ages, say, 49 and 50 are placed into different bins and assigned WoE values of 0.170319 and -0.030100 respectively. These two clients might (probably) be treated differently by the trained model.
Note how the binning protects against outliers: A very old person will fall in the same bin as the elderly.
A Detailed Example
Let’s consider a slightly more detailed example. The dataset below consists of two variables, A and B, and for each client the values of the two variables as well as labels are provided:
The question is how to bin the two variables. In general there are a few strategies. Two simple strategies are:
Choose the number of bins and simply divide the data into equal sized bins.
Choose the bins so that every bin contains the same number of samples.
For this example, let’s arbitrarily choose the bins as follows (note that it is a pure coincidence that the two variables have the same number of bins):
With this information we can calculate the WoE tables in the same way as we did for the age variable discussed above:
Again, some things to note:
Some of the bins have a zero count and for those bins the WoE cannot be calculated.
This is an artefact of the small sample set we chose. In general, a binning with either zero good or bad counts in any of the bins, will be unacceptable. In that case either the binning should be adjusted or the variable discarded.
For this example we can deal with zero counts by adding ϵ=0.00001 everywhere.
Using a lookup table (the table above!) we replace the client data with the WoE values:
Let’s summarise everything in the table below:
There is a beautiful and important relationship between WoE and logistic regression that we’ll explore in detail in a forthcoming post. For now, we simply train the logistic classifier using sciki-learn. The trained classifier is used to classify all the samples in the region, \(0 <A <4.5\) and \(0 < B<7.5\). The decision boundary is shown in the image below:
Things to observe:
The training dataset is indicated by the discrete points, blue=”good” (label=0), red=”bad” (label=1).
The decision boundary divide the region into “good” (blue) and “bad” (orange) sub regions.
The decision boundary is clearly nonlinear. This is due to the nonlinear transformation from the original values to WoE values.
Note how the choice of bins directly affects the position of the decision boundary. By choosing the bins carefully, the good and bad samples can be completely separated.
The choice of the bins allow a perfect classification of the training data. Perhaps counter-intuitively, this is actually an undesirable outcome, since it implies overfitting. However, for our example it serves to illustrate how powerful a good binning choice may be.
Choosing the bins
Since the bin choice is clearly important, how do you choose them? Unfortunately, there are no hard and fast rules, but a few strategies and guidelines do exist:
Each bin should contain at least 5% of the observations. This implies that no bins should have zero counts for either good or bad loans.
The WoE should be distinct for each bin. Similar bins should be combined since bins with similar WoE values have the same predictive power.
The WoE should be monotonic, i.e., either growing or decreasing with the bins (Since the bins are the values of the independent variable, we want to the WoE values to either increase or decrease as we move along the bins from left to right or vice versa). Fluctuations in the WoE values can destabilise the model. For example, you don’t want a severe or sudden jump in your credit score if you move to the next bin after your next birthday.
Categorical values should be binned separately from the numerical bins.
The same rules apply to categorical as for numerical values.
Today we’ve taken a brief look at what WoE is, how it is calculated, and shown how the nonlinear transformation from original values to WoE values leads to a nonlinear decision boundary of the logistic regression classifier. A nonlinear decision boundary is of course much more flexible than a linear decision boundary in dividing two classes.
But this isn’t the end. We have also left a number of questions unanswered, which we’ll address in a future post:
How can one reduce the number of variables using WoE?
What is the predictive power of a single variable as measured by its GINI index? The answer to this question may surprise you.
In credit modelling one inevitable works with imbalanced training sets. Should one try and balance the training set before training?