ABSTRACT

Over the

last decade, several banks have developed models to quantify credit risk. In

addition to the monitoring of the credit portfolio, these models also help to

decide the acceptance of new contracts, assess customers’ profitability and

define pricing strategy. The objective of this paper is to improve the approach

in credit risk modeling, namely in scoring models for predicting default

events. To this end, we propose the development of a two-stage Ensemble Model

that combines the results interpretability of the Scorecard with the predictive

power of the Artificial Neural Network. The results show that the AUC improves

2.4% considering the Scorecard and 3.2% compared to the Artificial Neural

Network.

1. INTRODUCTION

Over the

last decade, several banks have developed models to quantify credit risk (Basel

Committee on Banking Supervision, 1999). The objective of credit risk modeling

is to estimate the expected loss (EL) associated with credit portfolio. To do

so, it is necessary to estimate the Probability of Default (PD), the Loss Given

Default (LGD) and the Exposure At the time of Default (EAD). The portfolio’s

expected loss is given by the product of these three components (Basel

Committee on Banking Supervision, 2004).

However, this work focuses only on

PD models, typically based on scoring models. Credit scoring models are built

using historical information from several actual customers. For each one some

attributes are recorded and whether the customer has failed to pay (defaulted).

Specifically, credit scoring objective is to assign credit applicants to either

good customers (non-default) or bad customers (default), therefore it lies in

the domain of the classification problem (Anderson, 1978).G1 G2

Currently,

credit scoring models are used by about 97% of banks that approve credit card

applications (Brill, 1998). Using scoring models increase revenue by increasing

volume, reducing the cost of credit analysis, enabling faster decisions, and

monitoring credit risk over time (Brill, 1998). From the previous, credit risk

measurement has become increasingly important in the Basel II capital accord

(Basel Committee on Banking Supervision, 2003; Gestel et al., 2005).

In the banking industry, credit scorecard development has been based

mostly on logistic regression. This happens due to the conciliation of

predictive and interpretative power. Recall that regulators require that banks

can explain the credit application decisions, thus transparency is fundamental

to these models (Dong, Lai, & Yen, 2010; Hand & Henley, 1997). In this

paper, we propose a two-stage ensemble model to reinforce the predictive

capacity of a scorecard without compromising its transparency and

interpretability.G3 G4 G5 G6 G7 G8

2. LITERATURE SURVEY

In recent

years, several attempts have been made to improve the accuracy of Logistic

Regression (Lessmann, Baesens, Seow, & Thomas, 2015). Louzada et al. (2016)

reviewed 187 (credit scoring) papers and concluded that the most common goal of

researchers is the proposition of new methods in credit scoring (51.3%), mainly

by using hybrid approaches (almost 20%), combined methods (almost 15%) and

support vector machine along with neural networks (around 13%). The second most

popular objective is the comparison of new methods with the traditional

techniques, where the most used techniques are Logistic Regression (23%) and

neural networks (21%). One of these studies was done by West (2000), that

compared five neural network models with traditional techniques. The results point

that neural network may improve the accuracy from 0.5% to 3%. Additionally, logistic

regression was found to be an alternative to the neural networks. In turn,

Gonçalves and Gouvêa (2007) obtained very similar results using Logistic

Regression and neural network models. However, the proposed new methods tend to

require complex computing schemes and limit the interpretation of the results,

which makes them difficult to implement (Liberati, Camillo, & Saporta,

2017).

Lessmann et al. (2015) state that

the accuracy differences between traditional methods and machine learning

result from the fully-automatic modeling approach. Consequently, some advanced

classifiers do not require human intervention to predict significantly more

accurately than simpler alternatives. Abdou and Pointon (2011) carried out a

comprehensive review of 214 papers that involve credit scoring applications to

conclude that there is no overall best statistical technique used in building

scoring models, thus the best technique for all circumstances does not yet

exist. This result is aligned with the Supervised Learning No-Free-Lunch (NFL)

theorems (Wolpert, 2002).G9 G10 G11 G12 G13

Marqués et

al. (2012) evaluated the performance of seven individual prediction techniques

when used as members of five different ensemble methods and concluded that C4.5

decision tree constitutes the best solution for most ensemble methods, closely

followed by the Multilayer Perceptron neural network and Logistic Regression,

whereas the nearest neighbor and the naive Bayes classifiers appear to be

significantly the worst. Gestel et al. (2005) suggested the application of a

gradual approach in which one starts with a simple Logistic Regression and

improves it, using Support Vector Machines to combine good model readability

with improved performance.

3. THEORETICAL FRAMEWORK

3.1. DATASET

To ensure that our results are

replicable and comparable, we decided to use the German Credit Data Set from

University of California at Irvine (UCI) Machine Learning Repository. The

dataset can be found at http://archive.ics.uci.edu/ml/datasets.html. According

to Louzada et al. (2016), almost 45% of all reviewed papers (in their survey)

consider either Australian or German credit datasets. The dataset contains 1000

in force credits, where 700 are identified as non-defaulted and 300 as

defaulted. The 20 input variables prepared by Prof. Hofmann are presented in

Table 1.G14 G15

The target

variable is “status” and contains the classification of the loan in terms of

default (Lichman, 2013).

The dataset

comes with a recommended cost matrix, making a fail in predicting a default

five times worse than failing to predict a non-default. however, given this

paper’s objectives, we chose not to use any cost matrix. Thus, both failing to

predict a default and a non-default have the same cost.

3.2. TWO-STAGE ENSEMBLE MODEL

In this

paper, we aim to improve the approach used in credit scoring models. To this

end, we propose a Two-Stage Ensemble Model (2SEM) to reinforce the predictive

capacity of a Scorecard without compromising its transparency and

interpretability.

The concept behind the ensemble is

to use several algorithms together to obtain a better performance than the one

obtained by each of the algorithms individually (Rokach, 2010). In our paper,

we will firstly estimate a Scorecard (SC) model and then an Artificial Neural

Network (ANN) is estimated on the SC Residual. Then, we ensemble the two models

using a logistic regression. This way, we pretend that the ANN covers for the

nonlinearity that SC is unable to capture. The proposed architecture for the

Ensemble Model is presented in Figure 1:G16 G17

Where is the set of inputs, the target variable, and

are the target and residual estimates, respectively. The box operator

stands for a specific algorithm (in this case, SC, ANN, and LR) and the circle

a sum operator (where the above sign corresponds to the above variable, and the

other to the below variable). The components in Figure 1 are better described

in Table 2.G18

Lastly, to avoid overfitting the

dataset was split randomly into the training set (65%), the validation set

(15%) and test set (20%). In this process, we used stratified sampling on the

target variable to ensure the event proportion is similar in all sets.G19 G20 G21 G22

3.3. PERFORMANCE METRICS

Following Hamdy & Hussein (2016)

performance assessment approach, we will rely on confusion matrix and Area

Under the ROC curve (AUC) to compare the predictive quality of the 2SEM, SC and

ANN.G23 G24

Confusion

Matrix

The confusion matrix is a very is a

very widespread concept, and it allows a more detailed analysis of the right

and wrong predictions. As may be seen in Figure 2.4, there are two possible

predictive classes and two actual classes as well. The combination of these

classes originates four possible outcomes: True Positive (TP), False Negative

(FN), False Positive (FP) and True Negative (TN).G25 G26

These

classifications have the following meaning:

• True Positive: it includes the

observations that we predict as default and are actually default;

• False Positive: it includes the

observations that we predict as default but are actually non-default – error

type I;

• True Negative: it includes the

observations that we predict as non-default and are actually non-default;

• False Negative: it includes the

observations that we predict as non-default but are actually default – error

type II;

To ease up

the matrix interpretation the following measures may be computed:

From the

previous, accuracy takes a central place. However, this metric must be used

carefully, especially on unbalanced datasets (as the one we are using). For

example, in a dataset with 5% event rate, then a unary prediction of non-event

would have an accuracy of 95%, better than a stochastic model that could get

90% of the times correct in a dataset with 50% event rate. Clearly, this metric

is not robust for comparisons between models applied on datasets with different

event rate. However, we may use it to compare models on the same dataset, that

is precisely what we want to do. Moreover, we will use the inverse metric, the

Misclassification Rate.

AUC

Another measure for assessing

predictive power is the Area Under Curve (AUC) Receiver Operating

Characteristic (ROC). The curve is created by plotting the true positive rate

against the false positive rate at various cutoff points. The true-positive

rate is the probability of identifying a default, while the false-positive rate

is the probability of false alarm. The AUC=0.5 (random predictor) is used as a

baseline to see whether the model is useful or not (Provost & Fawcett,

2013).G27 G28

Compared to

the confusion matrix, this method has the advantage of not requiring the

cut-off definition (value from which the probability of default is high enough

to consider that the customer is a bad one). Besides, it is also suited for

unbalanced datasets (Hamdy & Hussein, 2016). However, the use of ROC Curve

as unique misclassification criterion has decreased significantly in the

articles over the years. More recently the use of metrics based on confusion matrix

is most common (Louzada et al., 2016).

4. RESULTS AND DISCUSSION

In this section, we first present

the estimation results for both 2SEM and the baselines (SC and ANN). And then

the results obtained are analyzed and compared to select the most appropriate

model.G29 G30

4.1. SCORECARD

Prior to scorecard estimation, some

input variables had to be binned. This process consisted in grouping the input

variable’s values that had similar event behavior (target variable). To cutoffs

used maximized the Weight of Evidence (WOE), a metric for variable Information

Value (IV) (Zeng, 2014). The binning outcome consisted of 20 new categorical

input variables, that were then used in a stepwise selection algorithm. Thus,

the following seven input variables were included in the scorecard: Age in

years, Credit amount, Credit history, Duration in month, Purpose, Savings

account/bonds and Status of existing CA. The estimates might be seen in Table

4.G31 G32

The score

points in this scorecard increase as the event rate decreases. The estimation

parametrization ensures that a score of 200 represents odds of 50 to 1 (that is

P(Non-default)/P(Default)=50). The neutral score in a variable is 16 and an

increase of 20 in the score points corresponds to twice the odds. The link between

score points and the probability of default is pictured in Figure 2.G33

4.2. ARTIFICIAL NEURAL NETWORKS

The neural

network was designed of five layers, input, three hidden, and output layers.

The input layer has 20 variables while each hidden layer includes three neurons

with Tanh activation function. So, we included 9 hidden neurons and estimated

208 weights. Figure 3 presents the Artificial Neural Network architecture.

The optimization process ended on

the 10th iteration, achieving an average validation error of 0.496, as

presented in Figure 3.G34 G35

4.3. TWO-STAGE ENSEMBLE MODEL

The 2SEM consists of a logistic regression using PD estimate from SC

(P_Scard) and SC residual estimation from ANN (P_ANN) as inputs. We expect that

the P_Scard accounts for the majority of 2SEM predictive power, while P_ANN is

supposed to correct P_Scard deviations (prediction failures). The coefficients

estimates are presented in Table 5G36 G37 G38 G39

As may be

seen, the P_Scard is the main contributor to 2SEM (the P_Scard std estimate is

twice the P_ANN), been both statistically significant.

4.4. DISCUSSION

In this section, we compare the

Scorecard, Artificial Neural Network and the Two-Stage Ensemble Model according

to confusion matrix metrics and AUC. But before Figure 5 presents default rate

distribution through scoring deciles. To obtain these distributions the test

dataset was ascending sorted by target prediction (in each model) and divided

into 10 equipopulated bins. Then the average of Status (DefRate) and Status Prediction

(AvgProb) were computed. Analyzing these plots, we identify that none of the

distributions is monotonic (what is usually a requirement in a probability of

default model), however, there is an evolution in the right way from SC to

2SEM.G40 G41 G42 G43 G44

We turn now

to the fit statistics, presented in Table 6. The results indicate that the 2SEM

has a better fit to data according to all these statistics. Namely, AUC

improves 2.4% (0.019pp) considering the Scorecard and 3.2% (0.025pp) compared

to the Artificial Neural Network.

This result

is reinforced by the ROC curve representation. In Figure 6 are presented the

ROC curves for the train, validate and test datasets.G45

5. CONCLUSION

Credit

scoring models attempt to measure the risk of a customer falling to pay back a

loan based on his characteristics. In the banking industry, the most popular

model is the scorecard due to the conciliation of predictive and interpretative

power. Recall that regulators require that G46 banks

can explain the credit application decisions, thus transparency is fundamental

to these models. In this paper, we propose a new ensemble framework for the

credit-scoring model to reinforce the predictive capacity of a scorecard

without compromising its transparency and interpretability.G47 G48 G49 G50

The two-stage ensemble model consists of a logistic regression using PD

estimate from Scorecard and Scorecard residual estimation (obtained through

Artificial Neural Network) as inputs. Thus, the Scorecard estimate (PD)

accounts for the majority of 2SEM predictive power, while the Artificial Neural

Network aims to help to correct the Scorecard deviations (prediction failures).

This ensemble framework may be seen as an estimation by layers, where modeling

is done using more and more powerful methods from layer to layer. The advantage

of this approach relates to the use of residuals as the target in the next

layer. As the largest fit is obtained in the first layers the majority of the

model components are produced by the simplest algorithms, preserving the

interpretability of most of the prediction.G51 G52 G53 G54

Results

indicate that the default rate distribution produced by the Scorecard is not

monotonic (what is usually a requirement in the probability of default models),

however, there is an evolution in the right way when considering the 2SEM.

Furthermore, the AUC improves 2.4% (0.019pp) considering the Scorecard and 3.2%

(0.025pp) compared to the Artificial Neural Network.G55 G56

Finally,

several improvements are still to be done. Firstly, other algorithms and

parametrizations may be tested to check if the second stage contribution may be

improved. There is no hard evidence that the Artificial Neural Network used is

the best fit. Secondly, a generalization of the ensemble architecture should be

developed, turning the algorithm into an n-stage ensemble model. Finally, the

results should be obtained also for other datasets, to ensure that they are not

a lucky guess.G57 G58

Inserted: s

Deleted:ve

Inserted: ,

Inserted: t

Inserted: o

Inserted:

the

Deleted:i

Deleted:n

Inserted:

the

Inserted:

som

Deleted:c

Deleted:rtain

Deleted:a

Inserted: i

Deleted:o

Inserted: i

Deleted:o

Inserted: ,

Inserted: ,

Inserted:

the

Inserted: the

Deleted:ted

Inserted: on

Deleted:use

Inserted:

in

Deleted:at

Inserted:

Ano

Deleted:O

Inserted: ,

Deleted:ly

Inserted:

of

Deleted:in

Inserted: the

Inserted: i

Deleted:o

Inserted: i

Inserted: f

Deleted:n

Deleted:o

Inserted: ,

Inserted: a

Inserted:

to

Inserted: ,

Deleted:s

Inserted:

he t

Inserted: the

Inserted: the

Inserted: ,

Inserted: t

Deleted:n

Inserted:

to

Inserted: f

Deleted:n

Deleted:ing

Inserted: ,

Inserted:

the

Inserted: n

Inserted:

to