Site Loader


Over the
last decade, several banks have developed models to quantify credit risk. In
addition to the monitoring of the credit portfolio, these models also help to
decide the acceptance of new contracts, assess customers’ profitability and
define pricing strategy. The objective of this paper is to improve the approach
in credit risk modeling, namely in scoring models for predicting default
events. To this end, we propose the development of a two-stage Ensemble Model
that combines the results interpretability of the Scorecard with the predictive
power of the Artificial Neural Network. The results show that the AUC improves
2.4% considering the Scorecard and 3.2% compared to the Artificial Neural

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now


1.            INTRODUCTION

Over the
last decade, several banks have developed models to quantify credit risk (Basel
Committee on Banking Supervision, 1999). The objective of credit risk modeling
is to estimate the expected loss (EL) associated with credit portfolio. To do
so, it is necessary to estimate the Probability of Default (PD), the Loss Given
Default (LGD) and the Exposure At the time of Default (EAD). The portfolio’s
expected loss is given by the product of these three components (Basel
Committee on Banking Supervision, 2004).

However, this work focuses only on
PD models, typically based on scoring models. Credit scoring models are built
using historical information from several actual customers. For each one some
attributes are recorded and whether the customer has failed to pay (defaulted).
Specifically, credit scoring objective is to assign credit applicants to either
good customers (non-default) or bad customers (default), therefore it lies in
the domain of the classification problem (Anderson, 1978).G1 G2 

credit scoring models are used by about 97% of banks that approve credit card
applications (Brill, 1998). Using scoring models increase revenue by increasing
volume, reducing the cost of credit analysis, enabling faster decisions, and
monitoring credit risk over time (Brill, 1998). From the previous, credit risk
measurement has become increasingly important in the Basel II capital accord
(Basel Committee on Banking Supervision, 2003; Gestel et al., 2005).

In the banking industry, credit scorecard development has been based
mostly on logistic regression. This happens due to the conciliation of
predictive and interpretative power. Recall that regulators require that banks
can explain the credit application decisions, thus transparency is fundamental
to these models (Dong, Lai, & Yen, 2010; Hand & Henley, 1997). In this
paper, we propose a two-stage ensemble model to reinforce the predictive
capacity of a scorecard without compromising its transparency and
interpretability.G3 G4 G5 G6 G7 G8 



In recent
years, several attempts have been made to improve the accuracy of Logistic
Regression (Lessmann, Baesens, Seow, & Thomas, 2015). Louzada et al. (2016)
reviewed 187 (credit scoring) papers and concluded that the most common goal of
researchers is the proposition of new methods in credit scoring (51.3%), mainly
by using hybrid approaches (almost 20%), combined methods (almost 15%) and
support vector machine along with neural networks (around 13%). The second most
popular objective is the comparison of new methods with the traditional
techniques, where the most used techniques are Logistic Regression (23%) and
neural networks (21%). One of these studies was done by West (2000), that
compared five neural network models with traditional techniques. The results point
that neural network may improve the accuracy from 0.5% to 3%. Additionally, logistic
regression was found to be an alternative to the neural networks. In turn,
Gonçalves and Gouvêa (2007) obtained very similar results using Logistic
Regression and neural network models. However, the proposed new methods tend to
require complex computing schemes and limit the interpretation of the results,
which makes them difficult to implement (Liberati, Camillo, & Saporta,

Lessmann et al. (2015) state that
the accuracy differences between traditional methods and machine learning
result from the fully-automatic modeling approach. Consequently, some advanced
classifiers do not require human intervention to predict significantly more
accurately than simpler alternatives. Abdou and Pointon (2011) carried out a
comprehensive review of 214 papers that involve credit scoring applications to
conclude that there is no overall best statistical technique used in building
scoring models, thus the best technique for all circumstances does not yet
exist. This result is aligned with the Supervised Learning No-Free-Lunch (NFL)
theorems (Wolpert, 2002).G9 G10 G11 G12 G13 

Marqués et
al. (2012) evaluated the performance of seven individual prediction techniques
when used as members of five different ensemble methods and concluded that C4.5
decision tree constitutes the best solution for most ensemble methods, closely
followed by the Multilayer Perceptron neural network and Logistic Regression,
whereas the nearest neighbor and the naive Bayes classifiers appear to be
significantly the worst. Gestel et al. (2005) suggested the application of a
gradual approach in which one starts with a simple Logistic Regression and
improves it, using Support Vector Machines to combine good model readability
with improved performance.



3.1.        DATASET

To ensure that our results are
replicable and comparable, we decided to use the German Credit Data Set from
University of California at Irvine (UCI) Machine Learning Repository. The
dataset can be found at According
to Louzada et al. (2016), almost 45% of all reviewed papers (in their survey)
consider either Australian or German credit datasets. The dataset contains 1000
in force credits, where 700 are identified as non-defaulted and 300 as
defaulted. The 20 input variables prepared by Prof. Hofmann are presented in
Table 1.G14 G15 

The target
variable is “status” and contains the classification of the loan in terms of
default (Lichman, 2013).

The dataset
comes with a recommended cost matrix, making a fail in predicting a default
five times worse than failing to predict a non-default. however, given this
paper’s objectives, we chose not to use any cost matrix. Thus, both failing to
predict a default and a non-default have the same cost.



In this
paper, we aim to improve the approach used in credit scoring models. To this
end, we propose a Two-Stage Ensemble Model (2SEM) to reinforce the predictive
capacity of a Scorecard without compromising its transparency and

The concept behind the ensemble is
to use several algorithms together to obtain a better performance than the one
obtained by each of the algorithms individually (Rokach, 2010). In our paper,
we will firstly estimate a Scorecard (SC) model and then an Artificial Neural
Network (ANN) is estimated on the SC Residual. Then, we ensemble the two models
using a logistic regression. This way, we pretend that the ANN covers for the
nonlinearity that SC is unable to capture. The proposed architecture for the
Ensemble Model is presented in Figure 1:G16 G17 

Where   is the set of inputs,   the target variable,   and  
are the target and residual estimates, respectively. The box operator
stands for a specific algorithm (in this case, SC, ANN, and LR) and the circle
a sum operator (where the above sign corresponds to the above variable, and the
other to the below variable). The components in Figure 1 are better described
in Table 2.G18 

Lastly, to avoid overfitting the
dataset was split randomly into the training set (65%), the validation set
(15%) and test set (20%). In this process, we used stratified sampling on the
target variable to ensure the event proportion is similar in all sets.G19 G20 G21 G22 



Following Hamdy & Hussein (2016)
performance assessment approach, we will rely on confusion matrix and Area
Under the ROC curve (AUC) to compare the predictive quality of the 2SEM, SC and
ANN.G23 G24 



The confusion matrix is a very is a
very widespread concept, and it allows a more detailed analysis of the right
and wrong predictions. As may be seen in Figure 2.4, there are two possible
predictive classes and two actual classes as well. The combination of these
classes originates four possible outcomes: True Positive (TP), False Negative
(FN), False Positive (FP) and True Negative (TN).G25 G26 


classifications have the following meaning:

•             True Positive: it includes the
observations that we predict as default and are actually default;

•             False Positive: it includes the
observations that we predict as default but are actually non-default – error
type I;

•             True Negative: it includes the
observations that we predict as non-default and are actually non-default;

•             False Negative: it includes the
observations that we predict as non-default but are actually default – error
type II;


To ease up
the matrix interpretation the following measures may be computed:

From the
previous, accuracy takes a central place. However, this metric must be used
carefully, especially on unbalanced datasets (as the one we are using). For
example, in a dataset with 5% event rate, then a unary prediction of non-event
would have an accuracy of 95%, better than a stochastic model that could get
90% of the times correct in a dataset with 50% event rate. Clearly, this metric
is not robust for comparisons between models applied on datasets with different
event rate. However, we may use it to compare models on the same dataset, that
is precisely what we want to do. Moreover, we will use the inverse metric, the
Misclassification Rate.



Another measure for assessing
predictive power is the Area Under Curve (AUC) Receiver Operating
Characteristic (ROC). The curve is created by plotting the true positive rate
against the false positive rate at various cutoff points. The true-positive
rate is the probability of identifying a default, while the false-positive rate
is the probability of false alarm. The AUC=0.5 (random predictor) is used as a
baseline to see whether the model is useful or not (Provost & Fawcett,
2013).G27 G28 

Compared to
the confusion matrix, this method has the advantage of not requiring the
cut-off definition (value from which the probability of default is high enough
to consider that the customer is a bad one). Besides, it is also suited for
unbalanced datasets (Hamdy & Hussein, 2016). However, the use of ROC Curve
as unique misclassification criterion has decreased significantly in the
articles over the years. More recently the use of metrics based on confusion matrix
is most common (Louzada et al., 2016).



In this section, we first present
the estimation results for both 2SEM and the baselines (SC and ANN). And then
the results obtained are analyzed and compared to select the most appropriate
model.G29 G30 


4.1.        SCORECARD

Prior to scorecard estimation, some
input variables had to be binned. This process consisted in grouping the input
variable’s values that had similar event behavior (target variable). To cutoffs
used maximized the Weight of Evidence (WOE), a metric for variable Information
Value (IV) (Zeng, 2014). The binning outcome consisted of 20 new categorical
input variables, that were then used in a stepwise selection algorithm. Thus,
the following seven input variables were included in the scorecard: Age in
years, Credit amount, Credit history, Duration in month, Purpose, Savings
account/bonds and Status of existing CA. The estimates might be seen in Table
4.G31 G32 

The score
points in this scorecard increase as the event rate decreases. The estimation
parametrization ensures that a score of 200 represents odds of 50 to 1 (that is
P(Non-default)/P(Default)=50). The neutral score in a variable is 16 and an
increase of 20 in the score points corresponds to twice the odds. The link between
score points and the probability of default is pictured in Figure 2.G33 



The neural
network was designed of five layers, input, three hidden, and output layers.
The input layer has 20 variables while each hidden layer includes three neurons
with Tanh activation function. So, we included 9 hidden neurons and estimated
208 weights. Figure 3 presents the Artificial Neural Network architecture.

The optimization process ended on
the 10th iteration, achieving an average validation error of 0.496, as
presented in Figure 3.G34 G35 



The 2SEM consists of a logistic regression using PD estimate from SC
(P_Scard) and SC residual estimation from ANN (P_ANN) as inputs. We expect that
the P_Scard accounts for the majority of 2SEM predictive power, while P_ANN is
supposed to correct P_Scard deviations (prediction failures). The coefficients
estimates are presented in Table 5G36 G37 G38 G39 

As may be
seen, the P_Scard is the main contributor to 2SEM (the P_Scard std estimate is
twice the P_ANN), been both statistically significant.


4.4.        DISCUSSION

In this section, we compare the
Scorecard, Artificial Neural Network and the Two-Stage Ensemble Model according
to confusion matrix metrics and AUC. But before Figure 5 presents default rate
distribution through scoring deciles. To obtain these distributions the test
dataset was ascending sorted by target prediction (in each model) and divided
into 10 equipopulated bins. Then the average of Status (DefRate) and Status Prediction
(AvgProb) were computed. Analyzing these plots, we identify that none of the
distributions is monotonic (what is usually a requirement in a probability of
default model), however, there is an evolution in the right way from SC to
2SEM.G40 G41 G42 G43 G44 

We turn now
to the fit statistics, presented in Table 6. The results indicate that the 2SEM
has a better fit to data according to all these statistics. Namely, AUC
improves 2.4% (0.019pp) considering the Scorecard and 3.2% (0.025pp) compared
to the Artificial Neural Network.

This result
is reinforced by the ROC curve representation. In Figure 6 are presented the
ROC curves for the train, validate and test datasets.G45 


5.            CONCLUSION

scoring models attempt to measure the risk of a customer falling to pay back a
loan based on his characteristics. In the banking industry, the most popular
model is the scorecard due to the conciliation of predictive and interpretative
power. Recall that regulators require that G46 banks
can explain the credit application decisions, thus transparency is fundamental
to these models. In this paper, we propose a new ensemble framework for the
credit-scoring model to reinforce the predictive capacity of a scorecard
without compromising its transparency and interpretability.G47 G48 G49 G50 

The two-stage ensemble model consists of a logistic regression using PD
estimate from Scorecard and Scorecard residual estimation (obtained through
Artificial Neural Network) as inputs. Thus, the Scorecard estimate (PD)
accounts for the majority of 2SEM predictive power, while the Artificial Neural
Network aims to help to correct the Scorecard deviations (prediction failures).
This ensemble framework may be seen as an estimation by layers, where modeling
is done using more and more powerful methods from layer to layer. The advantage
of this approach relates to the use of residuals as the target in the next
layer. As the largest fit is obtained in the first layers the majority of the
model components are produced by the simplest algorithms, preserving the
interpretability of most of the prediction.G51 G52 G53 G54 

indicate that the default rate distribution produced by the Scorecard is not
monotonic (what is usually a requirement in the probability of default models),
however, there is an evolution in the right way when considering the 2SEM.
Furthermore, the AUC improves 2.4% (0.019pp) considering the Scorecard and 3.2%
(0.025pp) compared to the Artificial Neural Network.G55 G56 

several improvements are still to be done. Firstly, other algorithms and
parametrizations may be tested to check if the second stage contribution may be
improved. There is no hard evidence that the Artificial Neural Network used is
the best fit. Secondly, a generalization of the ensemble architecture should be
developed, turning the algorithm into an n-stage ensemble model. Finally, the
results should be obtained also for other datasets, to ensure that they are not
a lucky guess.G57 G58 

Inserted: s


Inserted: ,

Inserted: t

Inserted: o









Inserted: i


Inserted: i


Inserted: ,

Inserted: ,


Inserted:  the


Inserted:  on






Inserted: ,




Inserted:  the

Inserted: i


Inserted: i

Inserted: f



Inserted: ,

Inserted: a


Inserted: ,


he t

Inserted:  the

Inserted:  the

Inserted: ,

Inserted: t



Inserted: f



Inserted: ,


Inserted: n


Post Author: admin


I'm Erica!

Would you like to get a custom essay? How about receiving a customized one?

Check it out