Low-for-long interest rates, sluggish economic growth and increase in regulatory requirements affect both banks’ revenues and the structure of their balance sheet in ways that depend, inter alia, on their business model. The analysis of its business model can thus help identifying the risks to which a bank is exposed to and, by extension, estimating the effects that it would experience in the event of an economic shock or an increase in regulatory requirements, for example. According to the European Banking Authority (EBA), this approach makes it possible to estimate the viability of banks’ business model and the sustainability of their strategy. Hence, bank business model analysis is one of the four pillars1 of the Supervisory Review and Evaluation Process (SREP2), the results of which contribute to the setting of individual regulatory requirements under Pillar 2 of the European Directive CRD IV3.
Business model analysis involves its identification and assigning each bank to a single, relatively homogeneous cluster. Taken as a whole, however, banks carry out a wide range of activities in varying proportions. As a result, there is no harmonized and generally accepted definition of the different business models of banks (Cernov and Urbano, 20184). The classification of banks by business model can thus involve a significant amount of so-called expert judgment. This approach, which is based on authors’ personal assessment, has the advantage of being easily applicable, but its more or less arbitrary nature makes it arguable. The literature therefore proposes various methods for objectively identifying the business model of banks.
The approach favored in the recent literature is the hierarchical cluster analysis (HCA), and more particularly the agglomerative clustering method. Based on data, agglomerative clustering is an iterative process that successively aggregates banks according to their common characteristics. At the end of the aggregation, each bank is assigned to a homogeneous cluster, clearly distinguished from the others. This quantitative approach objectifies algorithmically the classification of banks according to their business model, which makes it more robust than an approach based solely on expert judgment. It also has the advantage of avoiding assumptions about the optimal number of clusters, which is determined ex-post.
Balance sheet variables (total assets, deposits as a share of total assets, leverage ratio, etc.) are, in the literature, the explanatory variables on which the clustering of banks by business model is based. However, many items are not included in the balance sheet, while some activities, which generate a substantial proportion of banking income, do not involve significant asset holdings. 
Such an approach implicitly assumes that a bank’s business model is mainly reflected in the structure of its balance sheet, which itself adequately represents its various sources of income. However, in their financial reports, banks often present a breakdown of their sources of income to illustrate their business model.
We therefore propose a classification of European banks5 according to their business model by giving, in addition to the traditional balance sheet variables, a greater importance to the different sources of income that constitute their net banking income. We also add assets under management to reflect, to some extent, the significance of off-balance sheet activities, which are often ignored in the literature. In addition, we introduce two methodological innovations: on the one hand, we retain three principal components, as opposed to the usual two. This allows us to retain more information and improve the quality of our classification. On the other hand, rather than an agglomerative clustering method, we use a divisive clustering method which, according to a statistical test, produces better results.
Finally, we estimate the optimal number of business models for European banks to be five. We name them according to the average characteristics of each cluster, as well as with reference to the literature: pure retail banks, retail-oriented commercial banks, universal banks,and investment banks and assimilated. The variables related to the net banking income and the assets under management variable prove to be particularly relevant for identifying bank business models. Our results remain consistent with those obtained in the literature.
Data preparation and model selection
The purpose of our study is to establish an objective classification of European banks in order to classify later, within the framework we will have established, new European banks resulting from the merger of several institutions, for example. Our method could also be transposed to other geographical areas, countries or banking systems for the purpose of international comparisons, regarding the gap between accounting standards that can impact the items accounted in the balance sheet, amongst other things. It would also be possible to observe the evolution of a particular bank’s business model by comparing its classification over several periods. Finally, an analysis of the sensitivity of each business model to changes in interest rates or prudential regulation is of course possible. Wherever possible, we apply a protocol in which each of our choices is guided by best-in-class practices.
 RETAINED VARIABLES
RETAINED VARIABLES CHOICE OF THE NUMBER OF PRINCIPAL COMPONENTS TO RETAIN
CHOICE OF THE NUMBER OF PRINCIPAL COMPONENTS TO RETAINEnsuring transposability of results with a large sample
Correct sample selection in clustering is essential because banks are classified relatively to each other, so sampling bias can affect the results. In addition, the sample should be large enough to cover as many variants of banking business models as possible, at the risk of not being sufficiently representative to be able to transpose the results. An overly general approach may also produce insufficiently precise or even aberrant results. In such case, this may lead some banks deem to be universal to be misclassified with investment banks simply because the institutions in the rest of the sample are not engaged in market activities, whereas this would be a strong differentiation criterion.
Within the limits of the exploitable data in SNL, 2946 banks were initially selected. Highly specialized banks (car loans only, credit cards, pawn shop, etc.) are excluded from the sample for the reasons given above. Moreover, their business model is already clearly identified and their exclusion should make it possible to distinguish more effectively the less obvious business model of the banks in the sample. Selected data cover all banking groups in the European Economic Area6 at their highest level of consolidation, since this is generally the level at which regulatory requirements apply. The exclusion of subsidiaries helps to avoid redundancy of information, which could lead to over-representation of particular business models. Finally, the segmentation of activities by subsidiary is potentially a component of a group’s strategy.
The clustering algorithm is sensitive to missing values or outliers. They are therefore checked and corrected, as far as possible. Otherwise, the bank is eliminated from the sample, reducing the initial sample of 759 banks. The same applies to extreme values7. This leads us to eliminate 62 additional banks that are identified as having extreme values by a dedicated algorithm8.
The data are normalized to make them more easily comparable. Their average over three years (2016, 2017 and 2018) has been calculated beforehand in order to smooth the cyclical fluctuations that could lead to the misclassification of a bank by over-interpreting one-off developments. A significantly longer period could lead to ignoring the evolution of a business model. Data for 2019, which are too often missing, are not included. Otherwise, the sample would be divided by more than half and would be composed mainly of the largest banks, while small German and Italian banks, in particular, would be eliminated from the sample.
The dimensionality reduction allows more information to be retained
On the basis of the variables traditionally used in the literature identifying banks’ business model, and a step-by-step selection based on a correlation between variables analysis, a total of thirteen variables are finally retained (see Table 1):
- eight balance sheet variables traditionally found in the literature,
- four variables covering the main lines of the net banking income according to the nature of the income: net interest income, net fee and commission income, net trading income9 and other net income, and
- a variable for assets under management.
However, cluster analysis becomes less efficient as the number of retained variables increases, according to Han and Al. (2012). The authors thus suggest that one solution is to reduce the dimensionality of the variables by means of a principal component analysis. This is the approach followed, for example, by Farnè and Vouldis (201710). The thirteen variables that we initially select are thus linearly combined into several principal components according to a procedure robust to small and large data samples whose precision of the results is not affected by extreme values11. Contrary to the literature, we have chosen not to apply the Kaiser criterion12 when determining the number of principal components to be retained because it no longer appears to be really adapted to the possibilities of the current research13. The application of this heuristic criterion would have led us to retain only the two principal components whose variance (or eigenvalue) is greater than 1 (see Table 2), as it is generally the case in the literature. Finally, we retained three principal components in order to preserve 79.21% of the information contained in the initial data (more precisely, the multivariate variance).
DIANA method and AGNES method 
Traditionally, the literature uses an Agglomerative nesting clustering (AGNES) method. This bottom-up method is based on an algorithm that classifies banks by successive aggregations according to the proximity of their characteristics. At each stage of this iterative process, the two bank(s) and/or cluster(s) of banks whose distance, measured by a combination of the numerical values taken by the variables characterizing them, is the shortest, are aggregated into a new cluster. Initially, each bank is considered to constitute its own cluster, a singleton, and then the total sample is gradually reconstituted by successive aggregations (see Chart 1).
The hierarchical cluster analysis method that we use, because of the better results that it produces, is known as Divisive analysis clustering (DIANA). This, also iterative, top-down method initially considers the sample as a single cluster which it then divides in two. At each (n-1) step, the most heterogeneous cluster (for which the variance is the highest) is split in two by maximizing the distance14 between the two new groups created («splinter group» and «old party»). At the end of the process, each bank is assigned to a single cluster, a singleton15.
The DIANA method classifies European banks into five business models
Our classification produces statistically satisfactory results. These results tend to validate both the addition of the variables related to the net banking income and the assets under management variable as well as the use of three principal components. We identify an optimal number of five business models, which we name in respect of the literature.
 DIANA AND AGNES METHODS
DIANA AND AGNES METHODS DENDROGRAM OF EUROPEAN BANKS WITH THE DIANA METHOD
DENDROGRAM OF EUROPEAN BANKS WITH THE DIANA METHOD THREE-DIMENSIONAL REPRESENTATION OF EUROPEAN BANKS’ CLASSIFICATION USING THE DIANA METHOD
THREE-DIMENSIONAL REPRESENTATION OF EUROPEAN BANKS’ CLASSIFICATION USING THE DIANA METHODThe optimal number of business models is five
At the end of the hierarchical cluster analysis (agglomerative or divisive), the objective identification of the optimal number of clusters, a term that does not imply any hierarchy between banks, is possible thanks to a dedicated algorithm that tests more than thirty different indices16, including the Calinski and Harabasz17 index which is the most common in the literature. This is one of the main advantages of hierarchical clustering methods: it does not require an ex ante assumption about the proper number of clusters into which classify banks. In this case, the European banks in our sample are classified according to their business model into five different clusters.
Dendrogram and 3D representation
The result of the successive divisions (or aggregations) can be represented by a classification tree or dendrogram18 (see Chart 2). The branch height (or cophenetic distance) indicates the distance between 2 bank(s) and/or bank cluster(s). The longer the branch, the more different the two bank(s)/bank cluster(s) are. Finally, a cophenetic correlation coefficient can be calculated to estimate the quality of the classification. The closer the coefficient is to 1, the better the classification. It is notably this criterion that encourages us to use the DIANA method rather than the AGNES method, whose coefficients are respectively 0.72 and 0.5519. Furthermore, Kassambara (201720) considers that the DIANA method is more suitable than the AGNES method for the classification of large samples. Finally, Roux (201821) demonstrates that top-down algorithms are more efficient than their bottom-up equivalents.
The results of the classification can also be represented in three dimensions, with each of the three axes representing a principal component (see Charts 3 to 6). This provides another view of the proximity between individual banks on the one hand and between clusters of banks on the other hand. It is thus clearer that banks belonging to cluster 2 have very similar characteristics, whereas the characteristics of banks in clusters 4 and 5 are more heterogeneous.
From pure retail banks to investment banks (and assimilated)
We designate the five banking business models identified on the basis of the average of the variables observed for each cluster (see Charts 7 to 9) and by using the names commonly used in the literature:
- The pure retail banking model comprises the 310 banks in cluster 1, of which, on average22, net loans to customers account for 83% of total assets, customer deposits 74% of total assets and net interest income 83% of net banking income,
- The retail-oriented commercial banking model encompasses the 1491 banks in cluster 2. Net loans to customers constitute, on average, 60% of total assets of the banks belonging to this category; total securities, 22%; net interest income and net fee and commission income, 68% and 24% of net banking income respectively,
- The commercial banking model is that of the 148 banks in cluster 3. Net loans to customers constitute, on average, 72% of total assets, total wholesale debt, 26% of total assets while assets under management represent 11% of total assets23. The distribution of income by source is comparable to that of retail-oriented commercial banks,
- The investment banking and assimilated model combines the 94 banks in cluster 4. Net loans to customers constitute, on average, 31% of total assets, customer deposits 67% of total assets and net fee and commission income 64% of net banking income,
- The universal banking model brings together the 82 banks in cluster 5. Net loans to customers represent, on average, 39% of total assets, assets under management 29% of total assets, customer deposits 41% of total assets, net interest income and net fee and commission income represent 38% and 31% of net banking income respectively.
 BREAKDOWN OF BANKING INCOME SOURCES BY BUSINESS MODEL - DIANA METHOD
BREAKDOWN OF BANKING INCOME SOURCES BY BUSINESS MODEL - DIANA METHOD BALANCE SHEET AND OFF-BALANCE SHEET ITEMS BY BUSINESS MODEL - DIANA METHOD
BALANCE SHEET AND OFF-BALANCE SHEET ITEMS BY BUSINESS MODEL - DIANA METHODPure retail banks are easily identifiable both by the structure of their balance sheet, which is to a large extent oriented towards the collection of customer deposits, and by the nature of their income, which consists mainly of interest income. Investment banks and assimilated also differ markedly from other business models by the preponderance of net fee and commissions in their net banking income. Universal banks are characterized by the equilibrium of their sources of income, compared with banks in other clusters for which one type of income predominates. Moreover, the structure of the resources of universal banks is very different from that of investment banks and assimilated. Clear differences can also be observed in the structure of the resources of the two categories of commercial banks. Thus, the literature sometimes distinguishes some commercial banks by describing them as «wholesale funded».
 BREAKDOWN OF FUNDING SOURCES FOR BANKING ACTIVITIES BY BUSINESS MODEL - DIANA METHOD
BREAKDOWN OF FUNDING SOURCES FOR BANKING ACTIVITIES BY BUSINESS MODEL - DIANA METHODLike the literature, our results illustrate the importance of banks’ balance sheet variables in identifying their business model. The breakdown of net banking income and assets under management are also relevant. Finally, a selection of fifty banking groups amongst the largest in Europe in terms of Common Equity Tier 1 seems to be correctly classified according to their business model, as far as our expert judgment can be applied (see Table 3). The over-representation of universal banks in this sub-sample highlights the correlation between the size of an institution and the diversification of its activities.
Alternative classification with the AGNES method 
In the context of our classification of European banks according to their business model, the DIANA method appears, as we have said, to perform better than the AGNES method. Moreover, the results obtained with this first method seem to us to be better, over and above the statistical criteria; the different business models are more clearly differentiable, particularly in the case of commercial banks. However, the agglomerative clustering is often preferred to the divisive clustering in the literature24. We therefore also apply this former method to our sample for comparison purposes.
The AGNES method requires an additional assumption to be made
Compared with the DIANA method, the AGNES method requires an additional assumption. Indeed, although the calculation of the distance between each bank is common to both approaches, the AGNES method requires choosing between several options in order to calculate the distance between two clusters, knowing the distance that have been previously calculated between each pair of banks of these two clusters. The most frequently used measure of aggregation is the so-called «Ward’s minimum variance measure». It takes into account the relative weight of each cluster and uses its gravity center as a reference for the calculation of the distance25. The Ward’s linkage method minimizes the total variance (distance) between banks in the same cluster and aggregates the banks or cluster(s) of banks with the lowest variance (distance) at each step. The banks are thus aggregated until they form homogeneous clusters (minimization of the within-cluster distance), as distinct as possible from each other (maximization of the between-cluster distance). Following the example of the results obtained with the DIANA method, the results obtained with the AGNES method can be represented by a dendrogram (see chart 10) as well as by using the three principal components as axes in a chart (see charts 11 to 14). The optimal number of clusters is, as with the DIANA method, five since it is determined using the same thirty indices.
 DENDROGRAM OF EUROPEAN BANKS USING THE AGNES METHOD
DENDROGRAM OF EUROPEAN BANKS USING THE AGNES METHODThe AGNES method makes the naming of the activity models more delicate
With the AGNES method, we apply the same procedure as with the DIANA method to name the five identified business models. The averages of the variables in each group show substantial differences from one method to another. Also, the results are imperfectly comparable and sometimes lead us to name the considered cluster differently:
- The pure retail banking model comprises the 212 banks in cluster 1, of which, on average, net loans to customers account for 84% of total assets, customer deposits 77% of total assets and net interest income 86% of net banking income,
- The commercial banking model encompasses the 517 banks in cluster 2. Net loans to customers constitute, on average, 50% of total assets of the banks belonging to this category; total securities, 34%; net interest income and net fee and commission income, 69% and 23% of net banking income respectively,
- The retail-oriented commercial banking model is that of the 821 banks in cluster 3. Net loans to customers constitute, on average, 66% of total assets, total wholesale debt, 1% of total assets while assets under management represent 0% of total assets. The distribution of income by source is almost identical to that of commercial banks,
- The wholesale funded commercial banking model combines the 380 banks in cluster 4. Net loans to customers constitute, on average, 72% of total assets, customer deposits 59% of total assets and net fee and commission income 22% of net banking income,
- The universal banking model brings together the 195 banks in cluster 5. Net loans to customers represent, on average, 33% of total assets, assets under management 13% of total assets, customer deposits 57% of total assets, net interest income and net fee and commission income represent 33% and 45% of net banking income respectively.
Naming the business model of the banks that compose the cluster 1 is relatively easy. Moreover, the average characteristics of the banks constituting this cluster are relatively similar regardless of the hierarchical clustering method used (AGNES or DIANA). The banks in cluster 5 are always identified as universal banks but, with regard to the classification obtained under the DIANA method, the cluster of universal banks within the meaning of the AGNES method includes investment banks and assimilated within the meaning of the DIANA method. Subject to an optimal number of five clusters, the AGNES method therefore fails to identify investment banks and assimilated. Finding representative headings for the business models of the banks composing clusters 2, 3 and 4 is more difficult with the AGNES method than with the DIANA method, as the average values of the variables that characterize them are close together (see charts 14 to 16). In particular, the different sources of income of the banks show an extremely similar distribution for clusters 2, 3 and 4. This may help to explain the moderate use of the different sources of net banking income in the literature that uses the agglomerative clustering method. Moreover, the relative size of the clusters is more homogeneous with the AGNES method than with the DIANA method. 
 THREE-DIMENSIONAL REPRESENTATION OF EUROPEAN BANKS’ CLASSIFICATION USING THE AGNES METHOD
THREE-DIMENSIONAL REPRESENTATION OF EUROPEAN BANKS’ CLASSIFICATION USING THE AGNES METHODThis seems rather counter-intuitive in view of the natural over-representation in the sample of German Sparkassen or small Italian banks whose business models are likely to show relative similarity. In this respect, the DIANA method appears, once again, to be more suitable for our sample of European banks than the AGNES method. Finally, as the groupings obtained with the two hierarchical clustering methods are not perfectly comparable, the classification of an individual bank only makes sense in comparison with the classification of other banks using the same method.
 CLASSIFICATION OF A SELECTION OF THE LARGEST EUROPEAN BANKS ACCORDING TO THE BUSINESS MODEL
CLASSIFICATION OF A SELECTION OF THE LARGEST EUROPEAN BANKS ACCORDING TO THE BUSINESS MODEL***
Identifying banks’ business model presents challenges for managers, investors, regulator, supervisor, monetary authorities, etc. The sensitivity of a bank’s income to cyclical and financial developments, its maximum losses in a given context or, in another respect, its ability to transmit monetary policy and to finance the economy during an economic downturn depend to a large extent on its business model. However, there is no harmonized definition of this term and recourse to so-called expert judgment is frequent despite its relatively arbitrary nature.
 BREAKDOWN OF BANKING INCOME SOURCES BY BUSINESS MODEL - AGNES METHOD
BREAKDOWN OF BANKING INCOME SOURCES BY BUSINESS MODEL - AGNES METHODWe therefore propose to classify European banks objectively by applying, as far as possible, the most appropriate method according to a set of statistical criteria. Banks’ business model is reflected in their balance sheet composition as well as in their income structure that, contrary to what is commonly done in the literature, we also take into account – in combination with the balance sheet composition data – when doing the analysis. We thus identify five banking business models, ranging from pure retail banks to investment banks and assimilated, which cover all the activities carried out by European banks, with the exception of highly specialized banks. The statistical indicators lead us to prefer a divisive (top-down) hierarchical classification, as opposed to the agglomerative (bottom-up) method most commonly used in the literature. In the latter, authors generally retain two principal components while our approach is based on three principal components in order to preserve more information. We also emphasize the importance of the distribution of the different sources of banking revenues in identifying the business model of a bank, in addition to traditional balance sheet variables.
 BALANCE SHEET AND OFF-BALANCE SHEET ITEMS BY BUSINESS MODEL - AGNES METHOD
BALANCE SHEET AND OFF-BALANCE SHEET ITEMS BY BUSINESS MODEL - AGNES METHODFinally, our study paves the way for many future applications. This is the case for the classification of new banks within our framework. In addition, it is possible to follow the classification of a bank or group of banks over time in order to observe the strategies and possible transformations at work. Replicating the analysis to other geographical areas would, for example, help to explain differences in performance at the aggregate level, considering the differences in accounting standards. Finally, it is also possible to estimate the sensitivity of a business model or a banking system to monetary policy.
 BREAKDOWN OF FUNDING SOURCES FOR BANKING ACTIVITY BY BUSINESS MODEL - AGNES METHOD
BREAKDOWN OF FUNDING SOURCES FOR BANKING ACTIVITY BY BUSINESS MODEL - AGNES METHOD 
Thomas Humblot
 
1 With the assessment of internal governance and institution-wide control arrangements, the assessment of risks to capital and adequacy of capital to cover these risks and the assessment of risks to liquidity and adequacy of liquidity resources to cover these risks.
2 European Banking Authority, 2018, Guidelines on common procedures and methodologies for the supervisory review and evaluation process (SREP) and supervisory stress testing – Consolidated version
3 Directive 2013/36/EU of the European Parliament and of the Council of 26 June 2013
4 For a literature review, see amongst others Cernov et Urbano, 2018, Identification of EU bank business models - A novel approach to classifying banks in the UE regulatory framework, EBA Staff Paper series, n°2 - june
5 After cleaning the database, 2125 consolidated banking groups from the 28 member countries of the European Union, plus Norway and Switzerland. Icelandic and Liechtenstein groups are not included in the final sample due to insufficient data.
6 Less Icelandic and Liechtenstein banking groups due to lack of data.
7 Han, J., Kamber, M. & Pei, J., 2012, Data mining: concepts and techniques – 3rd ed., Morgan Kaufmann publications
8 Breunig, M., Kriegel, H., Ng, R., & Sander, J., 2000, LOF: identifying density-based local outliers. In ACM International Conference on Management of Data, pp. 93-104
9 Since the implementation of IFRS 9 in the European Union on 1 January 2018, banks are required to classify their financial assets into three categories: assets measured at amortized cost, assets measured at fair value through profit or loss and assets measured at fair value through other comprehensive income (through equity). Previously, financial assets were classified under IAS 39 into four categories: financial assets at fair value through profit or loss, held-to-maturity investments, loans and receivables and available-for-sale financial assets.
10 Farnè, M. et Vouldis, A., 2017, Business models of the banks in the euro area, Working Paper Series, No 2070, European Central Bank
11 Hubert, M., Rousseeuw, P. & Vanden Branden, K., 2005, ROBPCA: A new approach to robust principal component analysis, Technometrics, Vol. 47, No. 1, pp.64-79
12 Kaiser, H. F., 1960, The application of electronic computers to factor analysis, Educational and Psychological Measurement, 20(1), pp. 141–151
13 See Preacher, K. & MacCallum, R., 2003, Repairing Tom Swift’s electric factor analysis machine, Understanding Statistics, 2 (1), pp. 13 – 43
14 Specifically, the Euclidean distance
15 For a mathematical presentation, see Struyf, A., Hubert, M. & Rousseeuw, P., 1997, Clustering in an object-oriented environment, Journal of Statistical Software, 1(4), pp.1 – 30
16 Charrad, M., Ghazzali, N., Boiteau, V. & Niknafs, A., 2014, NbClust: An R package for determining the relevant number of clusters in a data set, Journal of Statistical Software, 61(6), pp.1-36
17 Calinski, T. & Harabasz, J., 1974, A dendrite method for cluster analysis, Communications in Statistics, 3, pp.1-27
18 Etymologically: Drawing in the shape of a tree.
19 In an analysis with only two principal components and compliance with the Kaiser criterion, the cophenetic correlation coefficient is 0.69 for the DIANA method and 0.52 for the AGNES method.
20 Kassambara, A., 2017, Practical guide to cluster analysis in R – Unsupervised machine learning, STHDA
21 Roux, M., 2018, A comparative study of divisive and agglomerative hierarchical clustering algorithms, Journal of Classification, 35(2), pp.345-366
22 The median values are naturally of the same order of magnitude.
23 Assets under management, which of course do not appear on the balance sheet, are nevertheless reported in relation to total assets of banks to facilitate comparisons.
24 Cf. notamment Nakache, J.-P. & Confais, J., 2004, Approche pragmatique de la classification - Arbres hiérarchiques, Partitionnements, Technip, pp. 246
25 Other aggregation methods generally use the minimum or maximum distance between two elements of a class.