10. Characteristics of banking systems around the world Working in Python

Download the code

To download the code chunks used in this project, right-click on the download link and select ‘Save Link As…’. You’ll need to save the code download to your working directory, and open it in Python.

Don’t forget to also download the data into your working directory by following the steps in this project.

Getting started in Python

Visit the ‘Getting started in Python’ page for help and advice on setting up a Python session to work with. Remember, you can run any page from this book as a notebook by downloading the relevant file from this repository and running it on your own computer. Alternatively, you can run pages online in your browser over at Binder.

Preliminary settings

Let’s import the packages we’ll need and also configure the settings we want:

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from lets_plot import *

LetsPlot.setup_html(no_js=True)

Part 10.1 Summarizing the data

Learning objectives for this part

  • Compare characteristics of banking systems around the world and across time.
  • Use box and whisker plots to summarize distributions and identify outliers.
  • Calculate weighted averages and explain the differences between weighted and simple averages.

We will be using the World Bank’s Global Financial Development Database.

First, download the data and documentation:

  • Go to the Global Financial Development Database. At the bottom of the page, click the ‘June 2017 Version’ to download the Excel file. (You can download a later version of the data, though your results will be slightly different from those shown here, and you will need to adjust parts of the code, such as file names.)
  • The paper ‘Benchmarking financial systems around the world’ gives an overview of the data. You may find it helpful to read Section 3 (pages 7–9) for a summary of the framework used to measure financial systems.

The World Bank’s Global Financial Development Database contains information about four categories:

  • financial depth: the size of financial institutions and markets
  • access: the degree to which individuals are able to use financial services
  • stability of financial institutions and markets
  • efficiency of financial intermediaries and markets in facilitating financial transactions.

We will be looking at the first three categories, focusing particularly on measures of stability before and after the 2008 global financial crisis. Each category is measured by a number of indicators. Figure 10.1 shows the indicators we will be using in this project. (Note that in other versions of the dataset, the indicators may be in lowercase instead of uppercase.)

Category Indicator name Indicator code
Depth Private credit by deposit money banks to GDP (%) GFDD.DI.01
  Deposit money banks’ assets to GDP (%) GFDD.DI.02
Access Bank accounts per 1,000 adults GFDD.AI.01
  Bank branches per 100,000 adults GFDD.AI.02
  Firms with a bank loan or line of credit (%) GFDD.AI.03
  Small firms with a bank loan or line of credit (%) GFDD.AI.04
Stability Bank Z-score GFDD.SI.01
  Bank regulatory capital to risk-weighted assets (%) GFDD.SI.05

Figure 10.1 Indicators used in this project.

  1. The ‘Definition and Sources’ tab in the Excel spreadsheet contains a description of all indicators in the Database. Use the information provided in the ‘Short Description’ column to explain briefly why each of the indicators listed in Figure 10.1 may be a good measure of that category, or may give misleading information about that category. (You may find it helpful to conduct some research on these measures, especially if the explanation contains technical terms).

The ‘Data – June 2016’ tab contains the values of each indicator over time (1960–2014) for various countries around the world, though data may be missing for some countries and years.

Python walk-through 10.1 Importing an Excel spreadsheet into Python

Before loading an Excel spreadsheet into Python, it can be helpful to open it in Excel to understand the structure of the spreadsheet and the data it contains. In this case, we can see that detailed descriptions of all variables are in the first tab (‘Definitions and Sources’). Make sure to read the definitions for the indicators listed in Figure 10.1.

The spreadsheet contains a number of other worksheets, but the data that we need is in the tab called ‘Data – June 2016’. You can see that the variable names are all in the first row and missing values are simply empty cells. We can therefore proceed to import the data into Python using the pd.read_excel function without any additional options.

We’re going to assume here that you’ve downloaded the Excel file as ‘GlobalFinancialDevelopmentDatabaseJune2017.xlsx’, and saved it in a subfolder of your working directory called ‘data’. Remember to change the filename and folder names if you’ve downloaded a later version of the data and saved it in another location.

gfdd = pd.read_excel(
    Path("data/GlobalFinancialDevelopmentDatabaseJune2017.xlsx"),
    sheet_name="Data - June 2016",
)
gfdd.head()
ISO3 ISO2 IMF code Country Region Income group Year GFDD.AI.01 GFDD.AI.02 GFDD.AI.03 GFDD.OI.18 GFDD.OI.19 GFDD.OM.01 GFDD.OM.02 GFDD.OE.01 GFDD.OE.02 NY.GDP.MKTP.CD NY.GDP.PCAP.KD NY.GNP.MKTP.CD SP.POP.TOTL
0 AFG AF 512.0 Afghanistan South Asia Low income 1960 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 540000000.0 NaN 550000000.0 9000000.0
1 AFG AF 512.0 Afghanistan South Asia Low income 1961 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 550000000.0 NaN 560000000.0 9200000.0
2 AFG AF 512.0 Afghanistan South Asia Low income 1962 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 550000000.0 NaN 560000000.0 9300000.0
3 AFG AF 512.0 Afghanistan South Asia Low income 1963 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 750000000.0 NaN 770000000.0 9500000.0
4 AFG AF 512.0 Afghanistan South Asia Low income 1964 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 800000000.0 NaN 820000000.0 9700000.0
box and whisker plot
A graphic display of the range and quartiles of a distribution, where the first and third quartile form the ‘box’ and the maximum and minimum values form the ‘whiskers’.

To get an idea of what the distribution of values for each variable looks like, we will use box and whisker plots. Box and whisker plots are useful for looking at the distribution of a single variable and checking if there are many extreme values (either very large or very small, relative to the rest of the values).

  1. Make a separate box and whisker plot for each indicator, with the outliers displayed (see Empirical Project 6 for help on how to do this). Comment on the overall distribution, the number of outliers, and suggest why there may be many outliers. (In Question 5, we will look at one way to handle extreme values if there is a concern that one or a few very extreme values will heavily affect the average.)

Python walk-through 10.2 Making box and whisker plots

Box and whisker plots were introduced in Empirical Project 6. We can use the same process here, after ensuring that the data is in the correct format.

Some plotting libraries expect data like we have ingested to be in ‘long’ (also known as ‘tidy’) format (where each row is a value for a single variable and year), whereas our data is in ‘wide’ format (each row contains a single variable but multiple years). We transform the data from wide to long format using the melt function, and drop missing variables with the .dropna function.

So for the financial depth indicators:

# For convenience, create a list of the indicators we're interested in:
indicators = ["private_credit", "bank_assets"]

# Rename the variables we'll be plotting
gfdd_new_names = gfdd.rename(
    columns={"GFDD.DI.01": indicators[0], "GFDD.DI.02": indicators[1]}
)

# Create a long or "tidy" version of the data & drop invalid values
gfdd_long = gfdd_new_names.melt(
    id_vars=["Country", "Year"], value_vars=indicators, var_name="indicator"
).dropna()
gfdd_long
Country Year indicator value
0 Afghanistan 1960 private_credit 4.603305
1 Afghanistan 1961 private_credit 4.572875
2 Afghanistan 1962 private_credit 4.658536
3 Afghanistan 1963 private_credit 3.417160
4 Afghanistan 1964 private_credit 3.480556
22646 Zimbabwe 2001 bank_assets 19.494340
22647 Zimbabwe 2002 bank_assets 18.411180
22648 Zimbabwe 2003 bank_assets 26.838360
22649 Zimbabwe 2004 bank_assets 24.202580
22650 Zimbabwe 2005 bank_assets 17.669740

Now we can plot this data as a box and whisker plot, which we’ll do using the lets_plot package and geom_boxplot() option:

(ggplot(gfdd_long, aes(x="indicator", y="value")) + geom_boxplot() 
+ labs( x="Indicator", y="Value"))
Box and whisker plot for ‘Private credit by deposit money banks to GDP (%)’ (private.credit) and ‘Deposit money banks’ assets to GDP (%)’ (bank.assets).
Fullscreen

Figure 10.2 Box and whisker plot for ‘Private credit by deposit money banks to GDP (%)’ (private_credit) and ‘Deposit money banks’ assets to GDP (%)’ (bank_assets).

We could repeat the process for each topic and plot all indicators together. However, the range for the GFDD.AI.01 variable (Bank accounts per 1,000 adults) is far larger than the other variables in this group, so it makes sense to plot this variable separately. We use the same process as before, as shown below.

# For convenience, create a list of the indicators we're interested in:
indicators_big = ["bank_accounts"]

# Rename the variables we'll be plotting
gfdd_new_names = gfdd.rename(columns={"GFDD.AI.01": indicators_big[0]})

# Create a long or "tidy" version of the data & drop invalid values
gfdd_long = gfdd_new_names.melt(
    id_vars=["Country", "Year"], value_vars=indicators_big, var_name="indicator"
).dropna()

(ggplot(gfdd_long, aes(x="indicator", y="value")) + geom_boxplot()+ labs( x="Indicator", y="Value"))
Box and whisker plot for ‘Bank accounts per 1,000 adults’ (bank.accounts).
Fullscreen

Figure 10.3 Box and whisker plot for ‘Bank accounts per 1,000 adults’ (bank_accounts).

Now we’d like to do this task for several more cases. A key principle of coding is ‘DRY: Don’t Repeat Yourself’. We shouldn’t have to type this code out multiple times, and it’s more likely that something could go wrong if we do that. Instead, we’re going to list all of the indicators in one go, and that’s what the code below is going to do.

However, we do need a method for this task. To give the variables sensible names, we will use a built-in type of object in Python called a dictionary. Dictionaries provide a map from one set of values to another. A simple dictionary might look like this:

fruit_dict = {
    "Jazz": "Apple",
    "Owari": "Satsuma",
    "Seto": "Satsuma",
    "Pink Lady": "Apple",
}

This dictionary maps specific varieties of fruit into general types of fruit. A dictionary is very helpful here because we can use it to map the old column names to the new ones in a statement like gfdd_new_names = gfdd.rename(columns=dict_of_new_names). When we’re creating our dictionary below, we could do it as in the fruit example above, and there’s nothing wrong with that. But for convenience, because we might use them later, we’re going to instead create two lists (one for the new names and one for the old) and then bring those lists together to create our dictionary. We’ve already seen list comprehensions and the zip function; below we bring these ideas together to form a dictionary comprehension using this syntax:

dict_of_new_names = {k: w for k, w in zip(old_names, indicators)}
# For convenience, create a list of the indicators we're interested in:
indicators = [
    "private_credit",
    "bank_assets",
    "bank_accounts",
    "bank_branches",
    "firms_credit",
    "small_firms_credit",
    "risk_weighted_assets",
]

# Dictionary mapping old names to new names
old_names = [
    "GFDD.DI.01",
    "GFDD.DI.02",
    "GFDD.AI.01",
    "GFDD.AI.02",
    "GFDD.AI.03",
    "GFDD.AI.04",
    "GFDD.SI.05",
]
dict_of_new_names = {k: w for k, w in zip(old_names, indicators)}

# Rename the variables we'll be plotting
gfdd_new_names = gfdd.rename(columns=dict_of_new_names)

# Create a long or "tidy" version of the data & drop invalid values
gfdd_long = gfdd_new_names.melt(
    id_vars=["Country", "Year"], value_vars=indicators, var_name="indicator"
).dropna()
# Put all of our indicators of interest into a box plot
(ggplot(gfdd_long, aes(x="indicator", y="value")) + geom_boxplot() + ylim(0, 1e3) + labs( x="Indicator", y="Value"))
Box and whisker plot for ‘Bank branches per 100,000 adults’ (bank.branches), ‘Firms with a bank loan or line of credit (%)’ (firms.credit), and ‘Small firms with a bank loan or line of credit (%)’ (small.firms.credit).
Fullscreen

Figure 10.4 Box and whisker plot for our indicators of interest.

You can repeat the process for the indicators on bank stability by copying the above code and adding indicator variable names accordingly.

We will now create summary tables of some indicators and look at how they have changed over time. Each country belongs to a particular region and income group.

  1. Choose one indicator in Depth and one indicator in Access:
  • Make tables showing the average of those indicators and number of observations (count), with Region or Income Group as the column variable(s) and Year (2000–2014 only) as the row variable. (Make a separate table for region and income group.)
  • For each indicator chosen, make a line chart with the average indicator value (either for region or income group) as the vertical axis variable, and year as the horizontal axis variable. Comment on any patterns you see across regions/income groups over time.

In this walk-through, we will use the indicators for ‘Deposit money banks’ assets to GDP (%)’ and ‘Bank accounts per 1,000 adults’ as examples (bank_assets and bank_accounts, respectively).

Obtaining the average indicator value for each year and region is straightforward using the group_by and agg functions, but again we have to select the relevant years (using .query) and remove any observations that have a missing value for the indicator being analysed (using dropna). We save the final output as deposit_region.

deposit_region = (
    gfdd.rename(columns=dict_of_new_names)
    .query("Year > 1999 & Year < 2015")
    .dropna(subset=["bank_assets"])
    .groupby(["Year", "Region"])["bank_assets"]
    .agg(["mean", "count"])
)
deposit_region
Year Region mean count
2000 East Asia & Pacific 67.557537 25
Europe & Central Asia 58.608195 45
Latin America & Caribbean 46.928321 33
Middle East & North Africa 60.920046 17
North America 68.464945 2
2014 Latin America & Caribbean 57.331511 31
Middle East & North Africa 75.129953 18
North America 60.282820 1
South Asia 45.498802 8
Sub-Saharan Africa 29.011847 38

At this stage, the summary data is stored in long format. This format is useful for plotting the data, but to produce the required table (with Region as the column variable and Year as the row variable), we need to reshape the data into wide format. While we previously used melt to move from wide to long, we can use the pivot function to achieve the opposite and transform the data from long to wide.

There is a shortcut to pivot though: if we wish to move only one variable from a row to a column (and it is part of the index), we can use the .unstack method instead:

deposit_region.unstack()
mean count
Region East Asia & Pacific Europe & Central Asia Latin America & Caribbean Middle East & North Africa North America South Asia Sub-Saharan Africa East Asia & Pacific Europe & Central Asia Latin America & Caribbean Middle East & North Africa North America South Asia Sub-Saharan Africa
Year
2000 67.557537 58.608195 46.928321 60.920046 68.464945 28.117836 20.643030 25 45 33 17 2 7 43
2001 67.245847 58.096271 48.799121 62.508313 83.097380 29.391711 20.530323 25 46 33 18 2 7 44
2002 62.317274 58.667192 49.103766 61.924345 94.569720 32.060113 20.500727 26 47 33 19 2 7 45
2003 61.236189 60.465442 48.253112 59.386665 92.078475 33.148744 21.425730 26 47 33 19 2 7 45
2004 60.621324 62.510551 46.690875 57.018716 91.043525 36.022256 20.986100 26 47 33 20 2 7 45
2005 62.691213 67.640136 47.277459 56.144002 94.113530 38.773546 21.480451 25 46 33 20 2 7 46
2006 63.319616 73.501307 47.173601 56.313520 100.841875 34.776328 22.123007 25 47 33 20 2 8 44
2007 64.103156 79.041729 48.883928 55.711586 101.979305 37.704743 22.778664 25 46 33 21 2 8 44
2008 68.479393 85.709957 51.875900 57.938661 104.458085 41.531408 24.382722 25 46 33 21 2 8 44
2009 74.951694 92.899973 55.291540 65.274277 66.734920 43.880175 26.318474 25 44 33 21 1 8 42
2010 75.563454 91.179560 54.295021 64.119124 60.497590 45.363690 26.154682 25 46 32 21 1 8 43
2011 75.972785 90.080754 54.386018 65.473409 59.338070 45.629445 26.944692 23 46 32 20 1 8 43
2012 75.109313 89.882284 55.865148 65.359516 58.287330 45.792968 27.447431 24 46 32 20 1 8 41
2013 79.249328 88.833299 56.437788 66.467436 58.079990 46.435363 28.392471 24 46 32 20 1 8 40
2014 86.657043 87.061403 57.331511 75.129953 60.282820 45.498802 29.011847 23 46 31 18 1 8 38

Note how we get two sub-tables: one for mean, and one for count.

At this point, you could just print or view the data, but using one of many pandas export functions will produce output that is visually easier to read and can be copied-and-pasted into your analysis document. Here are some examples with just the first few rows and just the first few columns:

# To markdown, the popular text format
print(deposit_region.iloc[:5, :3].to_markdown())
|                                      |    mean |   count |
|:-------------------------------------|--------:|--------:|
| (2000, 'East Asia & Pacific')        | 67.5575 |      25 |
| (2000, 'Europe & Central Asia')      | 58.6082 |      45 |
| (2000, 'Latin America & Caribbean')  | 46.9283 |      33 |
| (2000, 'Middle East & North Africa') | 60.92   |      17 |
| (2000, 'North America')              | 68.4649 |       2 |
# To LaTeX, for writing academic papers
print(deposit_region.iloc[:5, :3].style.to_latex())
\begin{tabular}{llrr}
 &  & mean & count \\
Year & Region &  &  \\
\multirow[c]{5}{*}{2000} & East Asia & Pacific & 67.557537 & 25 \\
 & Europe & Central Asia & 58.608195 & 45 \\
 & Latin America & Caribbean & 46.928321 & 33 \\
 & Middle East & North Africa & 60.920046 & 17 \\
 & North America & 68.464945 & 2 \\
\end{tabular}
# To html, for the web
print(deposit_region.iloc[:5, :3].to_html())
<table>
  <thead>
    <tr>
      <th></th>
      <th></th>
      <th>mean</th>
      <th>count</th>
    </tr>
    <tr>
      <th>Year</th>
      <th>Region</th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="5" valign="top">2000</th>
      <th>East Asia &amp; Pacific</th>
      <td>67.557537</td>
      <td>25</td>
    </tr>
    <tr>
      <th>Europe &amp; Central Asia</th>
      <td>58.608195</td>
      <td>45</td>
    </tr>
    <tr>
      <th>Latin America &amp; Caribbean</th>
      <td>46.928321</td>
      <td>33</td>
    </tr>
    <tr>
      <th>Middle East &amp; North Africa</th>
      <td>60.920046</td>
      <td>17</td>
    </tr>
    <tr>
      <th>North America</th>
      <td>68.464945</td>
      <td>2</td>
    </tr>
  </tbody>
</table>

We can use lets_plot to plot a line chart using the long format data (deposit_region), with year on the horizontal axis. We specify color = "Region" so each region has a different line colour.

(
    ggplot(deposit_region.reset_index(), aes(x="Year", y="mean", color="Region"))
    + geom_line(size=2)
    + labs(
        y="Mean deposit, % of GDP",
        title="Deposit money banks' assets to GDP (%)",
        subtitle="2000–2014, by region",
    )
)
Line chart of ‘Deposit money banks’ assets to GDP (%)’ by region (2000–2014).
Fullscreen

Figure 10.5 Line chart of ‘Deposit money banks’ assets to GDP (%)’ by region (2000–2014).

The process can be repeated for income group rather than region.

deposit_income = (
    gfdd_new_names.query("Year > 1999")
    .dropna(subset=["bank_assets"])
    .groupby(["Year", "Income Group"])["bank_assets"]
    .agg(["mean", "count"])
)
(
    ggplot(deposit_income.reset_index(), aes(x="Year", y="mean", color="Income Group"))
    + geom_line(size=2)
    + labs(
        y="Mean deposit, % of GDP",
        title="Deposit money banks' assets to GDP (%)",
        subtitle="2000–2014, by income group",
    )
)
Line chart of ‘Deposit money banks’ assets to GDP (%)’ by region (2000–2014).
Fullscreen

Figure 10.6 Line chart of ‘Deposit money banks’ assets to GDP (%)’ by region (2000–2014).

You can repeat the process for the indicator ‘Bank accounts per 1,000 adults’ by replacing the variable name bank_assets with bank_accounts in the above code, again by region and then by income group.

So far, we have been looking at simple averages, where each observation is given the same weight (importance), so we simply add up all the numbers and divide by the number of observations. However, when we take averages across regions or income groups, we might want to account for the fact that countries differ in size (population or GDP). For example, if one country is much larger than another, we might want that country to have a larger influence on the average. See the box below for more about weighted averages.

weighted average
A type of average that assigns greater importance (weight) to some components than to others, in contrast with a simple average, which weights each component equally. Components with a larger weight can have a larger influence on the average.

Weighted averages

An example of weighted averages that you have probably experienced is in calculating course grades. Usually, course grades are not calculated by simply summing up the scores in all components and dividing by the number of components. Instead, certain components such as the final exam are given more importance (influence over the overall grade) than the midterm exam or course assignments.

To calculate the weighted average, we first determine the weight of each component (these are fractions or proportions that sum to 1). Then we multiply each component by its respective weight, and then sum over all components. Using the course grade as an example, suppose the final exam is worth 70% of the final grade and the midterm exam is worth 30%, with both being scored out of 100. Then the weighted average would be:

In comparison, the simple average would give both components equal weight:

To develop your intuition for this concept, you can experiment by choosing values for the final exam score and midterm exam score and seeing how a change in one of the scores affects the weighted and simple averages.

The indicator ‘Bank regulatory capital to risk-weighted assets (%)’ in the Database also uses weights to account for the fact that some assets are riskier than others, and should therefore not be considered equally.

We will practise calculating weighted averages for the indicator ‘Bank accounts per 1,000 adults’, weighting according to total population in each region (so countries with a larger population will have a larger influence on the average). Since data is missing for some countries, we will calculate the total population in each region as the total population for countries with non-missing data.

  1. For each region and the years 2004–2014:
  • Create a new variable for the weight that is the ratio of SP.POP.TOTL and the sum of SP.POP.TOTL for each country within the relevant region and year. Only use observations that have non-missing values for the GFDD.AI.01 (Bank accounts per 1,000 adults) indicator variable.
  • Check that your calculations are correct by filtering the data for a particular region and year and verifying that the weights sum to 1.
  • Now multiply the ‘Bank accounts per 1,000 adults’ indicator by the weights in Question 4(b) and sum up the resulting values according to region to get the weighted average.
  • Compare your answers to Question 4(c) with the corresponding simple averages in Question 3(a) and comment on any similarities or differences.

Python walk-through 10.4 Creating weighted averages

As we only require the weighted averages for the years 2004–2014, we will create a new dataframe (called weighted_gfdd) to save our results in. The weights are required for each country within each region for each year, but only if there is a value for the GFDD.AI.01 (bank_accounts) indicator. So we do the following:

  • Filter results by years of interest using .query.
  • Select only columns of interest using .loc.
  • Drop any invalid entries for bank_accounts using .dropna.

With our new dataframe, we group by year and then region (using .groupby) and then generate the weight for each country by dividing the population of each country by the sum of populations of all countries within a region (and year). To return the results in the same shape (index) as the data we began with, we use the .transform method. Remember:

  • Use .agg when using a groupby, but you want your groups to become the new index (here, this would give a year-to-region index).
  • Use .transform when using a groupby, but you want to retain your original index (here, numbered entries).
  • Use .apply when using a groupby, but you want to perform operations that will leave neither the original index nor an index of groups.
gfdd_weighted = (
    gfdd_new_names.query("Year > 2003 & Year < 2015")
    .loc[:, ["Year", "Country", "Region", "bank_accounts", "SP.POP.TOTL"]]
    .dropna(subset=["bank_accounts"])
)

gfdd_weighted["weight"] = gfdd_weighted.groupby(["Year", "Region"])[
    "SP.POP.TOTL"
].transform(lambda x: x / x.sum())

gfdd_weighted
Year Country Region bank_accounts SP.POP.TOTL weight
48 2008 Afghanistan South Asia 38.021832 27000000.0 0.017563
49 2009 Afghanistan South Asia 90.909562 27000000.0 0.077729
50 2010 Afghanistan South Asia 109.722130 28000000.0 0.074594
51 2011 Afghanistan South Asia 145.330790 29000000.0 0.083005
52 2012 Afghanistan South Asia 174.633520 30000000.0 0.080997
11325 2010 Zimbabwe Sub-Saharan Africa 161.508630 14000000.0 0.020834
11326 2011 Zimbabwe Sub-Saharan Africa 79.569343 14000000.0 0.020393
11327 2012 Zimbabwe Sub-Saharan Africa 83.730240 15000000.0 0.022773
11328 2013 Zimbabwe Sub-Saharan Africa 80.160849 15000000.0 0.026333
11329 2014 Zimbabwe Sub-Saharan Africa 87.989885 15000000.0 0.033305

To check that the weights we have calculated are correct, summing over each year–region combination should produce a value of unity (1). Let’s check:

gfdd_weighted.groupby(["Year", "Region"])["weight"].sum()
Year  Region                    
2004  East Asia & Pacific           1.0
      Europe & Central Asia         1.0
      Latin America & Caribbean     1.0
      Middle East & North Africa    1.0
      South Asia                    1.0
                                    
2014  Europe & Central Asia         1.0
      Latin America & Caribbean     1.0
      Middle East & North Africa    1.0
      South Asia                    1.0
      Sub-Saharan Africa            1.0
Name: weight, Length: 66, dtype: float64

The weights we have calculated are correct, so we can proceed to calculate the required weighted indicator values by year and region. We start by creating a new variable with the weighted indicator value (bank_accounts_weighted), and then sum up the weighted indicator values by year and region. Recall that when calculating the weighted average, we sum all of the weighted observations rather than taking the mean (which would calculate the simple average instead).

(
    gfdd_weighted.assign(
        bank_accounts_weighted=lambda x: x["bank_accounts"] * x["weight"]
    )
    .groupby(["Year", "Region"])
    .sum()
    .round(2)["bank_accounts_weighted"]
)
Year  Region                    
2004  East Asia & Pacific            253.86
      Europe & Central Asia          637.40
      Latin America & Caribbean      504.48
      Middle East & North Africa     301.03
      South Asia                     529.82
                                        
2014  Europe & Central Asia         1059.82
      Latin America & Caribbean      780.49
      Middle East & North Africa     524.85
      South Asia                     411.35
      Sub-Saharan Africa             350.80
Name: bank_accounts_weighted, Length: 66, dtype: float64

You can change this table by unstacking it, or just export it in this format.

Extension Using Winsorization to handle extreme values

If we are interested in combining indicators into a single index (as in Empirical Project 4), we may be concerned about extreme values, but still want to include these countries in the index (rather than excluding them from the calculations). When calculating summary statistics, we can deal with these extreme values by using the median instead of the mean.

On page 19 of the paper ‘Benchmarking financial systems around the world’, the authors discuss Winsorization (replacing extreme values with either the 95th or the 5th percentile value) as one way to handle these extreme values. Sometimes the extreme values are due to peculiar features of a single country, so we might want to adjust the data to make the values ‘less extreme’.

  1. For an indicator you have used in Questions 3 and 4 and for the year 2010:
  • Calculate the 95th and 5th percentile value of that indicator, across all countries.
  • Replace any value larger than the 95th percentile value with the 95th percentile value, and replace any value smaller than the 5th percentile value with the 5th percentile value.
  • Use your ‘Winsorized’ values from Question 5(b) to calculate the average values of the indicator, by region and income group (separately). Compare these values to the simple averages from Question 3(a).

Extension Python walk-through 10.5 Dealing with extreme values

In this example, we use ‘Bank accounts per 1,000 adults’ (bank_accounts). The 5th and 95th percentiles can be obtained using the quantile function. We save the output into a dataframe (q_5_95) so we can refer to these values in later calculations.

q_5_95 = (
    gfdd_new_names.query("Year == 2010")
    .dropna(subset=["bank_accounts"])["bank_accounts"]
    .quantile([0.05, 0.95])
)
q_5_95
0.05      27.593772
0.95    1604.688000
Name: bank_accounts, dtype: float64

We can compare the value of the indicator with these upper and lower bounds using the np.where function from numerical library numpy. np.where has the following syntax:

np.where(condition, value if condition is true, value if condition is false)

One convenient feature of np.where is that instead of passing it single values in either of its three arguments, we can pass vectors to all of its arguments, creating a vector-valued return column.

In the code below, we make use of np.where first to replace all values below the 5th percentile with the value for the 5th percentile, and then to replace all values above the 95th percentile with the value for the 95th percentile.

Two points to be aware of before running the code: 1) the index values for retrieving data from q_5_95 are of integer type (as opposed to strings); 2) because we refer to bank_accounts multiple times, we need to define it first in a separate step (or, alternatively use lambda expressions).

gfdd_2010 = gfdd_new_names.query("Year == 2010").dropna(subset=["bank_accounts"])

bank_2010 = gfdd_2010.assign(
    bank_accounts=np.where(
        gfdd_2010["bank_accounts"] < q_5_95[0.05],
        q_5_95[0.05],
        np.where(
            gfdd_2010["bank_accounts"] > q_5_95[0.95],
            q_5_95[0.95],
            gfdd_2010["bank_accounts"],
        ),
    )
)

Next, we can obtain our summary statistics and print out the ‘Winsorized’ averages (use gfdd_2010 to see the original averages).

bank_2010.groupby("Income Group").agg(avg_2010=("bank_accounts", "mean")).round(2)
Income group avg_2010
High income: OECD 1356.47
High income: nonOECD 916.90
Low income 123.06
Lower middle income 422.65
Upper middle income 635.71

Part 10.2 Comparing financial stability before and after the 2008 global financial crisis

Learning objectives for this part

  • Use confidence intervals to assess changes in the stability of financial institutions before and after the 2008 global financial crisis.

Now we will assess whether financial stability (measured by the two indicators in Figure 10.1) has changed since the 2008 global financial crisis.

  1. For both indicators of stability in Figure 10.1, explain what effect the post-global financial crisis banking regulations are likely to have on the value of the indicator (for example, would the value increase or decrease?), and why. You may find it helpful to research the regulations that were implemented as a result of the 2008 global financial crisis.
  1. For the years 2007 and 2014, using the ttest function, create tables with Region or Income Group as the row variables(s), and the difference in the average of those indicators between the two years, the 95% confidence interval lower and upper values, and the CI width, as column variables.

Python Walk-Through 10.6 Calculating confidence intervals

In Python walk-throughs 3.6 and 8.10 we used the ttest function from the pingouin package to obtain differences in means and confidence intervals (CIs) for two groups of data. Here we need to obtain these statistics for the GFDD.SI.05 indicator (renamed as risk_weighted_assets) between 2007 and 2014 for each region.

As we need to find the confidence intervals for a number of regions, we can use a vectorised operation to perform the same calculation for each region in turn. For this operation to work, we need to reshape our data. We want to format the data so that the 2007 and 2014 values of risk weighted assets appear as columns while regions (and countries) are rows. To do this, we’re going to use the .loc command to select both the values and variables we’re interested in. We’ll then use pivot to re-order the data into the shape we want.

rwa_07_14 = gfdd_new_names.loc[
    gfdd_new_names["Year"].isin([2007, 2014]),
    ["Year", "Region", "Country", "risk_weighted_assets"],
].pivot(index=["Region", "Country"], columns=["Year"], values=["risk_weighted_assets"])

rwa_07_14
risk_weighted_assets
Year 2007 2014
Region Country
East Asia & Pacific Australia 10.1 12.4
Brunei Darussalam NaN 21.0
Cambodia NaN NaN
China 8.4 13.2
Fiji NaN 14.7
Sub-Saharan Africa Tanzania NaN 17.0
Togo NaN NaN
Uganda 19.3 22.2
Zambia NaN NaN
Zimbabwe NaN NaN

Now we have the correct shape, but we’ve also created quite a complex structure, especially of our index. To access columns in a hierarchical or multi-level index, use a ‘tuple’ (tuples have curvy brackets). You can see the precise name of the columns by running .columns like this:

rwa_07_14.columns
MultiIndex([('risk_weighted_assets', 2007),
            ('risk_weighted_assets', 2014)],
           names=[None, 'Year'])

Now, we want to conduct a t-test for each region. We’ll use the pingouin package for the t-test, groupby by region, and use the apply function, which allows us to run functions that combine different columns.

rwa_07_14_ttest = rwa_07_14.groupby("Region").apply(
    lambda row: pg.ttest(
        row[("risk_weighted_assets", 2007)], row[("risk_weighted_assets", 2014)]
    )
)
rwa_07_14_ttest
T dof alternative p-val CI95% cohen-d BF10 power
Region
East Asia & Pacific T-test −0.996595 21.376682 two-sided 0.330109 [−5.75, 2.02] 0.394260 0.519 0.165007
Europe & Central Asia T-test −2.697072 76.344458 two-sided 0.008604 [−4.75, −0.71] 0.574108 5.049 0.763601
Latin America & Caribbean T-test −0.481507 32.000000 two-sided 0.633433 [−1.85, 1.14] 0.165155 0.36 0.075360
Middle East & North Africa T-test −0.040157 15.165713 two-sided 0.968492 [−3.06, 2.95] 0.017512 0.373 0.050194
North America T-test −0.497519 2.000000 two-sided 0.668138 [−4.82, 3.82] 0.497519 0.662 0.061395
South Asia T-test −1.918656 4.000000 two-sided 0.127469 [−7.49, 1.37] 0.959328 1.247 0.155595
Sub-Saharan Africa T-test −0.683895 27.522410 two-sided 0.499760 [−5.17, 2.58] 0.240930 0.396 0.103905

Let’s expand the confidence interval into two separate columns. (It’s one column with two entries here). We’ll also add in some other statistics based on the t-test confidence intervals, the mean, and the width. Note that the mean is just the .mean across each row (axis=1), and the difference can be computed using .diff and then keeping the one valid entry by using dropna.

To assemble all of this information, we’re going to concatenate multiple dataframes using pd.concat. The syntax of this command is pd.concat([list of dataframes], axis=<axis you want to join dataframes together by). In the code below, the two dataframes in our list are the original rwa_07_14_ttest and a new, second dataframe that we create that has the confidence intervals as its values and two columns, ‘upper’ and ‘lower’.

Further below, when we wish to add in means and widths, we can do this by just declaring a new column (for example, by writing rwa_07_14_ttest["mean"] = ) and then applying a function to the two columns of interest.

rwa_07_14_ttest = pd.concat(
    [
        rwa_07_14_ttest,
        pd.DataFrame(
            rwa_07_14_ttest["CI95%"].tolist(),
            columns=["lower", "upper"],
            index=rwa_07_14_ttest.index,
        ),
    ],
    axis=1,
)
rwa_07_14_ttest["mean"] = rwa_07_14_ttest[["lower", "upper"]].mean(axis=1)
rwa_07_14_ttest["width"] = (
    rwa_07_14_ttest[["lower", "upper"]].diff(axis=1).dropna(axis=1) / 2
)
rwa_07_14_ttest
T dof alternative p-val CI95% cohen-d BF10 power lower upper mean width
Region
East Asia & Pacific T-test −0.996595 21.376682 two-sided 0.330109 [−5.75, 2.02] 0.394260 0.519 0.165007 −5.75 2.02 −1.865 3.885
Europe & Central Asia T-test −2.697072 76.344458 two-sided 0.008604 [−4.75, −0.71] 0.574108 5.049 0.763601 −4.75 −0.71 −2.730 2.020
Latin America & Caribbean T-test −0.481507 32.000000 two-sided 0.633433 [−1.85, 1.14] 0.165155 0.36 0.075360 −1.85 1.14 −0.355 1.495
Middle East & North Africa T-test −0.040157 15.165713 two-sided 0.968492 [−3.06, 2.95] 0.017512 0.373 0.050194 −3.06 2.95 −0.055 3.005
North America T-test −0.497519 2.000000 two-sided 0.668138 [−4.82, 3.82] 0.497519 0.662 0.061395 −4.82 3.82 −0.500 4.320
South Asia T-test −1.918656 4.000000 two-sided 0.127469 [−7.49, 1.37] 0.959328 1.247 0.155595 −7.49 1.37 −3.060 4.430
Sub-Saharan Africa T-test −0.683895 27.522410 two-sided 0.499760 [−5.17, 2.58] 0.240930 0.396 0.103905 −5.17 2.58 −1.295 3.875

The same process can be repeated for income groups and for the indicator GFDD.SI.01 (Bank Z-score).

  1. For each indicator:
  • Plot column charts (one for regions and one for income groups) showing the differences on the vertical axis and indicator on the horizontal axis. Add the confidence intervals from Question 2 to your charts.
leverage ratio (for banks or households)
The value of assets divided by the equity stake (capital contributed by owners and shareholders) in those assets.
  • Interpret your findings. Is there evidence that stability has increased since the 2008 global financial crisis? (Note that ‘Bank regulatory capital to risk-weighted assets’ is inversely related to the leverage ratio. High leverage ratios were common in the lead-up to the 2008 crisis, and contributed to financial instability.)

Python Walk-Through 10.7 Plotting column charts with error bars

Again, we use the GFDD.SI.05 indicator (risk_weighted_assets) for Region as an example. You can repeat the following steps by region and for the risk_weighted_assets variable by changing the variable name(s) in Python walk-through 10.6 accordingly, then running the code below.

rwa_07_14_ttest.reset_index()
Region level_1 T dof alternative p-val CI95% cohen-d BF10 power lower upper mean width
0 East Asia & Pacific T-test −0.996595 21.376682 two-sided 0.330109 [−5.75, 2.02] 0.394260 0.519 0.165007 −5.75 2.02 −1.865 3.885
1 Europe & Central Asia T-test −2.697072 76.344458 two-sided 0.008604 [−4.75, −0.71] 0.574108 5.049 0.763601 −4.75 −0.71 −2.730 2.020
2 Latin America & Caribbean T-test −0.481507 32.000000 two-sided 0.633433 [−1.85, 1.14] 0.165155 0.36 0.075360 −1.85 1.14 −0.355 1.495
3 Middle East & North Africa T-test −0.040157 15.165713 two-sided 0.968492 [−3.06, 2.95] 0.017512 0.373 0.050194 −3.06 2.95 −0.055 3.005
4 North America T-test −0.497519 2.000000 two-sided 0.668138 [−4.82, 3.82] 0.497519 0.662 0.061395 −4.82 3.82 −0.500 4.320
5 South Asia T-test −1.918656 4.000000 two-sided 0.127469 [−7.49, 1.37] 0.959328 1.247 0.155595 −7.49 1.37 −3.060 4.430
6 Sub-Saharan Africa T-test −0.683895 27.522410 two-sided 0.499760 [−5.17, 2.58] 0.240930 0.396 0.103905 −5.17 2.58 −1.295 3.875
(
    ggplot(rwa_07_14_ttest.reset_index(), aes(x="Region", y="mean", fill="Region"))
    + geom_bar(stat="identity", color="black", alpha=0.6, show_legend=False)
    + geom_errorbar(
        aes(ymin="lower", ymax="upper", color="Region"),
        size=1,
        show_legend=False,
        alpha=0.8,
    )
    + coord_flip()
    + labs(
        title="Differences in risk weighted assets between 2007 and 2014",
        x="Difference", y="Mean"
    )
    + ggsize(700, 600)
)
Column chart with error bars for ‘Bank regulatory capital to risk-weighted assets (%)’ (risk.weighted.assets).
Fullscreen

Figure 10.7 Column chart with error bars for ‘Bank regulatory capital to risk-weighted assets (%)’ (risk_weighted_assets).