Empirical Project 10 Working in R

Download the code

To download the code chunks used in this project, right-click on the download link and select ‘Save Link As…’. You’ll need to save the code download to your working directory, and open it in RStudio.

Don’t forget to also download the data into your working directory by following the steps in this project.

Getting started in R

For this project you will need the following packages:

If you need to install any of these packages, run the following code:

install.packages(c("tidyverse", "readxl", "knitr"))

You can import the libraries now, or when they are used in the R walk-throughs below.


Part 10.1 Summarizing the data

Learning objectives for this part

  • compare characteristics of banking systems around the world and across time
  • use box and whisker plots to summarize distributions and identify outliers
  • calculate weighted averages and explain the differences between weighted and simple averages.

We will be using the World Bank’s Global Financial Development Database.

First, download the data and documentation:

The World Bank’s Global Financial Development Database contains information about four categories:

We will be looking at the first three categories, focusing particularly on measures of stability before and after the 2008 global financial crisis. Each category is measured by a number of indicators. Figure 10.1 shows the indicators we will be using in this project. (Note that in other versions of the dataset, the indicators may be in lowercase instead of uppercase.)

Category Indicator name Indicator code
Depth Private credit by deposit money banks to GDP (%) GFDD.DI.01
  Deposit money banks’ assets to GDP (%) GFDD.DI.02
Access Bank accounts per 1,000 adults GFDD.AI.01
  Bank branches per 100,000 adults GFDD.AI.02
  Firms with a bank loan or line of credit (%) GFDD.AI.03
  Small firms with a bank loan or line of credit (%) GFDD.AI.04
Stability Bank Z-score GFDD.SI.01
  Bank regulatory capital to risk-weighted assets (%) GFDD.SI.05

Figure 10.1 Indicators used in this project.

  1. The ‘Definition and Sources’ tab in the Excel spreadsheet contains a description of all indicators in the Database. Use the information provided in the ‘Short Description’ column to explain briefly why each of the indicators listed in Figure 10.1 may be a good measure of that category, or may give misleading information about that category. (You may find it helpful to conduct some research on these measures, especially if the explanation contains technical terms).

The ‘Data – June 2016’ tab contains the values of each indicator over time (1960–2014) for various countries around the world, though data may be missing for some countries and years.

R walk-through 10.1 Importing an Excel spreadsheet into R

Before loading an Excel spreadsheet into R, open it in Excel to understand the structure of the spreadsheet and the data it contains. In this case we can see that detailed descriptions of all variables are in the first tab (‘Definitions and Sources’). Make sure to read the definitions for the indicators listed in Figure 10.1.

The spreadsheet contains a number of other worksheets, but the data that we need is in the tab called ‘Data – June 2016’. You can see that the variable names are all in the first row and missing values are simply empty cells. We can therefore proceed to import the data into R using the read_excel function without any additional options.


# Set your working directory to the correct folder.
# Insert your file path for 'YOURFILEPATH'.

GFDD <- read_excel(
  sheet = "Data - June 2016")
box and whisker plot
A graphic display of the range and quartiles of a distribution, where the first and third quartile form the ‘box’ and the maximum and minimum values form the ‘whiskers’.

To get an idea of what the distribution of values for each variable looks like, we will use box and whisker plots. Box and whisker plots are useful for looking at the distribution of a single variable and checking if there are many extreme values (either very large or very small, relative to the rest of the values).

  1. Make a separate box and whisker plot for each indicator, with the outliers displayed (see Empirical Project 6 for help on how to do this). Comment on the overall distribution, the number of outliers, and suggest why there may be many outliers. (In Question 5, we will look at one way to handle extreme values if there is a concern that one or a few very extreme values will heavily affect the average.)

R walk-through 10.2 Making box and whisker plots

Box and whisker plots were introduced in Empirical Project 6. We can use the same process here, after ensuring that the data is in the correct format. ggplot expects the data to be in ‘long’ format (each row is a value for a single variable and year), whereas our data is in ‘wide’ format (each row contains a single variable but multiple years). We transform the data from wide to long format using the gather function.

Further details on the difference between long and wide format data can be found in ‘The Wide and Long Data Format for Repeated Measures Data’.

We also have to remove any missing values using the na.omit function, otherwise the ggplot function will not plot the chart. So before creating our box plots, we will select the relevant variables (saving them in a list called indicators), reshape the data (gather) and remove missing values (na.omit), and finally save it as a temporary dataframe called df.new. We can run this sequence of commands in one go using the piping operator (%>%) from the tidyverse package. (For a more detailed introduction to piping, see the University of Manchester’s Econometric Computing Learning Resource.)

So for the Depth indicators:

# Rename and specify the variables to plot
names(GFDD)[names(GFDD) == "GFDD.DI.01"] <- 
names(GFDD)[names(GFDD) == "GFDD.DI.02"] <- 

# Select the correct variables and remove missing values
indicators <- c("private.credit", "bank.assets")

df.new <- GFDD[, indicators] %>%
  # Reshape wide to long format
  gather(., Indicator, value) %>%
  # Remove missing values

# Plot the data
# The Indicator variable determines the plot's 'color'.
ggplot(df.new) +
  geom_boxplot(aes(x = Indicator, y = value, 
    color = Indicator), lwd = 1) +
    ylab("Value") +

Figure 10.2 Box and whisker plot for ‘Private credit by deposit money banks to GDP (%)’ (private.credit) and ‘Deposit money banks’ assets to GDP (%)’ (bank.assets).

We could repeat the process for each topic and plot all indicators together. However, the range for the GFDD.AI.01 variable (Bank accounts per 1,000 adults) is far larger than the other variables in this group, so it makes sense to plot this separately. We use the same process as before, as shown below.

names(GFDD)[names(GFDD) == "GFDD.AI.01"] <- "bank.accounts"

indicators <- c("bank.accounts")

df.new <- GFDD[, indicators] %>%
  gather(., Indicator, value) %>%

ggplot(df.new) +
  geom_boxplot(aes(x = Indicator, y = value, 
    color = Indicator), lwd = 1) +
    ylab("Value") +

Figure 10.3 Box and whisker plot for ‘Bank accounts per 1,000 adults’ (bank.accounts).

names(GFDD)[names(GFDD) == "GFDD.AI.02"] <- "bank.branches"
names(GFDD)[names(GFDD) == "GFDD.AI.03"] <- "firms.credit"
names(GFDD)[names(GFDD) == "GFDD.AI.04"] <- "small.firms.credit"

indicators <- c("bank.branches", "firms.credit", "small.firms.credit")

df.new <- GFDD[, indicators] %>%
  gather(., Indicator, value) %>%

ggplot(df.new) +
  geom_boxplot(aes(x = Indicator, y = value, 
    color = Indicator), lwd = 1) +
    ylab("Value") +

Figure 10.4 Box and whisker plot for ‘Bank branches per 100,000 adults’ (bank.branches), ‘Firms with a bank loan or line of credit (%)’ (firms.credit), and ‘Small firms with a bank loan or line of credit (%)’ (small.firms.credit).

You can repeat the process for the indicators on bank stability by copying the above code and replacing the indicator variable names accordingly.

We will now create summary tables of some indicators and look at how they have changed over time. Each country belongs to a particular region and income group.

  1. Choose one indicator in Depth and one indicator in Access:

In this walk-through we will use the indicators for ‘Deposit money banks’ assets to GDP (%)’ and ‘Bank accounts per 1,000 adults’ as examples (bank.assets and bank.accounts respectively).

Obtaining the average indicator value for each year and region is straightforward using the group_by and summarize functions, but again we have to select the relevant years (using filter) and remove any observations that have a missing value for the indicator being analysed (using is.na). We save the final output as deposit_region.

# Save into new dataframe
deposit_region <- GFDD %>%
  # We only want observations from 2000 to 2014.
  subset(Year > 1999 & Year < 2015) %>%
  # We need to average by year and region.
  group_by(Year, Region) %>%
  # Remove observations with missing values
  filter(!is.na(bank.assets)) %>%
  # Get the mean (rounded to 2 decimal places)
  # and number of observations
  summarize(mean = round(mean(bank.assets), 2), n = n())

At this stage the summary data is stored in long format.

## # A tibble: 6 x 4
## # Groups:   Year [1]
##    Year Region                      mean     n
##   <dbl> <chr>                      <dbl> <int>
## 1  2000 East Asia & Pacific         67.6    25
## 2  2000 Europe & Central Asia       58.6    45
## 3  2000 Latin America & Caribbean   46.9    33
## 4  2000 Middle East & North Africa  60.9    17
## 5  2000 North America               68.5     2
## 6  2000 South Asia                  28.1     7

This format is useful for plotting the data, but to produce the required table (with Region as the column variable and Year as the row variable), we need to reshape the data into wide format. While we previously used gather to move from wide to long, we can use the spread function to achieve the opposite and transform the data from long to wide.

Trying to produce complicated tables with multiple values (average and counts) for each year and region can be cumbersome. In this example, we can combine the mean and count into a single entry (with the count in brackets) using the paste function. This will help with the formatting of the table.

deposit_region_table <- deposit_region %>%
  # Create a new variable for formatting purposes
  mutate(new = paste(mean, "(", n, ")")) %>%
  # Drop the old average and count variables
  select(-n, -mean) %>%
  # Reshape the data from long to wide
  spread(Region, new)

At this point you could just print or view the data, however using the kable function from the knitr package produces output that is visually easier to read and can be copied and pasted into your analysis document.

kable(deposit_region_table, format="markdown", 
  align = "r", digits = 2)

We can use ggplot to plot a line chart using the long format data (deposit_region), with year on the horizontal axis (formatted as a factor variable so the data will be plotted in chronological order). We specify color = Region and group = Region to group the data by region.

# Plot a line chart
  aes(x = factor(Year), y = mean, color = Region)) +   
  geom_line(aes(group = Region), size = 1) + 
  xlab("Year") +
  ggtitle("Deposit money banks' assets to GDP (%), 
    2000-2014, by region") +
  ylab("Mean deposit, % of GDP") +

Figure 10.5 Line chart of ‘Deposit money banks’ assets to GDP (%)’, 2000–2014, by region.

The process can be repeated for income group rather than region. However, be careful when specifying the variable Income Group, which needs backtick quotes (i.e. `Income Group`) because it contains a space in the variable name.

deposit_income <- GFDD %>%
  subset(Year > 1999) %>%
  # Note variable name in backticks
  group_by(Year, `Income Group`) %>%
  filter(!is.na(bank.assets)) %>%
  summarize(mean = round(mean(bank.assets), 2), n = n())

deposit_income %>%
  # Create a new variable for formatting purposes
  mutate(new = paste(mean, "(", n, ")")) %>%
  select(-n, -mean) %>%
  spread(`Income Group`, new) %>% 
  kable(., format = "markdown", align = "r", digits = 2)
  aes(x = factor(Year), y = mean, 
    color = `Income Group`)) +
  geom_line(aes(group = `Income Group`), size = 1) + 
  xlab("Year") +
  ggtitle("Deposit money banks' assets to GDP (%), 
    2000-2014, by income group") +
  ylab("Mean deposit, % of GDP") +

Figure 10.6 Line chart of ‘Deposit money banks’ assets to GDP (%)’, 2000–2014, by income group.

We can repeat the process for the indicator ‘Bank accounts per 1,000 adults’ by replacing the variable name bank.assets with bank.accounts in the above code, again by region and then by income group.

So far, we have been looking at simple averages, where each observation is given the same weight (importance), so we simply add up all the numbers and divide by the number of observations. However, when we take averages across regions or income groups, we might want to account for the fact that countries differ in size (population or GDP). For example, if one country is much larger than another, we might want that country to have a larger influence on the average. See the box below for more about weighted averages.

weighted average
A type of average that assigns greater importance (weight) to some components than to others, in contrast with a simple average, which weights each component equally. Components with a larger weight can have a larger influence on the average.

Weighted averages

An example of weighted averages that you have probably experienced is in calculating course grades. Usually, course grades are not calculated by simply summing up the scores in all components and dividing by the number of components. Instead, certain components such as the final exam are given more importance (influence over the overall grade) than the midterm exam or course assignments.

To calculate the weighted average, we first determine the weight of each component (these are fractions or proportions that sum to 1). Then we multiply each component by its respective weight, and then sum over all components. Using the course grade as an example, suppose the final exam is worth 70% of the final grade and the midterm exam is worth 30%, with both being scored out of 100. Then the weighted average would be:

In comparison, the simple average would give both components equal weight:

To develop your intuition for this concept, you can experiment by choosing values for the final exam score and midterm exam score and seeing how a change in one of the scores affects the weighted and simple averages.

The indicator ‘Bank regulatory capital to risk-weighted assets (%)’ in the Database also uses weights to account for the fact that some assets are riskier than others, and should therefore not be considered equally.

We will practice calculating weighted averages for the indicator ‘Bank accounts per 1,000 adults’, weighting according to total population in each region (so countries with a larger population will have a larger influence on the average). Since data is missing for some countries, we will calculate the total population in each region as the total population for countries with non-missing data.

  1. For each region and the years 2004–2014:

R walk-through 10.4 Creating weighted averages

As we only require the weighted averages for the years 2004–2014, we will create a new dataframe (called weighted_GFDD) to save our results in. The weights are required for each country within each region for each year, but only if there is a value for the GFDD.AI.01 (bank.accounts) indicator, so we group by year and then region (using group_by). We can then generate the weight for each country by dividing the population of each country by the sum of populations of all countries within a region (and year).

weighted_GFDD <- GFDD %>%
  # Select years of interest
  subset(Year > 2003 & Year < 2015) %>%
  # Select variables of interest
  select("Year", "Country", "Region", 
    "bank.accounts", "SP.POP.TOTL") %>%
  # Only keep observations with non-missing data
  filter(!is.na(bank.accounts)) %>%
  # Group by year and then country
  group_by(Year, Region) %>%
  # Generate country weights
  mutate(weight = SP.POP.TOTL / sum(SP.POP.TOTL))

After transforming and manipulating data, stop and check that you have obtained the desired result. In this case, the sum of the weights for a given region in a particular year should be 1, so let’s check whether that is the case.

weighted_GFDD %>%
  group_by(Year, Region) %>%
  summarize(total = sum(weight)) %>%
## # A tibble: 6 x 3
## # Groups:   Year [1]
##    Year Region                     total
##   <dbl> <chr>                      <dbl>
## 1  2004 East Asia & Pacific            1
## 2  2004 Europe & Central Asia          1
## 3  2004 Latin America & Caribbean      1
## 4  2004 Middle East & North Africa     1
## 5  2004 South Asia                     1
## 6  2004 Sub-Saharan Africa             1

This is correct, so we can proceed to calculate the required weighted indicator values by year and region. We start by creating a new variable with the weighted indicator value (bank.accounts.weighted), and then sum up the weighted indicator values by year and region. Recall that when calculating the weighted average, you sum all of the weighted observations rather than taking the mean (which would calculate the simple average instead).

Again, our data will be in long format so it needs to be converted to wide format before printing out.

weighted_GFDD %>%
  # Apply the weighting
  mutate(bank.accounts.weighted = bank.accounts * weight) %>%
  group_by(Year, Region) %>%
  # Round the summed weights to 1 decimal place
  summarize(weighted_mean = 
    round(sum(bank.accounts.weighted), 2)) %>%
  # Reshape the data
  spread(Region, weighted_mean) %>%
  kable(., format = "markdown", align = "r", digits = 2)

As usual, there are several ways to achieve the same thing in R. You may want to work out how to get the same results with the weighted.mean function in R.

Extension Using Winsorization to handle extreme values

If we are interested in combining indicators into a single index (as in Empirical Project 4), we may be concerned about extreme values, but still want to include these countries in the index (rather than excluding them from the calculations). When calculating summary statistics, we can deal with these extreme values by using the median instead of the mean.

On page 19 of the paper ‘Benchmarking financial systems around the world’, the authors discuss Winsorization (replacing extreme values with either the 95th or the 5th percentile value) as one way to handle these extreme values. Sometimes the extreme values are due to peculiar features of a single country, so we might want to adjust the data to make the values ‘less extreme’.

  1. For an indicator you have used in Questions 3 and 4 and for the year 2010:
  • Calculate the 95th and 5th percentile value of that indicator, across all countries.
  • Replace any value larger than the 95th percentile value with the 95th percentile value, and replace any value smaller than the 5th percentile value with the 5th percentile value.
  • Use your ‘Winsorized’ values from Question 5(b) to calculate the average values of the indicator, by region and income group (separately). Compare these values to the simple averages from Question 3(a).

Extension R walk-through 10.5 Dealing with extreme values

In this example we use ‘Bank accounts per 1,000 adults’ (bank.accounts). The 95th and 5th percentiles can be obtained using the quantiles function. We save the output into a tibble (similar to a dataframe) so we can refer to the values in later calculations.

q_5_95 <- GFDD %>%
  subset(Year == 2010) %>%
  filter(!is.na(bank.accounts)) %>%
    lower = quantile(bank.accounts, probs = c(0.05)), 
    upper = quantile(bank.accounts, probs = c(0.95))) %>%
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1  27.6 1605.

We can compare the value of the indicator with these upper and lower bounds using the ifelse function. An example of using this function to nest two conditions was used in R walk-through 8.8. Here, we use a similar syntax, first replacing all values below the 5th percentile with the value for the 5th percentile, and then replacing all values above the 95th percentile with the value for the 95th percentile.

bank_2010 <- GFDD %>%
  subset(Year == 2010) %>%
  mutate(bank.accounts = 
    ifelse(bank.accounts < q_5_95$lower, q_5_95$lower,
      ifelse(bank.accounts > q_5_95$upper,
        q_5_95$upper, bank.accounts)))

Next we can obtain our summary statistics and print out the ‘Winsorized’ averages.

bank_2010 %>% 
  subset(Year == 2010) %>%
  group_by(`Income Group`) %>%
  filter(!is.na(bank.accounts)) %>%
  summarize(`2010 average` = mean(bank.accounts)) %>%
  kable(., format = "markdown", align = "r", digits = 2)
|         Income Group| 2010 average|
| High income: nonOECD|       916.90|
|    High income: OECD|      1356.47|
|           Low income|       123.06|
|  Lower middle income|       422.65|
|  Upper middle income|       635.71|

Part 10.2 Comparing financial stability before and after the 2008 global financial crisis

Learning objectives for this part

  • use confidence intervals to assess changes in the stability of financial institutions before and after the 2008 global financial crisis.

Now we will assess whether financial stability (measured by the two indicators in Figure 10.1) has changed since the 2008 global financial crisis.

  1. For both indicators of stability in Figure 10.1, explain what effect the post-global financial crisis banking regulations are likely to have on the value of the indicator (for example, would the value increase or decrease?), and why. You may find it helpful to research the regulations that were implemented as a result of the 2008 global financial crisis.
  1. For the years 2007 and 2014, using the t.test function, create tables with Region or Income Group as the row variables(s), and the difference in the average of those indicators between the two years, the 95% confidence interval lower and upper values, and the CI width, as column variables.

R Walk-Through 10.6 Calculating confidence intervals

In R walk-throughs 3.6 and 8.10 we used the t.test function to obtain differences in means and confidence intervals (CIs) for two groups of data. Here we need to obtain these statistics for the GFDD.SI.05 indicator (renamed as risk.weighted.assets) between 2007 and 2014 for each region.

As we need to find the confidence intervals for a number of regions, we can use a for loop to perform the same calculation for each region in turn. We get the full list of regions using the levels function. We start by defining an empty dataframe (df.ttest), and for each region, we will apply the t.test function. We can then extract the confidence interval upper and lower limits, and compute the difference in the means (as the midpoint) and the CI width.

  names(GFDD) == "GFDD.SI.05"] <- "risk.weighted.assets"

df.ttest <- data.frame(region = factor(), mean = double(),
  lower = double(), upper = double(), width = double())

for (i in levels(as.factor(GFDD$`Region`))) {
  t <- t.test(
      GFDD$`Region` == i & GFDD$Year == 2014],
      GFDD$`Region` == i & GFDD$Year == 2007])
  df.ttest <- rbind(df.ttest, data.frame(
    region = i,
    mean = (t$conf.int[1] + t$conf.int[2]) / 2,
    lower = t$conf.int[1],
    upper = t$conf.int[2],
    width =(t$conf.int[2] - t$conf.int[1]) / 2))

##                       region       mean       lower     upper     width
## 1        East Asia & Pacific 1.86333333  -2.0207518  5.747418  3.884085
## 2      Europe & Central Asia 2.73065657   0.7143307  4.746982  2.016326
## 3  Latin America & Caribbean 0.35294118  -1.1443531  1.850235  1.497294
## 4 Middle East & North Africa 0.05666667  -2.9481918  3.061525  3.004858
## 5              North America 0.50000000 -11.6928842 12.692884 12.192884
## 6                 South Asia 3.06000000  -1.3680600  7.488060  4.428060
## 7         Sub-Saharan Africa 1.29263158  -2.5820985  5.167362  3.874730

The same process can be repeated for income groups and for the indicator GFDD.SI.01 (Bank Z-score).

  1. For each indicator:
leverage ratio (for banks or households)
The value of assets divided by the equity stake (capital contributed by owners and shareholders) in those assets.

R Walk-Through 10.7 Plotting column charts with error bars

Again we use the GFDD.SI.05 indicator (risk.weighted.assets) for Region as an example. You can repeat the following steps by region and for the risk.weighted.assets variable by changing the variable name(s) in R walk-through 10.6 accordingly, then running the code below.

The data required to plot a column chart of the difference in means for each income group (using ggplot and geom_bar) was obtained in Question 2. To add confidence intervals to the chart we use the geom_errorbar option (see R walk-through 6.4 for more details on how you can use this option).

As the labels for each income group are quite long (wider than the columns) they will overlap on the horizontal axis. To prevent this, we rotate the labels using the themes option. (As with many problems in R, we found the solution by searching the internet for ‘ggplot rotate axis labels’ and adapting the code from the first result.)

ggplot(df.ttest, aes(y = mean, x = region)) +
  geom_bar(stat = "identity", position = "identity", 
    fill = "grey") + 
  geom_errorbar(aes(ymin = lower, ymax = upper), 
    width = 0.2, lwd = 1) +
  theme_bw() +
  theme(axis.text.x = element_text(
    angle = 30, hjust = 1)) +
    ylab("Difference") +

Figure 10.7 Column chart with error bars for ‘Bank regulatory capital to risk-weighted assets (%)’ (risk.weighted.assets).