Empirical Project 10 Working in R

Download the code

To download the code chunks used in this project, right-click on the download link and select ‘Save link as’. You’ll need to save the code download to your working directory, and open it in RStudio.

Don’t forget to also download the data into your working directory by following the steps in this project.

Getting started in R

For this project you will need the following packages:

If you need to install any of these packages, run the following code:

install.packages(c("tidyverse", "readxl", "knitr"))

You can import the libraries now, or when they are used in the R walk-throughs below.


Part 10.1 Summarizing the data

We will be using the World Bank’s Global Financial Development Database.

First, download the data and documentation:

The World Bank’s Database contains information about four categories:

We will be looking at the first three categories, focusing particularly on measures of stability before and after the 2008 global financial crisis. Each category is measured by a number of indicators. Figure 10.1 shows the indicators we will be using in this project.

Category Indicator name Indicator code
Depth Private credit by deposit money banks to GDP (%) GFDD.DI.01
  Deposit money banks’ assets to GDP (%) GFDD.DI.02
Access Bank accounts per 1,000 adults GFDD.AI.01
  Bank branches per 100,000 adults GFDD.AI.02
  Firms with a bank loan or line of credit (%) GFDD.AI.03
  Small firms with a bank loan or line of credit (%) GFDD.AI.04
Stability Bank Z-score GFDD.SI.01
  Bank regulatory capital to risk-weighted assets (%) GFDD.SI.05

Indicators used in this project.

Figure 10.1 Indicators used in this project.

  1. The ‘Definition and Sources’ tab in the Excel spreadsheet contains a description of all indicators in the Database. Use the information provided to explain briefly why each of the indicators listed in Figure 10.1 may be a good measure of that category, or may give misleading information about that category.

The ‘Data – June 2016’ tab contains the values of each indicator over time (1960–2014) for various countries around the world, though data may be missing for some countries and years.

  1. For each category, calculate and interpret the correlation coefficient between the indicators. Remember that the correlation between A and B is the same as the correlation between B and A, so you only need to calculate the correlation for each pair of items once. (Hint: For Access, you may find it helpful to display the pairwise correlations in a table like Figure 10.2.)
GFDD.AI.01 1
GFDD.AI.02   1
GFDD.AI.03     1
GFDD.AI.04       1

Correlation table for indicators in Question 2.

Figure 10.2 Correlation table for indicators in Question 2.

R walk-through 10.1 Calculating correlation coefficients

Before loading an Excel spreadsheet into R, open it in Excel to understand the structure of the spreadsheet and the data it contains. In this case we can see that detailed descriptions of all variables are in the first worksheet. Be sure to read the definitions for the indicators listed in Figure 10.2.

The spreadsheet contains a number of other worksheets, but the data that we need is in the tab called ‘Data – June 2016’. The data is structured such that the variable name is in the first row and missing values are simply empty cells. We can therefore proceed to import the data into R using the read_excel function without any additional options.

The first task is to calculate the correlation coefficient between the indicators within each topic. So let us first consider the indicators on Depth and calculate the correlation between ‘Financial resources provided to the private sector by domestic money banks’ and ‘Total assets held by deposit money banks’. We use the cor function to compute the correlation. Since this function requires variables to be columns of a matrix, we use the cbind function to join the two variables of interest together.

##      [,1] [,2]
## [1,]    1   NA
## [2,]   NA    1

The output is a correlation matrix with the corresponding correlations given in the off-diagonal entries. In the output above, the correlation is reported as ‘NA’, because both of the variables of interest have many missing values. (If you explore the data (type view(GFDD) in the Console) then you will see this.)

Throughout this project we have to be mindful of missing values and how we handle them. It may be tempting to ‘clean’ the data at the start and remove any observations that have any missing data. But this would remove a lot of useful data, for example, for a particular question we may only be interested in the indicators on Depth, so it does not make sense to remove observations with missing Access indicator values. Throughout this project we will deal with the issue of missing data as it arises.

To find out how cor can deal with missing values, read the relevant help file (type ?cor). In this case we can use the use="pairwise.complete.obs" option to include only observations that have values for both GFDD.DI.01 and GFDD.DI.01.

cor(cbind(GFDD$GFDD.DI.01,GFDD$GFDD.DI.02), use="pairwise.complete.obs")
##           [,1]      [,2]
## [1,] 1.0000000 0.9612094
## [2,] 0.9612094 1.0000000

From this output we can see that the correlation matrix is symmetric, and the correlation coefficient is 0.96.

Similarly, we can obtain the correlation coefficients for the Access and Stability indicators.

##           [,1]      [,2]      [,3]      [,4]
## [1,] 1.0000000 0.3582013 0.4196770 0.4733813
## [2,] 0.3582013 1.0000000 0.4730903 0.5049081
## [3,] 0.4196770 0.4730903 1.0000000 0.9673203
## [4,] 0.4733813 0.5049081 0.9673203 1.0000000
##             [,1]        [,2]
## [1,] 1.000000000 0.002266157
## [2,] 0.002266157 1.000000000

Box and whisker plots are useful for looking at a single variable and checking if there are many extreme values (either very large or very small, relative to the rest of the values).

  1. Make a separate box and whisker plot for each indicator, with the outliers displayed (see Empirical Project 6 for help on how to do this). From looking at your plots, do extreme values appear to be an issue? (In Question 6, we will look at one way to handle extreme values if there is a concern that one or a few very extreme values will heavily affect the average.)

R walk-through 10.2 Making box and whisker plots

Box and whisker plots were introduced in Empirical Project 6. We can use the same process here, after ensuring that the data is in the correct format. ggplot expects the data to be in ‘long’ format (each row is a value for a single variable and year), whereas our data is in ‘wide’ format (each row contains a single variable but multiple years). We transform the data from wide to long format using the gather function.

Further details on the difference between long and wide format data can be found in ‘The Wide and Long Data Format for Repeated Measures Data’.

We also have to remove any missing values using the na.omit function; therefore before creating our box plots we will select the relevant variables, clean and reshape the data, and then save it as a temporary dataframe.

So for the Depth indicators:

# Select the correct variables and clean up missing values
indicators <- c("GFDD.DI.01","GFDD.DI.02")      # Specify the variables to plot
df.new <- GFDD[,indicators] %>%                 # Select the correct variables and save the results 
  gather(.,Indicator,value) %>%                 # Reshape wide to long format
  na.omit()                                     # Remove missing values

# Plot the data
# The grouping of the data is selected by specifying the 'color' of the plot based on the Indicator variable.
ggplot(df.new) +
  geom_boxplot(aes(x=Indicator, y=value, color=Indicator), lwd=1) +

Box and whisker plot for indicators GFDD.DI.01 and GFDD.DI.02.

Figure 10.3 Box and whisker plot for indicators GFDD.DI.01 and GFDD.DI.02.

We could repeat the process for each topic and plot all indicators together, however, the range for the GFDD.AI.01 variable is far larger than the other variables in this group, so it makes sense to plot this separately. Use the same process as before, as done below.

indicators <- c("GFDD.AI.01")
df.new <- GFDD[,indicators] %>%
  gather(.,Indicator,value) %>%
ggplot(df.new) +
  geom_boxplot(aes(x=Indicator, y=value, color=Indicator), lwd=1) +

Box and whisker plot for indicator GFDD.AI.01.

Figure 10.4 Box and whisker plot for indicator GFDD.AI.01.

indicators <- c("GFDD.AI.02","GFDD.AI.03","GFDD.AI.04")
df.new <- GFDD[,indicators] %>%
  gather(.,Indicator,value) %>%
ggplot(df.new) +
  geom_boxplot(aes(x=Indicator, y=value, color=Indicator), lwd=1) +

Box and whisker plot for indicators GFDD.AI.02,GFDD.AI.03 and GFDD.AI.04.

Figure 10.5 Box and whisker plot for indicators GFDD.AI.02,GFDD.AI.03 and GFDD.AI.04.

You can repeat the process for the indicators on bank stability by copying the above code and replacing the indicator variable names accordingly.

We will now create summary tables of some indicators and look at how they have changed over time. Each country belongs to a particular region and income group.

  1. Choose one indicator in Depth and one indicator in Access:

In this section we will use the indicators for ‘Deposit money banks’ assets to GDP (%)’ and ‘Bank accounts per 1,000 adults’ as examples (GFDD.DI.02 and GFDD.AI.01 respectively).

Obtaining the average indicator value for each year and region is straightforward using the group_by and summarize functions, but again we have to select the relevant years and remove any observations that have a missing value for the indicator being analysed.

deposit_region <- GFDD %>%                      # Save into new dataframe
  subset(Year > 1999 & Year < 2015) %>%         # We only want observations from 2000 to 2014.
  group_by(Year,Region) %>%                     # We need to average by year and region.
  filter(!is.na(GFDD.DI.02)) %>%                # Remove observations with missing values
  summarize(mean = round(mean(GFDD.DI.02),2), n=n())  # Obtain the mean (rounded to 2 decimal places) and number of observations

At this stage the summary data is stored in long format.

## # A tibble: 6 x 4
## # Groups:   Year [1]
##    Year Region                      mean     n
##   <dbl> <chr>                      <dbl> <int>
## 1  2000 East Asia & Pacific         67.6    25
## 2  2000 Europe & Central Asia       58.6    45
## 3  2000 Latin America & Caribbean   46.9    33
## 4  2000 Middle East & North Africa  60.9    17
## 5  2000 North America               68.5     2
## 6  2000 South Asia                  28.1     7

This format is useful for plotting the data, but to produce the required table (with Region as the column variable and Year as the row variable), we need to reshape the data into wide format. While we used gather to move from wide to long, we can use the spread function to achieve the opposite and transform from long to wide.

Trying to produce complicated tables with multiple values (average and counts) for each year and region can be cumbersome. In this example, we can combine the mean and count into a single entry (with the count in brackets) using the paste function. This will help with the formatting of the table.

deposit_region_table <- deposit_region %>%
  mutate(new = paste(mean,"(",n,")")) %>%     # Create a new variable for formatting purposes
  select(-n,-mean) %>%                        # Drop the old average and count variables
  spread(Region,new)                          # Reshape the data from long to wide

At this point you could just print or view the data, however using the kable function from the knitr package produces output that is visually easier to read and can be copied and pasted into your analysis document.

kable(deposit_region_table, format="markdown", align = "r", digits = 2)

We can plot a line chart using the long format data (deposit_region) as follows:

# Plot a line chart
# Set the horizontal axis to be the year, but treat it as a factor for the correct labelling
# The grouping by region is done by specifying 'color=Region'.
# As we have treated Year as a factor, we have to specify the grouping as 'Region' in geom_line.
ggplot(deposit_region,aes(x=factor(Year), y=mean, color=Region)) +   
  geom_line(aes(group=Region), size=1) + xlab("Year") +
  ggtitle("Deposit money banks' assets to GDP (%), 2000-2014, by region") +
  ylab("Mean deposit, % of GDP") +

Line chart of Deposit money banks’ assets to GDP (%), 2000–2014, by region.

Figure 10.6 Line chart of Deposit money banks’ assets to GDP (%), 2000–2014, by region.

The process can be repeated for income group rather than region. However, be careful when specifying the variable Income Group, which needs backtick quotes (i.e. `Income Group`) because it contains a space in the variable name.

deposit_income <- GFDD %>%
  subset(Year > 1999) %>%
  group_by(Year,`Income Group`) %>%           # Note variable name in backticks
  filter(!is.na(GFDD.DI.02)) %>%
  summarize(mean = round(mean(GFDD.DI.02),2), n=n())

deposit_income %>%
  mutate(new = paste(mean,"(",n,")")) %>%     # Create a new variable for formatting purposes
  select(-n,-mean) %>%
  spread(`Income Group`,new) %>% 
  kable(. , format="markdown", align = "r", digits =2)
ggplot(deposit_income,aes(x=factor(Year), y=mean, color=`Income Group`)) +
  geom_line(aes(group=`Income Group`),size=1) + xlab("Year") +
  ggtitle("Deposit money banks' assets to GDP (%), 2000-2014, by income group") +
  ylab("Mean deposit, % of GDP") +

Line chart of Deposit money banks’ assets to GDP (%), 2000–2014, by income group.

Figure 10.7 Line chart of Deposit money banks’ assets to GDP (%), 2000–2014, by income group.

We can repeat the process for the indicator Bank accounts per 1,000 adults by replacing the variable name GFDD.DI.02 with GFDD.AI.01 in the above code, again by region and then by income group.

So far, we have been looking at simple averages, where each observation is given the same weight (importance), so we simply add up all the numbers and divide by the number of observations. However, when we take averages across regions or income groups, we might want to account for the fact that countries differ in size (population or GDP). For example, if one country is much larger than another, we might want that country to have a larger influence on the average. See the box below for more about weighted averages.

weighted average
A type of average that assigns greater importance (weight) to some components than to others, in contrast with a simple average, which weights each component equally. Components with a larger weight can have a larger influence on the average.

Weighted averages

An example of weighted averages that you have probably experienced is in calculating course grades. Usually, course grades are not calculated by simply summing up the scores in all components and dividing by the number of components. Instead, certain components such as the final exam are given more importance (influence over the overall grade) than the midterm exam or course assignments.

To calculate the weighted average, we first determine the weight of each component (these are fractions or proportions that sum to 1). Then we multiply each component by its respective weight, and then sum over all components. Using the course grade as an example, suppose the final exam is worth 70% of the final grade and the midterm exam is worth 30%, with both being scored out of 100. Then the weighted average would be:

In comparison, the simple average would give both components equal weight:

To develop your intuition for this concept, you can experiment by choosing values for the final exam score and midterm exam score and seeing how a change in one of the scores affects the weighted and simple averages.

The indicator ‘Bank regulatory capital to risk-weighted assets (%)’ in the Database also uses weights to account for the fact that some assets are riskier than others, and should therefore not be considered equally.

We will practice calculating weighted averages for the indicator ‘Bank accounts per 1,000 adults’, weighting according to total population in each region (so countries with a larger population will have a larger influence on the average). Since data is missing for some countries, we will calculate the total population in each region as the total population for countries with non-missing data.

  1. For each region and the years 2004–2014:

R walk-through 10.4 Creating weighted averages

As we only require the weighted averages for the years 2004–2014, we will create a new dataframe (called weighted_GFDD) to save our results in. The weights are required for each country within each region for each year, but only if there is a value for the GFDD.AI.01 indicator, so we group by year and then region. We can then generate the weight for each country by dividing the population of each country by the sum of populations of all countries within a region (and year).

weighted_GFDD <- GFDD %>%
  subset(Year > 2003 & Year <2015) %>%                                # Select years of interest
         "SP.POP.TOTL") %>%    # Select variables of interest
  filter(!is.na(GFDD.AI.01)) %>%                                  # Only keep observations with non-missing data
  group_by(Year,Region) %>%                                        # Group by year and then country
  mutate(weight = SP.POP.TOTL / sum(SP.POP.TOTL))                # Generate country weights

After transforming and manipulating data, stop and check that you have obtained the desired result. In this case, the sum of the weights for a given region in a particular year should be 1, so let’s check whether that is the case.

weighted_GFDD %>%
  group_by(Year,Region) %>%
  summarize(total = sum(weight)) %>%
## # A tibble: 6 x 3
## # Groups:   Year [1]
##    Year Region                     total
##   <dbl> <chr>                      <dbl>
## 1  2004 East Asia & Pacific            1
## 2  2004 Europe & Central Asia          1
## 3  2004 Latin America & Caribbean      1
## 4  2004 Middle East & North Africa     1
## 5  2004 South Asia                     1
## 6  2004 Sub-Saharan Africa             1

This is correct, so we can proceed to calculate the required weighted indicator values by year and region. We start by creating a new variable with the weighted indicator value (GFDD.AI.01.weighted), and then sum up the weighted indicator values by year and region. Recall that when calculating the weighted average, you sum all of the weighted observations rather than taking the mean.

Again, our data will be in long format so it needs to be converted to wide format before printing out.

weighted_GFDD %>%
  mutate(GFDD.AI.01.weighted = GFDD.AI.01 * weight) %>%             # Apply the weighting
  group_by(Year,Region) %>%
  summarize(weighted_mean = round(sum(GFDD.AI.01.weighted),2)) %>%   # Round the summed weights to 1 decimal place
  spread(Region,weighted_mean) %>%                                  # Reshape the data
  kable(., format="markdown", align = "r", digits=2)

As usual there are several ways to achieve the same thing in R. You may want to work out how to replicate these results with the weighted.mean function in R.

Extension Using Winsorization to handle extreme values

If we are interested in combining indicators into a single index (as in Empirical Project 4), we may be concerned about extreme values, but still want to include these countries in the index (rather than excluding them from the calculations). On page 19 of the paper ‘Benchmarking financial systems around the world’, the authors discuss Winsorization (replacing extreme values with either the 95th or the 5th percentile value) as one way to handle these extreme values. Sometimes the extreme values are due to peculiar features of a single country, so we might want to adjust the data to make the values ‘less extreme’.

  1. For an indicator you have used in Questions 4 and 5 and for the year 2010:
  • Calculate the 95th and 5th percentile value of that indicator, across all countries.
  • Replace any value larger than the 95th percentile value with the 95th percentile value, and replace any value smaller than the 5th percentile value with the 5th percentile value.
  • Use your ‘Winsorized’ values from Question 6(b) to calculate the average values of the indicator, by region and income group (separately). Compare these values to the simple averages from Question 4(a).

Extension R walk-through 10.5 Dealing with extreme values

In this example we use GFDD.AI.01. The 95th and 5th percentiles can be obtained using the quantiles function. We save the output into a tibble so we can refer to the values in later calculations.

q_5_95 <- GFDD %>%
  subset(Year == 2010) %>%
  filter(!is.na(GFDD.AI.01)) %>%
  summarize(lower = quantile(GFDD.AI.01,probs = c(0.05)), upper = quantile(GFDD.AI.01,probs = c(0.95))) %>%
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1  27.6 1605.

We can now replace the extreme values with the corresponding 5th and 95th percentile values. As we want to keep the original values for later calculations, we will output our transformed values into a new dataframe (bank_2010).

We can compare the value of the indicator with these upper and lower bounds using the ifelse function. An example of using this function to nest two conditions was used in R walk-through 8.8, and we repeat the process here.

bank_2010 <- GFDD %>%
  subset(Year == 2010) %>%
  mutate(GFDD.AI.01 = 

Next we can obtain our summary statistics and print out the ‘Winsorized’ averages.

bank_2010 %>% 
  subset(Year == 2010) %>%
  group_by(`Income Group`) %>%
  filter(!is.na(GFDD.AI.01)) %>%
  summarize(`2010 average` = mean(GFDD.AI.01)) %>%
  kable(., format="markdown", align = "r", digits=2)
|         Income Group| 2010 average|
| High income: nonOECD|       916.90|
|    High income: OECD|      1356.47|
|           Low income|       123.06|
|  Lower middle income|       422.65|
|  Upper middle income|       635.71|

Part 10.2 Comparing financial stability before and after the 2008 global financial crisis

Now we will assess whether financial stability (measured by the two indicators in Figure 10.1) has changed since the 2008 global financial crisis.

  1. For both indicators of stability in Figure 10.1, explain what effect the post-global financial crisis banking regulations are likely to have on the value of the indicator (for example, would the value increase or decrease?), and why. You may find it helpful to research the regulations that were implemented as a result of the 2008 global financial crisis.
  1. For the years 2007 and 2014, using the t.test function, create tables with Region or Income Group as the row variables(s), and the difference in the average of those indicators between the two years, the 95% confidence interval lower and upper values, and the CI width, as column variables.

R Walk-Through 10.6 Calculating confidence intervals

In R walk-throughs 3.6 and 8.10 we used the t.test function to obtain differences in means and confidence intervals (CIs) for two groups of data. Here we need to obtain these statistics for the GFDD.SI.05 indicator between 2007 and 2014 for each region.

As we need to find the confidence intervals for a number of regions, we can use a loop to perform the two sample hypothesis tests. We start by defining an empty dataframe (say df.ttest), and on each iteration of the for loop we will perform the hypothesis test using the t.test function. We can then extract the confidence interval upper and lower limits, and compute the difference in the mean (as the midpoint) and the CI width.

df.ttest <- data.frame(region = factor(),
                       width = double())
for (i in levels(as.factor(GFDD$`Region`))) {
  t <- t.test(GFDD$GFDD.SI.05[GFDD$`Region`==i & GFDD$Year ==2014],
              GFDD$GFDD.SI.05[GFDD$`Region`==i & GFDD$Year ==2007])
  df.ttest <- rbind(df.ttest,data.frame(region = i,
                                        mean =   (t$conf.int[1]+ t$conf.int[2])/2,
                                        lower = t$conf.int[1],
                                        upper = t$conf.int[2],
                                        width = (t$conf.int[2] - t$conf.int[1])/2))

##                       region       mean       lower     upper     width
## 1        East Asia & Pacific 1.86333333  -2.0207518  5.747418  3.884085
## 2      Europe & Central Asia 2.73065657   0.7143307  4.746982  2.016326
## 3  Latin America & Caribbean 0.35294118  -1.1443531  1.850235  1.497294
## 4 Middle East & North Africa 0.05666667  -2.9481918  3.061525  3.004858
## 5              North America 0.50000000 -11.6928842 12.692884 12.192884
## 6                 South Asia 3.06000000  -1.3680600  7.488060  4.428060
## 7         Sub-Saharan Africa 1.29263158  -2.5820985  5.167362  3.874730

The same process can be repeated for income groups and for the indicator ‘Bank Z-score’.

  1. For each indicator:
leverage ratio (for banks or households)
The value of assets divided by the equity stake (capital contributed by owners and shareholders) in those assets.

R Walk-Through 10.7 Plotting column charts with error bars

Again we use the GFDD.SI.05 indicator for Region as an example; the following can be repeated by region and for the GFDD.SI.05 variable.

The data required to plot a column chart of the difference in means for each income group (using ggplot and geom_bar) was obtained in Question 2. To add confidence intervals to the chart we use the geom_errorbar option (see R walk-through 6.4 for an example).

As the labels for each income group are quite long (wider than the columns) they will overlap on the horizontal axis. To prevent this, we rotate the labels using the themes option. (We obtained this result by searching the internet for ‘ggplot rotate axis labels’ and adapting the code from the first result.)

ggplot(df.ttest, aes(y=mean ,x=region)) +
  geom_bar(stat="identity", position = "identity", fill="grey") + 
  geom_errorbar(aes(ymin=lower, ymax=upper), width=0.2, lwd=1) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) + ylab("Difference")

Column chart with error bars for indicator GFDD.SI.05.

Figure 10.8 Column chart with error bars for indicator GFDD.SI.05.