Empirical Project 9 Working in R

Getting started in R

For this project you will need the following packages:

If you need to install these packages, run the following code:

install.packages(c("readxl","tidyverse", "knitr", "mosaic"))

You can import the libraries now, or when they are used in the R walk-throughs below.

library(readxl)  
library(tidyverse)
library(knitr)
library(mosaic)

Part 9.1 Households that did not get a loan

The Ethiopian Socioeconomic Survey (ESS) data was collected in 2013–14 from a nationally representative sample of households. Households were asked about topics such as their housing conditions, assets, and access to credit.

Download the ESS data and survey questionnaire:

R walk-through 9.1 Importing data into R

Before importing the data, open it in Excel to look at its structure. You can see there are three worksheets: ‘Data dictionary’, ‘All households’, and ‘Got loan’. Also note that there are a lot of empty cells where there is missing data. In the read_excel function we therefore use the na = "" option so that R recognizes empty cells as missing data.

library(tidyverse)
library(readxl)
allHH <- read_excel("Project 9 datafile.xlsx", sheet = "All households", na = "NA")
gotL <- read_excel("Project 9 datafile.xlsx", sheet = "Got loan", na = "NA")
DataDict <- read_excel("Project 9 datafile.xlsx", sheet = "Data dictionary", na = "NA")

Now let’s look at the variable types.

str(allHH)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5262 obs. of  19 variables:
##  $ household_id2     : num  1.01e+16 1.01e+16 1.01e+16 1.01e+16 1.01e+16 ...
##  $ got_loan          : chr  "No" "No" "No" "Yes" ...
##  $ rural             : chr  "Rural" "Rural" "Rural" "Rural" ...
##  $ hhsize            : num  8 8 1 3 4 4 5 5 6 7 ...
##  $ region            : chr  "Tigray" "Tigray" "Tigray" "Tigray" ...
##  $ gender            : chr  "Female" "Male" "Female" "Female" ...
##  $ age               : num  78 37 78 30 71 28 37 32 51 43 ...
##  $ young_children    : num  4 5 0 1 1 3 2 3 3 2 ...
##  $ working_age_adults: num  3 3 0 2 1 1 3 2 3 6 ...
##  $ max_education     : num  3 0 0 4 8 4 9 6 4 5 ...
##  $ number_assets     : num  32 19 3 6 34 5 45 35 20 64 ...
##  $ loan_rejected     : chr  "No" "No" "No" "No" ...
##  $ rejection_source1 : chr  NA NA NA NA ...
##  $ rejection_source2 : chr  NA NA NA NA ...
##  $ loan_purpose      : chr  NA NA NA NA ...
##  $ loan_purpose_other: chr  NA NA NA NA ...
##  $ did_not_apply     : chr  "Did not apply" "Did not apply" "Did not apply" "Applied" ...
##  $ reason_not_apply1 : chr  "Too Expensive" "Believe Would Be Refused" "Too Expensive" NA ...
##  $ reason_not_apply2 : chr  "Fear Not Be Able To Pay" NA "Fear Not Be Able To Pay" NA ...
str(gotL)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1480 obs. of  21 variables:
##  $ household_id2      : num  1.01e+16 1.01e+16 1.01e+16 1.01e+16 1.01e+16 ...
##  $ got_loan           : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ rural              : chr  "Rural" "Rural" "Rural" "Rural" ...
##  $ hhsize             : num  3 4 5 9 7 7 8 7 10 8 ...
##  $ region             : chr  "Tigray" "Tigray" "Tigray" "Tigray" ...
##  $ gender             : chr  "Female" "Male" "Male" "Male" ...
##  $ age                : num  30 71 53 52 56 38 44 47 52 37 ...
##  $ young_children     : num  1 1 3 4 3 4 4 3 2 6 ...
##  $ working_age_adults : num  2 1 2 6 3 3 4 4 8 2 ...
##  $ max_education      : num  4 8 5 8 7 7 8 6 10 1 ...
##  $ number_assets      : num  6 34 14 16 9 14 24 15 22 10 ...
##  $ borrowed_from      : chr  "Relative" "Relative" "Relative" "Other (specify)" ...
##  $ borrowed_from_other: chr  NA NA NA "Cooperatives" ...
##  $ loan_purpose       : chr  "Purchase Agricultural Inputs for Food Crop" "Purchase Agricultural Inputs for Food Crop" "Purchase Agricultural Inputs for Food Crop" "Purchase Agricultural Inputs for Food Crop" ...
##  $ loan_startmonth    : chr  "March" "November" "November" "June" ...
##  $ loan_startyear     : num  2004 2006 2006 2005 2005 ...
##  $ loan_repaid        : chr  "Yes" "Yes" "No" "No" ...
##  $ loan_endmonth      : chr  NA NA "June" "March" ...
##  $ loan_endyear       : num  NA NA 2006 2006 2006 ...
##  $ loan_amount        : num  1000 800 2500 2653 1600 ...
##  $ loan_interest      : num  300 0 1750 325 300 1750 290 300 300 268 ...

It is important to ensure that all variables we expect to be numerical (num) variables are coded as num, and in this case, they are. You can see that there are many variables that are coded as character (chr) variables (for example gender or region), but it may be useful to change them into categorical (factor) variables for later use. Let’s do this now and then the data will be ready to use.

Instead of converting each variable to a factor individually, say allHH$gender <- as.factor(allHH$gender), we use the tidyverse package to do this step in one go.

allHH <- allHH %>% mutate_if(is.character, as.factor)
gotL <- gotL %>% mutate_if(is.character, as.factor)

We take allHH and send it to the mutate_if function, which applies the as.factor function to character variables only. Then we do the same for gotL.

Typing str(allHH) and str(gotL) will confirm that all character variables are now factors.

We also imported the ‘Data dictionary’ so that you do not have to return to the Excel spreadsheet. Use view(DataDict) to see the variable definitions.

  1. The data is already in a relatively clean format, so we will begin by summarizing the information in the ‘All households’ tab, starting with region and household characteristics.

R walk-through 9.2 Creating summary tables

In order to get the proportions of households living in large towns, small towns, or rural areas (encoded in the variable rural), we use the table function.

options(digits=3) # Control how many digits are printed
stab1 <- table(allHH$region, allHH$rural)
prop.table(stab1,1)
##                  
##                   Large town (urban)  Rural Small town (urban)
##   Addis Ababa                 1.0000 0.0000             0.0000
##   Afar                        0.0956 0.7574             0.1471
##   Amhara                      0.2176 0.6692             0.1132
##   Benshagul Gumuz             0.0000 0.9040             0.0960
##   Diredwa                     0.4685 0.5315             0.0000
##   Gambelia                    0.1154 0.8000             0.0846
##   Harari                      0.2727 0.7273             0.0000
##   Oromia                      0.2831 0.6070             0.1098
##   SNNP                        0.1826 0.7270             0.0905
##   Somalie                     0.1552 0.7552             0.0897
##   Tigray                      0.3670 0.5628             0.0701

The counts (number) of households in the respective regions and area types are contained in stab1. Running this table through the prop.table() function changes the values from counts to proportions. The optional input ( ,1) makes the proportions for each row add to one. (You could see what happens when you leave this option out, or if you choose the value 2.) To obtain detailed information, check ?prop.table.

Let’s use a similar approach to calculate the percentage of households with female heads.

stab2  <- table(allHH$gender)
prop.table(stab2)
## 
## Female   Male 
##  0.304  0.696

As shown, 30.4% of households have a female head.

We need to provide summary statistics for a range of variables. Most of these variables are numeric variables, but one, gender, is a factor variable. For the latter we use the summary function. For the numeric variables we use the favstats function, which is part of the mosaic package. One could also use the summary function, but it does not provide standard deviations.

library(mosaic) # Load the mosaic package
summary(allHH$gender)
## Female   Male   NA's 
##   1599   3662      1
# Create a list of the numeric variable names
sel_q <- c('hhsize', 'age', 'young_children', 'working_age_adults', 'max_education', 'number_assets')
lapply(subset(allHH,select = sel_q),favstats)
## $hhsize
##  min Q1 median Q3 max mean  sd    n missing
##    1  3      4  6  16 4.58 2.4 5260       2
## 
## $age
##  min Q1 median Q3 max mean   sd    n missing
##    3 32     42 55  99 44.2 15.6 5253       9
## 
## $young_children
##  min Q1 median Q3 max mean   sd    n missing
##    0  0      2  3  10 1.89 1.71 5262       0
## 
## $working_age_adults
##  min Q1 median Q3 max mean   sd    n missing
##    0  2      2  3  10 2.58 1.52 5262       0
## 
## $max_education
##  min Q1 median Q3 max mean   sd    n missing
##    0  2      6 10  30 7.53 7.28 5262       0
## 
## $number_assets
##  min Q1 median Q3 max mean   sd    n missing
##    0  5      9 18 203 14.9 17.2 5262       0

The summary statistics for the numeric variables are generated using the lapply (list apply) function. Inside the function we specify the subset of variables we are interested in, subset(allHH,select = sel_q), and specify the function we want to apply to these variables, namely the favstats function.

Now that we have an idea of what our data looks like, we will move on to identifying households that are potentially excluded from the credit market or are credit constrained. The former are households that find it impossible to borrow, and the latter are households that can only borrow on unfavourable terms (see Section 9.8 of Economy, Society, and Public Policy).

The variables in our dataset that are related to this issue are did_not_apply and loan_rejected. Later we will also look at the responses given in the variables reason_not_apply1 and reason_not_apply2.

  1. Using the ‘All households’ dataset:

R walk-through 9.3 Making frequency tables for loan applications and outcomes

The easiest way to make a frequency table is to use the table function.

stab3 <- addmargins(table(allHH$did_not_apply,allHH$loan_rejected, useNA = "always", dnn = c("Applied?","Rejected?")))
stab3
##                Rejected?
## Applied?          No  Yes <NA>  Sum
##   Applied       1363  201    1 1565
##   Did not apply 3632   24    2 3658
##   <NA>            37    2    0   39
##   Sum           5032  227    3 5262

Note that we nested the table () function in the addmargins function to obtain row and column totals. By default, the table function excludes ‘NA’ values. Consulting the help function (type ?table in the command window) will show that the option useNA = "always" includes them.

Here, we decide to exclude the 24 households that indicated that they did not apply for a loan, but also indicated that they were refused a loan. This is a contestable decision as it results in excluding more than 10% of households that indicated that they were refused a loan. We shall also remove all observations that have missing data for any of these two questions.

Now we create the dataset with the non-missing data only.

allHHc <- subset(allHH, !is.na(allHH$did_not_apply)) # Remove NAs in did_not_apply
allHHc <- subset(allHHc, !is.na(allHHc$loan_rejected)) 

nrow(allHHc) # Show the number of observations
## [1] 5220

At this stage we have dropped 42 observations. Now we delete the 24 observations that gave a nonsensical answer.

allHHc <- subset(allHHc, !(allHHc$loan_rejected == "Yes" & allHHc$did_not_apply == "Did not apply")) 

nrow(allHHc) # Show the number of observations
## [1] 5196

We are left with 5,196 observations. Let’s recreate the frequency table, but this time calculating proportions. (To obtain percentages, multiply the proportions by 100.)

stab4 <- addmargins(prop.table(table(allHHc$did_not_apply,allHHc$loan_rejected, dnn = c("Applied?","Rejected?"))))
stab4
##                Rejected?
## Applied?            No    Yes    Sum
##   Applied       0.2623 0.0387 0.3010
##   Did not apply 0.6990 0.0000 0.6990
##   Sum           0.9613 0.0387 1.0000

To create operational categories to use throughout this project, we will label households as either:

You should note that the ‘denied’ households are most likely only a subset of the credit-excluded households, as there will be households that are credit excluded and do not even apply for a loan. One could, for instance, reason that households who answered ‘Inadequate Collateral’ or ‘Do Not Know Any Lender’ are also likely to be credit-excluded.

  1. Using the subset of data from Question 2(b):

Note that arguably other answers are also indicative of being credit constrained, so the criteria we use is definitely only a subset of all households that are credit constrained. For example, one could include households that have been denied a loan, and it is also likely that some households that have been granted a loan are in fact credit constrained.

R walk-through 9.4 Creating variables to classify households

Let’s first create the HH_status variable.

allHHc$HH_status <- "not applied" # This is the default category.

allHHc$HH_status[allHHc$did_not_apply == "Applied" & allHHc$loan_rejected == "No"] <- "successful"
allHHc$HH_status[allHHc$did_not_apply == "Applied" & allHHc$loan_rejected == "Yes"] <- "denied"

allHHc$HH_status <- factor(allHHc$HH_status) # Change from character to factor variable

Typing summary(allHHc$HH_status) should give you numbers that correspond to the frequency table from Question 2.

Let’s continue by making the discouraged_borrower variable.

allHHc$discouraged_borrower <- "No" # This is the default category.

allHHc$discouraged_borrower[allHHc$reason_not_apply1 == "Believe Would Be Refused"] <- "Yes"
allHHc$discouraged_borrower[allHHc$reason_not_apply2 == "Believe Would Be Refused"] <- "Yes"

allHHc$discouraged_borrower <- factor(allHHc$discouraged_borrower) # Change from character to factor variable
summary(allHHc$discouraged_borrower)
##   No  Yes 
## 4608  588

To make the credit_constrained variable, we use levels to check the possible answers to the reason_not_apply questions

sel_ans <- levels(allHHc$reason_not_apply1)
sel_ans
##  [1] "Believe Would Be Refused"  "Do Not Know Any Lender"   
##  [3] "Do Not Like To Be In Debt" "Fear Not Be Able To Pay"  
##  [5] "Have Adequate Farm"        "Inadequate Collateral"    
##  [7] "No Farm or Business"       "Other (Specify)"          
##  [9] "Too Expensive"             "Too Much Trouble"

Of these reasons, only reasons [5] and [8] do not lead to a conclusion that a household is credit constrained.

sel_ans <- sel_ans[-c(5,8)] # Remove reasons 5 and 8

allHHc$credit_constrained <- "No" # This is the default category, as households that did not provide any reasons are classified as not credit constrained.

allHHc$credit_constrained[allHHc$reason_not_apply1 %in% sel_ans] <- "Yes"
allHHc$credit_constrained[allHHc$reason_not_apply2 %in% sel_ans] <- "Yes"

allHHc$credit_constrained <- factor(allHHc$credit_constrained) # Change from character to factor variable
summary(allHHc$credit_constrained)
##   No  Yes 
## 2184 3012

The selection criterion allHHc$reason_not_apply1 %in% sel_ans is a very useful programming technique. sel_ans contains all the answers that we associate with a credit-constrained household. allHHc$reason_not_apply1 %in% sel_ans gives an outcome of TRUE if the answer to reason_not_apply1 is one of the answers in sel_ans, then sets the value of credit_constrained to ‘Yes’ for those observations.

stab4  <- addmargins(prop.table(table(allHHc$credit_constrained,allHHc$discouraged_borrower, useNA = "ifany", dnn = c("Constrained?","Discouraged?"))))
stab4
##             Discouraged?
## Constrained?    No   Yes   Sum
##          No  0.420 0.000 0.420
##          Yes 0.467 0.113 0.580
##          Sum 0.887 0.113 1.000
library(knitr) # Required for the use of the kable function

print("Reasons not to apply 1")
## [1] "Reasons not to apply 1"
table5 <- prop.table(table(allHHc$reason_not_apply1))
kable(table5[rev(order(table5))]) # The kable function is optional but formats tables more neatly.
Var1                          Freq
--------------------------  ------
Do Not Like To Be In Debt    0.191
Have Adequate Farm           0.185
Fear Not Be Able To Pay      0.170
Believe Would Be Refused     0.116
No Farm or Business          0.102
Do Not Know Any Lender       0.068
Too Expensive                0.049
Inadequate Collateral        0.045
Other (Specify)              0.041
Too Much Trouble             0.033
print("Reasons not to apply 2")
## [1] "Reasons not to apply 2"
table6 <- prop.table(table(allHHc$reason_not_apply2))
kable(table6[rev(order(table6))])
Var1                          Freq
--------------------------  ------
Fear Not Be Able To Pay      0.277
Do Not Like To Be In Debt    0.243
Inadequate Collateral        0.087
Believe Would Be Refused     0.087
Too Expensive                0.068
Do Not Know Any Lender       0.064
Too Much Trouble             0.062
Have Adequate Farm           0.050
No Farm or Business          0.040
Other (Specify)              0.024

We will now analyse the stated reasons for wanting a loan, comparing those households that were successful (HH_status equal to ‘successful’) with those that were not successful (HH_status equal to ‘denied’).

  1. For both groups, create one table showing the proportion of households for each loan purpose. You will realize that in the ‘All households’ dataset, the reason for all ‘successful’ loans is ‘Other’. For that reason, you should use the ‘Got loan’ dataset to retrieve the reasons for loan information for successful loans. Was the purpose of loans for denied and successful borrowers similar? (Hint: It may help to think about the broad categories of spending on consumption and investment.)

R walk-through 9.5 Making frequency tables to compare proportions

Some of the data is in the ‘allHH’ dataset, while the rest is in the ‘gotL’ dataset, both of which we imported in R walk-through 9.1. We will combine that information into one new dataset called ‘loan_data’, which we then use to produce the table.

sel_allHHc <- subset(allHHc, subset = (allHHc$HH_status %in% c("successful","denied")))
sel_allHHc <- droplevels(sel_allHHc) # Remove the unused 'did not apply' level
prop.table(table(sel_allHHc$loan_purpose,sel_allHHc$HH_status,dnn = c("Loan Purpose","Loan")),2)
##                                             Loan
## Loan Purpose                                 denied successful
##   Business Start-up Capital                  0.2615     0.0000
##   Expanding Business                         0.1385     1.0000
##   Other (Specify)                            0.2718     0.0000
##   Purchase Agricultural Inputs for Food Crop 0.2103     0.0000
##   Purchase House/Lease Land                  0.0308     0.0000
##   Purchase Inputs for other Crops            0.0615     0.0000
##   Purchase Non-farm Inputs                   0.0256     0.0000

This reveals a particular feature of the data, namely that for successful borrowers, the ‘allHHc’ dataset does not contain all the useful information, as every successful household has ‘Other (Specify)’ in the loan_purpose variable. There is more useful information on loan purpose in the ‘gotL’ data, so we will extract the loan_purpose for unsuccessful households from the ‘allHHc’ dataset, and the equivalent information for successful loaners from the ‘gotL’ dataset.

# Select unsuccessful households from allHHc
loan_no <- subset(allHHc,allHHc$HH_status == "denied", select = c("loan_purpose","HH_status"))

# Select loan purpose for successful households from gotL
loan_yes <- subset(gotL, gotL$got_loan == "Yes",select = "loan_purpose")
loan_yes$HH_status <- "successful"

loan_data <- rbind(loan_no,loan_yes) # Combine into one dataset

loan_data <- droplevels(loan_data) # Remove the unused 'did not apply' level
kable(prop.table(table(loan_data$loan_purpose,loan_data$HH_status,dnn = c("Loan Purpose","Loan")),2))
                                              denied   successful
-------------------------------------------  -------  -----------
Business Start-up Capital                      0.262        0.154
Expanding Business                             0.138        0.081
Other (Specify)                                0.272        0.027
Purchase Agricultural Inputs for Food Crop     0.210        0.300
Purchase House/Lease Land                      0.031        0.023
Purchase Inputs for other Crops                0.062        0.098
Purchase Non-farm Inputs                       0.026        0.115
For consumption and personal expenses          0.000        0.201
  1. Using the information in the ‘All households’ and ‘Got loan’ tab, for ‘successful’ and ‘denied’ households:
Household characteristic Successful borrowers Credit-excluded
Age of household head    
Highest education in household    
Number of assets    
Household size    
Number of young children    
Number of working-age adults    

Characteristics of credit-excluded and successful borrowers.

Figure 9.1 Characteristics of credit-excluded and successful borrowers.

R walk-through 9.6 Calculating differences in household characteristics

Here we show how to get average characteristics conditional on HH_status using the mean function. With the mosaic package loaded (as we have done in R walk-through 9.2), we can use an enhanced capability of the mean function to split our data easily according to HH_status.

summary(allHHc$HH_status) # Show the number of observations in each category
##      denied not applied  successful 
##         201        3632        1363
# Mean household size conditional on credit status
mean(~hhsize|HH_status,data = allHHc, na.rm = TRUE)
##      denied not applied  successful 
##        4.82        4.46        4.87
# Mean max_education of household head conditional on credit status
mean(~max_education|HH_status,data = allHHc, na.rm = TRUE)
##      denied not applied  successful 
##        8.00        7.62        7.26

Repeat the mean command above for all variables to complete the table.

If we want to repeat this analysis by also splitting the data according to rural or region, we can use the group option (which is only available when mosaic has been loaded).

table(allHHc$rural,allHHc$HH_status) # Show the number of observations in each category
##                     
##                      denied not applied successful
##   Large town (urban)     56        1092        332
##   Rural                 128        2236        903
##   Small town (urban)     17         304        128
# Mean HH size conditional on rural and credit status variables
mean(~hhsize|HH_status,group = rural, data = allHHc, na.rm = TRUE)
##      denied.Large town (urban) not applied.Large town (urban) 
##                           3.57                           3.44 
##  successful.Large town (urban)                   denied.Rural 
##                           3.84                           5.47 
##              not applied.Rural               successful.Rural 
##                           4.98                           5.35 
##      denied.Small town (urban) not applied.Small town (urban) 
##                           4.06                           4.28 
##  successful.Small town (urban)             Large town (urban) 
##                           4.12                           3.54 
##                          Rural             Small town (urban) 
##                           5.10                           4.23
# Mean max_education of HH head conditional on rural and credit status variables
mean(~working_age_adults |HH_status,group = rural, data = allHHc, na.rm = TRUE)
##      denied.Large town (urban) not applied.Large town (urban) 
##                           2.36                           2.30 
##  successful.Large town (urban)                   denied.Rural 
##                           2.42                           2.88 
##              not applied.Rural               successful.Rural 
##                           2.58                           2.91 
##      denied.Small town (urban) not applied.Small town (urban) 
##                           3.18                           2.75 
##  successful.Small town (urban)             Large town (urban) 
##                           2.54                           2.33 
##                          Rural             Small town (urban) 
##                           2.68                           2.71

The same result can be also obtained using the piping technology of the tidyverse package.

stats4 <- allHHc %>% 
              group_by(HH_status,region) %>% 
              summarize(avg_hhsize = mean(hhsize,na.rm = TRUE)) %>% 
              spread(HH_status, avg_hhsize) %>% print()
## # A tibble: 11 x 4
##    region          denied `not applied` successful
##    <fct>            <dbl>         <dbl>      <dbl>
##  1 Addis Ababa       2.50          4.00       3.84
##  2 Afar              7.00          4.97       5.69
##  3 Amhara            4.42          3.86       4.50
##  4 Benshagul Gumuz   5.00          4.70       5.41
##  5 Diredwa           4.57          4.04       4.28
##  6 Gambelia          5.22          5.04       5.50
##  7 Harari            3.75          5.02       3.74
##  8 Oromia            4.91          4.53       5.21
##  9 SNNP              5.20          4.88       5.09
## 10 Somalie           4.33          5.15       5.26
## 11 Tigray            4.62          4.01       4.78

To understand what the spread command does, run the above code without it and see the difference.

  1. Using Figure 9.1, without conditioning on rural or region:

R walk-through 9.7 Calculating confidence intervals and adding them to a chart

To repeat some calculations for a list of variables, we will write a loop. First we create a list of these variables.

sel_var = c("age","max_education","number_assets","hhsize","young_children","working_age_adults")

Now we use the age variable as an example.

stats5 <- allHHc %>%
              filter(HH_status %in% c("denied","successful")) %>%   # Filters out the 'did not apply' cases
              group_by(HH_status) %>%
              summarize(avg_ = mean(age,na.rm = TRUE), sd_ = sd(age,na.rm = TRUE), n_ = sum(!is.na(age))) %>%
              print()
## # A tibble: 2 x 4
##   HH_status   avg_   sd_    n_
##   <fct>      <dbl> <dbl> <int>
## 1 denied      41.2  12.9   201
## 2 successful  43.4  14.3  1361
stats5 <- allHHc %>%
              filter(HH_status %in% c("denied","successful")) %>%   # Filters out the 'did not apply' cases
              group_by(HH_status) %>%
              summarize(avg_ = mean(age,na.rm = TRUE), sd_ = sd(age,na.rm = TRUE), n_ = sum(!is.na(age))) %>%
              print()
## # A tibble: 2 x 4
##   HH_status   avg_   sd_    n_
##   <fct>      <dbl> <dbl> <int>
## 1 denied      41.2  12.9   201
## 2 successful  43.4  14.3  1361

Now we use the t.test function to calculate the difference between the groups.

# Select the age variable (sel_var[1]) for successful and denied borrowers
sel_success <- unlist(allHHc[allHHc$HH_status == "successful",sel_var[1]])
sel_denied <- unlist(allHHc[allHHc$HH_status == "denied",sel_var[1]])
# The unlist function is necessary to get a vector of data instead of a dataframe/tibble.

temp <- t.test(sel_success,sel_denied, conf.level = 0.95)

The output of this test provides us with the details required. Note that the conf.level = 0.95 option is actually not necessary here, as 0.95 is the default level.

We will now do this for all required variables and save the difference in means and the confidence interval values in a dataframe so we can plot this information.

# Create the dataframe in which we save the data used for the chart
temp_plot <- data.frame(name = sel_var, dmean = NA, yhigh = NA, ylow = NA)

for (i in sel_var){
  sel_success <- unlist(allHHc[allHHc$HH_status == "successful",i])
  sel_denied <- unlist(allHHc[allHHc$HH_status == "denied",i])
  temp <- t.test(sel_success,sel_denied, conf.level = 0.95)
  temp_plot$dmean[temp_plot$name == i] <- temp$estimate[1]-temp$estimate[2] # Mean difference
  temp_plot$ylow[temp_plot$name == i] <- temp$conf.int[1] # Lower limit of the confidence interval
  temp_plot$yhigh[temp_plot$name == i] <- temp$conf.int[2] # Upper limit of the confidence interval
}

ggplot(temp_plot, aes(x=name, y=dmean)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin=ylow, ymax=yhigh), width=.2) +
  ylab("Difference in means") + xlab("variable") +
  theme_bw +
  ggtitle("Difference in HH characteristics (successful and denied borrowers)")

Column chart showing difference in HH characteristics for successful and denied borrowers.

Figure 9.2 Column chart showing difference in HH characteristics for successful and denied borrowers.

  1. Using the information in the ‘All households’ dataset:

R walk-through 9.8 Calculating conditional means

We are interested in the means of a range of variables for different subgroups. Two subgroups are mutually exclusive (HH_status == "successful" and HH_status == "denied"), while the others (credit_constrained == "yes" and dicouraged_borrower == "yes") are partially overlapping subgroups of the data. Our strategy is to create a temporary dataframe (sel_allHHc) that only contains the relevant observations and the relevant variables.

sel_var = c("age","max_education","number_assets","hhsize","young_children","working_age_adults") # List variables we are interested in

sel_allHHc <- allHHc[allHHc$HH_status == "successful",sel_var]
paste("successful (n =",nrow(sel_allHHc), ")")
## [1] "successful (n = 1363 )"
colMeans(sel_allHHc,na.rm = TRUE)
##                age      max_education      number_assets 
##              43.37               7.26              15.88 
##             hhsize     young_children working_age_adults 
##               4.87               2.09               2.75
sel_allHHc <- allHHc[allHHc$HH_status == "denied",sel_var]
paste("denied (n =",nrow(sel_allHHc), ")")
## [1] "denied (n = 201 )"
colMeans(sel_allHHc,na.rm = TRUE)
##                age      max_education      number_assets 
##              41.21               8.00              14.46 
##             hhsize     young_children working_age_adults 
##               4.82               2.22               2.76
sel_allHHc <- allHHc[allHHc$discouraged_borrower == "Yes",sel_var]
paste("discouraged (n =",nrow(sel_allHHc), ")")
## [1] "discouraged (n = 588 )"
colMeans(sel_allHHc,na.rm = TRUE)
##                age      max_education      number_assets 
##              43.28               6.50              10.16 
##             hhsize     young_children working_age_adults 
##               4.65               2.03               2.49
sel_allHHc <- allHHc[allHHc$credit_constrained == "Yes",sel_var]
paste("constrained (n =",nrow(sel_allHHc), ")")
## [1] "constrained (n = 3012 )"
colMeans(sel_allHHc,na.rm = TRUE)
##                age      max_education      number_assets 
##              44.84               7.14              13.54 
##             hhsize     young_children working_age_adults 
##               4.44               1.81               2.49
selection bias
An issue that occurs when the sample or data observed is not representative of the population of interest. For example, individuals with certain characteristics may be more likely to be part of the sample observed (such as students being more likely than CEOs to participate in computer lab experiments).

A study on access to loans in Ethiopia looked at the relationship between loan amount and household characteristics. When doing so, they needed to account for selection bias, because we only observe positive loan amounts for successful borrowers. If we only had data for successful borrowers, then our sample would not be representative of the population of interest (all households), so we would have to interpret our results with caution. In our case, we have information about all households, so we can compare observable characteristics to see whether successful borrowers are similar to other households.

  1. Think of another example where there might be selection bias, in other words, where the data we observe is not representative of the population of interest.

Part 9.2 Households that got a loan

For households that successfully got a loan, we will look at:

We will also see if there are any relationships between these loan characteristics and household characteristics.

Now we will use the variables relating to the loan start and end dates to calculate the duration of the loan. Before using these variables, we need to check that the variable entries make sense. Some of this information could be recorded incorrectly (for example, the year is missing a digit, or the month is a number rather than a word).

  1. Using the ‘Got loan’ dataset:

R walk-through 9.9 Data cleaning and summarizing loan characteristics

We start by cleaning up the loan dates. We have information on start month and year as well as end month and year. Let’s look at these in turn. The structure of the dataframe (str(gotL)) indicates that the start and end year are numeric variables, but the months are factor variables with month names (for example ‘April’).

Let’s first look at the years by creating a scatterplot.

ggplot(gotL, aes(x=loan_endyear, y=loan_startyear)) +
  geom_point(size=2, shape=23, fill = "blue") +
  theme_bw() +
  ggtitle("Loan start and end year")

Scatterplot showing loan start and end year.

Figure 9.3 Scatterplot showing loan start and end year.

We can see that there are three observations that have very low (< 500) start or end year values, which does not make sense. We will replace these with ‘NA’, but leave the original data untouched and create a new dataset called ‘gotLc’, where the ‘c’ indicates cleaned data.

gotLc <- gotL
gotLc$loan_startyear[gotLc$loan_startyear < 500] <- NA
gotLc$loan_endyear[gotLc$loan_endyear < 500] <- NA

ggplot(gotLc, aes(x=loan_endyear, y=loan_startyear)) +
  geom_point(size=2, shape=23, fill = "blue") +
  theme_bw() +
  ggtitle("Loan start and end year")

Revised scatterplot showing loan start and end year without outliers.

Figure 9.4 Revised scatterplot showing loan start and end year without outliers.

In the top left corner, there is a loan with the start year (2006) after the end year (2003). Clearly this is incorrect, so we should remove this observation when analysing loan periods. However, we wait until we have combined the years with the months as there may be more observations with this issue.

Also, we can only see a small number of points because there are many identical observations (for example startyear of 2006 and endyear of 2006). To see these points you could replacegeom_point with geom_jitter in the command above ggplot command . Use ?geom_jitter to understand what this option does.

So let’s look at the months.

summary(gotLc$loan_startmonth)
##     April    August  December  February   January      July      June 
##        95        63       133       145       176       115       156 
##     March       May  November   October September      NA's 
##        85       141       115       106       146         4

There is no particular issue with the start months. What about the end months?

summary(gotLc$loan_endmonth)
##     April    August  December  February   January      July      June 
##       136        53        49       170        52        34        94 
##     March       May  November   October    Pagume September      NA's 
##       155        89        33        39         5        37       534

Two things are noteworthy here: there are now many ‘NA’ entries, and there is an entry called ‘Pagume’. As described in the task, ‘Pagume’ can be approximated by September. Let’s recode that.

gotLc$loan_endmonth[gotLc$loan_endmonth == "Pagume"] <- "September"

Another call of summary(gotLc$loan_endmonth) would confirm that there are no observations with ‘Pagume’ left.

Now we want to calculate the length of the loan, in other words, the number of days between start and end day. As we only have months and not days, this will be an approximation. We will create a new variable combining months and years using the paste function.

# We assume that all loan start and end dates are on the 1st of the month.
gotLc$loan_startdate <- paste("1",gotLc$loan_startmonth,gotLc$loan_startyear)
gotLc$loan_enddate <- paste("1",gotLc$loan_endmonth,gotLc$loan_endyear)

For observations with an unknown end date (recall we had more than 500 of these), we will code as missing. Currently these are recorded as ‘1 NA NA’. First we need to find which observations have missing data elements for either the loan_startdate or loan_enddate variable. R has very powerful tools to identify text patterns such as this, which you can learn about by searching the Internet for help (you could search for ‘R test whether character contains string’). For example, a useful short introduction to using such tools contains one solution to our problem.

See the first 20 observations of loan_enddate.

gotLc$loan_enddate[1:20]
##  [1] "1 NA NA"          "1 NA NA"          "1 June 2006"     
##  [4] "1 March 2006"     "1 February 2006"  "1 September 2007"
##  [7] "1 March 2006"     "1 March 2006"     "1 March 2006"    
## [10] "1 March 2006"     "1 February 2006"  "1 NA NA"         
## [13] "1 August 2007"    "1 August 2007"    "1 February 2006" 
## [16] "1 June 2007"      "1 November 2007"  "1 March 2006"    
## [19] "1 April 2006"     "1 NA NA"

You can find NAs in observations 1, 2, 12, and 20. The command grep will identify these automatically.

selNA <- grep('NA', gotLc$loan_enddate) # Identify the position of the observations that contain NAs
selNA[1:5]
## [1]  1  2 12 20 26

As you can see this identifies the first four instances correctly, and we can see that the next missing end date is in observation 26. Now we will replace all these observations as missing values and then convert the non-missing observations to dates.

gotLc$loan_enddate[selNA] <- NA
gotLc$loan_enddate <- as.Date(gotLc$loan_enddate, format = "%d %B %Y")

The option format = "%d %B %Y" specifies the format that dates are recorded as (for example ‘1 June 2006’), where %B stands for full months (view this page for examples of date formatting options). Now we repeat the same steps for the start date.

selNA <- grep('NA', gotLc$loan_startdate) # Identify the position of the observations that contain NAs
gotLc$loan_startdate[selNA] <- NA
gotLc$loan_startdate <- as.Date(gotLc$loan_startdate, format = "%d %B %Y")

Let’s find out the percentage of observations with missing values.

# Missing start dates
paste(sum(is.na(gotLc$loan_startdate)),"(",100*round(mean(is.na(gotLc$loan_startdate)),4),"% )")
## [1] "6 ( 0.41 % )"
# Missing end dates
paste(sum(is.na(gotLc$loan_enddate)),"(",100*round(mean(is.na(gotLc$loan_enddate)),4),"% )")
## [1] "535 ( 36.15 % )"

Now we will add a new variable indicating the length of the loan period. As R knows that loan_startdate and loan_enddate are dates, it recognizes automatically that the difference between two dates should be expressed as the number of days.

gotLc$loan_length <- gotLc$loan_enddate-gotLc$loan_startdate

Let’s look at the first few observations.

gotLc[1:5,c("loan_startdate","loan_enddate","loan_length")]
## # A tibble: 5 x 3
##   loan_startdate loan_enddate loan_length
##   <date>         <date>       <time>     
## 1 2004-03-01     NA           <NA>       
## 2 2006-11-01     NA           <NA>       
## 3 2006-11-01     2006-06-01   -153       
## 4 2005-06-01     2006-03-01   273        
## 5 2005-06-01     2006-02-01   245

Notice the following:

  • Where any of the two dates is missing, the length is missing as well.
  • Some loan lengths are negative (for example observation 3), because the recorded end date is before the start date. It could be that the two dates were switched when the data was entered into the system.

This is unfortunate, but is a common feature of real-life data, and you will have to be on the lookout for such occasions.

As required in Question 1, we will create two variants of the loan_length variable: one where we assign missing values to all observations that have negative loan_length, and one where we assume that the problem was the switching of start and end date, so we transform all loan lengths to positive values.

gotLc$loan_length_NA <- gotLc$loan_length
gotLc$loan_length_NA[gotLc$loan_length_NA<0] <- NA # Assign NA to negative loan lengths
gotLc$loan_length_abs <- abs(gotLc$loan_length)

Now we can create the long_term variable and look at the number of long-term loans.

gotLc$long_term <- (gotLc$loan_length_abs>365)
summary(gotLc$long_term)
##    Mode   FALSE    TRUE    NA's 
## logical     728     215     537

We therefore have about 23% long-term loans (only looking at loans for which we do have date information).

  1. Using the variables loan_amount and loan_interest:

R walk-through 9.10 Making summary tables and calculating correlations

To make summary tables, we use the favstats function from the mosaic package.

# loan_amount
favstats(~loan_amount, data = gotLc)
##  min  Q1 median   Q3   max  mean     sd    n missing
##    1 400   1200 3490 3e+07 26896 783587 1479       1
# loan_amount + loan_interest
favstats(~(loan_amount+loan_interest), data = gotLc)
##  min  Q1 median   Q3      max  mean     sd    n missing
##   20 500   1400 3780 31260000 29223 827144 1445      35

It is best to look at loan amounts and interest rate graphically, for example in a scatterplot.

ggplot(gotLc, aes(x=loan_amount, y=loan_interest)) +
  geom_point(size=2, shape=23, fill = "blue") +
  theme_bw() +
  ggtitle("Loan amounts and interest payments")

Scatterplot showing loan amounts and interest payments.

Figure 9.5 Scatterplot showing loan amounts and interest payments.

One large loan (top right corner) dominates this graph. Let’s exclude observations with a loan amount larger than 200,000 from the graph.

ggplot(gotLc, aes(x=loan_amount, y=loan_interest)) +
  geom_point(size=2, shape=23, fill = "blue") +
  xlim(0, 200000) + # Set horizontal axis limits
  ylim(0, 30000) + # Set vertical axis limits
  theme_bw() +
  ggtitle("Loan amounts and interest payments")

Revised scatterplot showing loan amounts and interest payments without outliers.

Figure 9.6 Revised scatterplot showing loan amounts and interest payments without outliers.

Interestingly we can see many zero-interest loans. Now we will calculate the interest rate as loan_interest/loan_amount.

gotLc$interest_rate <- gotLc$loan_interest/gotLc$loan_amount
favstats(~interest_rate,data = gotLc)
##  min Q1 median    Q3 max  mean   sd    n missing
##    0  0      0 0.167 200 0.257 5.26 1445      35

The maximum interest rate is 200 (in other words 20,000%), which does not make sense and could be due to a data entry error. Making another scatterplot can also identify extreme values for loan amounts

ggplot(gotLc, aes(x=loan_amount, y=interest_rate)) +
  geom_point(size=2, shape=23, fill = "blue") +
  theme_bw() +
  ggtitle("Loan amounts and interest rates")

Scatterplot identifying extreme values for loan amounts.

Figure 9.7 Scatterplot identifying extreme values for loan amounts.

Let’s make another scatterplot, excluding the observation with the extremely high interest rate and only looking at small loan amounts (< 1,000).

ggplot(subset(gotLc, interest_rate < 50), aes(x=loan_amount, y=interest_rate)) +
  geom_point(size=2, shape=23, fill = "blue") +
  xlim(0, 1000) + # Set horizontal axis limits
  ylim(0, 5) + # Set vertical axis limits
  theme_bw() +
  ggtitle("Loan amount and interest rates")

Scatterplot excluding extremely high interest rate and including only small loan amounts.

Figure 9.8 Scatterplot excluding extremely high interest rate and including only small loan amounts.

Again we can see that there are many zero-interest loans. From the summary statistics above, we can see that the median interest rate is 0, which implies that at least 50% of loans have a zero interest rate. The following code calculates that percentage precisely.

temp_all <- gotLc$interest_rate[!is.na(gotLc$interest_rate)]
temp_0 <- temp_all[temp_all==0]
paste("Percentage of zero interest rate loans: ", round(100*length(temp_0)/length(temp_all),2),"%")
## [1] "Percentage of zero interest rate loans: 50.52 %"

Now let’s calculate statistics conditional on whether a loan is long-term or not. Before we do this we will remove the observation with the very extreme interest rate (20,000%) from our ‘gotLc’ dataset (but not from the original ‘gotL’ dataset). That observation has a loan amount of 1 and an interest payment of 200, which is probably a data entry mistake. There is another extreme observation (with a loan amount of 30,000,000), but there is no indication that this observation is misrecorded as there is a significant interest payment for this loan.

gotLc <- subset(gotLc,interest_rate<200)
favstats(~interest_rate|long_term,data = gotLc)
##   long_term min Q1 median    Q3  max  mean    sd   n missing
## 1     FALSE   0  0  0.050 0.171 1.12 0.110 0.173 717       0
## 2      TRUE   0  0  0.141 0.245 2.24 0.189 0.268 211       0

Both the mean and median interest rate are higher for long-term loans. You can adapt the code above to calculate statistics for the loan_amount variable.

We now calculate correlations between interest rates and household characteristics. Below we use piping operations to select the relevant data (as in Project 8). The correlation matrix is then sent to a matrix we call M.

gotLc %>%
  subset(!is.na(interest_rate)) %>% # Only select observations with interest rate information
  select( age, max_education, number_assets, hhsize, young_children, working_age_adults, interest_rate) %>%
  cor(.,use="pairwise.complete.obs") -> M

M[,c("interest_rate")]
##                age      max_education      number_assets 
##             0.0258            -0.0841            -0.0474 
##             hhsize     young_children working_age_adults 
##             0.1050             0.1022             0.0466 
##      interest_rate 
##             1.0000
  1. Now we will look at sources of finance and how they are related to loan characteristics.

R walk-through 9.11 Creating summary tables of means

First we create the table with financing sources using borrowed_from.

stab10 <- table(gotLc$borrowed_from,gotLc$rural)
addmargins(prop.table(stab10,2),1)
##                           
##                            Large town (urban)   Rural Small town (urban)
##   Bank (commercial)                   0.01881 0.00393            0.00000
##   Employer                            0.04075 0.00196            0.01031
##   Grocery/Local Merchant              0.08150 0.04711            0.10309
##   Microfinance Institution            0.19122 0.27969            0.26804
##   Money Lender (Katapila)             0.00313 0.04809            0.02062
##   NGO                                 0.01254 0.04711            0.05155
##   Neighbour                           0.10658 0.11580            0.07216
##   Other (specify)                     0.05643 0.12071            0.04124
##   Relative                            0.48589 0.31501            0.43299
##   Religious Institution               0.00313 0.02061            0.00000
##   Sum                                 1.00000 1.00000            1.00000

Note that in all settings, most loans come from relatives. To create the table with borrowed_from_other, substitute this variable name in the above command.

When creating a table with categorical (factor) variables in the rows and columns, but with the cells reporting a statistic based on a third variable such as the average duration of a loan (rather than counts or proportions), we use piping operations.

tab10  <- gotLc %>%
        group_by(borrowed_from,rural) %>%
        summarize(mean_duration = round(mean(loan_length_abs,na.rm = TRUE),0)) %>%
        spread(rural,mean_duration) %>%
        print()
## # A tibble: 11 x 4
## # Groups:   borrowed_from [11]
##    borrowed_from            `Large town (urban)` Rural  `Small town (urba~
##    <fct>                    <time>               <time> <time>            
##  1 Bank (commercial)        1814                 619    <NA>              
##  2 Employer                 602                  290    NaN               
##  3 Grocery/Local Merchant   166                  259    176               
##  4 Microfinance Institution 712                  411    510               
##  5 Money Lender (Katapila)  365                  332    365               
##  6 Neighbour                125                  187    296               
##  7 NGO                      372                  395    236               
##  8 Other (specify)          274                  372    806               
##  9 Relative                 237                  217    393               
## 10 Religious Institution    1461                 343    <NA>              
## 11 <NA>                     1096                 289    151

To get the tables for the loan amount and interest rate, change the variable name in the mean() calculation above.

Extension

Investigating sources of finance associated with zero-interest loans

We previously saw that a large percentage of loans have a zero interest rate. Here we investigate whether particular sources of finance are responsible for these large numbers. The code we use is very similar to the code above, but instead of calculating the mean of a variable, we calculate the mean of a boolean (true/false) variable ((interest_rate==0)). This will deliver the proportion of ‘true’ observations, in other words, loans where the interest rate was equal to zero.

tab10 <- gotLc %>%
       group_by(borrowed_from,rural) %>%
       summarize(prop_0_interest = mean((interest_rate==0),na.rm = TRUE)) %>%
       spread(rural,prop_0_interest) %>%
       print()
## # A tibble: 11 x 4
## # Groups:   borrowed_from [11]
##    borrowed_from            `Large town (urban)`  Rural `Small town (urba~
##    <fct>                                   <dbl>  <dbl>              <dbl>
##  1 Bank (commercial)                      0.     0.                NA     
##  2 Employer                               0.692  0.500              1.00  
##  3 Grocery/Local Merchant                 1.00   0.812              1.00  
##  4 Microfinance Institution               0.0820 0.0351             0.0385
##  5 Money Lender (Katapila)                0.     0.0204             0.500 
##  6 NGO                                    0.250  0.0833             0.400 
##  7 Neighbour                              1.00   0.763              1.00  
##  8 Other (specify)                        0.222  0.171              0.    
##  9 Relative                               0.961  0.819              0.976 
## 10 Religious Institution                  1.00   0.190             NA     
## 11 <NA>                                   0.     0.571              1.00

We will use exactly the same technique to determine the proportion of loans that go to households with female heads.

tab11 <- gotLc %>%
       group_by(borrowed_from,rural) %>%
       summarize(prop_female = mean((gender=="Female"),na.rm = TRUE)) %>%
       spread(rural,prop_female) %>%
       print()
## # A tibble: 11 x 4
## # Groups:   borrowed_from [11]
##    borrowed_from            `Large town (urban)` Rural `Small town (urban~
##    <fct>                                   <dbl> <dbl>               <dbl>
##  1 Bank (commercial)                       0.    0.500              NA    
##  2 Employer                                0.308 0.                  0.   
##  3 Grocery/Local Merchant                  0.385 0.188               0.500
##  4 Microfinance Institution                0.410 0.130               0.308
##  5 Money Lender (Katapila)                 0.    0.265               0.500
##  6 NGO                                     0.250 0.312               0.200
##  7 Neighbour                               0.353 0.254               0.429
##  8 Other (specify)                         0.389 0.187               0.250
##  9 Relative                                0.400 0.206               0.286
## 10 Religious Institution                   1.00  0.238              NA    
## 11 <NA>                                    1.00  0.429               1.00
  1. In this project we have looked at patterns in borrowing and access to credit, but we are not able to make any causal statements such as ‘changes in X will cause households to be credit constrained’ or ‘characteristic Y causes improved access to credit’. Outline a policy intervention that could help improve households’ access to loans, and how to design the implementation so you can assess the causal effects of this policy.