Empirical Project 9 Working in Excel

Excel-specific learning objectives

In addition to the learning objectives for this project, in this section you will learn how to create and format time variables.

Part 9.1 Households that did not get a loan

The Ethiopian Socioeconomic Survey (ESS) data was collected in 2013–14 from a nationally representative sample of households. Households were asked about topics such as their housing conditions, assets, and access to credit.

• Download the ESS data. Read the Data dictionary tab and make sure you know what each variable represents. (Later we will discuss exactly how some of these variables were constructed.)
• Under the heading ‘Related Materials’ in the middle of the page, download the PDF file called ‘2013–2014 Ethiopian Socioeconomic Survey, Household Questionnaire’. You may find it helpful to refer to Section 14 for the exact questions asked about credit and saving.
1. The data is already in a relatively clean format, so we will begin by summarizing the information in the ‘All households’ tab, starting with region and household characteristics.
• Create a PivotTable showing the proportion of households that lived in each region and area type, with ‘region’ as the row variable and ‘rural’ as the column variable.
dummy variable (indicator variable)
A variable that takes the value 1 if a certain condition is met, and 0 otherwise.
• Use Excel’s IF function to create a variable that equals 1 if the household head was female, and 0 if the household head was male. Variables that are coded in this way are known as dummy variables (or indicator variables). What percentage of household heads were female?
• Create an appropriate summary table for the variables ‘hhsize’, ‘gender’, ‘age’, ‘young_children’, ‘working_age_adults’, ‘max_education’, and ‘number_assets’. (You may find it helpful to refer to Figure 2.9 in Empirical Project 2 for one possible format to use.)
• Write a short paragraph describing the information in your tables for 1(c).

Now that we have an idea of what our data looks like, we will move on to identify households that are potentially excluded from the credit market or are credit-constrained. The former are households who find it impossible to borrow and the latter are households who can only borrow on unfavourable terms (see Section 9.8 of Economy, Society, and Public Policy).

The variables in our dataset that are related to this issue are ‘did_not_apply’ and ‘loan_rejected’. We will soon also look at the responses given in the variables ‘reason_not_apply1’ and ‘reason_not_apply2’.

1. Using the ‘All households’ tab:
• Create a frequency table with ‘did_not_apply’ as the row variable and ‘loan_rejected’ as the column variable. Include the blanks as a separate row.
• Looking at these two variables, explain why some observations should be excluded and remove them from the dataset. Also remove all households with missing information for one or more of these variables. Of the non-excluded observations, what percentage of households applied for a loan over the past 12 months? Of those households, what percentage were successful?
• For the resulting categories in the frequency table, explain whether the households in that category can be described as credit-constrained, credit-excluded, or both.

To create operational categories to use throughout this project, we will label households as either:

• ‘successful’: households that applied for a loan and were given the loan
• ‘denied’: households that applied but were not given the loan
• ‘did not apply’: households that did not apply for a loan.

You should note that the ‘denied’ households are most likely only a subset of the credit-excluded households as there will be households that are credit excluded and do not even apply for a loan. One could, for instance, reason that households who answered ‘Inadequate Collateral’ or ‘Do Not Know Any Lender’ are also likely to be credit-excluded.

1. Using the subset of data from Question 2(b):
• Create a new variable called ‘HH_status’ with the above categories.
• Create a new variable ‘discouraged_borrower’ that takes the value 1 if the household did not apply for a loan because it believed that it would not receive a loan (answered ‘Believe Would Be Refused’ in ‘reason_not_apply1’ or ‘reason_not_apply2’). How many households (and what percentage) are discouraged borrowers? (Note: this is a fairly narrow definition of ‘discouraged’ and one could easily argue that other criteria should also be considered under this label.)

Note that arguably other answers are also indicative of being credit constrained, and so the criteria we use is definitely only a subset of all households that are credit constrained. For example, one could include households that have been denied a loan, and it is also indeed likely that some households that have been granted a loan are in fact credit constrained.

• Create a new variable ‘credit_constrained’ that takes the value 1 (or yes) for households that gave a reason for not applying other than ‘NA’, ‘Other’, or ‘Have Adequate Farm’ in either of the two questions ‘reason_not_apply1’ or ‘reason_not_apply2’, and 0 otherwise. For example, a household that answers ‘Have Adequate Farm’ in ‘reason_not_apply1’ and ‘Do Not Know Any Lender’ would not be classified as credit constrained. How many households (and what percentage) are credit constrained?
• Create a frequency table showing the most important reason for not applying for a loan, and another showing the second most important reason for not applying. What were the most common reasons for not applying?

We will now analyse the stated reasons for wanting a loan, comparing those households that were successful (‘HH_status’ equal to ‘successful’) with those that were not successful (‘HH_status’ equal to ‘denied’).

1. For both groups, create one table showing the proportion of households for each loan purpose. (You will realize that in the ‘All households’ tab, the reason for all successful loans is ‘Other’. For that reason, you should use the ‘Got loan’ tab to retrieve the reasons for loan information for successful loans.) Was the purpose of loans for denied and successful borrowers similar? (Hint: It may help to think about the broad categories of spending on consumption and investment).
1. Using the information in the ‘All households’ tab and ‘Got loan’ tab, for each group of households:
• Create a table as shown in Figure 9.1 to compare the averages of the specified household characteristics.
Household characteristic Successful borrowers Credit-excluded
Highest education in household
Number of assets
Household size
Number of young children

Characteristics of credit-excluded and successful borrowers.

Figure 9.1 Characteristics of credit-excluded and successful borrowers.

• For each characteristic, explain how it may affect a household’s ability to get a loan (ceteris paribus).
• Discuss whether you see this pattern in the data.
• Now try conditioning on the variable ‘rural’ or ‘region’ and discuss how (if at all) your results change.
1. Using Figure 9.1, without conditioning on ‘rural’ or ‘region’, carry out the following:
• Calculate the difference in means (‘successful’ borrowers minus ‘denied’ borrowers).
• Establish whether the differences between the two groups are statistically significant at the 5% level, by calculating the 95% confidence interval for the difference in means between the two subgroups (‘successful’ minus ‘denied’). (Hint: Use Excel’s CONFIDENCE.T function and see Part 8.3 of Empirical Project 8 for help on how to do this.)
• Plot a column chart showing the differences on the vertical axis (sorted from smallest to largest), and household characteristics on the horizontal axis. Add the confidence intervals from Question 6(b) to the chart.
1. Using the information in the ‘All households’ tab:
• Create a table similar to Figure 9.1, but with additional columns for discouraged borrowers and credit-constrained households.
• Compare the means across the four groups (you do not need to assess statistical significance formally).
selection bias
An issue that occurs when the sample or data observed is not representative of the population of interest. For example, individuals with certain characteristics may be more likely to be part of the sample observed (such as students being more likely than CEOs to participate in computer lab experiments).

A study on access to loans in Ethiopia looked at the relationship between loan amount and household characteristics. When doing so, they needed to account for selection bias, because we only observe positive loan amounts for successful borrowers. If we only had data for successful borrowers, then our sample would not be representative of the population of interest (all households), so we would have to interpret our results with caution. In our case, we have information about all households, so we can compare observable characteristics to see whether successful borrowers are similar to other households.

1. Think of another example where there might be selection bias, in other words, where the data we observe is not representative of the population of interest.

Part 9.2 Households that got a loan

For households that successfully got a loan, we will look at:

• purpose of the loan
• duration of the loan(s)
• loan amount and interest rate charged
• who the household borrowed from.

We will also see if there are any relationships between these loan characteristics and household characteristics.

Now we will use the variables relating to the loan start and end dates to calculate the duration of the loan. Before using these variables, we need to check that the variable entries make sense. Some of this information could be recorded incorrectly (for example, the year is missing a digit, or the month is a number rather than a word).

1. Using the ‘Got loan’ tab, do the following:
• Check the variables ‘loan_startmonth’, ‘loan_startyear’, ‘loan_endmonth’, and ‘loan_endyear’ and replace the entries that are recorded incorrectly with either the correct entry (if possible), or as blank (if not possible to infer the correct entry). (Note: Some entries are recorded as ‘Pagume’, which corresponds to early September in the Ethiopian calendar.)
• To calculate loan duration, we need to combine the month and year variables into one date variable. Use Excel’s CONCATENATE function to create new variables for the start and end date, and format them as date variables. (See Excel walk-through 9.1 for help on how to do this.)
• Some of the dates (months or years) are missing. Calculate the percentage of the data that is missing and explain whether you think missing data is a serious problem.
• Create a new variable containing the loan duration (end date minus start date), which will be measured in days.
• You will notice that some dates were recorded incorrectly, with the start date later than the end date. We could either treat these entries as missing or swap the start and end dates. Create two new variables for loan duration, one with all negative entries recorded as blank, and one with negative entries replaced as positive numbers. (Hint: Excel’s ABS function converts any number to a positive number.)
• For this project we will define a long-term loan as lasting more than a year (365 days), which we will use in later questions. For this definition use the loan length variable that converts negative loan lengths to positive ones. Create an indicator variable called ‘long_term’ that equals 1 if the loan was long term, and 0 otherwise. What percentage of loans were long term?

Excel walk-through 9.1 Creating and formatting time variables

Follow the walk-through in the CORE video, or in Figure 9.2, to find out how to create and format time variables in Excel.

How to create and format time variables using CONCATENATE.

Figure 9.2 How to create and format time variables using CONCATENATE.

The data

In this example, we will use the month and year variables in Columns O, P, R, and S to make new date variables in Columns V and W. Before doing this, we need to make sure all the months and years are coded correctly, otherwise Excel may not recognize these values as dates. Also, some cells are blank, so we need to make our cell formulas work regardless of missing data.

Figure 9.2a In this example, we will use the month and year variables in Columns O, P, R, and S to make new date variables in Columns V and W. Before doing this, we need to make sure all the months and years are coded correctly, otherwise Excel may not recognize these values as dates. Also, some cells are blank, so we need to make our cell formulas work regardless of missing data.

Combine the month and year variables into one variable

Excel’s CONCATENATE function combines text in cells in the specified order. You can add punctuation and spaces by specifying them in quotation marks (“”). Here, we use the IF function so that Excel only fills in cells in rows where both start and end dates were non-missing (the AND function states these conditions).

Figure 9.2b Excel’s CONCATENATE function combines text in cells in the specified order. You can add punctuation and spaces by specifying them in quotation marks (“”). Here, we use the IF function so that Excel only fills in cells in rows where both start and end dates were non-missing (the AND function states these conditions).

Reformat the cells as date variables

After step 4, you may not notice any visible changes to the text in cells, but Excel now recognizes them as dates and you can use them to make calculations, such as counting the number of days between two dates.

Figure 9.2c After step 4, you may not notice any visible changes to the text in cells, but Excel now recognizes them as dates and you can use them to make calculations, such as counting the number of days between two dates.

Use date variables to calculate duration

With the proper formatting, Excel can calculate the number of days between two dates. In this example, some of the start dates are later than the end dates, resulting in negative numbers. We will correct these values in the next step.

Figure 9.2d With the proper formatting, Excel can calculate the number of days between two dates. In this example, some of the start dates are later than the end dates, resulting in negative numbers. We will correct these values in the next step.

Recode incorrect durations

Excel’s ABS function converts any value to its positive counterpart. After Step 8, the negative values in Column X are now recorded as positive numbers in Column Y. Again, we used the AND function so that Excel only did this for non-blank cells (in this case, the condition AND(X2<>“”) would give the same results).

Figure 9.2e Excel’s ABS function converts any value to its positive counterpart. After Step 8, the negative values in Column X are now recorded as positive numbers in Column Y. Again, we used the AND function so that Excel only did this for non-blank cells (in this case, the condition AND(X2<>“”) would give the same results).

1. Using the variables ‘loan_amount’ and ‘loan_interest’:
• Create summary tables to summarize the distribution of loan amount (mean, standard deviation, maximum, and minimum), one using the loan amount, the other using the total amount to repay (loan amount + interest). Make sure to exclude the one observation previously identified as having an extremely high interest rate. Remember to give your tables meaningful titles. Describe any features of the data that you find interesting.
• As mentioned earlier, the interest rate is a borrowing condition that can vary widely across households. Here we will take the interest rate to be the interest paid as a percentage of the loan amount. Calculate the interest rate for each loan in the data. (Exclude observations where the interest paid is not recorded.)
• Check for extreme values (interest rates that are either very large or zero). You may also want to create a scatterplot (with interest rate on the vertical axis and loan amount on the horizontal axis) to help you identify extreme observations. Exclude the observation with the most extreme interest rate from further calculations. What percentage of the loans are zero-interest?
• Make summary tables of the mean, maximum, minimum, and quartiles of the loan amount and interest rate, calculating these measures separately for long-term and short-term loans. Do the distributions of interest rates for short-term and long-term loans look different?
• Create a table showing the correlation between the interest rate and household characteristics (you may want to refer to Figure 8.6 in Empirical Project 8 for an example). Interpreting the interest rate charged as a measure of default risk (inability to repay), explain whether the relationships implied by the coefficients are what you expected (for example, would you expect interest rates to be higher for households with less assets, more dependents etc.).
1. Now we will look at sources of finance and how they are related to loan characteristics.
• Create a table showing the proportion of loans (in terms of the column variable) with source of finance (‘borrowed_from’) as the row variable and ‘rural’ as the column variable. Make a similar table but with ‘borrowed_from_other’ as the row variable instead. Does it look like rural households use different sources of finance from urban households? (Hint: It may help to think about sources of finance in terms of formal, informal, and other institutions such as microfinancers or NGOs.)
• For each of the variables below, create a table showing the average of that variable, with ‘borrowed_from’ as the row variable and ‘rural’ as the column variable. Comment on any similarities or differences between rows and columns that you find interesting, and suggest explanations for what you observe.

• duration of loan (using the variable in which negative durations were transformed to positive durations)
• loan amount
• interest rate
• Create a table showing the proportion of ‘gender’ (in terms of the row variable) with ‘borrowed_from’ as the row variable and ‘rural’ as the column variable. Does there appear to be a relationship between gender of household head and the types of finance used?
• What other variables are currently not in our dataset but could also be important for our analysis in Questions 2 and 3?
1. In this project we have looked at patterns in borrowing and access to credit, but we are not able to make any causal statements such as ‘changes in X will cause households to be credit-constrained’ or ‘characteristic Y causes improved access to credit’. Outline a policy intervention that could help improve households’ access to loans, and how to design the implementation so you can assess the causal effects of this policy.