Empirical Project 8 Working in Excel

Part 8.1 Cleaning and summarizing the data

So far we have been working with data that was formatted correctly. However, sometimes the datasets will be messier and we will need to ‘clean’ the data before starting any analysis. In the data we will use, the European Values Study (EVS) data has been converted to Excel from another program, so there are some entries that were not converted correctly and some variables that need to be recoded (for example, replacing words with numbers, or replacing one number with another).

• Click on the ‘Data and Documents’ button in the middle of the page, then click the ‘Other Documents’ button and download the PDF file called ‘ZA4804_EVS_VariableCorrespondence.pdf’.
1. The data spreadsheet contains an incomplete Data dictionary and four tabs with the data collected from each wave of the survey. The variable names currently do not tell us what the variable is, so we need to relabel them to avoid confusion.
• Use the PDF you have downloaded to fill in Column C (Variable description) of the Data dictionary tab.
• Fill in Column B with an appropriate new name for each variable and rename the variables in the other four tabs accordingly.

Throughout this project we will refer to the variables using their original names, so the Data dictionary tab will come in handy.

1. Now we will take a closer look at how some of the variables were measured.
• Variable A170 is reported life satisfaction on a scale of 1 (dissatisfied) to 10 (satisfied). Respondents answered the question ‘All things considered, how satisfied are you with your life as a whole these days?’ Discuss the assumptions needed to use this measure in interpersonal and cross-country comparisons, and whether you think they are plausible. (You may find it helpful to refer to Box 2.1 of the ‘OECD guidelines on measuring subjective well-being’.)
• An individual’s employment status (variable X028) was self-reported. Explain whether misreporting of employment status is likely to be an issue, and give some factors that may affect the likelihood of misreporting in this context.
• Variables C036 to C041 ask about an individual’s attitudes towards work. With self-reports, we may also be concerned that individuals are using a heuristic (rule of thumb) to answer the questions. Table 2.1 of the ‘OECD Guidelines on measuring subjective well-being’ lists some response biases and heuristics that individuals could use. Pick three that you think particularly apply to questions about life satisfaction or work ethic and describe how we might check whether this issue may be present in our data.
1. We will now check and clean the dataset so it is ready to use. Make the following changes to all relevant data tabs:
• Currently, missing values are recorded as ‘.a’, but we would like them to be blank cells. Use Excel’s Find and Replace tool to change the ‘.a’ to blank cells for variables A009 to X047D. (See Excel walk-through 8.1 for help on how to do this).
• Variable A170 (life satisfaction) is currently a mixture of numbers (2 to 9) and words (‘Satisfied’ and ‘Dissatisfied’), but we would like it to be all numbers. Replace the word ‘Dissatisfied’ with the number 1, and the word ‘Satisfied’ with the number 10.
• Similarly, variable X011_01 (number of children) has recorded no children as a word rather than a number. Replace ‘No children’ with the number 0.
• The variables C036 to C041 should be replaced with numbers ranging from 1 (‘Strongly disagree’) to 5 (‘Strongly agree’) so we can take averages of them later. Similarly, variable A009 should be recoded as 1 = ‘Very Poor’, 2 = ‘Poor’, 3 = ‘Fair’, 4 = ‘Good’, 5 = ‘Very Good’.
• We would like to split X025A into two variables, one for the number before the colon, and the other containing the words after the colon. Use Excel’s LEFT and/or RIGHT functions to create two new variables accordingly. (See Excel walk-through 8.1 for help on how to do this).

Excel walk-through 8.1 Cleaning data and splitting variables

Follow the walk-through in the CORE video, or in Figure 8.1, to find out how to clean data and split variables in Excel.

How to clean data using find and replace, and split variables using Excel’s LEFT and RIGHT functions.

Figure 8.1 How to clean data using find and replace, and split variables using Excel’s LEFT and RIGHT functions.

The data

This is what the data looks like. Currently some cells have entries ‘.a’ where the data is missing. We will find and replace the variables A009 to X047D (not pictured here).

Figure 8.1a This is what the data looks like. Currently some cells have entries ‘.a’ where the data is missing. We will find and replace the variables A009 to X047D (not pictured here).

Find and replace specific cell entries

‘Find and Replace’ in Excel is similar to this feature in Microsoft Word, except that it searches for text in individual cells rather than in lines of text. You can adapt steps 1-3 to change specific values of cells in a column, for example replacing a particular word with a particular number.

Figure 8.1b ‘Find and Replace’ in Excel is similar to this feature in Microsoft Word, except that it searches for text in individual cells rather than in lines of text. You can adapt steps 1-3 to change specific values of cells in a column, for example replacing a particular word with a particular number.

Extract a specific part of a string of text

Excel’s LEFT and RIGHT functions can help you extract parts of a long string of text, which is useful if you want to shorten variables or only keep particular information in a cell. Here, we are going to save variable X025A as two variables: ‘Education_1’ will contain the numbers before the colon, and ‘Education_2’ will contain the words after the colon.

Figure 8.1c Excel’s LEFT and RIGHT functions can help you extract parts of a long string of text, which is useful if you want to shorten variables or only keep particular information in a cell. Here, we are going to save variable X025A as two variables: ‘Education_1’ will contain the numbers before the colon, and ‘Education_2’ will contain the words after the colon.

Extract a specific part of a string of text

The RIGHT function counts the specified number of characters starting from the right end of the string of text. Since words have different lengths, we use LEN() to calculate the length of the string, then subtract the colon’s position in the string (using the FIND function). For example, if the string is 10 characters long, and the colon is the third character, we want Excel to extract characters 4 to 10 (the first 6 characters from the right).

Figure 8.1d The RIGHT function counts the specified number of characters starting from the right end of the string of text. Since words have different lengths, we use LEN() to calculate the length of the string, then subtract the colon’s position in the string (using the FIND function). For example, if the string is 10 characters long, and the colon is the third character, we want Excel to extract characters 4 to 10 (the first 6 characters from the right).

Although the paper we are following only considered individuals aged 25–80 who were not students, we will retain these observations in our analysis. However, we need to remove any observations that have missing values for A170, X028, or any of the other variables except X047D.

1. Remove all rows with missing values from your spreadsheet. Excel walk-through 8.2 gives guidance on how to do this. (Remember that not all questions were asked in each wave, so refer to the data documentation PDF to check whether there should be missing values in an entire column.) (Note: Excel may take some time to process the commands.)

Excel walk-through 8.2 Dropping observations that satisfy particular conditions

How to drop observations that satisfy particular conditions using ‘filter’, ‘select’ and ‘delete’.

Figure 8.2 How to drop observations that satisfy particular conditions using ‘filter’, ‘select’ and ‘delete’.

The data.

This is what the data looks like. In this example, we will remove any rows with the variable A170 recorded as missing (blank). First, we will filter the data so that only rows with blank cells in this column are showing.

Figure 8.2a This is what the data looks like. In this example, we will remove any rows with the variable A170 recorded as missing (blank). First, we will filter the data so that only rows with blank cells in this column are showing.

Filter the rows of interest

After step 3, you will see all the data that needs to be deleted.

Figure 8.2b After step 3, you will see all the data that needs to be deleted.

Delete these rows from the dataset

Excel will remove all the selected rows from the dataset. If there are many rows to delete, this command may take some time to process.

Figure 8.2c Excel will remove all the selected rows from the dataset. If there are many rows to delete, this command may take some time to process.

Clear the filter to see the rest of the data

After removing the specified rows, clear the filter to see the rest of your data. You will see that the number of rows has decreased.

Figure 8.2d After removing the specified rows, clear the filter to see the rest of your data. You will see that the number of rows has decreased.

The modified data

This is what the data looks like after removing rows with missing entries in A170. You can adapt steps 1–7 to remove rows that satisfy other conditions, such as containing a particular value or range of values.

Figure 8.2e This is what the data looks like after removing rows with missing entries in A170. You can adapt steps 1–7 to remove rows that satisfy other conditions, such as containing a particular value or range of values.

1. We will now create two variables, work ethic and relative income, which we will use in our comparison of wellbeing.
• Work ethic is measured as the average of C036 to C041. Create a new variable showing this work ethic measure.
• The study calculated an individual’s relative income as a deviation from the average income in that individual’s country. Explain the issues with using this method if the income distribution is skewed (for example, a long right tail).
• Instead of using average income, we will define relative income as the individual’s percentile in the income distribution. Create a new variable showing this information. (See Excel walk-through 8.3 for one way to do this).

Excel walk-through 8.3 Calculating percentiles from actual values

How to calculate percentiles from actual values.

Figure 8.3 How to calculate percentiles from actual values.

The data

This is what the data looks like. In this example, we will calculate the percentile of X047D (log income), according to the country (S003). You will need to use two new columns (R and S in this case).

Figure 8.3a This is what the data looks like. In this example, we will calculate the percentile of X047D (log income), according to the country (S003). You will need to use two new columns (R and S in this case).

Calculate the rank of each observation

The intuition behind our rank calculation is that an observation’s rank is the number of observations larger than it, plus 1. (For example, if there were only 2 numbers, the smaller number should be ranked 2 and the larger number ranked 1). By default, Excel will rank observations from largest to smallest (higher numbers will have lower ranks). We need this setting to calculate percentiles. In this example, we need to rank observations only within the country it belongs to, not the entire dataset, so we use this as a condition in COUNTIFS.

Figure 8.3b The intuition behind our rank calculation is that an observation’s rank is the number of observations larger than it, plus 1. (For example, if there were only 2 numbers, the smaller number should be ranked 2 and the larger number ranked 1). By default, Excel will rank observations from largest to smallest (higher numbers will have lower ranks). We need this setting to calculate percentiles. In this example, we need to rank observations only within the country it belongs to, not the entire dataset, so we use this as a condition in COUNTIFS.

Use the calculated rank to calculate the percentile.

The intuition behind the percentile calculation is that an observation’s percentile is the percentage of observations smaller than it (the +1 in the formula is needed to exclude the observation itself). The 1- at the start of the formula is because we ranked observations from largest to smallest (so we need to ‘reverse’ all the numbers). Here we have multiplied by 100 so the numbers correspond to actual percentiles (e.g. 92 percentile). You can see that higher incomes have lower ranks and higher percentiles, indicating that we have done this correctly. This example was based on observations with non-missing income only (not the fully-cleaned dataset), so the numbers you get will be slightly different.

Figure 8.3c The intuition behind the percentile calculation is that an observation’s percentile is the percentage of observations smaller than it (the +1 in the formula is needed to exclude the observation itself). The 1- at the start of the formula is because we ranked observations from largest to smallest (so we need to ‘reverse’ all the numbers). Here we have multiplied by 100 so the numbers correspond to actual percentiles (e.g. 92 percentile). You can see that higher incomes have lower ranks and higher percentiles, indicating that we have done this correctly. This example was based on observations with non-missing income only (not the fully-cleaned dataset), so the numbers you get will be slightly different.

1. Now we have all the variables we need in the format we need. We will make some tables to summarize the data. Using the data for Wave 4:
• Create a table showing the breakdown of each country’s population according to employment status, with country (S003) as the row variable, and employment status (X028) as the column variable. Express the values as percentages of the row total rather than frequencies. Discuss any differences or similarities between countries that you find interesting.
• Create a new table as shown in Figure 8.4 (similar to Table 1 in the study ‘Employment status and subjective well-being’) and fill in the missing values. Summary tables such as these give a useful overview of what each variable looks like.
Male Female
Mean Standard deviation Mean Standard deviation
Wellbeing
Self-reported health
Work ethic
Age
Education
Number of children

Summary statistics by gender, European Values Study.

Figure 8.4 Summary statistics by gender, European Values Study.

Part 8.2 Visualizing the data

Note

You will need to have done Questions 1–5 in Part 8.1 before doing this part.

We will now create some summary charts of the self-reported measures (work ethic and wellbeing), starting with column charts to show the distributions of values. Along with employment status, these are the main variables of interest, so it is important to look at them carefully before doing further data analysis.

The distribution of work ethic and wellbeing may vary across countries but may also change over time within a country, especially since the surveys are conducted around once a decade. To compare distributions for a particular country over time, we have to use the same horizontal axis, so we will first need to make a frequency table for each distribution of interest. Also, since the number of people surveyed in each wave may differ, we will use percentages instead of frequencies as the vertical axis variable.

1. Using the data from Wave 3 and Wave 4 only, for three countries of your choice:
• Using the work ethic measure, create a separate frequency table for Wave 3 and for Wave 4, similar to Figure 8.5. The values in the first column should range from 1 to 5, in intervals of 0.2. (Hint: To count observations only for a particular country, you will need to use the IF function and FREQUENCY function together. See Excel walk-through 6.1 for help on how to do this).
Range of work ethic score Frequency Percentage of individuals (%)
1.00
1.20

4.80
5.00

Frequency table for self-reported work ethic.

Figure 8.5 Frequency table for self-reported work ethic.

• Plot a separate column chart for each country to show the distribution of work ethic scores in Wave 3, with the percentage of individuals on the vertical axis and the range of work ethic score on the horizontal axis. On each chart, plot the distribution of scores in Wave 4 on top of the Wave 3 distribution. (See Excel walk-through 6.2 in Empirical Project 6 for guidance on what your charts should look like).
• Based on your charts from 1(b), does it look like attitudes towards work for each country have changed over time?
1. We will use line charts to make a similar comparison for wellbeing over time. Using data for countries that are present in Waves 1 to 4:
• Create a table showing the average wellbeing, by wave (column variable) and by country (row variable). (Hint: You may find it easier to make four separate PivotTables and copy-paste the data into a new table).
• Plot a line chart with wave number (1 to 4) on the horizontal axis and average wellbeing on the vertical axis. Make sure to include a legend.
correlation coefficient
A numerical measure of how closely associated two variables are and whether they tend to take similar or dissimilar values, ranging from a value of 1, indicating that the variables take similar values (positively correlated), to −1, indicating that the variables take dissimilar variables (negative or inverse correlation). A value of 1 or −1 indicates that knowing the value of one of the variables would allow you to perfectly predict the value of the other. A value of 0 indicates that knowing one of the variables provides no information about the value of the other.
• From your results in 2(a) and (b), does it look like wellbeing has changed over time? What other information about the distribution of wellbeing could we use to supplement these results?

After describing patterns in our main variables over time, we will use correlation coefficients and scatterplots to look at the relationship between these variables and the other variables in our dataset.

1. Using the Wave 4 data:
• Create a table as shown in Figure 8.6 and calculate the required correlation coefficients (for employment status and gender, you will need to create new variables).
Variable Wellbeing Work ethic
Age
Education
Employment status (= 1 if full-time employed, = 0 if unemployed)
Gender (= 0 if male, = 1 if female)
Self-reported health
Income
Number of children
Relative income
Wellbeing 1
Work ethic 1

Correlation between wellbeing, work ethic and other variables, Wave 4.

Figure 8.6 Correlation between wellbeing, work ethic and other variables, Wave 4.

• Interpret the coefficients, paying close attention to how the variables are coded. Explain whether the relationships implied by the coefficients are what you expected (for example, would you expect wellbeing to increase or decrease with health, income, etc.)

Next, we will look at the relationship between employment status and wellbeing and investigate the paper’s hypothesis that this relationship varies with the average work ethic in a country.

1. Using the data from Wave 4, carry out the following:
• Create a table showing the average wellbeing according to employment status (showing the full-time employed, retired, and unemployed categories only), with country (S003) as the row variable, and employment status (X028) as the column variable. Comment on any differences in average wellbeing between these three groups, and whether social norms is a plausible explanation for these differences.
• Use the table from 4(a) to calculate the difference in average wellbeing (full-time employed minus unemployed, and full-time employed minus retired).
• Plot column charts showing both of these differences in wellbeing, with country on the horizontal axis (sorted from lowest to highest average work ethic).
• Does the gap in wellbeing between employed and unemployed vary according to the country’s average work ethic?

So far we have described the data using tables and charts, but have not made any statements about whether what we observe is likely to be due to chance or not. In the next part, we will assess the statistical significance of the relationship between employment status and wellbeing.

Part 8.3 Confidence intervals for difference in the mean

Note

You will need to have done Questions 1–5 in Part 8.1 before doing this part.

The aim of this project was to look at the empirical relationship between employment and wellbeing. In other words, is the mean wellbeing of employed people (statistically) significantly different from the mean wellbeing of unemployed people (as we might expect from economic theory)?

When we calculate differences between groups, we might also want to know if these differences are statistically significant. In Part 6.2 of Empirical Project 6, we constructed confidence intervals for means, and used a rule of thumb to assess statistical significance from confidence intervals displayed on a chart. Now we will learn how to construct confidence intervals for the difference in two means, which allows us to assess statistical significance directly instead of using the rule of thumb.

Remember that the width of a confidence interval depends on the standard deviation and number of observations. (Read Part 6.2 of Empirical Project 6 to understand why.) When making a confidence interval for a sample mean (such as the mean wellbeing of the unemployed), we use the standard deviation and number of observations in that sample (unemployed people).

When we look at the difference in means (such as wellbeing of employed minus unemployed), we are using data from two groups (the unemployed and the employed) to make one confidence interval, so the number of observations is the sum of observations across both groups. To calculate the standard deviation for the difference in means, we use the standard deviations (SD) of each group:

This formula requires the two groups of data to be independent, meaning that the data from one group is not related, paired, or matched with data from the other group. This assumption is reasonable for the wellbeing data we are using. However, if the two groups of data are not independent, for example if the same people generated both groups of data (as in the public goods experiments of Empirical Project 2), then we cannot use this formula.

Once we have the new standard deviation and number of observations, we can calculate the width of the confidence interval (distance from the mean to one end of the interval) as before. The difference in means is statistically significant if the confidence interval for the difference in means does not contain 0.

1. We will apply this method to make confidence intervals for differences in wellbeing. Choose three countries: one with an average work ethic in the top third of scores, one in the middle third, and one in the lower third. (See Excel walk-through 6.4 for help on calculating confidence intervals and adding them to a chart.)
• Create a PivotTable for these countries, showing the average life satisfaction score, standard deviation (StdDev) of life satisfaction, and number of observations, with country (S003) as the row variable, and employment status (full-time employed, retired, and unemployed only) as the column variable.
• Use your PivotTable from 1(a) to calculate the difference in means (full-time employed minus unemployed, and full-time employed minus retired), the standard deviation of these differences, and the number of observations.
• Use Excel’s CONFIDENCE.T function and the calculated values in 1(b) to determine the confidence interval width (distance between the mean and one end of the interval) of the difference in means (using a 5% significance level).
• Plot a column chart for your chosen countries showing the difference in wellbeing (employed vs unemployed and employed vs retired) on the vertical axis, and country on the horizontal axis (sorted according to low, medium, and high work ethic). Add the confidence intervals from 1(c) to your chart.
• Discuss the statistical significance of your findings.
1. The method we used to compare wellbeing relied on making comparisons between people with different employment statuses, but a person’s employment status is not entirely random. We cannot therefore make causal statements such as ‘being unemployed causes wellbeing to decrease’. Describe how we could use the following methods to assess better the effect of being unemployed on wellbeing, and make some statements about causality:
• a natural experiment
• panel data (data on the same individuals, taken at different points in time).