# Empirical Project 8 Working in R

### Download the code

To download the code chunks used in this project, right-click on the download link and select ‘Save link as’. You’ll need to save the code download to your working directory, and open it in RStudio.

Don’t forget to also download the data into your working directory by following the steps in this project.

## Getting started in R

For this project you will need the following packages:

• tidyverse, to help with data manipulation
• readxl, to import an Excel spreadsheet
• knitr, to format tables.

If you need to install any of these packages, run the following code:

install.packages(c("tidyverse", "readxl", "knitr"))

You can import the libraries now, or when they are used in the R walk-throughs below.

library(tidyverse)
library(readxl)
library(knitr)

## Part 8.1 Cleaning and summarizing the data

So far we have been working with data that was formatted correctly. However, sometimes the datasets will be messier and we will need to clean the data before starting any analysis. In the data we will use, the European Values Study (EVS) data has been converted to an Excel spreadsheet from another program, so there are some entries that were not converted correctly and some variables that need to be recoded (for example, replacing words with numbers, or replacing one number with another).

Download the EVS data and documentation:

• Download the EVS data.
• For the documentation, go to the data download site.
• Click on the ‘Data and Documents’ button in the middle of the page, then click the ‘Other Documents’ button and download the PDF file called ‘ZA4804_EVS_VariableCorrespondence.pdf’.

Although we will be performing our analysis in R, the data has been provided as an Excel spreadsheet, so look at the spreadsheet in Excel first to understand its structure.

1. The Excel spreadsheet contains multiple worksheets, one for each wave, that need to be joined together to create a single ‘tibble’ (like a spreadsheet for R). The variable names currently do not tell us what the variable is, so we need to provide labels and short descriptions to avoid confusion.
• Using the rbind function, load each worksheet containing data (‘Wave 1’ through to ‘Wave 4’) and combine the worksheets into a single dataset.
• Use the PDF you have downloaded to create a label and a short description (attribute) for each variable using the attr function. (See R walk-through 8.1 for details.)

### R walk-through 8.1 Importing the data into R

As we are dealing with an Excel file, we use the read_excel function from the readxl package. The file is called Project-8-datafile.xlsx and contains four worksheets named ‘Wave 1’ through to ‘Wave 4’ that contain the data. We will load the worksheets one by one and add them to the previous worksheets using the rbind function.

As the variable names provided in the spreadsheet are not very specific, we can assign a label and a short description to each variable for reference later, using the attr function (note that the order in each list is the same as the order of the variable names).

attr(wellbeing_data,"labels") <-
c("EVS-wave","Country/region",
"Respondent number",
"Health","Life satisfaction",
"Work Q1","Work Q2","Work Q3",
"Work Q4","Work Q5",
"Sex","Age","Marital status",
"Number of children",
"Education","Employment",
"Monthly household income")

attr(wellbeing_data,"shortDescription") <-
c( "EVS-wave",
"Country/region",
"Original respondent number",
"State of health (subjective)",
"Satisfaction with your life",
"To develop talents you need to have a job",
"Humiliating to receive money without having to work for it",
"People who don't work become lazy",
"Work is a duty towards society",
"Work should come first even if it means less spare time",
"Sex",
"Age",
"Marital status",
"How many children you have-deceased children not included",
"Educational level respondent: ISCED-code one digit",
"Employment status",
"Monthly household income (× 1,000), corrected for ppp in euros")

Throughout this project we will refer to the variables using their original names, but you can use the following code to look up the attributes of each variable:

attr(wellbeing_data,
"labels")[attr(wellbeing_data,
"names")=="X028"]
## [1] "Employment"
attr(wellbeing_data,
"shortDescription")[attr(wellbeing_data,
"names")=="X028"]
## [1] "Employment status"
1. Now we will take a closer look at how some of the variables were measured.
• Variable A170 is reported life satisfaction on a scale of 1 (dissatisfied) to 10 (satisfied). Respondents answered the question ‘All things considered, how satisfied are you with your life as a whole these days?’ Discuss the assumptions needed to use this measure in interpersonal and cross-country comparisons, and whether you think they are plausible. (You may find it helpful to refer to Box 2.1 of the OECD Guidelines on Measuring Subjective Well-being.)
• An individual’s employment status (variable X028) was self-reported. Explain whether misreporting of employment status is likely to be an issue, and give some factors that may affect the likelihood of misreporting in this context.
• Variables C036 to C041 ask about an individual’s attitudes towards work. With self-reports, we may also be concerned that individuals are using a heuristic (rule of thumb) to answer the questions. Table 2.1 of the OECD Guidelines on Measuring Subjective Well-being lists some response biases and heuristics that individuals could use. Pick three that you think particularly apply to questions about life satisfaction or work ethic and describe how we might check whether this issue may be present in our data.
1. We will now check and clean the dataset so it is ready to use.
• Inspect the data and understand what type of data each variable is stored as. Currently, missing values are recorded as ‘.a’, but we would like them to be recorded as ‘NA’ (not available).
• Variable A170 (life satisfaction) is currently a mixture of numbers (2 to 9) and words (‘Satisfied’ and ‘Dissatisfied’), but we would like it to be all numbers. Replace the word ‘Dissatisfied’ with the number 1, and the word ‘Satisfied’ with the number 10.
• Similarly, variable X011_01 (number of children) has recorded no children as a word rather than a number. Replace ‘No children’ with the number 0.
• The variables C036 to C041 should be replaced with numbers ranging from 1 (‘Strongly disagree’) to 5 (‘Strongly agree’) so we can take averages of them later. Similarly, variable A009 should be recoded as 1 = ‘Very Poor’, 2 = ‘Poor’, 3 = ‘Fair’, 4 = ‘Good’, 5 = ‘Very Good’.
• Split X025A into two variables, one for the number before the colon, and the other containing the words after the colon.

### R walk-through 8.2 Cleaning data and splitting variables

#### Inspect the data and recode missing values

R stores variables as different types depending on the kind of information the variable represents. For categorical data, such as country or occupation, the variables can be stored as text (chr). Numerical data should be stored as a number (num), but sometimes when R imports data it will store numerical values as text, which can cause problems later when we use these variables in calculations. The str function shows what type each variable is being stored as.

str(wellbeing_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    164997 obs. of  17 variables:
##  $S002EVS: chr "1981-1984" "1981-1984" "1981-1984" "1981-1984" ... ##$ S003   : chr  "Belgium" "Belgium" "Belgium" "Belgium" ...
##  $S006 : num 1001 1002 1003 1004 1005 ... ##$ A009   : chr  "Fair" "Very good" "Poor" "Very good" ...
##  $A170 : chr "9" "9" "3" "9" ... ##$ C036   : chr  ".a" ".a" ".a" ".a" ...
##  $C037 : chr ".a" ".a" ".a" ".a" ... ##$ C038   : chr  ".a" ".a" ".a" ".a" ...
##  $C039 : chr ".a" ".a" ".a" ".a" ... ##$ C041   : chr  ".a" ".a" ".a" ".a" ...
##  $X001 : chr "Male" "Male" "Male" "Female" ... ##$ X003   : num  53 30 61 60 60 19 38 39 44 76 ...
##  $X007 : chr "Single/Never married" "Married" "Separated" "Married" ... ##$ X011_01: chr  ".a" ".a" ".a" ".a" ...
##  $X025A : chr ".a" ".a" ".a" ".a" ... ##$ X028   : chr  "Full time" "Full time" "Unemployed" "Housewife" ...
##  $X047D : chr ".a" ".a" ".a" ".a" ... ## - attr(*, "labels")= chr "EVS-wave" "Country/region" "Respondent number" "Health" ... ## - attr(*, "shortDescription")= chr "EVS-wave" "Country/region" "Original respondent number" "State of health (subjective)" ... We start by recoding all missing values as NA. wellbeing_data[wellbeing_data ==".a"] <-NA #### Recode the life satisfaction variable To recode the life satisfaction variable, we can use logical indexing to select all observations where the variable A170 is equal to either ‘Dissatisfied’ or ‘Satisfied’ and assign the value 1 or 10 respectively. When this variable was imported, some of the entries included text, so it is a chr variable. After changing the text into numerical values we can convert the variable into a numerical type using the as.numeric function. wellbeing_data$A170[wellbeing_data$A170 =="Dissatisfied"] <- 1 wellbeing_data$A170[wellbeing_data$A170 =="Satisfied"] <- 10 wellbeing_data$A170 <- as.numeric(wellbeing_data$A170) #### Recode the variable for number of children We repeat this process for the variable indicating the number of children. wellbeing_data$X011_01[wellbeing_data$X011_01=="No children"] <- 0 wellbeing_data$X011_01 <- as.numeric(wellbeing_data$X011_01) #### Replace text with numbers for multiple variables When we have to recode multiple variables with the same mapping of text to numerical value, we can use the tidyverse library and the piping (%>%) operator instead of repeating the process used above for each variable. (At this stage, don’t worry if you can’t understand exactly what this code does, as long as you can adapt it for similar situations.) wellbeing_data <- wellbeing_data %>% mutate_at(c("C036","C037","C038", "C039","C041"), funs(recode(.,"Strongly disagree" = "1", "Disagree" = "2" , "Neither agree nor disagree" = "3", "Agree" = "4", "Strongly agree" = "5"))) %>% mutate_at(c("A009"),funs(recode(.,"Very poor" = "1", "Poor"="2", "Fair"="3", "Good"="4", "Very good"="5"))) %>% mutate_at(c("A009","C036","C037", "C038","C039","C041","X047D"), funs(as.numeric)) #### Split a variable containing numbers and text To split the variable X025A into two new columns we use the separate function, which creates two new variables called Education_1 and Education_2 containing the numeric value and the text description respectively. wellbeing_data <- wellbeing_data %>% separate(X025A,c("Education_1","Education_2")," : ",remove = FALSE) %>% mutate_at("Education_1",funs(as.numeric)) Although the paper we are following only considered individuals aged 25–80 who were not students, we will retain these observations in our analysis. However, we need to remove any observations that have missing values for A170, X028, or any of the other variables except X047D. 1. Remove all observations with missing values from your data. R walk-through 8.3 gives guidance on how to do this. (Remember that not all questions were asked in each wave, so refer to the data documentation PDF to check whether there should be missing values for an entire variable in a particular wave.) ### R walk-through 8.3 Dropping specific observations As not all questions were asked in all waves, we have to be careful when dropping observations with missing values for certain questions, to avoid accidentally dropping an entire wave of data. For example, information on self-reported health (A009) was not recorded in Wave 3, and questions on work attitudes (C036 to C041) and information on household income are only asked in Waves 3 and 4. Furthermore, information on the number of children (X011_01) and education (X025A) are only collected in the final wave. We will first use the complete.cases function to filter variables present in all waves. This function creates an index of which observations to keep, which we can then use to reference the required rows in the dataframe. include <- names(wellbeing_data) %in% c("X003","A170", "X028","X007", "X001") wellbeing_data <- wellbeing_data[ complete.cases(wellbeing_data[,include]),] Next we will filter based on whether or not a question was asked in a particular wave. Again we use the complete.cases function, but combine it with the logical OR operator | to include all observations for waves in which a question does not feature, along with the complete cases for that question in the other waves. include_wave_1_2_4 <- names(wellbeing_data) %in% c("A009") # A009 is not in Wave 3. include_wave_3_4 <- names(wellbeing_data) %in% c("C036","C037","C038","C039","C041","X047D") # Work attitudes and income are in Waves 3 and 4. include_wave_4 <- names(wellbeing_data) %in% c("X011_01","X025A") # Number of children and education are in Wave 4. wellbeing_data <- wellbeing_data %>% filter(S002EVS == "1981-1984" | S002EVS == "1990-1993" | complete.cases( wellbeing_data[,include_wave_3_4])) wellbeing_data <- wellbeing_data %>% filter(S002EVS != "2008-2010" | complete.cases( wellbeing_data[,include_wave_4])) wellbeing_data <- wellbeing_data %>% filter(S002EVS == "1999-2001" | complete.cases( wellbeing_data[,include_wave_1_2_4])) 1. We will now create two variables, work ethic and relative income, which we will use in our comparison of wellbeing. • Work ethic is measured as the average of C036 to C041. Create a new variable showing this work ethic measure. • The study calculated an individual’s relative income as a deviation from the average income in that individual’s country. Explain the issues with using this method if the income distribution is skewed (for example, if it has a long right tail). • Instead of using average income, we will define relative income as the individual’s percentile in the income distribution. Create a new variable showing this information. (See R walk-through 8.4 for one way to do this). ### R walk-through 8.4 Calculating averages and percentiles #### Calculate average work ethic score We use the rowMeans function to calculate the average work ethic score for each observation based on the five survey questions, C036 to C041, related to working attitudes. wellbeing_data$work_ethic <-
rowMeans(wellbeing_data[,c("C036","C037",
"C038","C039",
"C041")])

#### Calculate income percentiles for each individual

R provides a handy function (ecdf) to obtain an individual’s relative income as a percentile. As we want a separate income distribution for each wave, we use the group_by function. Finally, we tidy up the results by converting each value into a percentage and rounding to a single decimal place.

The ecdf function will not work if there is missing data. Since we don’t have income data for Waves 1 and 2, we will have to split the data when working out the percentiles. First we select observations from Waves 3 and 4 that have income values, calculate the percentile values, and save them as a variable called percentile (which will be ‘NA’ for Waves 1 and 2). Then we recombine this data with the original observations from Waves 1 and 2.

# Generate the percentile variable for Waves 3 and 4, and store in temporary dataframe
df.new <- wellbeing_data %>%
subset(S002EVS %in% c("1999-2001","2008-2010")) %>%   # Select Waves 3 and 4
filter(!is.na(X047D)) %>%                             # Drop observations with missing income data
group_by(S002EVS) %>%                                 # Calculate the percentiles per wave
mutate(., percentile = ecdf(X047D)(X047D)) %>%        # Calculate the relative income as the percentiles of income distribution
mutate(., percentile = round(percentile*100,1))       # Express the variable as a percentage

# Recombine Waves 3 and 4 with the Waves 1 and 2
wellbeing_data <- wellbeing_data %>%
subset(S002EVS %in% c("1981-1984","1990-1993")) %>%   # Select Waves 1 and 2
mutate(., percentile = NA) %>%                        # Create an empty variable so old and new dataframes are same size
bind_rows(., df.new)                                  # Recombine the data
1. Now we have all the variables we need in the format we need. We will make some tables to summarize the data. Using the data for Wave 4:
• Create a table showing the breakdown of each country’s population according to employment status, with country (S003) as the row variable, and employment status (X028) as the column variable. Express the values as percentages of the row total rather than as frequencies. Discuss any differences or similarities between countries that you find interesting.
• Create a new table as shown in Figure 8.1 (similar to Table 1 in the study ‘Employment status and subjective well-being’) and fill in the missing values. Summary tables such as these give a useful overview of what each variable looks like.
Male Female
Mean Standard deviation Mean Standard deviation
Wellbeing
Self-reported health
Work ethic
Age
Education
Number of children

Summary statistics by gender, European Values Study.

Figure 8.1 Summary statistics by gender, European Values Study.

### R walk-through 8.5 Calculating summary statistics

#### Create table showing employment status, by country

One of the most useful R functions to obtain summary information is summarize, which produces many different summary statistics for a set of data. First we select the data for Wave 4 and group it by country (S003) and employment type (X028), then we use the summarize function to obtain the number of observations for each country and employment type pair. With the data effectively grouped by country only (a single observation contains the number of observations for each level of X028), we can use the sum function to obtain a total count per country and calculate relative frequencies (saved as freq).

wellbeing_data %>%
subset(S002EVS == "2008-2010") %>%          # Select Wave 4 only
group_by(S003,X028) %>%                     # Group by country and employment
summarize(n = n()) %>%                      # Count the number of observations by group
mutate(freq = round(n / sum(n),4)*100) %>%  # Calculate the relative frequency (percentages)
select(-n) %>%                              # Drop the count variable
spread(X028,freq) %>%                       # Reshape the data into a neat table
kable(., format="markdown")

#### Calculate summary statistics by gender

We can also obtain summary statistics on a number of variables at the same time using the summarize_at function. To obtain the mean value for each of the required variables, grouped by the gender variable, we can do the following:

wellbeing_data %>%    # Obtain mean value
subset(S002EVS == "2008-2010") %>%    # We only require Wave 4.
group_by(X001) %>%                    # Group by gender
summarize_at(c("A009","A170","work_ethic",
"X003","Education_1","X011_01"),
mean)
## # A tibble: 2 x 7
##   X001    A009  A170 work_ethic  X003 Education_1 X011_01
##   <chr>  <dbl> <dbl>      <dbl> <dbl>       <dbl>   <dbl>
## 1 Female  3.60  6.93       3.64  47.3        3.05    1.69
## 2 Male    3.77  7.03       3.72  46.9        3.14    1.55

We can repeat this to obtain the standard deviation, again grouped by male and female.

wellbeing_data %>%    # Obtain standard deviation
subset(S002EVS == "2008-2010") %>%
group_by(X001) %>%
summarize_at(c("A009","A170","work_ethic",
"X003","Education_1","X011_01"),
sd)
## # A tibble: 2 x 7
##   X001    A009  A170 work_ethic  X003 Education_1 X011_01
##   <chr>  <dbl> <dbl>      <dbl> <dbl>       <dbl>   <dbl>
## 1 Female 0.969  2.32      0.765  17.5        1.40    1.39
## 2 Male   0.929  2.28      0.757  17.4        1.31    1.43

We can combine these two operations into a single code block and ask summarize_at to perform multiple functions on multiple variables at the same time.

wellbeing_data %>%    # Obtain mean and standard deviations
subset(S002EVS == "2008-2010") %>%
group_by(X001) %>%
summarise_at(c("A009","A170","work_ethic",
"X003","Education_1","X011_01"),
funs(mean,sd))
## # A tibble: 2 x 13
##   X001   A009_mean A170_mean work_ethic_mean X003_mean Education_1_mean
##   <chr>      <dbl>     <dbl>           <dbl>     <dbl>            <dbl>
## 1 Female      3.60      6.93            3.64      47.3             3.05
## 2 Male        3.77      7.03            3.72      46.9             3.14
## # ... with 7 more variables: X011_01_mean <dbl>, A009_sd <dbl>,
## #   A170_sd <dbl>, work_ethic_sd <dbl>, X003_sd <dbl>,
## #   Education_1_sd <dbl>, X011_01_sd <dbl>

In this example, R automatically appends the function name (in this case, _mean or _sd) to the end of the variables. If you need these variables in subsequent calculations, you need to reformat the data and variable names.

## Part 8.2 Visualizing the data

### Note

You will need to have done Questions 1–5 in Part 8.1 before doing this part.

We will now create some summary charts of the self-reported measures (work ethic and wellbeing), starting with column charts to show the distributions of values. Along with employment status, these are the main variables of interest, so it is important to look at them carefully before doing further data analysis.

The distribution of work ethic and wellbeing may vary across countries but may also change over time within a country, especially since the surveys are conducted around once a decade. To compare distributions for a particular country over time, we have to use the same horizontal axis, so we will first need to make a frequency table for each distribution of interest. Also, since the number of people surveyed in each wave may differ, we will use percentages instead of frequencies as the vertical axis variable.

1. Use the data from Wave 3 and Wave 4 only, for three countries of your choice:
• For each of your chosen countries create a frequency table that contains the frequency of each unique value of the work ethic scores. Also include the percentage of individuals at each value grouped by Wave 3 and Wave 4 separately.
• Plot a separate column chart for each country to show the distribution of work ethic scores in Wave 3, with the percentage of individuals on the vertical axis and the range of work ethic scores on the horizontal axis. On each chart, plot the distribution of scores in Wave 4 on top of the Wave 3 distribution.
• Based on your charts from 1(b), does it look like the attitudes towards work for each country have changed over time?

### R walk-through 8.6 Calculating frequencies and percentages

First we need to create a frequency table of the work_ethic variable for each wave. This variable only takes discrete values from 1.0 to 5.0 in increments of 0.2 (since it is an average of five integer variables), so we can group by each discrete value and count the number of observations in each group using the summarize function. Once we have a count of each value (for each wave), we compute the percentages by dividing the count of the work_ethic values by the total number of observations (for each wave).

Once we have the percentages and frequency data, we use ggplot to plot a column chart. To overlay the two waves and make sure that the plot for each wave is visible, we use the alpha option in the geom_bar function to set the transparency level.

waves  <- c("1999-2001","2008-2010")
df.new <- wellbeing_data %>%
subset(., S002EVS %in% waves) %>%       # Select Waves 3 and 4
subset(., S003 == "Germany") %>%        # We use Germany in this example.
group_by(S002EVS, work_ethic) %>%       # Group by each wave and the discrete values of work_ethic
summarize (freq = n()) %>%
mutate(proportion = freq / sum(freq))

The frequencies and percentages are saved in a new dataframe called df.new. If you want to look at it, type view(df.new) into the console window. These numbers were generated to produce a plot of the distribution of scores for each wave as follows:

p<-ggplot(df.new, aes(work_ethic,proportion, fill=S002EVS, color=S002EVS)) +
# The colour and fill of each wave is set by the wave variable in the count/frequency dataframe.
geom_bar(stat="identity", position="identity", alpha=0.2) +
theme_bw() +
ggtitle(paste("Distribution of work ethic for Germany"))
print(p)

Distribution of work ethic scores for Germany.

Figure 8.2 Distribution of work ethic scores for Germany.

1. We will use line charts to make a similar comparison for wellbeing over time. Using data for countries that are present in Waves 1 to 4:
• Create a table showing the average wellbeing, by wave (column variable) and by country (row variable).
• Plot a line chart with wave number (1 to 4) on the horizontal axis and the average wellbeing on the vertical axis.
• From your results in 2(a) and 2(b), does it look like wellbeing has changed over time? What other information about the distribution of wellbeing could we use to supplement these results?

### R walk-through 8.7 Plotting multiple lines on a chart

#### Calculate average wellbeing, by wave and country

Before we can plot the line charts we have to calculate the average wellbeing for each country in each wave.

In R walk-through 8.5 we produced summary tables, grouped by country and employment status. We will copy this process, but now we only require mean values. Countries that do not report the wellbeing variable for all waves will have an average wellbeing of ‘NA’. Since each country is represented by a row in the summary table, we use the rowwise and na.omit functions to drop any countries that do not have a value for the average wellbeing for all four waves.

df.new <- wellbeing_data %>%
group_by(S002EVS,S003) %>%
summarize(avg_wellbeing=round(mean(A170),2)) %>%    # Calculate average wellbeing per wave and country
spread(S002EVS,avg_wellbeing) %>%                   # Reshape the data to one row per country
rowwise() %>%                                       # Set subsequent operations to be performed on each row
na.omit() %>%                                       # Drop any rows with missing observations
print()
## # A tibble: 11 x 5
##    S003             1981-1984 1990-1993 1999-2001 2008-2010
##    <chr>                  <dbl>       <dbl>       <dbl>       <dbl>
##  1 Belgium                 7.37        7.6         7.42        7.63
##  2 Denmark                 8.21        8.17        8.31        8.41
##  3 France                  6.71        6.77        6.98        7.05
##  4 Germany                 7.22        7.03        7.43        6.77
##  5 Iceland                 8.05        8.01        8.08        8.07
##  6 Ireland                 7.82        7.88        8.21        7.82
##  7 Italy                   6.65        7.3         7.18        7.4
##  8 Netherlands             7.75        7.77        7.83        7.99
##  9 Northern Ireland        7.66        7.88        8.07        7.82
## 10 Spain                   6.6         7.15        6.97        7.29
## 11 Sweden                  8.03        7.99        7.62        7.68

#### Create a line chart for average wellbeing

To draw the lines for all countries on a single plot we need to gather the data into the correct format to use in ggplot.

df.new %>%
gather(Wave,mean,1981-1984:2008-2010) %>%
ggplot(., aes(x=Wave, y=mean, color=S003)) + geom_line(aes(group=S003,linetype=S003), size=1)  + # Group the data by country
theme_bw()

Line chart of average wellbeing across countries and survey waves.

Figure 8.3 Line chart of average wellbeing across countries and survey waves.

correlation coefficient
A numerical measure of how closely associated two variables are and whether they tend to take similar or dissimilar values, ranging from a value of 1, indicating that the variables take similar values (positively correlated), to −1, indicating that the variables take dissimilar variables (negative or inverse correlation). A value of 1 or −1 indicates that knowing the value of one of the variables would allow you to perfectly predict the value of the other. A value of 0 indicates that knowing one of the variables provides no information about the value of the other.

After describing patterns in our main variables over time, we will use correlation coefficients and scatter plots to look at the relationship between these variables and the other variables in our dataset.

1. Using the Wave 4 data:
• Create a table as shown in Figure 8.4 and calculate the required correlation coefficients. (For employment status and gender, you may need to create new variables.)
Variable Wellbeing Work ethic
Age
Education
Employment status (= 1 if full-time employed, = 0 if unemployed)
Gender (= 0 if male, = 1 if female)
Self-reported health
Income
Number of children
Relative income
Wellbeing 1
Work ethic 1

Correlation between wellbeing, work ethic and other variables, Wave 4.

Figure 8.4 Correlation between wellbeing, work ethic and other variables, Wave 4.

• Interpret the coefficients, paying close attention to how the variables are coded. Explain whether the relationships implied by the coefficients are what you expected (for example, would you expect wellbeing to increase or decrease with health, income, etc.)

### R walk-through 8.8 Creating dummy variables and calculating correlation coefficients

To obtain the correlation coefficients between variables, we have to make sure that all variables are numeric. However, the data on sex and employment status are coded using text, so we need to create two new variables. We can use the ifelse function to make the value of the variable conditional on whether a logical statement is satisfied or not. As shown below, we can nest ifelse statements to create more complex conditions.

Although we had previously removed observations with missing values (‘NA’), we have reintroduced ‘NA’ values in the previous step when creating the employment variable. This would cause problems when obtaining the correlation coefficients using the cor function, so we include the option use="pairwise.complete.obs" to exclude observations containing ‘NA’ from the calculations.

##                    A170  work_ethic
## X003        -0.08220504  0.13333312
## Education_1  0.09363900 -0.14535643
## employment   0.18387286 -0.02710835
## gender      -0.02105535 -0.04781557
## A009         0.37560287 -0.07320865
## X047D        0.23542497 -0.15226999
## X011_01     -0.01729372  0.08925452
## percentile   0.29592968 -0.18679869
## A170         1.00000000 -0.03352464
## work_ethic  -0.03352464  1.00000000

Next we will look at the relationship between employment status and wellbeing and investigate the paper’s hypothesis that this relationship varies with a country’s average work ethic.

1. Using the data from Wave 4, carry out the following:
• Create a table showing the average wellbeing according to employment status (showing the full-time employed, retired, and unemployed categories only) with country (S003) as the row variable, and employment status (X028) as the column variable. Comment on any differences in average wellbeing between these three groups, and whether social norms is a plausible explanation for these differences.
• Use the table from Question 4(a) to calculate the difference in average wellbeing (full-time employed minus unemployed, and full-time employed minus retired).
• Plot column charts showing both of these differences in wellbeing, with country on the horizontal axis (sorted lowest to highest average work ethic).
• Does the gap in wellbeing between employed and unemployed vary according to the country’s average work ethic?

### R walk-through 8.9 Calculating group means

#### Calculate average wellbeing and differences in average wellbeing

We can achieve the tasks in Question 4(a) and (b) in one go using an approach similar to that used in R walk-through 8.5, although now we are interested in calculating the average wellbeing by country and employment type. Once we have tabulated these means we can compute the difference in the average values. We will create two new variables: D1 for the difference between the average wellbeing for full-time employed and unemployed, and D2 for the difference in average wellbeing for full-time employed and retired individuals.

employment_list = c("Full time","Retired","Unemployed")   # Set the employment types that we are interested in reporting
df.employment <- wellbeing_data %>%
subset(S002EVS == "2008-2010") %>%                      # Select Wave 4
subset(X028 %in% employment_list) %>%                   # Select only observations with the specified employment types
group_by(S003,X028) %>%                                 # Group by country and then employment type
summarize(mean = mean(A170)) %>%                        # Calculate the mean by country/employment type group
spread(X028,mean) %>%                                   # Reshape to one row per country
mutate(D1 = Full time - Unemployed, D2 = Full time - Retired)  # Create the difference in means
kable(df.employment, format="markdown")

#### Plot a column chart, sorted according to work ethic

In order to plot the differences ordered by the average work ethic, we first need to summarize the work_ethic variable by country and store the results in a temporary dataframe.

df.work_ethic <- wellbeing_data %>%
subset(S002EVS == "2008-2010") %>%
select(S003,work_ethic) %>%
group_by(S003) %>%
summarize(mean_work = mean(work_ethic))

We can now combine the average work_ethic data with the table containing the difference in means and plot a column chart, with the countries ordered by average work ethic. (The ordering is obtained using the x=reorder() option in ggplot.)

df.employment %>%
inner_join(.,df.work_ethic, by = "S003") %>%                # Combine with the average work ethic data
ggplot(.,aes(y=D1, x=reorder(factor(S003),mean_work)))  +
geom_bar(stat="identity") +
xlab("Country") + ylab("Difference") +
ggtitle("Difference in wellbeing between the full-time employed and the unemployed \n
(sorted from lowest to highest average work ethic).") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(hjust = 0.5))  # Rotate the country names

Difference in wellbeing between the full-time employed and the unemployed (sorted from lowest to highest average work ethic).

Figure 8.5 Difference in wellbeing between the full-time employed and the unemployed (sorted from lowest to highest average work ethic).

This process can be repeated for the difference in means between full-time employed and retired individuals by changing y=D1 to y=D2 in the ggplot function.

## Part 8.3 Confidence intervals for difference in the mean

### Note

You will need to have done Questions 1–5 in Part 8.1 before doing this part.

The aim of this project was to look at the empirical relationship between employment and wellbeing. In other words, is the mean wellbeing of employed people (statistically) significantly different from the mean wellbeing of unemployed people (as we might expect from economic theory)?

When we calculate differences between groups, we might also want to know if these differences are statistically significant. In Part 6.2 of Empirical Project 6, we constructed confidence intervals for means, and used a rule of thumb to assess statistical significance from confidence intervals displayed on a chart. Now we will learn how to construct confidence intervals for the difference in two means, which allows us to assess statistical significance directly instead of using the rule of thumb.

Remember that the width of a confidence interval depends on the standard deviation and number of observations. (Read Part 6.2 of Empirical Project 6 to understand why.) When making a confidence interval for a sample mean (such as the mean wellbeing of the unemployed), we use the sample standard deviation and number of observations in that sample (unemployed people) to obtain the standard error of the sample mean.

When we look at the difference in means (such as wellbeing of employed minus unemployed), we are using data from two groups (the unemployed and the employed) to make one confidence interval. To calculate the standard error for the difference in means, we use the standard errors (SE) of each group:

This formula requires the two groups of data to be independent, meaning that the data from one group is not related, paired, or matched with data from the other group. This assumption is reasonable for the wellbeing data we are using. However, if the two groups of data are not independent, for example if the same people generated both groups of data (as in the public goods experiments of Empirical Project 2), then we cannot use this formula.

Once we have the new standard error and number of observations, we can calculate the width of the confidence interval (distance from the mean to one end of the interval) as before. The difference in means is statistically significant if the confidence interval for the difference in means does not contain 0.

1. We will apply this method to make confidence intervals for differences in wellbeing. Choose three countries: one with an average work ethic in the top third of scores, one in the middle third, and one in the lower third. (See R walk-through 6.3 for help on calculating confidence intervals and adding them to a chart.)
• In Question 4 of Part 8.2 you computed the average life satisfaction score by country and employment type (using Wave 4 data). Repeat this procedure to obtain the standard error for the means for each of your chosen countries.

Create a table for these countries, showing the average life satisfaction score, standard deviation (StdDev) of life satisfaction, and number of observations, with country (S003) as the row variable, and employment status (full-time employed, retired, and unemployed only) as the column variable.

• Use your table from Question 1(a) to calculate the difference in means (full-time employed minus unemployed, and full-time employed minus retired), the standard deviation of these differences, the number of observations, and the confidence interval.
• Use R’s t.test function to obtain the confidence interval width of the difference in means and compare your results with Question 1(b) (using a 5% significance level).
• Plot a column chart for your chosen countries showing the difference in wellbeing (employed vs unemployed and employed vs retired) on the vertical axis, and country on the horizontal axis (sorted according to low, medium, and high work ethic). Add the confidence intervals from Question 1(c) to your chart.
• Discuss the statistical significance of your findings.

### R walk-through 8.10 Calculating confidence intervals and adding error bars

We will use Turkey, Spain, and Great Britain as example countries in the top, middle, and bottom third of work ethic scores respectively.

In the tasks in Questions 1(a) and (b) we will obtain the means, standard errors, and confidence intervals step-by-step, then for Question 1(c) we show how to use a shortcut to obtain confidence intervals from a single function.

#### Calculate confidence intervals manually

We obtained the difference in means in R walk-through 8.9, so now we can calculate the standard error of the means for each country of interest.

country_list <- c("Turkey","Spain","Great Britain")     # List chosen countries
df.employment.se <- wellbeing_data %>%
subset(S002EVS == "2008-2010") %>%                    # Select Wave 4
subset(X028 %in% employment_list) %>%                 # Select the employment types we are interested in
subset(S003 %in% country_list) %>%                    # Select countries
group_by(S003,X028) %>%                               # Group by country and employment type
summarize(se = sd(A170) /sqrt(n())) %>%               # Calculate the standard error of each group mean
spread(X028,se) %>%
mutate(D1_SE = sqrt(Full time^2 + Unemployed^2), D2_SE = sqrt(Full time^2 + Retired^2))   # Calculate the SE of difference

We can now combine the standard errors with the difference in means, and compute the confidence interval width.

df.employment <- df.employment %>%
subset(S003 %in% country_list) %>%                # Select chosen countries
select(-Full time, -Retired, -Unemployed) %>%   # We only need the differences.
inner_join(.,df.employment.se, by="S003") %>%     # Join the means with the respective SEs
select(-Full time, -Retired, -Unemployed) %>%
mutate(CI_1 = 1.96*D1_SE, CI_2 = 1.96*D2_SE) %>%  # Compute the confidence interval width for both differences
print()
## # A tibble: 3 x 7
## # Groups:   S003 [3]
##   S003             D1     D2 D1_SE D2_SE  CI_1  CI_2
##   <chr>         <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Great Britain 1.51  -0.398 0.267 0.154 0.524 0.302
## 2 Spain         0.145  0.127 0.265 0.168 0.520 0.330
## 3 Turkey        0.737 -0.107 0.229 0.231 0.449 0.452

We now have a table containing the difference in means, the standard error of the difference in means, and the confidence intervals for each of the two differences. (Recall that D1 is the difference between the average wellbeing for full-time employed and unemployed, and D2 is the difference in average wellbeing for full-time employed and retired individuals.)

#### Calculate confidence intervals using the t.test function

We could obtain the confidence intervals directly by using the t.test function. First we need to prepare the data in two groups. In the following example we go through the difference in average wellbeing for full-time employed and unemployed individuals in Turkey, but the process can be repeated for the difference between full-time employed and retired individuals (also for your other two chosen countries).

Start by selecting the data and storing it in a temporary matrix to pass to the t.test function.

turkey_full <- wellbeing_data %>%
subset(S002EVS == "2008-2010") %>%    # Select Wave 4 only
subset(S003 == "Turkey") %>%          # Select country
subset(X028 == "Full time") %>%       # Select employment type
select(A170) %>%                      # We only need the wellbeing data.
as.matrix()                           # We have to set it as a matrix for t.test.

turkey_unemployed <- wellbeing_data %>% # Repeat for second group
subset(S002EVS == "2008-2010") %>%
subset(S003 == "Turkey") %>%
subset(X028 == "Unemployed") %>%
select(A170) %>%
as.matrix()

We can now use the t.test function, passing the two newly created matrices as the data. We also set the confidence level to 95%. The t.test function produces quite a bit of output, but we are only interested in the confidence interval, which we can obtain by using $conf.int. turkey_ci <- t.test(turkey_full,turkey_unemployed, conf.level = 0.95)$conf.int
turkey_ci
## [1] 0.2873459 1.1875704
## attr(,"conf.level")
## [1] 0.95

We can then calculate the difference in means by finding the midpoint of the interval ((turkey_ci[2] + turkey_ci[1])/2 is 0.7374582), which should be the same as the figures obtained in Question 1(b) (df.employment[3,2] is 0.7374582).

#### Add error bars to the column charts

We can now use these confidence intervals (and widths) to add error bars to our column charts. To do so we use the geom_errorbar option, and specify the lower and upper levels of the confidence interval for the ymin and ymax options respectively. In this case it is easier to use the results from Questions 1(a) and (b), as we already have the values for the difference in means and the CI width stored as variables.

ggplot(df.employment, aes(x=S003, y=D1)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin=D1-CI_1, ymax=D1+CI_1), width=.2) +
ylab("Difference in means") + xlab("Country") +
theme_bw() +
ggtitle("Difference in wellbeing (full-time and unemployed)")

Difference in wellbeing between full-time employed and unemployed.

Figure 8.6 Difference in wellbeing between full-time employed and unemployed.

Again, this can be repeated for the difference in wellbeing between full-time employed and retired. Remember to change y=D1 to y=D2, and the upper and lower limits for the error bars.

1. The method we used to compare wellbeing relied on making comparisons between people with different employment statuses, but a person’s employment status is not entirely random. We cannot therefore make causal statements such as ‘being unemployed causes wellbeing to decrease’. Describe how we could use the following methods to assess better the effect of being unemployed on wellbeing, and make some statements about causality:
• a natural experiment
• panel data (data on the same individuals, taken at different points in time).