---
title: "Empirical Project 2: Working in R code"
---
# **Empirical Project 2** Working in R
These code downloads have been constructed as supplements to the full Doing Economics projects (https://core-econ.org/doing-economics/). You'll need to download the data before running the code that follows.
## Getting started in R
For this project you will need the following packages:
- `tidyverse`, to help with data manipulation
- `readxl`, to import an Excel spreadsheet.
If you need to install either of these packages, run the following code:
```{r}
install.packages(c("readxl","tidyverse"))
```
You can import the libraries now, or when they are used in the R walk-throughs below.
```{r}
library(readxl)
library(tidyverse)
```
## Part 2.1 Collecting data by playing a public goods game
### **R walk-through 2.1** Plotting a line chart with multiple variables
Use the data from your own experiment to answer Question 1. As an example, we will use the data for the first three cities of the dataset that will be introduced in Part 2.2.
```{r}
Period <- seq(1,10)
Copenhagen <- c(14.1,14.1,13.7,12.9,12.3,11.7,10.8,10.6,9.8,5.3)
Dniprop <- c(11.0,12.6,12.1,11.2,11.3,10.5,9.5,10.3,9.0,8.7)
Minsk <- c(12.8,12.3,12.6,12.3,11.8,9.9,9.9,8.4,8.3,6.9)
# Put the data into a data frame
data_ex <- data.frame(Period,Copenhagen,Dniprop,Minsk)
plot(data_ex$Period,data_ex$Copenhagen, type = "l", lwd=2, col="blue", xlab = "Round",ylim = c(4,16), ylab="Average contribution")
lines(data_ex$Dniprop,col="red", lwd=2) # Select colour and line width
lines(data_ex$Minsk,col="green", lwd=2)
title("Average contribution to public goods game: without punishment")
legend("bottomleft", legend=c("Copenhagen", "Dniprop", "Minsk"),
col=c("blue", "red", "green"), lwd = 2, lty = 1, cex=1.2)
```
[End of walk-through]
## Part 2.2 Describing the data
### **R walk-through 2.2** Importing the datafile into R
Notice that the one Excel worksheet contains both the tables you need. Note down the cell ranges of each table, in this case `A2:Q12` for the without punishment data and `A16:Q26` for the punishment data. We will now use this range information to import the data into the respective dataframes.
```{r}
library(tidyverse) # This package provides useful functionality later.
library(readxl)
data_N <- read_excel("Public-goods-experimental-data.xlsx",
range = "A2:Q12")
data_P <- read_excel("Public-goods-experimental-data.xlsx",
range = "A16:Q26")
```
Look at the data either by opening the dataframes from the Environment window or by typing `data_N` or `data_P` into the Console.
You can see that in each period (row), the average contribution varies across countries, in other words, there is a distribution of average contributions in each period.
[End of walk-through]
### **R walk-through 2.3** Calculating the mean using a loop or the `apply` function
#### Calculate mean contribution
We calculate the mean using two different methods, to illustrate that there are usually many ways of achieving the same thing. First we use a loop that calculates the average over all but the first column (`data_P[row,2:17]` or `data_P[row,-1]`). Then we use the `apply` function (for `data_P`).
```{r}
# Use a loop for data_N
data_N$meanC <- 0
for (row in 1:nrow(data_N)) {
data_N$meanC[row] <- rowMeans(data_N[row,2:17])
}
# Use the apply function for data_P
data_P$meanC <- apply(data_P[,2:17], 1,mean)
```
The `apply` function takes another function (the `mean` function in this case) and applies it to all rows in `data_P[,2:17]`. The second input, `1` applies the specified function to all rows. An entry of `2` would have calculated column means. Type `?apply` in your console for more details, or see R walk-through 2.5 for further practice.
#### Plot mean contribution
Now we will produce the line charts for the mean contributions.
```{r}
plot(data_N$Period,data_N$meanC, type = "l",col="blue", lwd=2, xlab = "Round",ylim = c(4,14), ylab="Average contribution")
lines(data_P$meanC, col="red", lwd=2)
title("Average contribution to public goods game")
legend("bottomleft", legend=c("Without punishment", "With punishment"),
col=c("blue", "red"), lty = 1, cex=1.2, lwd=2)
```
The difference is stark as the contributions increase and then stabilize at around $13 when there is punishment, but decrease consistently from around $11 to $4 across the rounds when there is no punishment.
[End of walk-through]
### **R walk-through 2.4** Drawing a column chart to compare two groups
To make a column chart, we will use the `barplot` function. We first extract the four data points we need (Period 1 and 10 for with and without punishment) and place them into a matrix, which we then input into the `barplot` function.
```{r}
temp_d <- c(data_N$meanC[1],data_N$meanC[10],
data_P$meanC[1],data_P$meanC[10])
temp <- matrix(temp_d, nrow = 2, ncol = 2, byrow = TRUE)
temp
```
```{r}
barplot(temp, main="Mean contributions in a public goods game", ylab = "Contribution",
beside=TRUE, col=c("Blue","Red"), names.arg = c("Round 1","Round 10"))
legend("bottomleft", c("Without punishment"," With punishment"), pch = 1, col=c("Blue","Red"))
```
### Tip
Experimenting with these charts will help you to learn how to use R. The details of how to specify the column chart can be complicated, but you can see from Figure 2.3 what the options `main`, `ylab`, `col` and `names.arg` do. To figure out what the `beside` option does, try switching the option from `TRUE` to `FALSE` and see what happens.
[End of walk-through]
### **R walk-through 2.5** Calculating and understanding the standard deviation
In order to calculate these standard deviations and variances, we will use the `apply` function, which we introduced in R walk-through 2.3. As we saw, `apply` is a command asking R to apply a loop to a dataframe, and the basic structure is as follows: `apply(dataframe, dimension (rows or columns),which function to apply)`. So to calculate the variances, use the following command:
```{r}
data_N$varC <- apply(data_N[,2:17],1,var)
```
Here we take `data_N[,2:17]` and apply the `var` function to each row (recall that the entry `1` does this; `2` would indicate columns). Note that we exclude the first column from the calculation, as that indicates the period and not one of the locations given in columns `2:17`. The result is saved to a new variable called `varC`.
The same principle is now extended to the standard deviation calculation and the `data_P` dataframe.
```{r}
data_N$sdC <- apply(data_N[,2:17],1,sd)
data_P$varC <- apply(data_P[,2:17],1,var)
data_P$sdC <- apply(data_P[,2:17],1,sd)
```
To determine whether 95% of the observations fall within two standard deviations of the mean, we can use a line chart. As we have 16 countries in every period, we would expect about one observation (0.05 × 16 = 0.8) to fall outside this interval.
```{r}
citylist <- names(data_N[2:17])
plot(data_N$Period,data_N$meanC, type = "l",col="blue", lwd=2, xlab = "Round",ylim = c(0,20),
ylab="Average contribution")
lines(data_N$meanC+2*data_N$sdC,col="red",lwd=2) # mean + 2 sd
lines(data_N$meanC-2*data_N$sdC,col="red",lwd=2) # mean – 2 sd
for(i in citylist) {
points(data_N[[1]],data_N[[i]])
}
title("Contribution to public goods game without punishment")
legend("bottomleft", legend=c("Mean", "+/- 2 sd"),
col=c("blue", "red"), lwd=2, lty = 1,cex=1.2)
```
None of the observations falls outside the mean ± two standard deviations interval for the without punishment public goods game. Let's see the equivalent picture for the version with punishment.
```{r}
citylist <- names(data_N[2:17])
plot(data_P$Period,data_P$meanC, type = "l",col="blue", xlab = "Round",ylim = c(0,22), ylab="Average contribution")
lines(data_P$meanC+2*data_P$sdC,col="red") # mean + 2 sd
lines(data_P$meanC-2*data_P$sdC,col="red") # mean – 2 sd
for(i in citylist) {
points(data_P[[1]],data_P[[i]])
}
title("Contribution to public goods game with punishment")
legend("bottomleft", legend=c("Mean", "+/- 2 sd"),
col=c("blue", "red"), lty = 1,cex=1.2)
```
Here it looks as if we only have one observation outside the interval (in Period 8). In that aspect the two experiments look similar. However, from comparing these two plots, we see that the game with punishment displays a greater variation of responses than the game without punishment. In other words, there is a larger standard deviation and variance for the observations coming from the game with punishment.
[End of walk-through]
### **R Walk-through 2.6** Finding the minimum, maximum, and range of a variable
To calculate the range for both experiments and for all periods, we will use the `apply` function again.
```{r}
data_P$rangeC <- apply(data_P[,2:17],1,range)
```
Unfortunately, when you execute this command you are likely to get an error message that reads something like this: `Error in $<-.data.frame(*tmp*, "rangeC", value = c(5.81818199157715, : replacement has 2 rows, data has 10`.
Let's investigate why this is the case by picking one data row and calculating the range.
```{r}
range(data_N[1,2:17])
```
You can see that we get two values, the maximum and the minimum. However, we need the range (maximum–minimum). So we need a short intermediary step, where we calculate the maximum and minimum for our whole dataset (using the `apply` function) and save this in a new variable called `temp`.
```{r}
temp <- apply(data_N[,2:17],1,range)
temp
```
Now we use the difference between the respective maximums and minimums to define our new range variable `rangeC`.
```{r}
data_N$rangeC <- temp[2,] - temp[1,]
```
And now we do the same for `data_P`:
```{r}
temp <- apply(data_P[,2:17],1,range)
data_P$rangeC <- temp[2,] - temp[1,]
```
Let's create a chart of the ranges for both experiments for all periods in order to compare them.
```{r}
plot(data_N$Period,data_N$rangeC, type = "l",col="blue", lwd=2, xlab = "Round",ylim = c(4,14),
ylab="Range of contributions")
lines(data_P$rangeC,col="red",lwd=2)
title("Range of contributions to public goods game")
legend("bottomright", legend=c("Without punishment", "With punishment"),
col=c("blue", "red"), lwd=2,lty = 1,cex=1.2)
```
This chart confirms what we found in R walk-through 2.5, which is that there is a greater spread (variation) of contributions in the game with punishment.
[End of walk-through]
### **R walk-through 2.7** Creating a table of summary statistics
We have already done most of the work for creating this summary table in R walk-through 2.6. Since we also want to display the minimum and maximum values, we should add these to the dataframes.
```{r}
data_N$minC <- apply(data_N[,2:17],1,min)
data_N$maxC <- apply(data_N[,2:17],1,max)
data_P$minC <- apply(data_P[,2:17],1,min)
data_P$maxC <- apply(data_P[,2:17],1,max)
```
Now we display the statistics. Here we enclose our command in the `round` function, which reduces the number of digits displayed after the decimal point and makes the table a little easier to read.
```{r}
print("Public goods game without punishment")
```
```{r}
round(data_N[c(1,10),c(1,18:23)], digits=2) # We want to see Rows 1 and 10 and Column 1 (which contains the period number) as well as Columns 18 to 23 (which contain the statistics).
```
Let's repeat this command for the version with punishment.
```{r}
options(signif=2) # Show two digits
print("Public goods game with punishment")
```
```{r}
round(data_P[c(1,10),c(1,18:23)], digits=2)
```
[End of walk-through]
## Part 2.3 Did changing the rules of the game have a significant effect on behaviour?
### **R walk-through 2.8** Calculating a t-test for means
We need to extract the observations in Period 1 for the data for with and without punishment, and then feed the observations into the `t.test` function. The `t.test` function is extremely flexible: if you feed in a `y` and a `x` variable as shown below, it will automatically test the null hypothesis that the means of both series are equal (in other words, the difference in means is 0).
```{r}
p1_N <- data_N[1,2:17]
p1_P <- data_P[1,2:17]
t.test(x=p1_N,y=p1_P)
```
Unfortunately, if you run this code, you are likely to get an error message like this:
`Error: Unsupported use of matrix or array for column indexing.`
The reason for this is that `p1_N` and `p1_P` are still tibbles (dataframes) with one observation and 16 variables. The `t.test` function requires one variable (with many observations), but we are giving it one observation for 16 variables, so we need to change this. One way to do this is to transpose (the same idea as a vector transpose) from one observation and many variables to many observations and one variable, which is done using the `t` function.
```{r}
p1_N <- t(data_N[1,2:17])
p1_P <- t(data_P[1,2:17])
t.test(x=p1_N,y=p1_P)
```
This result delivers a p-value of 0.9496. This means it is very likely to receive data like that we observed if H₀ was true, so we cannot reject the null hypothesis.
```{r}
p1_N <- t(data_N[1,2:17])
p1_P <- t(data_P[1,2:17])
t.test(x=p1_N,y=p1_P,paired = TRUE)
```
As you can see, the p-value does change. It becomes smaller as we can attribute more of the differences to the ‘with punishment’ treatment, but the p-value is still very large (0.8828), so we still conclude that there is no evidence that the contributions in Period 1 differ.
[End of walk-through]