2. Collecting and analysing data from experiments Working in Python

Download the code

To download the code chunks used in this project, right-click on the download link and select ‘Save Link As…’. You’ll need to save the code download to your working directory, and open it in Python.

Don’t forget to also download the data into your working directory by following the steps in this project.

Getting started in Python

Read the ‘Getting Started in Python’ page for help and advice on setting up a Python session to work with. Remember, you can run any page from this book as a notebook by downloading the relevant file from this repository and running it on your own computer. Alternatively, you can run pages online in your browser over at Binder.

Preliminary settings

Let’s import the packages we’ll need and also configure the settings we want:

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from lets_plot import *


### You don't need to use these settings yourself
### — they are just here to make the book look nicer!
# Set the plot style for prettier charts:


Part 2.1 Collecting data by playing a public goods game

Learning objectives for this part

  • Collect data from an experiment and enter it into Python.
  • Use summary measures, for example, mean and standard deviation, and line charts to describe and compare data.


You can still do Parts 2.2 and 2.3 without completing this part of the project.

Before taking a closer look at the experimental data, you will play a public goods game like the one in the introduction with your classmates to learn how experimental data can be collected. If your instructor has not set up a game, follow the instructions below to set up your own game.

Instructions How to set up the public goods game

Form a group of at least four people. (You may also want to set a maximum of 8 or 10 players to make the game easier to play). Choose one person to be the game administrator. The administrator will monitor the game, while the other people play the game.


  1. Create the game: Go to the ‘Economics Games’ website, scroll down to the bottom of the page, and click ‘Create a Multiplayer Game and Get Logins’. Then click ‘Externalities and public goods’. Under the heading ‘Voluntary contribution to a public good’, click ‘Choose this Game’. Enter in the number of people playing the game, and select ‘1’ for the number of universes. Then click ‘Get Logins’. A pop-up will appear, showing the login IDs and passwords for the players and for the administrator.
  2. Start the game: Give each player a different login ID. The game should be played anonymously, so make sure that players do not know the login IDs of other players. You are now ready to start the first round of the game. There are ten rounds in total.
  3. Confirm that all the rounds are complete: In the top right corner of the webpage, click ‘Login’, enter your login ID and password, and then click the green ‘Login’ button. You will be taken to the game administration page, which will show the average contribution in each round, and the results of the round just played. Wait until all the players have finished playing ten rounds before refreshing this page.
  4. Collect the game results: Once the players have finished playing ten rounds, refresh this page. The table at the top of the page will now show the average contribution (in euros) for each of the ten rounds played. Select the whole table, then copy and paste it into a new worksheet in Excel.


  1. Login: Once the administrator has created the game, go to the ‘Economics Games’ website. In the top right corner, click ‘Login’, enter the login ID and password that your administrator has given you, then click the green ‘Login’ button. You will be taken to the public goods game that your administrator has set up.
  2. Play the first round of the game: Read the instructions at the top of the page carefully before starting the game. In each round, you must decide how much to contribute to the public good. Enter your choice for each universe (group of players) that you are a part of (if the same players are in two universes, then make the same contribution in both), then click ‘Validate’.
  3. View the results of the first round: You will then be shown the results of the first round, including how much each player (including yourself) contributed, the payoffs, and the profits. Click ‘Next’ to start the next round.
  4. Complete all the rounds of the game: Repeat steps 2 and 3 until you have played ten rounds in total, then collect the results of the game from your administrator.

Use the results of the game you have played to answer the following questions.

  1. Make a line chart with average contribution as the vertical axis variable, and period (from 1 to 10) on the horizontal axis. Describe how average contributions have changed over the course of the game.

Python walk-through 2.1 Plotting a line chart with multiple variables

Use the data from your own experiment to answer Question 1. As an example, we will use the data for the first three cities of the dataset that will be introduced in Part 2.2.

# Create a dictionary with the data in
data = {
    "Copenhagen": [14.1, 14.1, 13.7, 12.9, 12.3, 11.7, 10.8, 10.6, 9.8, 5.3],
    "Dniprop": [11.0, 12.6, 12.1, 11.2, 11.3, 10.5, 9.5, 10.3, 9.0, 8.7],
    "Minsk": [12.8, 12.3, 12.6, 12.3, 11.8, 9.9, 9.9, 8.4, 8.3, 6.9],

df = pd.DataFrame.from_dict(data)
Copenhagen Dniprop Minsk
0 14.1 11.0 12.8
1 14.1 12.6 12.3
2 13.7 12.1 12.6
3 12.9 11.2 12.3
4 12.3 11.3 11.8

Now we need to plot the data. Note that, with data in ‘wide’ format (one column per city) and with an index, simply calling .plot on a pandas dataframe will create a matplotlib line chart. We could also use the lets_plot package to make this kind of chart, but it expects data in ‘tidy’ or ‘long’ format—and for that, we would have to reshape the data so that the city names were values in a single column called ‘city’ or similar. Let’s just use matplotlib for now.

# Plot the data
fig, ax = plt.subplots()
ax.set_title("Average contributions to the public goods game: Without punishment")
ax.set_ylabel("Average contribution")
Average contribution to the public goods game: without punishment

Figure 2.1 Average contribution to the public goods game: without punishment.

Tip: When using pandas, there are several different types of brackets for accessing data values. Let’s list them so that you know the differences. Here are the different ways to get the first column of a dataframe (when that first column is called column and the dataframe is df):

  • df.column
  • df["column"]
  • df.loc[:, "column"]
  • df.iloc[:, 0]

Note that : means ‘give me everything’! The ways to access rows are similar (here assuming the first row is called row):

  • df.loc["row", :]
  • df.iloc[0, :]

And to access the first value (that is, the value in first row, first column):

  • df.column[0]
  • df["column"][0]
  • df.iloc[0, 0]
  • df.loc["row", "column"]

In the above examples, square brackets are instructions to Python about where to grab information from the dataframe. They are like an address system for values within a dataframe. However, square brackets also denote lists, so if you want to select multiple columns or rows, you might see syntax like this:

df.loc[["row0", "row1"], ["column0", "column2"]]

This code picks out two rows and two columns via the lists ["row0", "row1"] and ["column0", "column2"]. Because there are lists as well as the usual system of selecting values, there are two sets of square brackets.

  1. Compare your line chart with Figure 3 of Herrmann et al. (2008).1 Comment on any similarities or differences between the results (for example, the amount contributed at the start and end, or the change in average contributions over the course of the game).
  1. Can you think of any reasons why your results are similar to (or different from) those in Figure 3? You may find it helpful to read the ‘Experiments’ section of the Herrmann et al. (2008) study for a more detailed description of how the experiments were conducted.

Part 2.2 Describing the data

Learning objectives for this part

  • Use summary measures, for example, mean and standard deviation, and column charts to describe and compare data.


You can still do Parts 2.2 and 2.3 without completing Part 2.1.

We will now use the data used in Figures 2A and 3 of Herrmann et al. (2008), and evaluate the effect of the punishment option on average contributions. Rather than compare two charts showing all of the data from each experiment, as the authors of the study did, we will use summary measures to compare the data, and show the data from both experiments (with and without punishment) on the same chart.

First, download and save the data. The spreadsheet contains two tables:

  • The first table shows average contributions in a public goods game without punishment (Figure 3).
  • The second table shows average contributions in a public goods game with punishment (Figure 2A).

You can see that in each period (row), the average contribution varies across countries, in other words, there is a distribution of average contributions in each period.

Python walk-through 2.2 Importing the datafile into Python

Both the tables you need are in a single Excel worksheet. Note down the cell ranges of each table, in this case A2:Q12 for the without punishment data and A16:Q26 for the punishment data. We will use this range information to import the data into two dataframes (data_n and data_p, respectively).

In the code below, we’ll use the .copy method, which we’ll explain more about in a moment.

data_np = pd.read_excel(
data_n = data_np.iloc[:10, :].copy()
data_p = data_np.iloc[14:24, :].copy()

When loading the data from Excel, you may see an error message about an ‘unknown extension’. Note that this particular Excel file has some issues that mean pandas will warn you about an ‘unknown extension’: an Excel file is actually a bundle of files tied up to look like one file, and what has happened here is that pandas doesn’t recognise one of the files in the bundle. Despite this issue, we can still import the data we need in the worksheets.

In the code above, we used .copy and you may be wondering what it does. When a new object (say, data_p) is created from an existing object (here data_np), programming languages have a few different options for how to do it. In this case, Python has two options: it could allocate some entirely new memory to store the new variable, data_p, or it could just create a link to the existing bit of memory where some of data_np is stored.

The two different approaches behave differently. Under the former, changes to data_p won’t affect data_np because data_p gets its own bit of memory and is entirely independent of the existing variable. But in the latter case, any changes to data_p will also be applied to data_np! This is because, underneath it all, they’re both ‘pointing’ to the same bit of computer memory. Indeed, that is why variables that do this are sometimes called pointers. They’re common to most programming languages and pandas tends to use them by default because they save on memory. This case is just an example of a situation where we don’t want to change data_np by changing data_p, so we use the .copy method to allocate new memory and avoid creating a pointer.

Let’s see a simple example of how this .copy method works:

test_data = {
    "City A": [14.1, 14.1, 13.7],
    "City B": [11.0, 12.6, 12.1],

# Original dataframe
test_df = pd.DataFrame.from_dict(test_data)
# A copy of the dataframe
test_copy = test_df.copy()
# A pointer to the dataframe
test_pointer = test_df

test_pointer.iloc[1, 1] = 99

Now, even though we only modified test_pointer, we can look at both the original data frame and the copy that we took earlier:

   City A  City B
0    14.1    11.0
1    14.1    99.0
2    13.7    12.1

   City A  City B
0    14.1    11.0
1    14.1    12.6
2    13.7    12.1

We see that test_df has changed because test_pointer pointed to it, but our pure copy, test_copy, hasn’t changed.

As well as importing the correct data, we’re going to ensure it is of the correct datatype. Common datatypes include ‘double’ and ‘integer’ (for numbers), string (for words), and ‘category’ (for variables that take on a fixed number of categories, like ethnicity or educational attainment). We can check the datatypes of the data we just read in using data_n.info() (you can do the same for data_p).

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 1 to 10
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Copenhagen       10 non-null     object
 1   Dnipropetrovs’k  10 non-null     object
 2   Minsk            10 non-null     object
 3   St. Gallen       10 non-null     object
 4   Muscat           10 non-null     object
 5   Samara           10 non-null     object
 6   Zurich           10 non-null     object
 7   Boston           10 non-null     object
 8   Bonn             10 non-null     object
 9   Chengdu          10 non-null     object
 10  Seoul            10 non-null     object
 11  Riyadh           10 non-null     object
 12  Nottingham       10 non-null     object
 13  Athens           10 non-null     object
 14  Istanbul         10 non-null     object
 15  Melbourne        10 non-null     object
dtypes: object(16)
memory usage: 1.3+ KB

All of the columns are of the ‘object’ type, which is Python’s default when it’s not clear which datatype to use.

We have continuous real numbers in the columns of data_n and data_p here, so we’ll set the datatypes to be double, which is a datatype used for continuous real numbers.

data_n = data_n.astype("double")
data_p = data_p.astype("double")

You can look at the data either by opening the dataframes from the Environment window or by typing data_n or data_p into the interactive Python window.

You can see that in each row, the average contribution varies across countries; in other words, there is a distribution of average contributions in each period.

A summary statistic for a set of observations, calculated by adding all values in the set and dividing by the number of observations.
A measure of dispersion in a frequency distribution, equal to the mean of the squares of the deviations from the arithmetic mean of the distribution. The variance is used to indicate how ‘spread out’ the data is. A higher variance means that the data is more spread out. Example: The set of numbers 1, 1, 1 has zero variance (no variation), while the set of numbers 1, 1, 999 has a high variance of 221,334 (large spread).

The mean and variance are two ways to summarize distributions. We will now use these measures, along with other measures (range and standard deviation) to summarize and compare the distribution of contributions in both experiments.

  1. Using the data for Figures 2A and 3 of Herrmann et al. (2008):
  • Calculate the mean contribution in each period (row) separately for both experiments.
  • Plot a line chart of mean contribution on the vertical axis and time period (from 1 to 10) on the horizontal axis (with a separate line for each experiment). Make sure the lines in the legend are clearly labelled according to the experiment (with punishment or without punishment).
  • Describe any differences and similarities you see in the mean contribution over time in both experiments.

Python walk-through 2.3 Calculating the mean using the .mean() or the agg functions

We calculate the mean using two different methods to illustrate that there are usually many ways of achieving the same thing. We apply the first method on data_n, which uses the built-in .mean() function to calculate the average separately over each column except the first. We use the second method (the agg function) on data_p.

mean_n_c = data_n.mean(axis=1)
mean_p_c = data_p.agg(np.mean, axis=1)

As the name suggests, the agg function applies an aggregation function (the mean function in this case) to all rows or columns in a dataframe. The second input, axis=1, applies the specified function to all rows in data_p, so we are taking the average over cities for each period.

Typing axis=0 would have calculated column means instead, that is, it would have averaged over periods to produce one value per city (run this code to see for yourself). Type help(pd.DataFrame.agg) in your interactive Python window for more details, or see Python walk-through 2.5 for further practice.

Plot the mean contribution

Now we will produce a line chart showing the mean contributions.

fig, ax = plt.subplots()
mean_n_c.plot(ax=ax, label="Without punishment")
mean_p_c.plot(ax=ax, label="With punishment")
ax.set_title("Average contributions to the public goods game")
ax.set_ylabel("Average contribution")
Average contributions to the public goods game, with and without punishment.

Figure 2.2 Average contributions to the public goods game, with and without punishment.

The difference between experiments is stark, as the contributions increase and then stabilise at around 13 in the case where there is punishment, but decrease consistently from around 11 to 4 across the rounds when there is no punishment.

  1. Instead of looking at all periods, we can focus on contributions in the first and last period. Plot a column chart showing the mean contribution in the first and last period for both experiments. Your chart should look like Figure 2.3.

Python walk-through 2.4 Drawing a column chart to compare two groups

To do this next part, we’re going to use something called a ‘list comprehension’, which is a special kind of loop. Loops are very useful in programming when you have the same task that you want to execute for a sequence of values. You could use a loop to find the squares of the first 10 numbers, for example.

A list comprehension is a way of writing a loop that creates a Python list. The loops it creates tend to be quick to run, too.

As a specific example, let’s say we wanted to add the first name ‘John’ to a list of names. Using a list comprehension, the code would be:

partial_names_list = ["F. Kennedy", "Lennon", "Maynard Keynes", "Wayne"]
["John " + name for name in partial_names_list]
['John F. Kennedy', 'John Lennon', 'John Maynard Keynes', 'John Wayne']

The second line shows the syntax: square bracket (which usually signifies a list), then an operation (here "John" + name), and then for name_of_thing in name_of_list (replace name_of_thing and name_of_list with the thing you would like to apply the loop to, and your list name).

To make a column chart, we will use the .plot.bar() function. We first extract the four data points we need (Periods 1 and 10, with and without punishment) and place them into another dataframe (called compare_grps).

# Create new dataframe with bars in
compare_grps = pd.DataFrame(
    [mean_n_c.loc[[1, 10]], mean_p_c.loc[[1, 10]]],
    index=["Without punishment", "With punishment"],
# Rename columns to have 'round' in them
compare_grps.columns = ["Round " + str(i) for i in compare_grps.columns]
# Swap the column and index variables around with the transpose function, ready for plotting (.T is transpose)
compare_grps = compare_grps.T
# Make a bar chart
Mean contributions in a public goods game.

Figure 2.3 Mean contributions in a public goods game.

Tip: Experimenting with these charts will help you to learn how to use Python and its packages. Try using .plot.bar(stacked=True) or using rot=45 as keyword arguments, or using .plot.barh() instead.

The mean is one useful measure of the ‘middle’ of a distribution, but is not a complete description of what our data looks like. We also need to know how ‘spread out’ the data is in order to get a clearer picture and make comparisons between distributions. The variance is one way to measure spread: the higher the variance, the more spread out the data is.

standard deviation
A measure of dispersion in a frequency distribution, equal to the square root of the variance. The standard deviation has a similar interpretation to the variance. A larger standard deviation means that the data is more spread out. Example: The set of numbers 1, 1, 1 has a standard deviation of zero (no variation or spread), while the set of numbers 1, 1, 999 has a standard deviation of 46.7 (large spread).

A similar measure is standard deviation, which is the square root of the variance and is commonly used because there is a handy rule of thumb for large datasets, which is that most of the data (95%, if there are many observations) will be less than two standard deviations away from the mean.

  1. Using the data for Figures 2A and 3 of Herrmann et al. (2008):
  • Calculate the standard deviation for Periods 1 and 10 separately, for both experiments. Does the rule of thumb apply? (In other words, are most values within two standard deviations of the mean?)
  • As shown in Figure 2.3, the mean contribution for both experiments was 10.6 in Period 1. With reference to your standard deviation calculations, explain whether this means that the two sets of data are the same.

Python walk-through 2.5 Calculating and understanding standard deviation

In order to calculate these standard deviations and variances, we will use the agg function, which we introduced in Python walk-through 2.3. As we saw, agg is a command that asks pandas to aggregate a set of rows or columns of the dataframe using a particular aggregation function. The basic structure is as follows: dataframe_name.agg([function1, function2, ...], rows/columns). So to calculate the variances and more, we use the following command:

n_c = data_n.agg(["std", "var", "mean"], 1)
std var mean
1 2.020724 4.083325 10.578313
2 2.238129 5.009220 10.628398
3 2.329569 5.426891 10.407079
4 2.068213 4.277504 9.813033
5 2.108329 4.445049 9.305433
6 2.240881 5.021549 8.454844
7 2.136614 4.565117 7.837568
8 2.349442 5.519880 7.376388
9 2.413845 5.826645 6.392985
10 2.187126 4.783520 4.383769

Here we take data_n and apply the "var" and "std" functions to each row (recall that the second input 1 does this; 0 would indicate columns). Note that the index column, which contains the period numbers, is automatically excluded from the calculation. The result is saved as a new variable called n_c.

We then apply the same principle to the data_p dataframe.

p_c = data_p.agg(["std", "var", "mean"], 1)

Aside: In the next chart, we will use another kind of loop. The syntax for this one is for 'thing' in list of things, then a colon (:), then an indented operation that uses thing.

To determine whether 95% of the observations fall within two standard deviations of the mean, we can use a line chart. As we have 16 countries in every period, we would expect about one observation (0.05 × 16 = 0.8) to fall outside this interval.

fig, ax = plt.subplots()
n_c["mean"].plot(ax=ax, label="mean")
# mean + 2 standard deviations
(n_c["mean"] + 2 * n_c["std"]).plot(ax=ax, ylim=(0, None), color="red", label="±2 s.d.")
# mean - 2 standard deviations
(n_c["mean"] - 2 * n_c["std"]).plot(ax=ax, ylim=(0, None), color="red", label="")
for i in range(len(data_n.columns)):
    ax.scatter(x=data_n.index, y=data_n.iloc[:, i], color="k", alpha=0.3)
ax.set_ylabel("Average contribution")
ax.set_title("Contribution to public goods game without punishment")
Contribution to public goods game without punishment.

Figure 2.4 Contribution to public goods game without punishment.

None of the observations fall outside the mean ± two standard deviations interval for the public goods game without punishment. Let’s plot the equivalent chart for the version with punishment.

fig, ax = plt.subplots()
p_c["mean"].plot(ax=ax, label="mean")
# mean + 2 sd
(p_c["mean"] + 2 * p_c["std"]).plot(ax=ax, ylim=(0, None), color="red", label="±2 s.d.")
# mean - 2 sd
(p_c["mean"] - 2 * p_c["std"]).plot(ax=ax, ylim=(0, None), color="red", label="")
for i in range(len(data_p.columns)):
    ax.scatter(x=data_p.index, y=data_p.iloc[:, i], color="k", alpha=0.3)
ax.set_ylabel("Average contribution")
ax.set_title("Contribution to public goods game with punishment")
Contribution to public goods game with punishment.

Figure 2.5 Contribution to public goods game with punishment.

Here, we only have one observation outside the interval (in Period 8). In that aspect the two experiments look similar. However, from comparing these two charts, the game with punishment displays a greater variation of responses than the game without punishment. In other words, there is a larger standard deviation and variance for the observations coming from the game with punishment.

The interval formed by the smallest (minimum) and the largest (maximum) value of a particular variable. The range shows the two most extreme values in the distribution, and can be used to check whether there are any outliers in the data. (Outliers are a few observations in the data that are very different from the rest of the observations.)

Another measure of spread is the range, which is the interval formed by the smallest (minimum) and the largest (maximum) values of a particular variable. For example, we might say that the number of periods in the public goods experiment ranges from 1 to 10. Once we know the most extreme values in our dataset, we have a better picture of what our data looks like.

  1. Calculate the maximum and minimum value for Periods 1 and 10 separately, for both experiments.

Python walk-through 2.6 Finding the minimum, maximum, and range of a variable

We’re now going to see one of our first functions. A function takes inputs, does some operations on them, and returns outputs.

You can imagine functions as vending machines: for them to work you need some inputs (money, and a choice of snack or drink), then an operation happens (your drink or snack is dropped into the tray), and finally there is an output (your drink or snack as you grab it).

Functions are incredibly useful in programming because they are separate units that can be tested in isolation, re-used, and given helpful ‘dressing’ (such as information on how they work) that make code more readable.

To calculate the range for both experiments and for all periods, we will use an apply method in combination with the max and min methods that apply to a column or row. We’ll also use a lambda function to bring these all together. In our case, it’s going to look like this:

data_p.apply(lambda x: x.max() - x.min(), axis=1)
1     10.199675
2     12.185065
3     12.689935
4     12.625000
5     12.140375
6     12.827541
7     13.098931
8     13.482621
9     13.496754
10    11.307360
dtype: float64

This lambda function tells Python to take the difference between the maximum and minimum of each row.

A lambda function is an idea in programming (and mathematics) that has a long and interesting history. You don’t need to know all that, but it is instructive to look at a more general example of a lambda function:

# A lambda function accepting three inputs, a, b, and c, and calculating the sum of the squares
test_function = lambda a, b, c: a**2 + b**2 + c**2

# Now we apply the function by handing over (in parenthesis) the following inputs: a=3, b=4 and c=5
test_function(3, 4, 5)

Above, we defined a lambda function that looked like lambda x: x.max() - x.min(). It accepts one input, x (which could be a row or column), and returns the range of x. Because making code reusable is good programming practice, we will define this function and give it a name using a separate line of code like this:

range_function = lambda x: x.max() - x.min()

When we call data_p.apply(range_function, axis=1), the following will happen: data_p contains the experimental data (with punishment). We will apply the range_function to that data. As data_p has two dimensions, we also need to let Python know over which dimension it should calculate the minimum and maximum. The axis=1 option in the apply function tells the apply function that it should apply the range_function over rows rather than columns (to get columns, it would be axis=0, which is also the default if you don’t specify the axis keyword argument).

range_function = lambda x: x.max() - x.min()
range_p = data_p.apply(range_function, axis=1)
range_n = data_n.apply(range_function, axis=1)

Let’s create a chart of the ranges for both experiments for all periods in order to compare them.

fig, ax = plt.subplots()
range_p.plot(ax=ax, label="With punishment")
range_n.plot(ax=ax, label="Without punishment")
ax.set_ylim(0, None)
ax.set_title("Range of contributions to the public goods game")
Range of contributions to the public goods game.

Figure 2.6 Range of contributions to the public goods game.

This chart confirms what we found in Python walk-through 2.5, which is that there is a greater spread (variation) of contributions in the game with punishment.

  1. A concise way to describe the data is in a summary table. With just four numbers (mean, standard deviation, minimum value, maximum value), we can get a general idea of what the data looks like.
  • Create a table of summary statistics that displays mean, variance, standard deviation, minimum, maximum and range for Periods 1 and 10 and for both experiments.
  • Comment on any similarities and differences in the distributions, both across time and across experiments.

Python walk-through 2.7 Creating a table of summary statistics

We have already done most of the work for creating this summary table in Python walk-through 2.6. Since we also want to display the minimum and maximum values, we should create these too. And it’s convenient to add in std and mean using the same syntax (even though we created a separate mean earlier), so we have all the information in one place. We’ll call our new summary statistics summ_p and summ_n.

funcs_to_apply = [range_function, "max", "min", "std", "mean"]
summ_p = data_p.apply(funcs_to_apply, axis=1).rename(columns={"<lambda>": "range"})
summ_n = data_n.apply(funcs_to_apply, axis=1).rename(columns={"<lambda>": "range"})

Note that as well as applying all of the functions in the list funcs_to_apply, we also renamed the first function using the rename method. Because the range isn’t a built-in aggregation function and we defined it, it is automatically given a column name—and because the range function we supplied is a lambda function, the name it gets is "<lambda>". Using rename(columns=, we change this name to "range" using a dictionary object ({ : }) that maps the old name to the new name.

Now we display the summary statistics in a table. We use the round method, which reduces the number of digits displayed after the decimal point (2 in our case) and makes the table easier to read. We’re only interested in periods 1 and 10, so we pass a list, [1, 10], to the .loc selector in the first position (which corresponds to rows and the index). We want all columns, so we pass : to the second position of the .loc selector.

summ_n.loc[[1, 10], :].round(2)
range max min std mean
1 6.14 14.10 7.96 2.02 10.58
10 7.38 8.68 1.30 2.19 4.38

Now we do the same for the version with punishment.

summ_p.loc[[1, 10], :].round(2)
range max min std mean
1 10.20 16.02 5.82 3.21 10.64
10 11.31 17.51 6.20 3.90 12.87

Part 2.3 How did changing the rules of the game affect behaviour?

Learning objectives for this part

  • Calculate and interpret the p-value.
  • Evaluate the usefulness of experiments for determining causality, and the limitations of these experiments.

The punishment option was introduced into the public goods game in order to see whether it could help sustain contributions, compared to the game without a punishment option. We will now use a calculation called a p-value to compare the results from both experiments more formally.

By comparing the results in Period 10 of both experiments, we can see that the mean contribution in the experiment with punishment is 8.5 units higher than in the experiment without punishment (see Figure 2.6 in Part 2.2). Is it more likely that this behaviour is due to chance, or is it more likely to be due to the difference in experimental conditions?

  1. You can conduct another experiment to understand why we might see differences in behaviour that are due to chance.
  • First, flip a coin six times, using one hand only, and record the results (for example, Heads, Heads, Tails, etc.). Then, using the same hand, flip a coin six times and record the results again.
  • Compare the outcomes from Question 1(a). Did you get the same number of heads in both cases? Even if you did, was the sequence of the outcomes (for example, Heads, Tails, Tails …) the same in both cases?

The important point to note is that even when we conduct experiments under the same controlled conditions, due to an element of randomness, we may not observe the exact same behaviour each time we do the experiment.

Randomness arises because the statistical analysis is conducted on a sample of data (for example, a small group of people from the entire population), and the sample we observe is only one of many possible samples. Whatever differences we calculate between two samples would almost certainly change if we had observed another pair of samples. Importantly, economists aren’t really interested in whether two samples are actually different, but rather whether the underlying populations, from which the samples were drawn, differ in the characteristics we are interested in (for example, age, income, contributions to the public good). And this is the challenge faced by the empirical economist.

When we are interested in whether a treatment works—in this case, whether having the punishment option makes a difference in how much people contribute to the public good. So, we want a way to check whether any observed differences could just be due to sample variation.

The size of the difference alone cannot tell us whether it might just be due to chance. Even if the observed difference seems large, it could be small relative to how much the data vary. Figures 2.7 and 2.8 show the mean exam score of two groups of high school students and the size of house in which they live (represented by the height of the columns, and reported in the boxes above the columns), with the dots representing the underlying data. Figure 2.7 shows a relatively large difference in means that could have arisen by chance because the data is widely spread out (the standard deviation is large), while Figure 2.8 shows a relatively small difference that looks unlikely to be due to chance because the data is tightly clustered together (the standard deviation is very small). Note that we are looking at two distinct questions here: first, is there a large or small difference in exam score associated with the size of house of the student and second, is that difference likely to have arisen by chance. A social scientist is interested in the answer to both questions. If the difference is large but could easily have occurred by chance or if the difference is very small and unlikely to have occurred by chance, then the results are not suggestive of an important relationship between size of house and exam grade.

An example of a large difference in means that is likely to have happened by chance.

Figure 2.7 An example of a large difference in means that is likely to have happened by chance.

An example of a small difference in means that is unlikely to have happened by chance.

Figure 2.8 An example of a small difference in means that is unlikely to have happened by chance.

The probability of observing data at least as extreme as the data collected if a particular hypothesis about the population is true. The p-value ranges from 0 to 1: the lower the probability (the lower the p-value), the less likely it is to observe the given data, and therefore the less compatible the data are with the hypothesis.

To help us decide, we consider the hypothesis that the difference occurred by chance – in other words, we start by hypothesizing that house size does not matter for exam scores. Then we ask how likely it is that we would observe differences at least as extreme as those we actually observe in our sample groups, assuming that our hypothesis is true. The answer to this question is called a p-value. The smaller the p-value, the less likely that we would observe differences at least as extreme as those we did, given our hypothesis. So the smaller this p-value, the smaller our confidence will be in the hypothesis that in the population house size does not matter for exam grades.

Notice that the p-value is not the probability that the hypothesis is correct – the data cannot tell us that probability. It is the probability that we would find a difference as big as the one we have observed if the hypothesis were correct.

We can estimate the p-value from the data, using the sample means and sample deviations. It is calculated by comparing the difference in the means with the amount of variation in the data as measured by the standard deviations. This is a well-established method, although some other statistical assumptions, which we do not discuss, are required to ensure that it gives a good estimate.

When we look at the data in Figure 2.7, we cannot be absolutely certain that there really is a link between house size and exam scores. But if the p-value for the difference in means is very small (for example, 0.02) then we know that there would only be a 2% probability of seeing differences at least as extreme as those we did observe in the sample, given our hypothesis that in the population there was no relationship between house size and exam scores.

hypothesis test
A test in which a null (default) and an alternative hypothesis are posed about some characteristic of the population. Sample data is then used to test how likely it is that these sample data would be seen if the null hypothesis was true.

Find out more Hypothesis testing and p-values

The process of formulating a hypothesis about the data, calculating the p-value, and using it to assess whether what we observe is consistent with the hypothesis, is known as a hypothesis test. When we conduct a hypothesis test, we consider two hypotheses: either there is no difference between the populations, in which case the differences we observe must have happened by chance (known as the ‘null hypothesis’); or the populations really are different (known as the ‘alternative hypothesis’). The smaller the p-value, the lower the probability that the differences we observe could have happened simply by chance, in other words, if the null hypothesis were true. The smaller the p-value, the stronger the evidence in favour of the alternative hypothesis.

It is a common, but highly debatable practice, to pick a cutoff level for the p-value, and reject the null hypothesis if the p-value is below this cutoff. This approach has been criticized recently by statisticians and social scientists because the cutoff level is quite arbitrary.

Instead of using a cutoff, we prefer to calculate p-values and use them to assess the strength of the evidence. Whether the statistical evidence is strong enough for us to draw a firm conclusion about the data will always be a matter of judgement.

In particular, you want to make sure that you understand the consequences of concluding that the null hypothesis is not true, and hence that the alternative is true. You may quite easily be prepared to conclude that house sizes and exam scores are related, but much more cautious about deciding that a new medication is more effective than an existing one if you know that this new medication has severe side effects. In the case of the medication, you might want to see stronger evidence against the null hypothesis before deciding that doctors should be advised to prescribe the new medication.

We will calculate the p-value and use it to assess how likely it is that the differences we observe are due to chance.

  1. Using the data for Figures 2A and 3:
  • Use the ttest function to calculate the p-value for the difference in means in Period 1 (with and without punishment).
  • What does this p-value tell us about the difference in means in Period 1?

Python walk-through 2.8 Calculating the p-value for the difference in means

We need to extract the observations in Period 1 for the data for with and without punishment, and then feed the observations into a function that performs a t-test. We’ll use the statistics package pingouin for this, which you will need to install on the command line using pip install pingouin. Once installed, import it using import pingouin as pg, just like we did at the start of the project.

Tip: you can open up the command line, also known as the terminal or command prompt, in order to install packages in multiple ways. If you’re working within Visual Studio Code use the + \` keyboard shortcut (Mac) or CTRL + \` (Windows and Linux), or click ‘View > Terminal’. If you want to open up the command line independently of Visual Studio Code, search for ‘Terminal’ on Mac and Linux, and ‘Anaconda Prompt’ on Windows.

pingouin’s t-test function is called ttest. The ttest function is extremely flexible: if you input two variables (x and y) as shown below, it will automatically test whether the difference in means is likely to be due to chance or not (formally speaking, it tests the null hypothesis that the means of both variables are equal).

Note that the ttest function will only accept one series of data, not multiple data series. By subsetting (iloc[1, :]), we are passing in the 0th row (the first period) for all columns (cities).

pg.ttest(x=data_n.iloc[0, :], y=data_p.iloc[0, :])
T dof alternative p-val CI95% cohen-d BF10 power
ttest −0.063782 30 two-sided 0.949567 [−2.0, 1.87] 0.02255 0.337 0.050437

Note that as well as the t-statistic (T), the p-value (p-val), the degrees of freedom (dof), the alternative hypothesis (two-sided) and the confidence interval (CI95%), we get some other variables that help us put the main results into context.

This result delivers a p-value of 0.9496. This means it is very likely that the assumption that there are no differences in the populations is likely to be true (formally speaking, we cannot reject the null hypothesis).

The ttest function automatically assumes that both variables were generated by different groups of people. When calculating the p-value, it assumes that the observed differences are partly due to some variation in characteristics between these two groups, and not just the differences in experimental conditions. However, in this case, the same groups of people did both experiments, so there will not be any variation in characteristics between the groups. When calculating the p-value, we account for this fact with the paired=True option.

pg.ttest(x=data_n.iloc[0, :], y=data_p.iloc[0, :], paired=True)
T dof alternative p-val CI95% cohen-d BF10 power
ttest −0.149959 15 two-sided 0.882795 [−0.92, 0.8] 0.02255 0.258 0.05082

The p-value becomes smaller as we can attribute more of the differences to the ‘with punishment’ treatment, but the p-value is still very large (0.8828), so we still conclude that the differences in Period 1 are likely to be due to chance.

  1. Using the data for Period 10:
  • Use the ttest function to calculate the p-value for the difference in means in Period 10 (with and without punishment).
  • What does this p-value tell us about the relationship between punishment, and behaviour in the public goods game?
  • With reference to Figure 2.7 and Figure 2.8, explain why we cannot use the size of the difference to directly conclude whether the difference could be due to chance.
spurious correlation
A strong linear association between two variables that does not result from any direct relationship, but instead may be due to coincidence or to another unseen factor.

An important point to note is that calculating p-values may not tell us anything about causation. The example of house size and exam scores shown in Figure 2.8, gives us evidence that some kind of relationship between house size and exam scores is very likely. However, we would not conclude that building an extra room automatically makes someone smarter. P-values cannot help us detect these spurious correlations.

However, calculating p-values for experimental evidence can help us determine whether there is a causal link between two variables. If we conduct an experiment and find a difference in outcomes with a low p-value, then we may conclude that the change in experimental conditions is likely to have caused the difference.

  1. Refer to the results from the public goods games.
  • Which characteristics of the experimental setting make it likely that the with punishment option was the cause of the change in behaviour?
  • Using Figure 2.6, explain why we need to compare the two groups in Period 1 in order to conclude that there is a causal link between the with punishment option and behaviour in the game.

Experiments can be useful for identifying causal links. However, if people’s behaviour in experimental conditions were different from their behaviour in the real world, our results would not be applicable anywhere outside the experiment.

  1. Discuss some limitations of experiments, and suggest some ways to address (or partially address) them. (You may find pages 158–171 of the paper ‘What do laboratory experiments measuring social preferences reveal about the real world?’ helpful, as well as the discussion on free riding and altruism in Section 2.6 of Economy, Society, and Public Policy.)
  1. Benedikt Herrmann, Christian Thöni, and Simon Gächter. 2008. Figure 3 in ‘Antisocial punishment across societies’. Science Magazine 319 (5868): p. 1365.