Empirical Project 1 Working in Excel

Part 1.1 The behaviour of average surface temperature over time

In the questions below, we look at data from NASA about land-ocean temperature anomalies in the northern hemisphere from 1880 to 2016. Figure 1.1 is constructed using this data and shows temperatures in the northern hemisphere over the period 1880–2016, expressed as differences from the average temperature from 1951 to 1980. We start by creating charts similar to Figure 1.1, in order to visualize the data and spot patterns more easily.

Figure 1.1 Northern hemisphere temperatures (1880–2016).

Before plotting any charts, download the data and make sure you understand how temperature is measured:

  1. In this dataset, temperature is measured as ‘anomalies’ rather than absolute temperature. Using NASA’s Frequently Asked Questions section as a reference, explain in your own words what temperature ‘anomalies’ means. Why have researchers chosen this particular measure over other measures (such as absolute temperature)?

Now create some line charts using monthly, seasonal, and annual data, which help us look for general patterns over time.

  1. Choose one month and plot a line chart with average temperature anomaly on the vertical axis and time (1880–2016) on the horizontal axis. Label each axis appropriately and give your chart a suitable title (Refer to Figure 1.1 as an example.)

Excel walk-through 1.1 Drawing a line chart of temperature and time

Figure 1.2 How to draw a line chart of temperature and time.

The data

This is what the temperature data looks like. Column A has time (in years), Column B has temperature deviations, and Column C contains the average northern hemisphere temperature. We will be using Columns A and B to make the line chart.

Figure 1.2a This is what the temperature data looks like. Column A has time (in years), Column B has temperature deviations, and Column C contains the average northern hemisphere temperature. We will be using Columns A and B to make the line chart.

Draw a line chart

After completing step 4, your line chart will look similar to this. Temperature deviation is on the vertical axis and time is on the horizontal axis. Notice that the numbers for time are not correct (they should be years).

Figure 1.2b After completing step 4, your line chart will look similar to this. Temperature deviation is on the vertical axis and time is on the horizontal axis. Notice that the numbers for time are not correct (they should be years).

Change the horizontal axis variable to years

To change the horizontal axis labels to years, we need to add the values in Column A to the line chart.

Figure 1.2c To change the horizontal axis labels to years, we need to add the values in Column A to the line chart.

Change the horizontal axis variable to years

The current horizontal axis labels are the numbers 1, 2, 3, and so on. To change these labels to years, we need to edit the labels.

Figure 1.2d The current horizontal axis labels are the numbers 1, 2, 3, and so on. To change these labels to years, we need to edit the labels.

Change the horizontal axis variable to years

After completing step 10, the horizontal axis will be in time (years).

Figure 1.2e After completing step 10, the horizontal axis will be in time (years).

Reposition the horizontal axis

After completing step 12, the horizontal axis will be at the bottom of the chart.

Figure 1.2f After completing step 12, the horizontal axis will be at the bottom of the chart.

Add titles

Give the axes and the chart appropriate titles.

Figure 1.2g Give the axes and the chart appropriate titles.

  1. The columns labelled ‘DJF’, ‘MAM’, ‘JJA’, and ‘SON’ contain seasonal averages (means). For example, the ‘MAM’ column contains the average of the March, April, and May columns for each year. Plot a separate line chart for each season, using average temperature anomaly for that season on the vertical axis and time (1880–2016) on the horizontal axis.
  1. The column labelled ‘J–D’ contains the average temperature anomaly for each year.

Excel walk-through 1.2 Plotting a line chart and adding a horizontal line

Figure 1.3 How to plot a line chart and add a horizontal line.

The data

This is what the data looks like. Column A contains years (1880–Present), Columns B to M contain monthly temperature anomalies, and Columns N to S contain the mean temperature anomaly for the specified months (e.g. DJF is the mean over Dec, Jan, and Feb). We are going to draw a line chart for the January to December mean (Column N).

Figure 1.3a This is what the data looks like. Column A contains years (1880–Present), Columns B to M contain monthly temperature anomalies, and Columns N to S contain the mean temperature anomaly for the specified months (e.g. DJF is the mean over Dec, Jan, and Feb). We are going to draw a line chart for the January to December mean (Column N).

Draw a line chart of mean temperature anomaly over time

After completing step 3, your line chart will look similar to this. Temperature deviation is on the vertical axis and time is on the horizontal axis. Notice that the numbers for time are not correct (they should be years).

Figure 1.3b After completing step 3, your line chart will look similar to this. Temperature deviation is on the vertical axis and time is on the horizontal axis. Notice that the numbers for time are not correct (they should be years).

Add a horizontal line showing the 1951–1980 average

To add a horizontal line, we will create a new variable (called ‘1951–1980 average’, shown in Column T) and add it to our line chart. This variable will have the value 0 so it will show up as a horizontal line on the chart.

Figure 1.3c To add a horizontal line, we will create a new variable (called ‘1951–1980 average’, shown in Column T) and add it to our line chart. This variable will have the value 0 so it will show up as a horizontal line on the chart.

Add a horizontal line showing the 1951–1980 average

Now, we have to add the values in Column T as a new series in the line chart.

Figure 1.3d Now, we have to add the values in Column T as a new series in the line chart.

Add a horizontal line showing the 1951–1980 average

The section on the left lists all the data series that are currently plotted on the chart as vertical axis values. We need to add the values in Column T to this list, so that a horizontal line with vertical axis values of 0 will appear.

Figure 1.3e The section on the left lists all the data series that are currently plotted on the chart as vertical axis values. We need to add the values in Column T to this list, so that a horizontal line with vertical axis values of 0 will appear.

Add a horizontal line showing the 1951–1980 average

The following box will pop up. You need to fill the two spaces with information about the new series. The first space is for the series name, and the second space is for the cells containing the vertical axis values.

Figure 1.3f The following box will pop up. You need to fill the two spaces with information about the new series. The first space is for the series name, and the second space is for the cells containing the vertical axis values.

Add a horizontal line showing the 1951–1980 average

The cells in Column T contain the vertical axis values we need to add to the chart, so we need to select them all.

Figure 1.3g The cells in Column T contain the vertical axis values we need to add to the chart, so we need to select them all.

Add a horizontal line showing the 1951–1980 average

After step 11, if both spaces in the box have been filled with the correct information, a horizontal line will appear on the chart.

Figure 1.3h After step 11, if both spaces in the box have been filled with the correct information, a horizontal line will appear on the chart.

Change the horizontal axis variable to time (in years)

There are two remaining things to change in the chart. First, the temperature series (Column N) needs to be given an appropriate name. (This name will show in the chart legend). Second, the horizontal axis labels should be changed to the years 1880–2016.

Figure 1.3i There are two remaining things to change in the chart. First, the temperature series (Column N) needs to be given an appropriate name. (This name will show in the chart legend). Second, the horizontal axis labels should be changed to the years 1880–2016.

Change the horizontal axis variable to time (in years)

After step 15, if the correct cells have been selected, the horizontal axis labels will change to the years 1880–2016.

Figure 1.3j After step 15, if the correct cells have been selected, the horizontal axis labels will change to the years 1880–2016.

Change the horizontal axis variable to time (in years)

After step 15, the box should look like this, with appropriate names for each data series (shown on the left), and correct horizontal axis values (shown on the right).

Figure 1.3k After step 15, the box should look like this, with appropriate names for each data series (shown on the left), and correct horizontal axis values (shown on the right).

Label the different lines on the line chart

After step 18, there will be a legend containing the names of each data series, in your chosen place on the chart.

Figure 1.3l After step 18, there will be a legend containing the names of each data series, in your chosen place on the chart.

Add titles and reposition the horizontal axis

By default, the horizontal axis is positioned at the vertical axis value of 0. To move it to the bottom of the chart (as in Figure 1.1), we have to change the axis position. After step 23, your chart will look similar to Figure 1.1.

Figure 1.3m By default, the horizontal axis is positioned at the vertical axis value of 0. To move it to the bottom of the chart (as in Figure 1.1), we have to change the axis position. After step 23, your chart will look similar to Figure 1.1.

  1. You now have charts for three different time intervals: month, season, and year. For each time interval, discuss what we can learn about patterns in temperature over time that we might not be able to learn from the charts of other time intervals.
  1. Compare your chart from Question 4 to Figure 1.4 which also shows the behaviour of temperature over time using data taken from the National Academy of Sciences.

Figure 1.4 Northern hemisphere temperatures over the long run (1000–2006).

Part 1.2 Variation in temperature over time

Aside from changes in the mean temperature, the government is also worried that climate change will result in more frequent extreme weather events. The island has experienced a few major storms and severe heat waves in the past, both of which caused serious damage and disruption to economic activity.

Will weather become more extreme and vary more as a result of climate change? A New York Times article uses the same temperature dataset you have been using to investigate the distribution of temperatures and temperature variability over time. Read through the article, paying close attention to the descriptions of the temperature distributions.

We can use the mean and median to describe distributions, and we can use deciles to describe parts of distributions. To visualize distributions, we can use column charts in Excel. (For some practice on using these concepts and creating column charts in Excel, see Section 1.1 of Economy, Society, and Public Policy). We are now going to create similar charts of temperature distributions to the ones in the New York Times article, and look at different ways of summarizing distributions.

frequency table
A record of how many observations in a dataset have a particular value, range of values, or belong to a particular category.

In order to create a column chart using the temperature data we have, we first need to summarize the data using a frequency table. Instead of using deciles to group the data, we use intervals of 0.05, so that temperature anomalies with a value from −0.3 to −0.025 will be in one group, a value greater than −0.025 up until 0.02 in another group, and so on. The frequency table shows us how many values belong to a particular group.

  1. Using the monthly data for June, July, and August (columns G to I in your spreadsheet), create two frequency tables similar to Figure 1.5 below for the years 1951–1980 and 1981–2010, respectively. The values in the first column should range from −0.3 to 1.05, in intervals of 0.05.
Range of temperature anomaly (T) Frequency
−0.30
−0.25
1.00
1.05

Figure 1.5 A frequency table.

Excel walk-through 1.3 Creating a frequency table

Follow the walk-through in the CORE video, or in Figure 1.6, to find out how to make a frequency table in Excel.

Create a table

Figure 1.6 How to create a frequency table in Excel.

Create a table

In this example, we will make a frequency table for the years 1951–1980 in Columns A and B. It’s a good idea to put all the tables in a separate place from the data.

Figure 1.6a In this example, we will make a frequency table for the years 1951–1980 in Columns A and B. It’s a good idea to put all the tables in a separate place from the data.

Create a table

After step 2, your table will look like Figure 1.5.

Figure 1.6b After step 2, your table will look like Figure 1.5.

Filter the data

It is easier to make a frequency table if you have filtered the data to show only the values you need for the table (the years 1951–1980 in this case).

Figure 1.6c It is easier to make a frequency table if you have filtered the data to show only the values you need for the table (the years 1951–1980 in this case).

Use the FREQUENCY function to fill in the rest of the table

After completing step 6, only the data for the selected years is shown in the spreadsheet. Data for the other years is still there, but it’s hidden.

Figure 1.6d After completing step 6, only the data for the selected years is shown in the spreadsheet. Data for the other years is still there, but it’s hidden.

Use the FREQUENCY function to fill in the rest of the table

Now that the data is filtered, we will use Excel’s FREQUENCY function to fill in Column B. First, select the cells that need to be filled in.

Figure 1.6e Now that the data is filtered, we will use Excel’s FREQUENCY function to fill in Column B. First, select the cells that need to be filled in.

Use the FREQUENCY function to fill in the rest of the table

Excel will fill in the frequency table based on the values in the cells selected.

Figure 1.6f Excel will fill in the frequency table based on the values in the cells selected.

Use the FREQUENCY function to fill in the rest of the table

Note: The values you get will be slightly different to those shown here, because this station temperature data is slightly different.

Figure 1.6g Note: The values you get will be slightly different to those shown here, because this station temperature data is slightly different.

  1. Using the frequency tables from Question 1:
variance
A measure of dispersion in a frequency distribution, equal to the mean of the squares of the deviations from the arithmetic mean of the distribution. The variance is used to indicate how ‘spread out’ the data is. A higher variance means that the data is more spread out. Example: The set of numbers 1, 1, 1 has zero variance (no variation), while the set of numbers 1, 1, 999 has a high variance of 2178 (large spread).

Now we will use our data to look at different aspects of distributions. First, we will learn how to use deciles to determine which observations are ‘normal’ and ‘abnormal’, and then learn how to use variance to describe the shape of a distribution.

  1. The New York Times article considers the bottom third (the lowest or coldest one-third) of temperature anomalies in 1951–1980 as ‘cold’ and the top third (the highest or hottest one-third) of anomalies as ‘hot’. In decile terms, temperatures in the 1st to 3rd decile are ‘cold’ and temperatures in the 7th to 10th decile or above are ‘hot’ (rounded to the nearest decile). Use Excel’s PERCENTILE.INC function to determine what values correspond to the 3rd and 7th decile, across all months in 1951–1980.

Excel walk-through 1.4 Calculating percentiles

Follow the walk-through in the CORE video, or in Figure 1.7, to find out how to calculate percentiles in Excel.

Figure 1.7 How to use Excel’s PERCENTILE.INC function.

The data

We will be using the same data as in Excel walk-through 1.3.

Figure 1.7a We will be using the same data as in Excel walk-through 1.3.

Use PERCENTILE.INC to get the value for the 3rd decile

The PERCENTILE.INC function will find the value corresponding to the chosen percentile in the cells you selected. The value 0.3 refers to the 30th percentile, also known as the 3rd decile.

Figure 1.7b The PERCENTILE.INC function will find the value corresponding to the chosen percentile in the cells you selected. The value 0.3 refers to the 30th percentile, also known as the 3rd decile.

Use PERCENTILE.INC to get the value for the 7th decile

Note: The values you get will be slightly different to those shown here, because this station temperature data is slightly different.

Figure 1.7c Note: The values you get will be slightly different to those shown here, because this station temperature data is slightly different.

  1. Based on the values you found in Question 3, count the number of anomalies that are considered ‘hot’ in 1981–2010, and express this as a percentage of all the temperature observations in that period. Does your answer suggest that we are experiencing hotter weather more frequently in 1981–2010? (Remember that each decile represents 10% of observations, so 30% of temperatures were considered ‘hot’ in 1951–1980.)

Excel walk-through 1.5 Using Excel’s COUNTIF function

Figure 1.8 How to use Excel’s COUNTIF function.

Filter the data, keeping the years 1981–2010

We will be using the years 1981–2010 only. To make the data easier to view, we will filter the data so that only the years 1981–2010 are visible.

Figure 1.8a We will be using the years 1981–2010 only. To make the data easier to view, we will filter the data so that only the years 1981–2010 are visible.

Filter the data, keeping the years 1981–2010

Now only the data for your selected years are shown in the spreadsheet. Data for the other years is still there, but hidden.

Figure 1.8b Now only the data for your selected years are shown in the spreadsheet. Data for the other years is still there, but hidden.

Use COUNTIF to get the number of cells with a value less than the 3rd decile of 1951–1980

The COUNTIF function counts the number of cells you selected that satisfy a given condition (in this case, having a value less than or equal to the value of the 3rd decile in 1951–1980).

Figure 1.8c The COUNTIF function counts the number of cells you selected that satisfy a given condition (in this case, having a value less than or equal to the value of the 3rd decile in 1951–1980).

Use COUNTIF to get the number of cells with a value greater than the 7th decile of 1951–1980

Now, the condition is that values should be greater than or equal to the value of the 7th decile in 1951–1980.

Figure 1.8d Now, the condition is that values should be greater than or equal to the value of the 7th decile in 1951–1980.

Use the numbers obtained to calculate percentages

COUNTIF gives us numbers, but to convert these into percentages we need to divide the numbers from COUNTIF by the total number of observations.

Figure 1.8e COUNTIF gives us numbers, but to convert these into percentages we need to divide the numbers from COUNTIF by the total number of observations.

  1. The New York Times article discusses whether temperatures have become more variable over time. One way to measure temperature variability is by calculating the variance of the temperature distribution for each season (‘DJF’, ‘MAM’, ‘JJA’, and ‘SON’).

Excel walk-through 1.6 Calculating and understanding the variance

Figure 1.9 How to calculate variance.

The temperature data

This data is temperature anomalies for 1981–2010, showing the months March to May only. The column chart shows what the data looks like. Each column shows how many values fall within the ranges shown on the horizontal axis. For example, this column tells us that 8 values are greater than 0.5 and less than or equal to 0.7. Note: The values you get will be slightly different to those shown here, because this station temperature data is slightly different.

Figure 1.9a This data is temperature anomalies for 1981–2010, showing the months March to May only. The column chart shows what the data looks like. Each column shows how many values fall within the ranges shown on the horizontal axis. For example, this column tells us that 8 values are greater than 0.5 and less than or equal to 0.7. Note: The values you get will be slightly different to those shown here, because this station temperature data is slightly different.

Made up data that is less spread out

This data on the right is made up. From this chart, you can see that the values are all quite close together, with the smallest value being 0.6 and the largest being 0.8. Comparing the charts, the real temperature data looks more spread out than the made up data.

Figure 1.9b This data on the right is made up. From this chart, you can see that the values are all quite close together, with the smallest value being 0.6 and the largest being 0.8. Comparing the charts, the real temperature data looks more spread out than the made up data.

Calculating and interpreting the variance

The variance is a measure of how spread out the data is. Just looking at the charts, we would expect the real temperature data to have a higher variance than the made up data. Note: There is a similar function in Excel called VAR.S, which is used to calculate the variance for other types of data. For this temperature data though, we will use VAR.P.

Figure 1.9c The variance is a measure of how spread out the data is. Just looking at the charts, we would expect the real temperature data to have a higher variance than the made up data. Note: There is a similar function in Excel called VAR.S, which is used to calculate the variance for other types of data. For this temperature data though, we will use VAR.P.

  1. Using the findings of the New York Times article and your answers to Questions 1 to 5, discuss whether temperature appears to be more variable over time. Would you advise the government to spend more money on mitigating the effects of extreme weather events?

Part 1.3 Carbon emissions and the environment

correlation
A measure of how closely related two variables are. Two variables are correlated if knowing the value of one variable provides information on the likely value of the other, for example high values of one variable being commonly observed along with high values of the other variable. Correlation can be positive or negative. It is negative when high values of one variable are observed with low values of the other. Correlation does not mean that there is a causal relationship between the variables. Example: When the weather is hotter, purchases of ice cream are higher. Temperature and ice cream sales are positively correlated. On the other hand, if purchases of hot beverages decrease when the weather is hotter, we say that temperature and hot beverage sales are negatively correlated.

The government has heard that carbon emissions could be responsible for climate change, and has asked you to investigate whether this is the case. To do so, we are now going to look at carbon emissions over time, and use another type of chart (scatter charts) to show their relationship with temperature anomalies. One way to measure the relationship between two variables is correlation. Excel walk-through 1.7 explains what correlation is and how to calculate it in Excel.

In the questions below, we will make charts using the CO2 data from the US National Oceanic and Atmospheric Administration. Download the Excel spreadsheet containing this data.

  1. The CO2 data was recorded from one Observatory in Mauna Loa. Using an Earth System Research Laboratory article as a reference, explain whether or not you think this data is a reliable representation of the global atmosphere.
  1. The variables ‘trend’ and ‘interpolated’ are similar, but not identical. In your own words, explain the difference between these two measures of CO2 levels. Why might there be seasonal variation in CO2 levels?

Now we will use a line chart to look for general patterns over time.

  1. Plot a line chart with interpolated and trend CO2 levels on the vertical axis and time (starting from January 1960) on the horizontal axis. Label the axes and the chart legend, and give your chart an appropriate title. What does this chart suggest about the relationship between CO2 and time?
correlation coefficient
A numerical measure of how closely associated two variables are and whether they tend to take similar or dissimilar values, ranging from a value of 1, indicating that the variables take similar values (positively correlated), to −1, indicating that the variables take dissimilar variables (negative or inverse correlation). A value of 1 or −1 indicates that knowing the value of one of the variables would allow you to perfectly predict the value of the other. A value of 0 indicates that knowing one of the variables provides no information about the value of the other.

We will now combine the CO2 data with the temperature data from Part 1.1, and then examine the relationship between these two variables visually, using scatterplots, and statistically, using the correlation coefficient.

  1. Choose one month and add the CO2 trend data to the temperature dataset from Part 1.1, making sure that the data corresponds to the correct year.

Excel walk-through 1.7 Calculating correlation and drawing a scatterplot

Follow the walk-through in the CORE video, or in Figure 1.10, to find out how to calculate correlation and draw a scatterplot in Excel.

Figure 1.10 How to create scatterplots and calculate the correlation coefficient.

The CO2 data

This is what the monthly CO2 data looks like. Column A has the years (1958–present), Column B has the months (1–12), and Columns C to E have different measures of CO2 emissions. We will use the Trend data (Column E) to make a scatterplot of CO2 emissions and temperature anomalies for January. You will need to have the temperature anomaly spreadsheet open because we will be using it later.

Figure 1.10a This is what the monthly CO2 data looks like. Column A has the years (1958–present), Column B has the months (1–12), and Columns C to E have different measures of CO2 emissions. We will use the Trend data (Column E) to make a scatterplot of CO2 emissions and temperature anomalies for January. You will need to have the temperature anomaly spreadsheet open because we will be using it later.

Filter the CO2 data according to your chosen month

After completing step 3, only the data for the selected months is shown in the spreadsheet. Data for the other months is still there, but it’s hidden.

Figure 1.10b After completing step 3, only the data for the selected months is shown in the spreadsheet. Data for the other months is still there, but it’s hidden.

Filter the CO2 data according to your chosen month

We need to copy and paste the data in Column E into the temperature anomalies spreadsheet. Filtering the data first means that only these values will be copied (rather than all values in Column E).

Figure 1.10c We need to copy and paste the data in Column E into the temperature anomalies spreadsheet. Filtering the data first means that only these values will be copied (rather than all values in Column E).

Filter the temperature data to correspond to the years present in the CO2 data

The values in Column A (years) need to match with the years available in the CO2 data. We will filter the data so that temperature data is only visible for the years that CO2 data is available.

Figure 1.10d The values in Column A (years) need to match with the years available in the CO2 data. We will filter the data so that temperature data is only visible for the years that CO2 data is available.

Filter the temperature data to correspond to the years present in the CO2 data

Pay particular attention to the year in the first space. For March to December, the first year in the CO2 data is 1958. For January or February, the first year in the CO2 data is 1959.

Figure 1.10e Pay particular attention to the year in the first space. For March to December, the first year in the CO2 data is 1958. For January or February, the first year in the CO2 data is 1959.

Paste the CO2 data into the temperature data spreadsheet

If the data has been filtered correctly, after step 9 the CO2 data will be matched with the corresponding year.

Figure 1.10f If the data has been filtered correctly, after step 9 the CO2 data will be matched with the corresponding year.

Draw a scatterplot for temperature anomaly and CO2 level

After completing step 13, your scatterplot will look similar to this chart, with CO2 level on the vertical axis and temperature anomaly on the horizontal axis.

Figure 1.10g After completing step 13, your scatterplot will look similar to this chart, with CO2 level on the vertical axis and temperature anomaly on the horizontal axis.

Change the range of the vertical axis

Currently the points are clustered in the top half of the graph. To see them more clearly, we need to change the range of the vertical axis.

Figure 1.10h Currently the points are clustered in the top half of the graph. To see them more clearly, we need to change the range of the vertical axis.

Change the range of the vertical axis

After completing step 15, you will see the individual data points more clearly.

Figure 1.10i After completing step 15, you will see the individual data points more clearly.

Add titles

Give the axes and the chart appropirate titles.

Figure 1.10j Give the axes and the chart appropirate titles.

Calculate and interpret the correlation coefficient

The correlation coefficient tells us how close the data is to resembling a straight line on a scatterplot. The correlation coefficient ranges from −1 to −1. A coefficient of 1 or −1 means that the data is a straight line (there is a perfect linear relationship between the two variables), while a coefficient of 0 means that the data do not resemble a straight line (no linear relationship between the two variables). Note: The coefficient you get will be slightly different because this temperature data is the station data, not the combined air-surface temperature data.

Figure 1.10k The correlation coefficient tells us how close the data is to resembling a straight line on a scatterplot. The correlation coefficient ranges from −1 to −1. A coefficient of 1 or −1 means that the data is a straight line (there is a perfect linear relationship between the two variables), while a coefficient of 0 means that the data do not resemble a straight line (no linear relationship between the two variables). Note: The coefficient you get will be slightly different because this temperature data is the station data, not the combined air-surface temperature data.

Calculate and interpret the correlation coefficient

Here are some more examples of correlation coefficients and how to interpret them. Note: There is no strict definition for what a ‘strong’ or ‘weak’ correlation is. Generally, a coefficient of 0.7 to 1 (or −0.7 to −1) is considered ‘strong’ and a coefficient of 0 to 0.3 (0 to −0.3) is considered ‘weak’. Any coefficient outside this range is generally considered ‘moderate’.

Figure 1.10l Here are some more examples of correlation coefficients and how to interpret them. Note: There is no strict definition for what a ‘strong’ or ‘weak’ correlation is. Generally, a coefficient of 0.7 to 1 (or −0.7 to −1) is considered ‘strong’ and a coefficient of 0 to 0.3 (0 to −0.3) is considered ‘weak’. Any coefficient outside this range is generally considered ‘moderate’.

Limitations of the correlation coefficient

One limitation of this correlation measure is that it only tells us about the strength of the linear relationship between two variables. The correlation coefficient cannot tell us if the two variables have a different kind of relationship (e.g. a wavy line). For example, suppose your electricity bill is high in winter but low in summer (the U-shaped pattern shown in the chart). This correlation measure cannot detect the U-shaped pattern, understating the actual association between the two variables. Note: There are other measures of correlation that can deal with this issue, but we will only look at this correlation measure (called Pearson correlation) for now.

Figure 1.10m One limitation of this correlation measure is that it only tells us about the strength of the linear relationship between two variables. The correlation coefficient cannot tell us if the two variables have a different kind of relationship (e.g. a wavy line). For example, suppose your electricity bill is high in winter but low in summer (the U-shaped pattern shown in the chart). This correlation measure cannot detect the U-shaped pattern, understating the actual association between the two variables. Note: There are other measures of correlation that can deal with this issue, but we will only look at this correlation measure (called Pearson correlation) for now.

Limitations of the correlation coefficient

Remember that correlation is a measure of association and does not imply that one variable causes the other. Spurious correlation is when two variables are strongly correlated but there is no direct relationship between them. In this example, the number of insurance scams in your city is strongly correlated with your monthly electricity bill (almost a perfectly straight line), but it is difficult to argue that your electricity consumption caused the scams, or vice versa. Similarly, we cannot use correlation or scatterplots to predict the value of either variable outside the range shown. For example, from this chart, we cannot conclude that when there are 45 insurance scams, your electricity bill will be higher than $37. We can only use correlation and scatterplots to summarize and look at the data we have.

Figure 1.10n Remember that correlation is a measure of association and does not imply that one variable causes the other. Spurious correlation is when two variables are strongly correlated but there is no direct relationship between them. In this example, the number of insurance scams in your city is strongly correlated with your monthly electricity bill (almost a perfectly straight line), but it is difficult to argue that your electricity consumption caused the scams, or vice versa. Similarly, we cannot use correlation or scatterplots to predict the value of either variable outside the range shown. For example, from this chart, we cannot conclude that when there are 45 insurance scams, your electricity bill will be higher than $37. We can only use correlation and scatterplots to summarize and look at the data we have.

  1. Choose two months and add the CO2 trend data to the temperature dataset from Part 1.1, making sure that the data corresponds to the correct year.
causation
A direction from cause to effect, establishing that a change in one variable produces a change in another. While a correlation gives an indication of whether two variables move together (either in the same or opposite directions), causation means that there is a mechanism that explains this association. Example: We know that higher levels of CO2 in the atmosphere lead to a greenhouse effect, which warms the Earth’s surface. Therefore we can say that higher CO2 levels are the cause of higher surface temperatures.
spurious correlation
A strong linear association between two variables that does not result from any direct relationship, but instead may be due to coincidence or to another unseen factor.

Even though two variables are strongly correlated with each other, it is not necessarily the case that one variable’s behaviour is the result of the other (a characteristic known as causation). The two variables could be spuriously correlated. The following example illustrates spurious correlation:

A child’s academic performance may be positively correlated with the number of rooms in their house or house size, but could we conclude that building an extra room would make a child smarter, or doing well at school would make your house bigger? It is more plausible that income or wealth, which determines the size of home that a family can afford and the resources available for studying, is the ‘unseen factor’ in this relationship. We could also determine whether income is the reason for this spurious correlation by comparing exam scores for children whose parents have similar income but different house sizes. If there is no correlation between exam scores and house size, then we can deduce that house size was not ‘causing’ exam scores (or vice versa).

  1. Consider the example of spurious correlation above.

Find out more What makes some correlations spurious?

In the spurious correlations website given in Question 6(c), most of the examples you will see involve data series that are trending (meaning that they tend to increase or decrease over time). If you calculate a correlation between two series that are trending, you are bound to find a large positive or negative correlation coefficient, even if there is no plausible explanation for a relationship between the two series. For example, ‘per capita cheese consumption’ (which increases over time due to increased disposable incomes or greater availability of cheese) has a correlation coefficient of 0.95 with the ‘number of people who die from becoming tangled in their bedsheets’ (which also increases over time due to a growing population and a growing availability of bedsheets).

The case for our example (the relationship between temperature and CO2 emissions) is slightly different. There is a well-known chemical link between the two. So we understand how CO2 emissions could potentially cause changes in temperature. But in general, do not be tempted to conclude that there is a causal link just because a high correlation coefficient can be seen. Be very cautious when attaching too much meaning to high correlation coefficients when data displays trending behaviour.

This part shows that summary statistics, such as the correlation coefficient, can help identify possible patterns or relationships between variables, but we cannot make conclusions about causation from them alone. It is also important to think about other explanations for what we see in the data, and whether we would expect there to be a relationship between the two variables.

natural experiment
An empirical study exploiting naturally occurring statistical controls in which researchers do not have the ability to assign participants to treatment and control groups, as is the case in conventional experiments. Instead, differences in law, policy, weather, or other events can offer the opportunity to analyse populations as if they had been part of an experiment. The validity of such studies depends on the premise that the assignment of subjects to the naturally occurring treatment and control groups can be plausibly argued to be random.

However, there are ways to determine whether there is a causal relationship between two variables, for example, by looking at the scientific processes that connect the variables (as with CO2 and temperature anomalies), or by using a natural experiment. To read more about how natural experiments help evaluate whether one variable causes another, see Section 1.5 of Economy, Society, and Public Policy. In Empirical Project 3, we will take a closer look at natural experiments and how we can use them to identify causal links between variables.