Empirical Project 1 Working in R

Download the code

To download the code chunks used in this project, right-click on the download link and select ‘Save link as’. You’ll need to save the code download to your working directory, and open it in RStudio.

Don’t forget to also download the data into your working directory by following the steps in this project.

Getting started in R

If you have worked with R and have the software installed on your machine, you can begin this project. If you have not, watch these videos for guidance on installing and using RStudio.

Part 1.1 The behaviour of average surface temperature over time

In the questions below, we look at data from NASA about land–ocean temperature anomalies in the northern hemisphere from 1880 to 2016. Figure 1.1 is constructed using this data, and shows temperatures in the northern hemisphere over the period 1880–2016, expressed as differences from the average temperature from 1951 to 1980. We start by creating charts similar to Figure 1.1, in order to visualize the data and spot patterns more easily.

Northern hemisphere temperatures (1880–2016).

Figure 1.1 Northern hemisphere temperatures (1880–2016).

Before plotting any charts, download the data and make sure you understand how temperature is measured:

  1. In this dataset, temperature is measured as ‘anomalies’ rather than as absolute temperature. Using NASA’s Frequently Asked Questions section as a reference, explain in your own words what temperature ‘anomalies’ means. Why have researchers chosen this particular measure over other measures (such as absolute temperature)?

First we have to import the data into R.

R walk-through 1.1 Importing the datafile into R

We want to import the data from ‘NH.Ts+dSST.csv’ into R.

We start by setting our working directory using the setwd command. Replace ‘YOURFILEPATH’ with the full path which points to the folder in which you have saved the datafile. If you don’t know how to find the path to your working folder, see the ‘Technical Reference section’.


Since our data is in csv format, we use the read.csv function to import the data into R. We will call our file ‘temperature data’, or ‘tempdata’ for short.

tempdata <- read.csv("NH.Ts+dSST.csv",skip=1,na.strings = "***") 

When using this function, we added two options. The skip=1 option is there as the real data table only starts in Row 2, so we need to skip one row. The na.strings = "***" option informs R how missing observations in the spreadsheet are coded. When looking at the spreadsheet, you can see that missing data is coded as "***". It is best to specify this here, as otherwise some of the data is not recognized as numeric data.

Understanding the data you are using is key to all empirical work. You can view the first six rows of the dataset, and confirm that they correspond to the columns in the csv file.

##   Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
## 1 1880 -0.57 -0.41 -0.28 -0.38 -0.12 -0.24 -0.25 -0.26 -0.29 -0.34 -0.39
## 2 1881 -0.21 -0.27 -0.01 -0.04 -0.07 -0.38 -0.07 -0.04 -0.25 -0.42 -0.44
## 3 1882  0.21  0.20 -0.01 -0.38 -0.33 -0.39 -0.38 -0.15 -0.18 -0.54 -0.34
## 4 1883 -0.61 -0.70 -0.17 -0.29 -0.34 -0.27 -0.10 -0.27 -0.35 -0.23 -0.42
## 5 1884 -0.24 -0.13 -0.67 -0.64 -0.43 -0.53 -0.49 -0.52 -0.46 -0.42 -0.49
## 6 1885 -1.01 -0.37 -0.21 -0.54 -0.57 -0.49 -0.38 -0.46 -0.33 -0.31 -0.29
##     Dec   J.D   D.N   DJF   MAM   JJA   SON
## 1 -0.51 -0.34    NA    NA -0.26 -0.25 -0.34
## 2 -0.30 -0.21 -0.23 -0.33 -0.04 -0.16 -0.37
## 3 -0.43 -0.23 -0.22  0.04 -0.24 -0.31 -0.35
## 4 -0.27 -0.34 -0.35 -0.58 -0.27 -0.21 -0.33
## 5 -0.40 -0.45 -0.44 -0.22 -0.58 -0.51 -0.45
## 6  0.00 -0.41 -0.45 -0.60 -0.44 -0.44 -0.31

Before we go on, it is also important to understand the data formats.

## 'data.frame':  138 obs. of 19 variables:
##  $ Year: int  1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 ...
##  $ Jan : num  -0.57 -0.21 0.21 -0.61 -0.24 -1.01 -0.69 -1.08 -0.54 -0.33 ...
##  $ Feb : num  -0.41 -0.27 0.2 -0.7 -0.13 -0.37 -0.69 -0.6 -0.61 0.32 ...
##  $ Mar : num  -0.28 -0.01 -0.01 -0.17 -0.67 -0.21 -0.58 -0.37 -0.59 0.05 ...
##  $ Apr : num  -0.38 -0.04 -0.38 -0.29 -0.64 -0.54 -0.35 -0.43 -0.26 0.12 ...
##  $ May : num  -0.12 -0.07 -0.33 -0.34 -0.43 -0.57 -0.36 -0.28 -0.18 -0.07 ...
##  $ Jun : num  -0.24 -0.38 -0.39 -0.27 -0.53 -0.49 -0.44 -0.21 -0.06 -0.15 ...
##  $ Jul : num  -0.25 -0.07 -0.38 -0.1 -0.49 -0.38 -0.21 -0.23 0.01 -0.12 ...
##  $ Aug : num  -0.26 -0.04 -0.15 -0.27 -0.52 -0.46 -0.48 -0.53 -0.21 -0.19 ...
##  $ Sep : num  -0.29 -0.25 -0.18 -0.35 -0.46 -0.33 -0.35 -0.18 -0.15 -0.28 ...
##  $ Oct : num  -0.34 -0.42 -0.54 -0.23 -0.42 -0.31 -0.33 -0.41 0.01 -0.37 ...
##  $ Nov : num  -0.39 -0.44 -0.34 -0.42 -0.49 -0.29 -0.46 -0.21 -0.04 -0.63 ...
##  $ Dec : num  -0.51 -0.3 -0.43 -0.27 -0.4 0 -0.18 -0.45 -0.28 -0.58 ...
##  $ J.D : num  -0.34 -0.21 -0.23 -0.34 -0.45 -0.41 -0.43 -0.42 -0.24 -0.19 ...
##  $ D.N : num  NA -0.23 -0.22 -0.35 -0.44 -0.45 -0.41 -0.39 -0.26 -0.16 ...
##  $ DJF : num  NA -0.33 0.04 -0.58 -0.22 -0.6 -0.46 -0.62 -0.53 -0.1 ...
##  $ MAM : num  -0.26 -0.04 -0.24 -0.27 -0.58 -0.44 -0.43 -0.36 -0.34 0.03 ...
##  $ JJA : num  -0.25 -0.16 -0.31 -0.21 -0.51 -0.44 -0.38 -0.32 -0.09 -0.15 ...
##  $ SON : num  -0.34 -0.37 -0.35 -0.33 -0.45 -0.31 -0.38 -0.27 -0.06 -0.43 ...

You can see that all variables are formatted as numerical data, which is helpful. If you don’t declare na.strings = "***" as you load the data, the variables with missing information would have been loaded as factor (categorical) variables.

Now create some line charts using monthly, seasonal, and annual data, which help us look for general patterns over time.

  1. Choose one month and plot a line chart with average temperature anomaly on the vertical axis and time (1880–2016) on the horizontal axis. Label each axis appropriately and give your chart a suitable title (Refer to Figure 1.1 as an example.)

R walk-through 1.2 Drawing a line chart of temperature and time

The data is formatted as num data. This is acceptable, but implies that R doesn’t recognize the data as time-series data. In R you can specify that a series is a time-series, in other words, that there are observations for the same variable but for different time periods. Letting R know that the data is a time-series will help later (with graphing in particular). Redefining series as time-series can be done as shown below. Make sure to amend the code below so that the end year (end=c()) corresponds to the latest year in your dataset (our example uses 2017).

tempdata$Jan <- ts(tempdata$Jan, start=c(1880), end=c(2017), frequency=1) 
tempdata$DJF <- ts(tempdata$DJF, start=c(1880), end=c(2017), frequency=1) 
tempdata$MAM <- ts(tempdata$MAM, start=c(1880), end=c(2017), frequency=1) 
tempdata$JJA <- ts(tempdata$JJA, start=c(1880), end=c(2017), frequency=1) 
tempdata$SON <- ts(tempdata$SON, start=c(1880), end=c(2017), frequency=1) 
tempdata$J.D <- ts(tempdata$J.D, start=c(1880), end=c(2017), frequency=1) 

Note that we placed each of these quarterly series in the relevant middle month. You could do the same for the remaining series, but we will only use the series above in this R walk-through.

To draw a line chart for the years 1880–2016, we use R’s plot function. The title on the next line adds a title to the chart, and the axis draws horizontal and vertical axes according to our specifications. The following code uses data for January as an example.

plot(tempdata$Jan,type="l",col="blue", lwd = 2, # Set line width and colour
     ylab="Annual temperature anomalies", xlab="Year")
title("Average temperature anomaly in January \n in the northern hemisphere (1880-2016)")  # \n creates a line break
abline(h=0,col="darkorange2",lwd = 2) # Add a horizontal line (at y=0)
text(2000,-0.1,"1951-1980 average") # Add a label to the horizontal line

Northern hemisphere January temperatures (1880–2016).

Figure 1.2 Northern hemisphere January temperatures (1880–2016).

Try different values for type and col in the plot command to figure out what these options do (some online research could help). xlab and ylab define the respective axis titles.

It is important to remember that all axis and chart titles should be enclosed in quotation marks (""), as well as any words that are not options (for example, colour names or filenames).

  1. The columns labelled DJF, MAM, JJA, and SON contain seasonal averages (means). For example, the MAM column contains the average of the March, April, and May columns for each year. Plot a separate line chart for each season, using average temperature anomaly for that season on the vertical axis and time (1880–2016) on the horizontal axis.
  1. The column labelled JD contains the average temperature anomaly for each year.

R walk-through 1.3 Producing a line chart for the annual temperature anomalies

This is where the power of programming languages comes into play. Take the code used in R walk-through 1.2 and replace the variable name Jan with the name for the annual variable (J.D).

plot(tempdata$J.D,type="l",col="blue",lwd = 2,   # Set line width and colour
     ylab="Annual temperature anomalies", xlab="Year")
title("Average annual temperature anomaly \n in the northern hemisphere (1880-2016)")  # \n creates a line break
abline(h=0,col="darkorange2",lwd = 2) # Add a horizontal line (at y=0)
text(2000,-0.1,"1951-1980 average") # Add a label to the horizontal line

Northern hemisphere annual temperatures (1880–2016).

Figure 1.3 Northern hemisphere annual temperatures (1880–2016).

  1. You now have charts for three different time intervals: month, season, and year. For each time interval, discuss what we can learn about patterns in temperature over time that we might not be able to learn from the charts of other time intervals.
  1. Compare your chart from Question 4 to Figure 1.4, which also shows the behaviour of temperature over time using data taken from the National Academy of Sciences.

Northern hemisphere temperatures over the long run (1000–2006).

Figure 1.4 Northern hemisphere temperatures over the long run (1000–2006).

Part 1.2 Variation in temperature over time

Aside from changes in the mean temperature, the government is also worried that climate change will result in more frequent extreme weather events. The island has experienced a few major storms and severe heat waves in the past, both of which caused serious damage and disruption to economic activity.

Will weather become more extreme and vary more as a result of climate change? A New York Times article uses the same temperature dataset you have been using to investigate the distribution of temperatures and temperature variability over time. Read through the article, paying close attention to the descriptions of the temperature distributions.

We can use the mean and median to describe distributions, and we can use deciles to describe parts of distributions. To visualize distributions, we can use column charts (sometimes referred to as frequency histograms). We are now going to create similar charts of temperature distributions to the ones in the New York Times article, and look at different ways of summarizing distributions.

frequency table
A record of how many observations in a dataset have a particular value, range of values, or belong to a particular category.

In order to create a column chart using the temperature data we have, we first need to summarize the data using a frequency table. Instead of using deciles to group the data, we use intervals of 0.05, so that temperature anomalies with a value from −0.3 to −0.025 will be in one group, a value greater than −0.025 and up to 0.02 in another group, and so on. The frequency table shows us how many values belong to a particular group.

  1. Using the monthly data for June, July, and August, create two frequency tables similar to Figure 1.5 below for the years 1951–1980 and 1981–2010 respectively. The values in the first column should range from −0.3 to 1.05, in intervals of 0.05. See Walk-through 1.3 for how to do this.
Range of temperature anomaly (T) Frequency

Figure 1.5 A frequency table.

  1. Using the frequency tables from Question 1:

R walk-through 1.4 Creating frequency tables and histograms

If you know that you will be dividing data up into subperiods, it is often best to create a categorical variable that indicates the subperiod for each observation. In R these are called factor variables. When you create factor variables it is useful to think about all the possible values the variables could take, and then predefine these values.

tempdata$Period <- factor(NA, levels=c("1921-1950","1951-1980","1981-2010"), 

This defines the new variable Period initially as a variable with value ‘NA’ (not available) for all observations, and predefines the possible values, called levels. Now we continue by changing the values for those observations that fall into the periods of interest.

tempdata$Period[(tempdata$Year > 1920) & (tempdata$Year < 1951)] <- "1921-1950"
tempdata$Period[(tempdata$Year > 1950) & (tempdata$Year < 1981)] <- "1951-1980"
tempdata$Period[(tempdata$Year > 1980) & (tempdata$Year < 2011)] <- "1981-2010"

As we are asked to use all monthly anomalies from June, July, and August, we need to create one new variable that contains all these values. We do this by taking the variables for the three months and attaching them to each other using the combine (c) function.

temp_summer <- c(tempdata$Jun, tempdata$Jul, tempdata$Aug) # Combine the temperature data for June, July, and August

There are always several ways to achieve the same result. One alternative is to use the unlist function and apply it to Columns 7 to 9 (containing the data for June to August) of tempdata.

temp_summer <- unlist(tempdata[,7:9],use.names = FALSE)

This is now one long dataseries (temp_summer), with the monthly temperature anomalies for the three months attached to each other. But remember that we want to separate the calculations that follow by the variable Period. We therefore repeat the same for the Period variable.

temp_Period <- c(tempdata$Period,tempdata$Period,tempdata$Period) # Mirror the Period information for temp_sum

temp_Period <- factor(temp_Period, levels=1:nlevels(tempdata$Period), labels=levels(tempdata$Period)) # Repopulate the factor information 

The second line here is somewhat non-intuitive. It is needed as concatenating tempdata$Period results in the new variable not being a factor (categorical) variable any longer. This is why we have to reimpose the factor characteristics.

Now we will obtain the information required for a frequency table by applying the hist function to temp_summer[(temp_Period == "1951-1980")], which are the monthly temperature anomalies from the period ‘1951–1980’. The option plot=FALSE switches off the plot. (See what happens if you set it to TRUE.)

hist(temp_summer[(temp_Period == "1951-1980")],plot=FALSE)
## $breaks
##  [1] -0.30 -0.25 -0.20 -0.15 -0.10 -0.05  0.00  0.05  0.10  0.15  0.20
## [12]  0.25
## $counts
##  [1]  2  7  6  7 14  6 14 12 10  9  3
## $density
##  [1] 0.4444444 1.5555556 1.3333333 1.5555556 3.1111111 1.3333333 3.1111111
##  [8] 2.6666667 2.2222222 2.0000000 0.6666667
## $mids
##  [1] -0.275 -0.225 -0.175 -0.125 -0.075 -0.025  0.025  0.075  0.125  0.175
## [11]  0.225
## $xname
## [1] "temp_summer[(temp_Period == \"1951-1980\")]"
## $equidist
## [1] TRUE
## attr(,"class")
## [1] "histogram"

From the output you can see that we can get the class breaks and the counts, which is all we need to create a frequency table. However, in our case the frequency table is merely a temporary input required to produce a histogram.

We will now use the histogram function from the mosaic package, which offers a straightforward way to separate the histogram calculation by Period.

The function below includes multiple commands:

  • | temp_Period splits the data according to the temp_Period value.
  • type = "count" indicates that we want to display the counts in each category
  • breaks = seq(-0.5,1.3,0.1) gives a sequence of numbers -0.5, -0.4, …, 1.3, which are boundaries for the categories.
  • main="Histogram of temperature anomalies" gives Figure 1.6 its title.
library(mosaic) # This is the library we use for the following command.

histogram(~ temp_summer | temp_Period, type = "count", breaks = seq(-0.5,1.3,0.10), main="Histogram of Temperature anomalies", xlab = "Summer temperature distribution")

Summer temperature distributions for different time periods.

Figure 1.6 Summer temperature distributions for different time periods.

To explain what a histogram displays, we refer to the histogram for the period from 1921–1950. Notice the vertical bars that are centred at values such as –0.35, –0.25, –0.15, –0.05, and so forth. We will look first at the highest of the bars, which is centred at –0.15. This bar represents values of the temperature anomalies that fall in the interval from –0.2 to –0.1. The height of this bar is a representation of how many values fall into this interval, and in this case there are 23 such observations. As it is the highest bar, this indicates that this is the interval in which the largest proportion of temperature anomalies fell for the period from 1921 to 1950. As you can see, there are virtually no temperature anomalies larger than 0.3. The height of these bars gives a useful first glimpse of the distribution of the temperature anomalies.

Now consider how this distribution changes as we move through the three distinct time periods. The distribution is clearly moving to the right for the 1981–2010 period. At this stage we only take note of this. This is an indication that the temperature is increasing, in other words an indication of global warming.

A measure of dispersion in a frequency distribution, equal to the mean of the squares of the deviations from the arithmetic mean of the distribution. The variance is used to indicate how ‘spread out’ the data is. A higher variance means that the data is more spread out. Example: The set of numbers 1, 1, 1 has zero variance (no variation), while the set of numbers 1, 1, 999 has a high variance of 2178 (large spread).

Now we will use our data to look at different aspects of distributions. Firstly, we will learn how to use deciles to determine which observations are ‘normal’ and ‘abnormal’, and then learn how to use variance to describe the shape of a distribution.

  1. The New York Times article considers the bottom third (the lowest or coldest one-third) of temperature anomalies in 1951–1980 as ‘cold’ and the top third (the highest or hottest one-third) of anomalies as ‘hot’. In decile terms, temperatures in the 1st to 3rd decile are ‘cold’ and temperatures in the 7th to 10th decile or above are ‘hot’ (rounded to the nearest decile). Use R’s quantile function to determine what values correspond to the 3rd and 7th decile across all months in 1951–1980.

R walk-through 1.5 Using the quantile function

The percentiles can be found with R’s quantile function. But first we need to create a vector that contains all monthly anomalies. As in the previous example, we want to continue to separate out the three periods.

temp_all_months <- subset(tempdata, (Year>=1951 & Year <=1980))  # Select years 1951 to 1980
temp_51to80 <- unlist(temp_all_months[,2:13])  # Columns 2 to 13 contain months Jan to Dec.
perc <- quantile(temp_51to80, c(0.3, 0.7))   # c(0.3, 0.7) indicates the chosen percentiles.

p30 <- perc[1] # The cold threshold
##  30% 
## -0.1
p70 <- perc[2] # The hot threshold
##  70% 
## 0.11
  1. Based on the values you found in Question 3, count the number of anomalies that are considered ‘hot’ in 1981–2010, and express this as a percentage of all the temperature observations in that period. Does your answer suggest that we are experiencing hotter weather more frequently in 1981–2010? (Remember that each decile represents 10% of observations, so 30% of temperatures were considered ‘hot’ in 1951–1980.)

R walk-through 1.6 Using the mean function

As before, you will have to retrieve the data for all months for the relevant period. You can just copy and paste from the earlier code and change the years.

temp_all_months <- subset(tempdata, (Year>=1981 & Year <=2010))  # Select years 1951 to 1980
temp_81to10 <- unlist(temp_all_months[,2:13])  # Columns 2 to 13 contain months Jan to Dec.

Now that we have all the monthly data for 1981–2010, we want to count the proportion of observations that are smaller than –0.1 and larger than 0.11. This is easily achieved with the following line:

paste("Proportion smaller than p30")
## [1] "Proportion smaller than p30"
temp <- temp_81to10 < p30
## [1] 0.01944444

You need to understand what temp <- temp_81to10 < p30 does, to understand what happended above. This command creates a variable with values that are either 1 (TRUE) or 0 (FALSE), depending on whether the particular monthly temperature anomaly in temp_81to10 was smaller than p30 (true) or larger than p30 (false). When you have a vector of 0s and 1s, and then calculate the mean of that vector, you end up with the proportion of 1s. Here we find that 0.019 (= 1.9%) of observations are smaller than p30. That means that while between 1951 and 1980, 30% of observations for the temperature anomaly were smaller than –0.10, between 1981 and 2010 only about two per cent of months are considered cold. That is a significant change.

Let’s check whether we get the same result at the upper tail.

paste("Proportion larger than p70")
## [1] "Proportion larger than p70"
mean(temp_81to10 > p70)
## [1] 0.8444444
  1. The New York Times article discusses whether temperatures have become more variable over time. One way to measure temperature variability is by calculating the variance of the temperature distribution for each season (DJF, MAM, JJA, and SON).

R walk-through 1.7 Calculating and understanding mean and variance

The best way to calculate mean and variance is to use the mosaic package introduced in R walk-through 1.4. (Remember to first install the mosaic package and name it library(mosaic).)

library(mosaic) # Only run if you haven't loaded mosaic yet
paste("Mean of DJF temperature anomalies across periods")
## [1] "Mean of DJF temperature anomalies across periods"
mean(~DJF|Period,data = tempdata)
##     1921-1950     1951-1980     1981-2010 
## -0.0573333333 -0.0006666667  0.5206666667
paste("Variance of DJF temperature anomalies across periods")
## [1] "Variance of DJF temperature anomalies across periods"
var(~DJF|Period,data = tempdata)
##  1921-1950  1951-1980  1981-2010 
## 0.05907540 0.05361333 0.07149609

The syntax here works as follows: Use the data in tempdata (data = tempdata) and calculate the mean of variable ~DJF conditioned on (|) the value for Period. Do the same for the var function. The conditionioning is a functionality added to the mean and var functions by the mosaic package. If mosaic is not loaded, you will get the error message: Error in mean(~DJF \| Period, data = tempdata) : unused argument (data = tempdata).

Looking at the results, it appears that it is not only the mean (December, January, and February) temperature anomaly that increases through 1981–2010, but also the variance.

Let’s calculate the variances through the periods for the other quarterly series.

paste("Variance of MAM temperature anomalies across periods")
## [1] "Variance of MAM temperature anomalies across periods"
var(~MAM|Period,data = tempdata)
##  1921-1950  1951-1980  1981-2010 
## 0.03029069 0.02661333 0.07535126
paste("Variance of JJA temperature anomalies across periods")
## [1] "Variance of JJA temperature anomalies across periods"
var(~JJA|Period,data = tempdata)
##  1921-1950  1951-1980  1981-2010 
## 0.01726713 0.01459264 0.06588690
paste("Variance of SON temperature anomalies across periods")
## [1] "Variance of SON temperature anomalies across periods"
var(~SON|Period,data = tempdata)
##  1921-1950  1951-1980  1981-2010 
## 0.02421437 0.02587920 0.10431506

We recognize that the variances seem to remain fairly constant across the first two periods, but they do increase markedly for the 1981–2010 period.

It is instructive to see these changes in graphical form as well. We will first plot a time-series plot.

     ylab="Annual temperature anomalies", xlab="Year")
title("Average temperature anomaly in DJF and JJA \n in the northern hemisphere (1880-2016)")  # \n creates a line break
abline(h=0,col="darkorange2",lwd=2) # Add a horizontal line (at y=0)
text(1895,0.1,"1951-1980 average") # Add a label to the horizontal line
legend(1880, 1.5, legend=c("DJF", "JJA"),col=c("blue", "darkgreen"), lty=1, cex=0.8,lwd=2)

Northern hemisphere winter and summer quarter temperatures (1958–2016).

Figure 1.7 Northern hemisphere winter and summer quarter temperatures (1958–2016).

  1. Using the findings of the New York Times article and your answers to Questions 1 to 5, discuss whether temperature appears to be more variable over time. Would you advise the government to spend more money on mitigating the effects of extreme weather events?
A measure of how closely related two variables are. Two variables are correlated if knowing the value of one variable provides information on the likely value of the other, for example high values of one variable being commonly observed along with high values of the other variable. Correlation can be positive or negative. It is negative when high values of one variable are observed with low values of the other. Correlation does not mean that there is a causal relationship between the variables. Example: When the weather is hotter, purchases of ice cream are higher. Temperature and ice cream sales are positively correlated. On the other hand, if purchases of hot beverages decrease when the weather is hotter, we say that temperature and hot beverage sales are negatively correlated.

Part 1.3 Carbon emissions and the environment

The government has heard that carbon emissions could be responsible for climate change, and has asked you to investigate whether this is the case. To do so, we are now going to look at carbon emissions over time, and use another type of chart, the scatterplot, to show their relationship to temperature anomalies. One way to measure the relationship between two variables is correlation. R walk-through 1.8 explains what correlation is and how to calculate it in R.

In the questions below, we will make charts using the CO2 data from the US National Oceanic and Atmospheric Administration. Download the Excel spreadsheet containing this data. Save the data as a csv file and import it into R.

  1. The CO2 data was recorded from one observatory in Mauna Loa. Using an Earth System Research Laboratory article as a reference, explain whether or not you think this data is a reliable representation of the global atmosphere.
  1. The variables trend and interpolated are similar, but not identical. In your own words, explain the difference between these two measures of CO2 levels. Why might there be seasonal variation in CO2 levels?

Now we will use a line chart to look for general patterns over time.

  1. Plot a line chart with interpolated and trend CO2 levels on the vertical axis and time (starting from January 1960) on the horizontal axis. Label the axes and the chart legend, and give your chart an appropriate title. What does this chart suggest about the relationship between CO2 and time?
correlation coefficient
A numerical measure of how closely associated two variables are and whether they tend to take similar or dissimilar values, ranging from a value of 1, indicating that the variables take similar values (positively correlated), to −1, indicating that the variables take dissimilar variables (negative or inverse correlation). A value of 1 or −1 indicates that knowing the value of one of the variables would allow you to perfectly predict the value of the other. A value of 0 indicates that knowing one of the variables provides no information about the value of the other.

We will now combine the CO2 data with the temperature data from Part 1.1, and then examine the relationship between these two variables visually using scatterplots, and statistically using the correlation coefficient.

  1. Choose one month and add the CO2 trend data to the temperature dataset from Part 1.1, making sure that the data corresponds to the correct year.

R walk-through 1.8 Scatterplots and the correlation coefficient

First we will import the CO2 data into R using the read.csv function, and call it ‘CO2data’.

CO2data <- read.csv("1_CO2 data.csv")

This file has monthly data, but in contrast to the data in tempdata, the data is all in one column (this is more conventional than the column per month format). To make this task easier, we will pick the June data from the CO2 emissions and add them as an additional variable to the tempdata dataset.

R has a convenient function called merge to do this. First we create a new dataset that contains only the June emissions data (‘CO2data_june’).

CO2data_june <- CO2data[CO2data$Month == 6,]

Then we use this data in the merge function. The merge function takes the original ‘tempdata’ and the ‘CO2data’ and merges them together. As the two have a common variable, Year, R automatically matches the data by year.

(Extension: Look up ?merge or Google ‘How to use the R merge function’ to figure out what all.x does, and to see other intricacies this function allows.)

tempCO2data <- merge(tempdata,CO2data_june)

Let us have a look at the data and how it aligns:

##   Year   Jun  Trend
## 1 1958  0.04 314.85
## 2 1959  0.14 315.92
## 3 1960  0.18 317.36
## 4 1961  0.19 317.48
## 5 1962 -0.10 318.27
## 6 1963 -0.02 319.16

To make a scatterplot, we use plot. Remember that R’s default chart for plot is a scatterplot, so we do not need to specify the chart type. One new option that applies to scatterplots is pch=, which determines the style of the data points. The number 16 corresponds to filled-in circles, but you can experiment with other numbers (0 to 25) to see what the data points look like.

plot(tempCO2data$Jun, tempCO2data$Trend, xlab="Temperature anomaly (degrees Celsius)", ylab="CO2 levels (trend, mole fraction)", pch=16, col="blue")
title("Scatterplot for CO2 emissions and temperature anomalies")

CO2 emissions and northern hemisphere June temperatures (1958–2016).

Figure 1.8 CO2 emissions and northern hemisphere June temperatures (1958–2016).

The cor function calculates the correlation coefficient.

cor(tempCO2data$Jun, tempCO2data$Trend)
## [1] 0.9157744

In this case, the correlation coefficient tells us that the data is quite close to resembling a straight line (as seen on the scatterplot). There is a strong positive association between the two variables (higher temperature anomalies are associated with higher CO2 levels).

Note: There is no strict definition for what a ‘strong’ or ‘weak’ correlation is. Generally, a coefficient of 0.7 to 1 (or between –0.7 and –1) is considered ‘strong’, and a coefficient of 0 to 0.3 (0 to –0.3) is considered ‘weak’. Any coefficient outside this range is generally considered ‘moderate’.

One limitation of this correlation measure is that it only tells us about the strength of the linear relationship between two variables, in other words how closely the scatterplot aligns along a straight line. The correlation coefficient cannot tell us if the two variables have a different kind of relationship (such as that represented by a wavy line).

If you need more insight into correlation coefficients, you may find it helpful to watch online turorials such as ‘Correlation coefficient intuition’ from the Khan Academy.

As we are dealing with time-series data, it is often more instructive to look at a line plot, as a scatterplot cannot convey how the observations relate to each other in the time dimension. If you were to check the variable types (use str(tempCO2data)), you would see that the data is not yet in time-series format. We could continue with the format as it is, but for plotting purposes it is useful to let R know that we are dealing with time-series data. We therefore apply the ts function as we did in Part 1.1.

tempCO2data$Jun <- ts(tempCO2data$Jun, start=c(1958), end=c(2017), frequency=1) 
tempCO2data$Trend <- ts(tempCO2data$Trend, start=c(1958), end=c(2017), frequency=1) 

Let’s start by plotting the June temperature anomalies.

     ylab="June temperature anomalies", xlab="Year")
title("June temperature anomalies and CO2 emissions")  

Northern hemisphere June temperatures (1958–2016).

Figure 1.9 Northern hemisphere June temperatures (1958–2016).

Typically, when using the plot function we would now only need to add the line for the second variable using the lines command. The issue, however, is that the CO2 emissions variable (Trend) is on a different scale, and the automatic vertical axis scale (from –0.2 to about 1.2) would not allow for the display of Trend. To resolve this issue you can introduce a second vertical axis using the commands below. (Tip: You are unlikely to remember the exact commands required, however you can Google ‘R plot 2 vertical axes’ or similar, and then adjust the code you find.)

par(mar = c(5,5,2,5))  # Create extra margins used for the second axis
     ylab="June temperature anomalies", xlab="Year")
title("June temperature anomalies and CO2 emissions")  
par(new = T)  # This ensures that the next plot goes into the same picture.
plot(tempCO2data$Trend, pch=16, lwd=2, axes=FALSE, xlab=NA, ylab=NA, cex=1.2) # No axis, no labels
axis(side = 4)
mtext(side = 4, line = 3, 'CO2 emissions')
       legend=c("June temp anom", "CO2 emis"),
       lty=c(1,1), col=c("blue", "black"),lwd=2)

CO2 emissions and northern hemisphere June temperatures (1958–2016).

Figure 1.10 CO2 emissions and northern hemisphere June temperatures (1958–2016).

This line graph not only shows how the two series move together in general, but also clearly demonstrates that both series display a clear upward trend over the sample period. This is an important feature of many (not all) time series, and is important for the interpretation (see below).

  1. Choose two months and add the CO2 trend data to the temperature dataset from Part 1.1, making sure that the data corresponds to the correct year.
A direction from cause to effect, establishing that a change in one variable produces a change in another. While a correlation gives an indication of whether two variables move together (either in the same or opposite directions), causation means that there is a mechanism that explains this association. Example: We know that higher levels of CO2 in the atmosphere lead to a greenhouse effect, which warms the Earth’s surface. Therefore we can say that higher CO2 levels are the cause of higher surface temperatures.
spurious correlation
A strong linear association between two variables that does not result from any direct relationship, but instead may be due to coincidence or to another unseen factor.

Even though two variables are strongly correlated with each other, this is not necessarily because one variable’s behaviour is the result of the other (a characteristic known as causation). The two variables could be spuriously correlated. The following example illustrates spurious correlation:

A child’s academic performance may be positively correlated with the size of the child’s house or the number of rooms in the house, but could we conclude that building an extra room would make a child smarter, or doing well at school would make someone’s house bigger? It is more plausible that income or wealth, which determines the size of home that a family can afford and the resources available for studying, is the ‘unseen factor’ in this relationship. We could also determine whether income is the reason for this spurious correlation by comparing exam scores for children whose parents have similar incomes but different house sizes. If there is no correlation between exam scores and house size, then we can deduce that house size was not ‘causing’ exam scores (or vice versa).

  1. Consider the example of spurious correlation above.

Find out more What makes some correlations spurious?

In the spurious correlations website given in Question 6(c), most of the examples you will see involve data series that are trending (meaning that they tend to increase or decrease over time). If you calculate a correlation between two series that are trending, you are bound to find a large positive or negative correlation coefficient, even if there is no plausible explanation for a relationship between the two series. For example, ‘per capita cheese consumption’ (which increases over time due to increased disposable incomes or greater availability of cheese) has a correlation coefficient of 0.95 with the ‘number of people who die from becoming tangled in their bedsheets’ (which also increases over time due to a growing population and a growing availability of bedsheets).

The case for our example (the relationship between temperature and CO2 emissions) is slightly different. There is a well-known chemical link between the two. So we understand how CO2 emissions could potentially cause changes in temperature. But in general, do not be tempted to conclude that there is a causal link just because a high correlation coefficient can be seen. Be very cautious when attaching too much meaning to high correlation coefficients when data displays trending behaviour.

This part shows that summary statistics, such as the correlation coefficient, can help identify possible patterns or relationships between variables, but we cannot draw conclusions about causation from them alone. It is also important to think about other explanations for what we see in the data, and whether we would expect there to be a relationship between the two variables.

natural experiment
An empirical study exploiting naturally occurring statistical controls in which researchers do not have the ability to assign participants to treatment and control groups, as is the case in conventional experiments. Instead, differences in law, policy, weather, or other events can offer the opportunity to analyse populations as if they had been part of an experiment. The validity of such studies depends on the premise that the assignment of subjects to the naturally occurring treatment and control groups can be plausibly argued to be random.

However, there are ways to determine whether there is a causal relationship between two variables, for example, by looking at the scientific processes that connect the variables (as with CO2 and temperature anomalies), or by using a natural experiment. To read more about how natural experiments help evaluate whether one variable causes another, see Section 1.5 of Economy, Society, and Public Policy. In Empirical Project 3, we will take a closer look at natural experiments and how we can use them to identify causal links between variables.