1. Measuring climate change Working in Python

Download the code

To download the code chunks used in this project, right-click on the download link and select ‘Save Link As…’. You’ll need to save the code download to your working directory, and open it in Python.

Don’t forget to also download the data into your working directory by following the steps in this project.

Getting started in Python

Head to the ‘Getting started in Python’ page for help and advice on setting up a Python session to work with. Remember, you can run any project as a notebook by downloading the relevant file from this repository and running it on your own computer. Alternatively, you can run pages online in your browser using Binder.

Preliminary settings

At the start of every Python session, you first need to import the extensions (also called packages or libraries) that you’ll need. These packages extend the base language in ways that are helpful. There are literally hundreds of thousands of libraries but you’ll find yourself reaching for the same ones again and again because a core few have most of what you’ll need. Among these core few are pandas for data analysis, numpy for numerical operations, matplotlib and lets_plot for plotting, and pingouin for running statistical tests. Although not in as wide use as the other packages, in some of these projects we’ll use skimpy for creating summary tables of statistics, too.

Let’s import the packages we’ll need and also configure the settings we want:

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from lets_plot import *

LetsPlot.setup_html(no_js=True)

### You don't need to use these settings yourself,
### they are just here to make the charts look nicer!
# Set the plot style for prettier charts:
plt.style.use(
    "https://raw.githubusercontent.com/aeturrell/core_python/main/plot_style.txt"
)

Part 1.1 The behaviour of average surface temperature over time

Learning objectives for this part

  • Use line charts to describe the behaviour of real-world variables over time.

In the questions below, we look at data from NASA about land–ocean temperature anomalies in the northern hemisphere. Figure 1.1 is constructed using this data, and shows temperatures in the northern hemisphere over the period 1880–2016, expressed as differences from the average temperature from 1951 to 1980. We start by creating charts similar to Figure 1.1, in order to visualize the data and spot patterns more easily.

Northern hemisphere temperatures (1880–2016).
Fullscreen

Figure 1.1 Northern hemisphere temperatures (1880–2016).

Before plotting any charts, download the data and make sure you understand how temperature is measured:

  • Go to NASA’s Goddard Institute for Space Studies website.
  • Under the subheading ‘Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies’, select the CSV version of ‘Northern Hemisphere-mean monthly, seasonal, and annual means’ (right-click and select ‘Save Link As…’).
  • The default name of this file is NH.Ts+dSST.csv. Give it a suitable name and save it in an easily accessible location, such as a folder on your Desktop or in your personal folder.
  1. In this dataset, temperature is measured as ‘anomalies’ rather than as absolute temperature. Using NASA’s Frequently Asked Questions section as a reference, explain in your own words what temperature ‘anomalies’ are. Why have researchers chosen this particular measure over other measures (such as absolute temperature)?

Let’s now download the data directly from NASA and load it in Python.

Python walk-through 1.1 Importing the data file into Python

We want to import the data file called ‘NH.Ts+dSST.csv’ from NASA’s website into Python using Visual Studio Code.

We start by opening Visual Studio Code in the folder we’ll be working in. Use File -> Open Folder to do this. We also need to ensure that the interactive Python window that we’ll be using to run Python code opens in this folder. In Visual Studio Code, you can ensure that the interactive window starts in the folder that you have open by setting ‘Jupyter: Notebook File Root’ to ${workspaceFolder} in the Settings menu.

Now create a new file (File -> New File in the menu) and name it exercise_1.py. In the new and empty file, right-click and select ‘Run file in interactive window’. This will launch the interactive Python window.

You will need to install the following packages: pandas, numpy, lets_plot, and matplotlib. pandas, numpy, and matplotlib provide extra functionality for data analysis, numbers, and plotting respectively. lets_plot provides some additional plotting functionality.

We will download the data directly from the internet into our Python session using the pandas library. We imported pandas as pd at the start of the script, so any direct command from that package is going to start with pd..

We’re going to use the pd.read_csv function to read in some data from a URL.

df = pd.read_csv(
    "https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
    skiprows=1,
    na_values="***",
)

When using the read_csv function, we added two options. If you open the spreadsheet in Excel, you will see that the real data only starts in Row 2, so we use the skiprows = 1 option to skip the first row when importing the data. When looking at the spreadsheet, you can see that missing temperature data is coded as ***. In order to ensure that the missing temperature data is recorded as numbers, we tell pandas that na_values = "***" which imports those missing values as NaN, which means the number is missing.

Note that we ended each line with a comma. Arguments that get passed to functions like read_csv need to be separated by commas. You don’t have to include the comma after the last argument, but some people prefer this as a matter of style.

You can also use pd.read_csv to open files that are stored locally on your computer. Instead of a URL, enter a file path to wherever you saved your data (enclosed in quotation marks).

To check that the data has been imported correctly, you can use the .head() function to view the first five rows of the dataset, and confirm that they correspond to the columns in the CSV file.

df.head()
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON
0 1880 −0.37 −0.53 −0.25 −0.32 −0.08 −0.18 −0.20 −0.28 −0.25 −0.34 −0.45 −0.42 −0.31 NaN NaN −0.22 −0.22 −0.35
1 1881 −0.32 −0.24 −0.06 −0.02 0.01 −0.35 0.06 −0.06 −0.28 −0.46 −0.39 −0.25 −0.20 −0.21 −0.33 −0.02 −0.12 −0.38
2 1882 0.24 0.20 0.00 −0.33 −0.26 −0.32 −0.30 −0.17 −0.26 −0.54 −0.35 −0.69 −0.23 −0.20 0.06 −0.20 −0.26 −0.38
3 1883 −0.59 −0.69 −0.17 −0.31 −0.26 −0.15 −0.05 −0.24 −0.34 −0.17 −0.43 −0.16 −0.30 −0.34 −0.66 −0.25 −0.15 −0.31
4 1884 −0.18 −0.11 −0.64 −0.61 −0.38 −0.44 −0.41 −0.52 −0.46 −0.46 −0.59 −0.49 −0.44 −0.42 −0.15 −0.54 −0.46 −0.51

Before working with this data, we use the .info() function to check that the data was read in as numbers (either real numbers, float64, or integers, int) rather than strings.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Year    144 non-null    int64  
 1   Jan     144 non-null    float64
 2   Feb     144 non-null    float64
 3   Mar     144 non-null    float64
 4   Apr     144 non-null    float64
 5   May     144 non-null    float64
 6   Jun     144 non-null    float64
 7   Jul     144 non-null    float64
 8   Aug     143 non-null    float64
 9   Sep     143 non-null    float64
 10  Oct     143 non-null    float64
 11  Nov     143 non-null    float64
 12  Dec     143 non-null    float64
 13  J-D     143 non-null    float64
 14  D-N     142 non-null    float64
 15  DJF     143 non-null    float64
 16  MAM     144 non-null    float64
 17  JJA     143 non-null    float64
 18  SON     143 non-null    float64
dtypes: float64(18), int64(1)
memory usage: 21.5 KB

You can see that all variables are formatted as either float64 or int, so Python correctly recognizes that these data items are numbers.

Try importing the data again without using the keyword argument option na_values="***" at all and see what difference it makes.

Now create some line charts using monthly, seasonal, and annual data, which help us look for general patterns over time.

  1. Choose one month and plot a line chart with average temperature anomaly on the vertical axis and time (from 1880 to the latest year available) on the horizontal axis. Label each axis appropriately and give your chart a suitable title (refer to Figure 1.1 as an example.)

Python walk-through 1.2 Drawing a line chart of temperature and time

We will now set the year as the index of the dataset. (This will make plotting the time series of temperature easier because the index can automatically take on the role of the horizontal axis.)

df = df.set_index("Year")
df.head()
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON
Year
1880 −0.37 −0.53 −0.25 −0.32 −0.08 −0.18 −0.20 −0.28 −0.25 −0.34 −0.45 −0.42 −0.31 NaN NaN −0.22 −0.22 −0.35
1881 −0.32 −0.24 −0.06 −0.02 0.01 −0.35 0.06 −0.06 −0.28 −0.46 −0.39 −0.25 −0.20 −0.21 −0.33 −0.02 −0.12 −0.38
1882 0.24 0.20 0.00 −0.33 −0.26 −0.32 −0.30 −0.17 −0.26 −0.54 −0.35 −0.69 −0.23 −0.20 0.06 −0.20 −0.26 −0.38
1883 −0.59 −0.69 −0.17 −0.31 −0.26 −0.15 −0.05 −0.24 −0.34 −0.17 −0.43 −0.16 −0.30 −0.34 −0.66 −0.25 −0.15 −0.31
1884 −0.18 −0.11 −0.64 −0.61 −0.38 −0.44 −0.41 −0.52 −0.46 −0.46 −0.59 −0.49 −0.44 −0.42 −0.15 −0.54 −0.46 −0.51
df.tail()
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON
Year
2019 1.19 1.11 1.54 1.24 0.98 1.18 1.03 1.08 1.20 1.30 1.19 1.38 1.20 1.18 1.13 1.25 1.10 1.23
2020 1.58 1.70 1.65 1.39 1.27 1.12 1.09 1.12 1.19 1.21 1.60 1.22 1.35 1.36 1.55 1.44 1.11 1.33
2021 1.25 0.95 1.19 1.12 1.03 1.20 1.06 1.02 1.05 1.31 1.30 1.13 1.13 1.14 1.14 1.11 1.09 1.22
2022 1.24 1.16 1.42 1.08 1.00 1.12 1.05 1.16 1.15 1.31 1.09 1.06 1.15 1.16 1.18 1.17 1.11 1.18
2023 1.28 1.31 1.59 1.01 1.10 1.18 1.43 NaN NaN NaN NaN NaN NaN NaN 1.22 1.24 NaN NaN

The way we’ve set up our dataframe, the year variable is special because it forms the index. This has a lot of consequences for the operations we can do on the dataframe; for example, if we do whole-dataframe operations like df+20, 20 will be added to every value except the values (here years) in the index.

Next we’ll make our chart. We will draw a line chart using data for January, df["Jan"] for the years 1880–2016.

We’ll use the matplotlib package for this (which we imported at the start as plt). A typical matplotlib line chart has the following elements:

# create a figure (a bit like a canvas) and an axis (ax)
# onto which to put chart elements
fig, ax = plt.subplots()
# select the column to use 'plot' on, and pass the 'ax' object
# note that the horizontal axis is given by the index of the dataframe
df["column_name"].plot(ax=ax)
# set the labels and title
ax.set_ylabel("y label")
ax.set_xlabel("x label")
ax.set_title("title")
# show the plot
plt.show()

Let’s see an example of this with the data for January.

fig, ax = plt.subplots()
df["Jan"].plot(ax=ax)
ax.set_ylabel("y label")
ax.set_xlabel("x label")
ax.set_title("title")
plt.show()
Example chart of northern hemisphere temperature anomalies for January.
Fullscreen

Figure 1.2 Example chart of northern hemisphere temperature anomalies for January.

You’ll see that we didn’t have to say to use years as the horizontal axis. That’s because the years are in the index, and, when we selected a single column of values representing January, pandas inferred that we wanted to use the index as the horizontal axis. Try out plotting:

fig, ax = plt.subplots()
ax.plot(df.index, df["Jan"])
ax.set_ylabel("y label")
ax.set_xlabel("x label")
ax.set_title("title")
plt.show()

You should find that you get exactly the same chart! But this time, we made it explicit that we wanted the horizontal axis to be drawn from the index.

Note that a lot of what we can do on the chart can be achieved by calling ax.[something] and that we finish by showing the chart with plt.show(). (You can also save charts to file with plt.savefig(nameofchart.pdf).)

Now, we’ll do a nicer version of the chart. Rather than ‘hard-code’ the month in, we’ll abstract the specific month into a ‘month’ variable. That way, the code can easily be re-used for any month.

We’ll add a horizontal line to make the chart easier to read using ax.axhline. As this zero is actually the average over the period we’re looking at, we’ll add some text annotation with ax.annotate. We specify this by passing in x and y values for where the text appears. For the x position, it’s convenient to place it two-thirds of the way across the figure regardless of how zoomed-in the chart is so we pass ‘figure fraction’ for the units of the x-coordinate; we just use the data for the y-coordinate of the text.

The next line plots January’s data—but note we don’t specify the x-axis for the data because that’s in our dataframe’s index.

Then we’re onto labelling and titling the chart. We can make use of something here called an f-string. This allows us to pass Python variables into a string. In this case, we made month a variable but we’d like to put it in the title. The simplest f-string that does this would be f"Average temperature anomaly in {month}..." which would result in a title of ‘Average temperature anomaly in Jan…’ on the chart. You can pass as many variables into the string as you like, and we can use this to automatically retrieve the most recent year in the data via df.index.max()—this way, if we update our data, the chart title is automatically updated too!

month = "Jan"
fig, ax = plt.subplots()
ax.axhline(0, color="orange")
ax.annotate("1951—1980 average", xy=(0.66, -0.2), xycoords=("figure fraction", "data"))
df[month].plot(ax=ax)
ax.set_title(
    f"Average temperature anomaly in {month} \n in the northern hemisphere (1880—{df.index.max()})"
)
ax.set_ylabel("Annual temperature anomalies");
Average temperature anomaly in January in the northern hemisphere (1880-2023).
Fullscreen

Figure 1.3 Average temperature anomaly in January in the northern hemisphere (1880-2023).

Try different values for color, xy, and the first argument of ax.axhline in the plot function to figure out what these options do. (Note that xycoords set the behaviour of xy.)

It is important to remember that all axis and chart titles should be enclosed in quotation marks (""), as well as any words that are not options (for example, colour names or filenames).

  1. Extra practice: The columns labelled DJF, MAM, JJA, and SON contain seasonal averages (means). For example, the MAM column contains the average of the March, April, and May columns for each year. Plot a separate line chart for each season, using average temperature anomaly for that season on the vertical axis and time (from 1880 to the latest year available) on the horizontal axis.
  1. The column labelled JD contains the average temperature anomaly for each year.
  • Plot a line chart with annual average temperature anomaly on the vertical axis and time (from 1880 to the latest year available) on the horizontal axis. Your chart should look like Figure 1.1.
    Extension: Add a horizontal line that intersects the vertical axis at 0, and label it ‘1951–1980 average’.
  • What do your charts from Questions 2 to 4(a) suggest about the relationship between temperature and time?

Python walk-through 1.3 Producing a line chart for the annual temperature anomalies

This is where the power of programming languages becomes evident: to produce the same line chart for a different variable, we simply take the code used in Python walk-through 1.2 and replace the "Jan" with the name for the annual variable ("J-D"). We don’t need to change the rest of the chart because we created it to be flexible. (We selected the month column but we changed the value of month.)

month = "J-D"
fig, ax = plt.subplots()
ax.axhline(0, color="orange")
ax.annotate("1951—1980 average", xy=(0.68, -0.2), xycoords=("figure fraction", "data"))
df[month].plot(ax=ax)
ax.set_title(
    f"Average annual temperature anomaly in \n in the northern hemisphere (1880—{df.index.max()})"
)
ax.set_ylabel("Annual temperature anomalies");
Average annual temperature anomaly in the northern hemisphere (1880-2023).
Fullscreen

Figure 1.4 Average annual temperature anomaly in the northern hemisphere (1880-2023).

  1. You now have charts for three different time intervals: month (Question 2), season (Question 3), and year (Question 4). For each time interval, discuss what we can learn about patterns in temperature over time that we might not be able to learn from the charts of other time intervals.
  1. Compare your chart from Question 4 to Figure 1.5, which also shows the behaviour of temperature over time using data taken from the National Academy of Sciences.
  • Discuss the similarities and differences between the charts. (For example, are the horizontal and vertical axes variables the same, or do the lines have the same shape?)
  • Looking at the behaviour of temperature over time from 1000 to 1900 in Figure 1.4, are the observed patterns in your chart unusual?
  • Based on your answers to Questions 4 and 5, do you think the government should be concerned about climate change?
Northern hemisphere temperatures over the long run (1000–2006).
Fullscreen

Figure 1.5 Northern hemisphere temperatures over the long run (1000–2006).

Part 1.2 Variation in temperature over time

Learning objectives for this part

  • Summarize data in a frequency table, and visualize distributions with column charts.
  • Describe a distribution using mean and variance.

Aside from changes in the mean temperature, the government is also worried that climate change will result in more frequent extreme weather events. The island has experienced a few major storms and severe heat waves in the past, both of which caused serious damage and disruption to economic activity.

Will weather become more extreme and vary more as a result of climate change? A New York Times article uses the same temperature dataset you have been using to investigate the distribution of temperatures and temperature variability over time. Read through the article, paying close attention to the descriptions of the temperature distributions.

We can use the mean and median to describe distributions, and we can use deciles to describe parts of distributions. To visualize distributions, we can use column charts (sometimes referred to as frequency histograms). We are now going to create similar charts of temperature distributions to the ones in the New York Times article, and look at different ways of summarizing distributions.

frequency table
A record of how many observations in a dataset have a particular value, range of values, or belong to a particular category.

In order to create a column chart using the temperature data we have, we first need to summarize the data using a frequency table. Instead of using deciles to group the data, we use intervals of 0.05, so that temperature anomalies with a value from −0.3 to −0.25 will be in one group, a value greater than −0.25 and up to −0.2 in another group, and so on. The frequency table shows us how many values belong to a particular group.

  1. Using the monthly data for June, July, and August, create two frequency tables similar to Figure 1.6 for the years 1951–1980 and 1981–2010, respectively. The values in the first column should range from −0.3 to 1.05, in intervals of 0.05. See Python walk-through 1.4 for how to do this.
Range of temperature anomaly (T) Frequency
−0.30
−0.25
 1.00
 1.05

Figure 1.6 A frequency table.

  1. Using the frequency tables from Question 1:
  • Plot two separate column charts (frequency histograms) for 1951–1980 and 1981–2010 to show the distribution of temperatures, with frequency on the vertical axis and the range of temperature anomaly on the horizontal axis. Your charts should look similar to those in the New York Times article.
  • Using your charts, describe the similarities and differences (if any) between the distributions of temperature anomalies in 1951–1980 and 1981–2010.

Python walk-through 1.4 Creating frequency tables and histograms

Since we will be looking at data from different sub-periods (year intervals) separately, we will create a categorical variable (a variable that has two or more categories) that indicates the sub-period for each observation (row). In Python, this type of variable is called a ‘category’ or ‘categorical variable’. When we create a categorical column, we need to define the categories that this variable can take.

We’ll achieve this using the pd.cut function, which arranges input data into a series of bins that can have labels. We’ll give the data labels that reflect what period they correspond to here, and we’ll also specify that there is an order for these categories.

df["Period"] = pd.cut(
    df.index,
    bins=[1921, 1950, 1980, 2010],
    labels=["1921—1950", "1951—1980", "1981—2010"],
    ordered=True,
)

We created a new variable called "Period" and defined the possible categories using the labels= keyword argument. Since we will not be using data for some years (before 1921 and after 2010), we want "Period" to take the value NaN (not a number) for these observations (rows), and the appropriate category for all the other observations. The pd.cut function does this automatically.

Let’s take a look at the last 20 entries of the new column of data using .tail:

df["Period"].tail(20)
Year
2004    19812010
2005    19812010
2006    19812010
2007    19812010
2008    19812010
2009    19812010
2010    19812010
2011          NaN
2012          NaN
2013          NaN
2014          NaN
2015          NaN
2016          NaN
2017          NaN
2018          NaN
2019          NaN
2020          NaN
2021          NaN
2022          NaN
2023          NaN
Name: Period, dtype: category
Categories (3, object): ['1921—1950' < '1951—1980' < '1981—2010']

We’d really like to combine the data from the three summer months. This is easy to do using the .stack function. Let’s look at the first few rows of the data once stacked using .head():

list_of_months = ["Jun", "Jul", "Aug"]
df[list_of_months].stack().head()
Year     
1880  Jun   -0.18
      Jul   -0.20
      Aug   -0.28
1881  Jun   -0.35
      Jul    0.06
dtype: float64

Now we need to think about how we can plot the three different periods. matplotlib has plenty of ways to do this; one of the easiest is to ask for more than one axis object to put plots on. So, in the code below, we ask for ncols=3, and this returns multiple axes instead of just one axis called ax. axes is actually a list that we can access individual plots in. To iterate over both axes and periods, we use the zip function which works exactly like a zipper: it brings together one entry from each list in turn—in this case, one axis and one period. We can use this to plot the histogram data one axis at a time in the zipped for loop. Within the loop the data is filtered just to the period, using == period, and months, using list_of_months, that we want.

Finally we set an overall title and a single horizontal axis label (on the middle chart).

fig, axes = plt.subplots(ncols=3, figsize=(9, 4), sharex=True, sharey=True)
for ax, period in zip(axes, df["Period"].dropna().unique()):
    df.loc[df["Period"] == period, list_of_months].stack().hist(ax=ax)
    ax.set_title(period)
plt.suptitle("Histogram of temperature anomalies")
axes[1].set_xlabel("Summer temperature distribution")
plt.tight_layout();
Histogram of temperature anomalies.
Fullscreen

Figure 1.7 Histogram of temperature anomalies.

To explain what a histogram displays, let’s take a look at the histogram for the period from 1921 to 1950. On the horizontal axis, we have a number of bins, for example 0 to 0.1, 0.1 to 0.2, and so on. The height of the bar over each interval represents the count of the number of anomalies that fall in the interval. The bar with the greatest height indicates the most frequently encountered temperature interval.

As you can see from the earlier data, there are virtually no temperature anomalies larger than 0.3. The height of these bars gives a useful overview of the distribution of the temperature anomalies.

Now consider how this distribution changes as we move through the three distinct time periods. The distribution is clearly moving to the right for the period 1981–2010, which is an indication that the temperature is increasing; in other words, an indication of global warming.

variance
A measure of dispersion in a frequency distribution, equal to the mean of the squares of the deviations from the arithmetic mean of the distribution. The variance is used to indicate how ‘spread out’ the data is. A higher variance means that the data is more spread out. Example: The set of numbers 1, 1, 1 has zero variance (no variation), while the set of numbers 1, 1, 999 has a high variance of 221,334 (large spread).

Now we will use our data to look at different aspects of distributions. Firstly, we will learn how to use deciles to determine which observations are ‘normal’ and ‘abnormal’, and then learn how to use variance to describe the shape of a distribution.

  1. The New York Times article considers the bottom third (the lowest or coldest one-third) of temperature anomalies in 1951–1980 as ‘cold’ and the top third (the highest or hottest one-third) of anomalies as ‘hot’. In decile terms, temperatures in the 1st to 3rd deciles are ‘cold’ and temperatures in the 7th to 10th deciles or above are ‘hot’ (rounded to the nearest decile). Use Python and numpy’s np.quantile function to determine which values correspond to the 3rd and 7th deciles across all months in 1951–1980. (See Python walk-through 1.5 for an example.)

Python walk-through 1.5 Using the np.quantile function

First, we need to create a variable that contains all monthly anomalies in the years 1951–1980. Then, we’ll use numpy’s np.quantile function to find the required percentiles (0.3 and 0.7 refer to the 3rd and 7th deciles, respectively).

To help us create a variable that encodes this information, we’re going to filter our pandas dataframe. We want to filter it both by the index and by only the single month columns. We can do both of these at once using the .loc method. The .loc method works like this df.loc[[rows you would like], [columns you would like]]. For rows, we’re going to ask for everything in between two year limits. For columns, we can do something called ‘slicing’, which selects every column from within a given range in the format "first column you want":"last column you want".

Note: You may get slightly different values to those shown here if you are using the latest data.

# Create a variable that has years 1951 to 1980, and months Jan to Dec (inclusive)
temp_all_months = df.loc[(df.index >= 1951) & (df.index <= 1980), "Jan":"Dec"]
# Put all the data in stacked format and give the new columns sensible names
temp_all_months = (
    temp_all_months.stack()
    .reset_index()
    .rename(columns={"level_1": "month", 0: "values"})
)
# Take a look at this data:
temp_all_months
Year month values
0 1951 Jan −0.36
1 1951 Feb −0.52
2 1951 Mar −0.18
3 1951 Apr 0.06
4 1951 May 0.17
... ... ... ...
355 1980 Aug 0.09
356 1980 Sep 0.10
357 1980 Oct 0.12
358 1980 Nov 0.20
359 1980 Dec 0.09
quantiles = [0.3, 0.7]
list_of_percentiles = np.quantile(temp_all_months["values"], q=quantiles)

print(f"The cold threshold of {quantiles[0]*100}% is {list_of_percentiles[0]}")
print(f"The hot threshold of {quantiles[1]*100}% is {list_of_percentiles[1]}")
The cold threshold of 30.0% is –0.1
The hot threshold of 70.0% is 0.1

After we have filled it, the variable list_of_percentiles (which is a list) contains two numbers, the 30th percentile (first value) and the 70th percentile (second value). When we print out these values using f-strings, we want to access the first and second values for the cold and hot thresholds, respectively. Python indexes lists and arrays from zero (not from 1!), so, to access the first entry, we use list_of_percentiles[0].

It may seem odd to index from zero, but many programming languages do this. Some programmers think it’s better, while some think it’s worse.

  1. Based on the values you found in Question 3, count the number of anomalies that are considered ‘hot’ in the years 1981–2010, and express this as a percentage of all the temperature observations in that period. Does your answer suggest that we are experiencing hotter weather more frequently in this timeframe? (Remember that each decile represents 10% of observations, so 30% of temperatures were considered ‘hot’ in the years 1951–1980.)

Python walk-through 1.6 Computing the proportion of anomalies at a given quantile using the .mean() function

Note: You may get slightly different values to those shown here if you are using the latest data.

We repeat the steps used in Python walk-through 1.5, now looking at monthly anomalies in the years 1981–2010. We can simply change the year values in the code from Python walk-through 1.5.

# Create a variable that has years 1981 to 2010, and months Jan to Dec (inclusive)
temp_all_months = df.loc[(df.index >= 1981) & (df.index <= 2010), "Jan":"Dec"]
# Put all the data in stacked format and give the new columns sensible names
temp_all_months = (
    temp_all_months.stack()
    .reset_index()
    .rename(columns={"level_1": "month", 0: "values"})
)
# Take a look at the start of this data data:
temp_all_months.head()
Year month values
0 1981 Jan 0.79
1 1981 Feb 0.63
2 1981 Mar 0.67
3 1981 Apr 0.39
4 1981 May 0.18
dummy variable (indicator variable)
A variable that takes the value 1 if a certain condition is met, and 0 otherwise.

Now that we have all the monthly data for the years 1981–2010, we want to count the proportion of observations that are smaller than –0.1. We’ll first create an indicator variable (i.e., it’s either True or False) that says, for each row (observation) in temp_all_months, whether the number is lower than the 0.3 quantile or not (given by list_of_percentiles[0]). Then we’ll take the mean of this list of True and False values; when you take the mean of indicator variables, each True evaluates to 1 and each False to 0, so the mean gives us the proportion of entries in temp_all_months that are lower than the 0.3 quantile:

entries_less_than_q30 = temp_all_months["values"] < list_of_percentiles[0]
proportion_under_q30 = entries_less_than_q30.mean()
print(
    f"The proportion under {list_of_percentiles[0]} is {proportion_under_q30*100:.2f}%"
)
The proportion under –0.1 is 1.94%

When we printed out the answer, we used some number formatting. This is written as :.2f within the curly brackets part of an f-string—this tells Python to display the number with two decimal places. You should also note that, as well as the mean given by .mean(), there are various other built-in functions like .std() for the standard deviation and .var() for the variance.

Now we can assess that between 1951 and 1980, 30% of observations for the temperature anomaly were smaller than –0.10, but between 1981 and 2010 only about two per cent of the months are considered cold. That is a large change.

Let’s check whether we get a similar result for the number of observations that are larger than 0.11.

proportion_over_q70 = (temp_all_months["values"] > list_of_percentiles[1]).mean()
print(f"The proportion over {list_of_percentiles[1]} is {proportion_over_q70*100:.2f}%")
The proportion over 0.1 is 84.72%
  1. The New York Times article discusses whether temperatures have become more variable over time. One way to measure temperature variability is by calculating the variance of the temperature distribution. For each season (DJF, MAM, JJA, and SON):
  • Calculate the mean (average) and variance separately for the following time periods: 1921–1950, 1951–1980, and 1981–2010.
  • For each season, compare the variances in different periods, and explain whether or not temperature appears to be more variable in later periods.

Python walk-through 1.7 Calculating and understanding mean and variance

The process of computing the mean and variance separately for each period and season would be quite tedious. We would prefer a way to cover all of them at once. Let’s re-stack the data in a form where season is one of the columns and could take the values DJF, MAM, JJA, or SON. Let’s also have check the structure of the data while we’re at it:

temp_all_months = (
    df.loc[:, "DJF":"SON"]
    .stack()
    .reset_index()
    .rename(columns={"level_1": "Season", 0: "Values"})
)
temp_all_months["Period"] = pd.cut(
    temp_all_months["Year"],
    bins=[1921, 1950, 1980, 2010],
    labels=["1921—1950", "1951—1980", "1981—2010"],
    ordered=True,
)
# Take a look at a cut of the data using `.iloc`, which provides position
temp_all_months.iloc[-135:-125]
Year season values Period
438 1989 SON 0.20 1981–2010
439 1990 DJF 0.45 1981–2010
440 1990 MAM 0.79 1981–2010
441 1990 JJA 0.40 1981–2010
442 1990 SON 0.45 1981–2010
443 1991 DJF 0.51 1981–2010
444 1991 MAM 0.45 1981–2010
445 1991 JJA 0.42 1981–2010
446 1991 SON 0.32 1981–2010
447 1992 DJF 0.44 1981–2010

Now we’ll take the mean and variance at once.

The following line of code will do a lot of things at the same time. First we will allocate each observation (Year–Season–Period–Temperature allocation combination) into one of 12 groups defined by the different possible combinations of season and period, e.g. ‘MAM; 1981-2010’. Think of these twelve groups as each having a label like ‘MAM; 1981-2010’ or ‘JJA; 1921-1950’. This is grouping our data and we do this with a groupby operation, to which we pass a list of the variables we’d like to group together; here that will be "Period" and "Season".

The variable we’d like to apply the grouping to is "Values" so we then filter down to just the "Values" column with ["Values"].

Then, once we have allocated each observation in the "Values" column to one of these groups, we ask pandas to calculate the mean and variance of the observations of each of the groups. That is what the agg([np.mean, np.var]) function does.

grp_mean_var = temp_all_months.groupby(["Season", "Period"])["Values"].agg(
    [np.mean, np.var]
)
grp_mean_var
season Period mean var
DJF 1921–1950 −0.030000 0.057650
1951–1980 −0.002333 0.050494
1981–2010 0.524333 0.079198
JJA 1921–1950 −0.057931 0.021846
1951–1980 0.000333 0.014527
1981–2010 0.399667 0.067686
MAM 1921–1950 −0.046207 0.032196
1951–1980 0.000333 0.025217
1981–2010 0.508667 0.075433
SON 1921–1950 0.077931 0.027881
1951–1980 −0.001000 0.026133
1981–2010 0.428333 0.110883

We recognize that the variances seem to remain fairly constant across the first two periods, but they do increase markedly for the 1981–2010 period.

We can plot a line chart to see these changes graphically—to do that in the most convenient way, we’re going to make use of a different plotting library that works really well with data that is arranged in this format where each row is an observation, each column is a variable, and each entry is a value. This is called ‘tidy’ data.

There are broadly two categories of approach to using code to create data vizualisations: imperative, where you build what you want, and declarative, where you say what you want. Choosing which to use involves a trade-off: imperative libraries (like matplotlib) offer you flexibility, but at the cost of some verbosity; declarative libraries offer you a quick way to plot your data, but only if it’s in the right format to begin with (the tidy format!), and customization to special chart types is more difficult. Python has many excellent plotting packages, but for these projects we recommend a tidy data-friendly declarative library called lets_plot for the vast majority of charts you might want to make and then matplotlib when you need something much more customized.

The lets_plot plotting library follows what is called a grammar of graphics approach. You don’t need to worry too much about what that means for now, only that it is a coherent declarative system for describing and building graphs. There is some background information that you might find useful in getting to grips with lets_plot. All plots are composed of the data, the information you want to visualize, and a mapping: the description of how the data’s variables are mapped to aesthetic attributes. There are five mapping components:

  • A layer is a collection of geometric elements and statistical transformations. Geometric elements, or geoms for short, represent what you actually see in the plot: points, lines, polygons, etc. Statistical transformations, or stats for short, summarize the data: for example, binning and counting observations to create a histogram, or fitting a linear model.
  • Scales map values in the data space to values in the aesthetic space. This includes the use of colour, shape, or size. Scales also draw the legend and axes.
  • A coord, or coordinate system, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to help read the graph. We normally use the Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.
  • A facet specifies how to break up and display subsets of data as small multiples.
  • A theme controls the finer points of display, like the font size and background colour. While the defaults have been chosen with care, you may need to consult other references to create an attractive plot.

At the start of this project, we imported lets_plot and set it up using LetsPlot.setup_html(no_js=True), so we should be ready to use it!

min_year = 1880
(
    ggplot(temp_all_months, aes(x="Year", y="Values", color="Season"))
    + geom_abline(slope=0, color="black", size=1)
    + geom_line(size=1)
    + labs(
        title=f"Average annual temperature anomaly in \n in the northern hemisphere ({min_year}—{temp_all_months['Year'].max()})",
        y="Annual temperature anomalies",
    )
    + scale_x_continuous(format="d")
    + geom_text(
        x=min_year, y=0.1, label="1951—1980 average", hjust=left, color="black"
    )
)
Average annual temperature anomaly in the northern hemisphere, by season.
Fullscreen

Figure 1.8 Average annual temperature anomaly in the northern hemisphere, by season.

Let’s talk through what each part of the code did here:

  • ggplot took the data and a second argument, the aes (short for aesthetic mappings) function.
  • In aes, we passed the mappings we wanted: the year along the horizontal axis, the values column on the vertical axis, and colour to distinguish between different seasons (via color="season").
  • geom_abline adds a line from a to b, in this case just along the horizontal axis at y = 0.
  • geom_text adds the text annotation.
  • geom_line adds a line for each season. Because we already said color="season" earlier, we actually get as many lines as there are seasons in our data.
  • labs sets the labels.
  • scale_x_continuous(format="d") tells lets_plot that the horizontal axis is a continuous (rather than discrete) scale, and that the format should just be a number with no decimal points and no commas (which is what we want for displaying years).
  1. Using the findings of the New York Times article and your answers to Questions 1 to 5, discuss whether temperature appears to be more variable over time. Would you advise the government to spend more money on mitigating the effects of extreme weather events?

Part 1.3 Carbon emissions and the environment

Learning objectives for this part

  • Use scatterplots and the correlation coefficient to assess the degree of association between two variables.
  • Explain what correlation measures and the limitations of correlation.
correlation
A measure of how closely related two variables are. Two variables are correlated if knowing the value of one variable provides information on the likely value of the other, for example high values of one variable being commonly observed along with high values of the other variable. Correlation can be positive or negative. It is negative when high values of one variable are observed with low values of the other. Correlation does not mean that there is a causal relationship between the variables. Example: When the weather is hotter, purchases of ice cream are higher. Temperature and ice cream sales are positively correlated. On the other hand, if purchases of hot beverages decrease when the weather is hotter, we say that temperature and hot beverage sales are negatively correlated.
correlation coefficient
A numerical measure, ranging between 1 and −1, of how closely associated two variables are—whether they tend to rise and fall together, or move in opposite directions. A positive coefficient indicates that when one variable takes a high (low) value, the other tends to be high (low) too, and a negative coefficient indicates that when one variable is high the other is likely to be low. A value of 1 or −1 indicates that knowing the value of one of the variables would allow you to perfectly predict the value of the other. A value of 0 indicates that knowing one of the variables provides no information about the value of the other.

The government has heard that carbon emissions could be responsible for climate change, and has asked you to investigate whether this is the case. To do so, we are now going to look at carbon emissions over time, and use another type of chart, the scatterplot, to show their relationship to temperature anomalies. One way to measure the relationship between two variables is correlation. Python walk-through 1.8 explains what correlation is and how to calculate it in Python.

In the questions below, we will make charts using the CO2 data from the US National Oceanic and Atmospheric Administration. Download the Excel spreadsheet containing this data. Save the data as a CSV file in a sub-directory of the folder you have open in Visual Studio Code named ‘data’. Import the CSV file that’s now in ‘data/1_CO2-data.csv’ into Python using pandas read csv function; the code might look like df_co2 = pd.read_csv("data/1_CO2-data.csv").

  1. The CO2 data was recorded from one observatory in Mauna Loa. Using an Earth System Research Laboratory article as a reference, explain whether or not you think this data is a reliable representation of the global atmosphere.
  1. The variables trend and interpolated are similar, but not identical. In your own words, explain the difference between these two measures of CO2 levels. Why might there be seasonal variation in CO2 levels?

Now we will use a line chart to look for general patterns over time.

  1. Plot a line chart with interpolated and trend CO2 levels on the vertical axis and time (starting from January 1960) on the horizontal axis. Label the axes and the chart legend, and give your chart an appropriate title. What does this chart suggest about the relationship between CO2 and time?
correlation coefficient
A numerical measure, ranging between 1 and −1, of how closely associated two variables are—whether they tend to rise and fall together, or move in opposite directions. A positive coefficient indicates that when one variable takes a high (low) value, the other tends to be high (low) too, and a negative coefficient indicates that when one variable is high the other is likely to be low. A value of 1 or −1 indicates that knowing the value of one of the variables would allow you to perfectly predict the value of the other. A value of 0 indicates that knowing one of the variables provides no information about the value of the other.

We will now combine the CO2 data with the temperature data from Part 1.1, and then examine the relationship between these two variables visually using scatterplots, and statistically using the correlation coefficient. If you have not yet downloaded the temperature data, follow the instructions in Part 1.1.

  1. Choose one month and add the CO2 trend data to the temperature dataset from Part 1.1, making sure that the data corresponds to the correct year.
  • Make a scatterplot of CO2 level on the vertical axis and temperature anomaly on the horizontal axis.
  • Calculate and interpret the (Pearson) correlation coefficient between these two variables.
  • Discuss the shortcomings of using this coefficient to summarize the relationship between variables.

Python walk-through 1.8 Scatterplots and the correlation coefficient

First we will use the pd.read_csv function to import the CO2 data file into Python, and call it df_co2.

df_co2 = pd.read_csv("data/1_CO2-data.csv")
df_co2.head()
Year Month Monthly average Interpolated Trend
0 1958 3 315.71 315.71 314.62
1 1958 4 317.45 317.45 315.29
2 1958 5 317.50 317.50 314.71
3 1958 6 −99.99 317.10 314.85
4 1958 7 315.86 315.86 314.98

This file has monthly data, but in contrast to the data in df from earlier, the data is all in so-called ‘tidy’ format (one observation per row, one column per variable). To make this task easier, we will pick only the June data from the CO2 emissions and add them as an additional variable to the df dataset.

Python’s pandas package has a convenient function called merge to do this. First we create a new dataset that contains only the June emissions data (df_co2_june).

df_co2_june = df_co2.loc[df_co2["Month"] == 6]
df_co2_june.head()
Year Month Monthly average Interpolated Trend
3 1958 6 −99.99 317.10 314.85
15 1959 6 318.15 318.15 315.92
27 1960 6 319.59 319.59 317.36
39 1961 6 319.77 319.77 317.48
51 1962 6 320.55 320.55 318.27

Then we use this data in the pd.merge function. The merge function takes the original df and the df_co2_june and merges (combines) them together. To merge, we need some commonality between the columns (or indexes) in the dataframe. In this case, the common variable is "Year", so we will use that to merge the data.

(Extension: Hover your cursor over pd.merge in Visual Studio Code, type help(pd.merge) into the interactive window, or Google ‘pandas merge’ to see the many other options that pd.merge allows.)

df_temp_co2 = pd.merge(df_co2_june, df, on="Year")
df_temp_co2[["Year", "Jun", "Trend"]].head()
Year Jun Trend
0 1958 0.05 314.85
1 1959 0.14 315.92
2 1960 0.18 317.36
3 1961 0.18 317.48
4 1962 −0.13 318.27

It looks like it worked! We now have some extra columns from the CO2 data in addition to the temperature anomaly data from before.

To make a scatterplot, we use lets_plot again:

(
    ggplot(df_temp_co2, aes(x="Jun", y="Trend"))
    + geom_point(color="black", size=3)
    + labs(
        title="Scatterplot of temperature anomalies vs carbon dioxide emissions",
        y="Carbon dioxide levels (trend, mole fraction)",
        x="Temperature anomaly (degrees Celsius)",
    )
)
Scatterplot of temperature anomalies vs carbon dioxide emissions.
Fullscreen

Figure 1.9 Scatterplot of temperature anomalies vs carbon dioxide emissions.

To calculate the correlation coefficient, we can use the .corr() function. Note: you may get slightly different results if you are using the latest data.

df_temp_co2[["Jun", "Trend"]].corr(method="pearson")
Jun Trend
Jun 1.000000 0.915272
Trend 0.915272 1.000000

In this case, the correlation coefficient tells us that an upward-sloping straight line is quite a good fit to the date (as seen on the scatterplot). There is a strong positive association between the two variables (higher temperature anomalies are associated with higher CO2 levels).

One limitation of this correlation measure is that it only tells us about the strength of the upward- or downward-sloping linear relationship between two variables; in other words, how closely the scatterplot aligns along an upward- or downward-sloping straight line. The correlation coefficient cannot tell us if the two variables have a different kind of relationship (such as that represented by a wavy line).

Note: The word ‘strong’ is often used for coefficients that are close to 1 or −1, and ‘weak’ is often used for coefficients that are close to 0, though there is no precise range of values that are considered ‘strong’ or ‘weak’.

If you need more insight into correlation coefficients, you may find it helpful to watch online tutorials such as ‘Correlation coefficient intuition’ from the Khan Academy.

As we are dealing with time-series data, it is often more instructive to look at a line plot, as a scatterplot cannot convey how the observations relate to each other in the time dimension.

Let’s start by plotting the June temperature anomalies.

(
    ggplot(df_temp_co2, aes(x="Year", y="Jun"))
    + geom_line(size=1)
    + labs(
        title="June temperature anomalies",
    )
    + scale_x_continuous(format="d")
)
June temperature anomalies.
Fullscreen

Figure 1.10 June temperature anomalies.

Typically, when using the plot function, we would now only need to add the line for the second variable using the lines command. The issue, however, is that the CO2 emissions variable (Trend) is on a different scale, and the automatic vertical axis scale (from –0.2 to about 1.2) would not allow for the display of Trend. To resolve this issue, we can introduce a second panel and put them side by side. You can think of the new plotting space as being like a table, with an overall title and two columns.

We’ll do this using lets_plot again. We’re going to create a base plot and save it to a variable called base_plot, then we’ll create two panels of an overall figure by adding different elements (one for June, one for the trend) onto this base plot. Finally, we’ll bring them both together using the gggrid function.

base_plot = ggplot(df_temp_co2) + scale_x_continuous(format="d")
plot_p = (
    base_plot
    + geom_line(aes(x="Year", y="Jun"), size=1)
    + labs(title="June temperature anomalies")
)
plot_q = (
    base_plot
    + geom_line(aes(x="Year", y="Trend"), size=1)
    + labs(title="Carbon dioxide emissions")
)
gggrid([plot_p, plot_q], ncol=2)
June temperature anomalies and carbon dioxide emissions.
Fullscreen

Figure 1.11 June temperature anomalies and carbon dioxide emissions.

  1. Extra practice: Choose two months and add the CO2 trend data to the temperature dataset from Part 1.1, making sure that the data corresponds to the correct year. Create a separate chart for each month. What do your charts and the correlation coefficients suggest about the relationship between CO2 levels and temperature anomalies?
causation
A direction from cause to effect, establishing that a change in one variable produces a change in another. While a correlation gives an indication of whether two variables move together (either in the same or opposite directions), causation means that there is a mechanism that explains this association. Example: We know that higher levels of CO2 in the atmosphere lead to a greenhouse effect, which warms the Earth’s surface. Therefore we can say that higher CO2 levels are the cause of higher surface temperatures.
spurious correlation
A strong linear association between two variables that does not result from any direct relationship, but instead may be due to coincidence or to another unseen factor.

Even though two variables are strongly correlated with each other, this is not necessarily because one variable’s behaviour is the result of the other (a characteristic known as causation). The two variables could be spuriously correlated. The following example illustrates spurious correlation:

A child’s academic performance may be positively correlated with the size of the child’s house or the number of rooms in the house, but could we conclude that building an extra room would make a child smarter, or doing well at school would make someone’s house bigger? It is more plausible that income or wealth, which determines the size of home that a family can afford and the resources available for studying, is the ‘unseen factor’ in this relationship. We could also determine whether income is the reason for this spurious correlation by comparing exam scores for children whose parents have similar incomes but different house sizes. If there is no correlation between exam scores and house size, then we can deduce that house size was not ‘causing’ exam scores (or vice versa).

See this TEDx Talk for more examples of the dangers of confusing correlation with causation.

  1. Consider the example of spurious correlation described above.
  • In your own words, explain spurious correlation and the difference between correlation and causation.
  • Give an example of spurious correlation, similar to the one above, for either CO2 levels or temperature anomalies.
  • Choose an example of spurious correlation from Tyler Vigen’s website. Explain whether you think it is a coincidence, or whether this correlation could be due to one or more other variables.

Find out more What makes some correlations spurious?

In the spurious correlations website given in Question 6(c), most of the examples you will see involve data series (variables) that are trending (meaning that they tend to increase or decrease over time). If you calculate a correlation between two variables that are trending, you are bound to find a large positive or negative correlation coefficient, even if there is no plausible explanation for a relationship between the two variables. For example, ‘per capita cheese consumption’ (which increases over time due to increased disposable incomes or greater availability of cheese) has a correlation coefficient of 0.95 with the ‘number of people who die from becoming tangled in their bedsheets’ (which also increases over time due to a growing population and a growing availability of bedsheets).

The case for our example (the relationship between temperature and CO2 emissions) is slightly different. There is a well-known chemical link between the two. So we understand how CO2 emissions could potentially cause changes in temperature. But in general, do not be tempted to conclude that there is a causal link just because a high correlation coefficient can be seen. Be very cautious when attaching too much meaning to high correlation coefficients when the data displays trending behaviour.

This part shows that summary statistics, such as the correlation coefficient, can help identify possible patterns or relationships between variables, but we cannot draw conclusions about causation from them alone. It is also important to think about other explanations for what we see in the data, and whether we would expect there to be a relationship between the two variables.

natural experiment
An empirical study exploiting naturally occurring statistical controls in which researchers do not have the ability to assign participants to treatment and control groups, as is the case in conventional experiments. Instead, differences in law, policy, weather, or other events can offer the opportunity to analyse populations as if they had been part of an experiment. The validity of such studies depends on the premise that the assignment of subjects to the naturally occurring treatment and control groups can be plausibly argued to be random.

However, there are ways to determine whether there is a causal relationship between two variables, for example, by looking at the scientific processes that connect the variables (as with CO2 and temperature anomalies), or by using a natural experiment. To read more about how natural experiments help evaluate whether one variable causes another, see Section 1.8 of Economy, Society, and Public Policy. In Empirical Project 3, we will take a closer look at natural experiments and how we can use them to identify causal links between variables.