4. Measuring wellbeing Working in Python

Download the code

To download the code chunks used in this project, right-click on the download link and select ‘Save Link As…’. You’ll need to save the code download to your working directory, and open it in Python.

Don’t forget to also download the data into your working directory by following the steps in this project.

Python-specific learning objectives

In addition to the learning objectives for this project, in this section you will learn how to convert (‘reshape’) data from wide to long format, and vice versa.

Getting started in Python

Head to the ‘Getting Started in Python’ page for help and advice on setting up a Python session to work with. Remember, you can run any page from this project as a notebook by downloading the relevant file from this repository and running it on your own computer. Alternatively, you can run pages online in your browser over at Binder.

Preliminary settings

Let’s import the packages we’ll need and also configure the settings we want:

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from skimpy import skim
from lets_plot import *

LetsPlot.setup_html(no_js=True)

Part 4.1 GDP and its components as a measure of material wellbeing

Learning objectives for this part

  • Check datasets for missing data.
  • Sort data and assign ranks based on values.
  • Distinguish between time series and cross-sectional data, and plot appropriate charts for each type of data.

The GDP data we will look at is from the United Nations’ National Accounts Main Aggregates Database, which contains estimates of total GDP and its components for all countries over the period 1970 to present. We will look at how GDP and its components have changed over time, and investigate the usefulness of GDP per capita as a measure of wellbeing.

To answer the questions below, download the data and make sure you understand how the measure of total GDP is constructed.

  • Go to the United Nations’ National Accounts Main Aggregates Database website. On the right-hand side of the page, under ‘Data Availability’, click ‘Downloads’.
  • Under the subheading ‘GDP and its breakdown at constant 2015 prices in US Dollars’, select the Excel file ‘All countries for all years – sorted alphabetically’.
  • Save it in a subfolder of the directory you are coding in such that the relative path is data/Download-GDPconstant-USD-countries.xlsx.

Python walk-through 4.1 Importing the Excel file (.xlsx or .xls format) into Python

First, make sure you move the saved data to a folder called data that is a subfolder of your working directory. The working directory is the folder that your code ‘starts’ in, and the one that you open when you start Visual Studio Code. Let’s say you called it core, then the file and folder structure of your working directory would look like this:

📁 core
│──📁data
   └──Download-GDPconstant-USD-all.xlsx
│──empirical_project_4.py

This is similar to what you should see in Visual Studio Code under the explorer tab (although the working directory, core, won’t appear). You can check your working directory by running the following lines of code in Visual Studio Code:

import os
os.getcwd()

Before importing the file into Python, open the file in Excel, OpenOffice, LibreOffice, or Numbers to see how the data is organized in the spreadsheet, and note the following:

  • There is a heading that we don’t need, followed by a blank row.
  • The data we need starts on row three.

Using this knowledge, we can import the data using the Path module to create the path to the data:

df = pd.read_excel("data/Download-GDPconstant-USD-countries.xlsx", skiprows=2)
df.head()
Area/CountryID Area/Country IndicatorName 1970 1971 1972 1973 1974 1975 1976 ... 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 1 World Final consumption expenditure 1.407658e+13 1.471021e+13 1.547790e+13 1.624870e+13 1.658903e+13 1.717994e+13 1.791836e+13 ... 4.875022e+13 5.005980e+13 5.126395e+13 5.255100e+13 5.387176e+13 5.538710e+13 5.690357e+13 5.859288e+13 6.035363e+13 6.185429e+13
1 1 World Household consumption expenditure (including N... 1.035693e+13 1.084386e+13 1.147363e+13 1.210090e+13 1.227802e+13 1.264545e+13 1.323398e+13 ... 3.738964e+13 3.848842e+13 3.948548e+13 4.055752e+13 4.163991e+13 4.286196e+13 4.407794e+13 4.546361e+13 4.682931e+13 4.793928e+13
2 1 World General government final consumption expenditure 3.716902e+12 3.860622e+12 3.996625e+12 4.138637e+12 4.305327e+12 4.531862e+12 4.682747e+12 ... 1.135533e+13 1.156536e+13 1.176947e+13 1.198689e+13 1.222256e+13 1.251001e+13 1.280957e+13 1.310784e+13 1.350042e+13 1.389068e+13
3 1 World Gross capital formation 4.498515e+12 4.622966e+12 4.840791e+12 5.303193e+12 5.406454e+12 5.143859e+12 5.640857e+12 ... 1.588746e+13 1.685123e+13 1.742526e+13 1.803677e+13 1.874605e+13 1.933678e+13 1.974630e+13 2.073245e+13 2.157186e+13 2.203811e+13
4 1 World Gross fixed capital formation (including Acqui... 4.155661e+12 4.364629e+12 4.617120e+12 4.984883e+12 4.942678e+12 4.957705e+12 5.280600e+12 ... 1.523050e+13 1.607289e+13 1.683424e+13 1.744210e+13 1.810611e+13 1.874677e+13 1.928192e+13 2.008485e+13 2.095530e+13 2.154916e+13
  1. You can see from the tab ‘Download-GDPconstant-USD-countr’ that some countries have missing data for some of the years. Data may be missing due to political reasons (for example, countries formed after 1970) or data availability issues.
  • Make and fill a frequency table similar to Figure 4.1, showing the number of years that data is available for each country in the category ‘Final consumption expenditure’.
  • How many countries have data for the entire period (1970 to the latest year available)? Do you think that missing data is a serious issue in this case?
Country Number of years of GDP data
   
   
   
   

Figure 4.1 Number of years of GDP data available for each country.

 

Python walk-through 4.2 Making a frequency table

We want to create a table showing how many years of Final consumption expenditure data are available for each country.

Looking at the dataset’s current format, you can see that countries and indicators (for example, Afghanistan and Final consumption expenditure) are row variables, while year is the column variable. This data is organized in ‘wide’ format (each individual’s information is in a single row).

For many data operations and making charts it is more convenient to have indicators as column variables, so we would like Final consumption expenditure to be a column variable, and year to be the row variable. Each observation would represent the value of an indicator for a particular country and year. This data is organized in ‘long’ format (each individual’s information is in multiple rows). This is also called ‘tidy’ data and it can be recognised by having one variable per column and one observation per row. Many data scientists consider it good practice to keep data in tidy format.

To change data from wide to long format, we use the pd.melt method. The melt method is very powerful and useful, as you will find many large datasets are in wide format. In this case, pd.melt takes the data from all columns not specified as being id_vars (via a list of column names), and uses them to create two new columns: one contains the name of the row variable created from the former column names, which is the year here; we can set that new column’s name with var_name="year". The second new column contains the values that were in the columns we unpivoted and is automatically given the name value. (We could have set a new name for this column by passing value_name=, too.)

Compare df_long to the wider df to understand how the melt command works. To learn more about organizing data in Python, read the Working with Data section of ‘Coding for Economists’.

df_long = pd.melt(
    df, id_vars=["CountryID", "Country", "IndicatorName"], var_name="year"
)
df_long.head()
Area/CountryID Area/Country IndicatorName year value
0 1 World Final consumption expenditure 1970 1.407658e+13
1 1 World Household consumption expenditure (including N... 1970 1.035693e+13
2 1 World General government final consumption expenditure 1970 3.716902e+12
3 1 World Gross capital formation 1970 4.498515e+12
4 1 World Gross fixed capital formation (including Acqui... 1970 4.155661e+12

To create the required table, we only need the Final consumption expenditure of each country, which we extract using the .loc function. We’d like all columns, so we pass the condition in the first position of .loc and leave the second entry as : to indicate that we want all columns.

cons = df_long.loc[df_long["IndicatorName"] == "Final consumption expenditure", :]

Now let’s use count to create our table.

year_count = cons.groupby("Country").count()
year_count
CountryID IndicatorName year value
Country
Afghanistan 52 52 52 52
Albania 52 52 52 52
Algeria 52 52 52 52
Andorra 52 52 52 52
Angola 52 52 52 52
... ... ... ... ...
Yemen Democratic (Former) 52 52 52 21
Yugoslavia (Former) 52 52 52 21
Zambia 52 52 52 52
Zanzibar 52 52 52 32
Zimbabwe 52 52 52 52

This code created a new column called value that counts the number of available years of GDP data for each country.

Now we can establish how many of the 220 countries and areas in the dataset have complete information. A dataset is complete if it has the maximum number of available observations (given by year_count["available_years"].max()).

sum(year_count["value"] == year_count["value"].max())
179

In this case, the full set of data is available for 179 out of 220 countries and areas.

If you add up the data on the right-hand side of this equation, you may find that it does not add up to the reported GDP value. The UN notes this discrepancy in Section J, item 17 of the ‘Methodology for the national accounts’: ‘The sums of com­ponents in the tables may not necessarily add up to totals shown because of rounding’.

There are three different ways in which countries calculate GDP for their national accounts, but we will focus on the expenditure approach, which calculates gross domestic product (GDP) as:

Gross capital formation refers to the creation of fixed assets in the economy (such as the construction of buildings, roads, and new machinery) and changes in inventories (stocks of goods held by firms).

  1. Rather than looking at exports and imports separately, we usually look at the difference between them (exports minus imports), also known as net exports. Choose three countries that have GDP data over the entire period (1970 to the latest year available). For each country, create a variable that shows the values of net exports in each year.

Python walk-through 4.3 Creating new variables

We will use Brazil, the US, and China as examples.

Before we select these three countries, we will calculate the net exports (exports minus imports) for all countries, as we need that information in Python walk-through 4.4. We will also shorten the names of the variables we need, to make the code easier to read. We will use a dictionary to map names into shorter formats. A dictionary is a built-in object type in Python and always has the structure {key1: value1, key2: value2, ...} where the keys and values could have any type (e.g. string, integer, dataframe). In our case, both keys and values will be strings. We will use a convention for our naming that is known as ‘snake case’. This means we use all lowercase words with spaces replaced by underscores (it looks a bit like a snake!). There are packages that will auto-rename long variables for you, but let’s see how to do it manually here.

short_names_dict = {
    "Final consumption expenditure": "final_expenditure",
    "Household consumption expenditure (including Non-profit institutions serving households)": "hh_expenditure",
    "General government final consumption expenditure": "gov_expenditure",
    "Gross capital formation": "capital",
    "Imports of goods and services": "imports",
    "Exports of goods and services": "exports",
}
# rename these values
df_long["IndicatorName"] = df_long["IndicatorName"].replace(short_names_dict)

df_long still has several rows for a particular country and year (one for each indicator). We will reshape this data using the .pivot method to ensure that we have only one row per country/area and per year. Note that pivot preserves the list of columns we pass as the index= and pivots (transposes/reshapes) the columns we pass to columns= so that they are wide.

df_table = df_long.pivot(
    index=["CountryID", "Country", "year"], columns=["IndicatorName"]
)
df_table.head()
value
IndicatorName Agriculture, hunting, forestry, fishing (ISIC A-B) Changes in inventories Construction (ISIC F) Gross Domestic Product (GDP) Gross fixed capital formation (including Acquisitions less disposals of valuables) Manufacturing (ISIC D) Mining, Manufacturing, Utilities (ISIC C-E) Other Activities (ISIC J-P) Total Value Added Transport, storage and communication (ISIC I) Wholesale, retail trade, restaurants and hotels (ISIC G-H) capital exports final_expenditure gov_expenditure hh_expenditure imports
CountryID Country year
4 Afghanistan 1970 7.797721e+09 NaN 4.829657e+07 1.046759e+10 1.653901e+09 1.027537e+09 1.029146e+09 4.318674e+08 9.969872e+09 2.634881e+08 4.298280e+08 1.654637e+09 1.546737e+08 3.068715e+09 3.288787e+08 2.734161e+09 1.622465e+08
1971 7.499719e+09 NaN 4.829657e+07 1.043873e+10 1.764161e+09 1.077057e+09 1.077777e+09 3.923603e+08 9.620515e+09 2.393842e+08 3.905074e+08 1.764946e+09 1.997868e+08 2.957075e+09 3.375334e+08 2.614420e+09 2.367204e+08
1972 7.191561e+09 NaN 5.027363e+07 9.101517e+09 1.543641e+09 1.127088e+09 1.127731e+09 3.806628e+08 9.336689e+09 2.322475e+08 3.788652e+08 1.615795e+09 2.420059e+08 2.886788e+09 3.530078e+08 2.529175e+09 2.478413e+08
1973 7.530719e+09 NaN 5.676967e+07 8.917725e+09 1.543641e+09 1.177119e+09 1.179560e+09 4.367777e+08 9.878990e+09 2.664839e+08 4.347151e+08 1.566079e+09 3.011881e+08 3.100416e+09 3.704164e+08 2.724908e+09 2.592801e+08
1974 7.854370e+09 NaN 5.846427e+07 9.354451e+09 1.984681e+09 1.281825e+09 1.285003e+09 4.726411e+08 1.043700e+10 3.220254e+08 4.717595e+08 1.914097e+09 4.078667e+08 3.328662e+09 3.810549e+08 2.941860e+09 3.587346e+08

Note that df_long previously had information in five columns: CountryID, Country, IndicatorName, year, and value. CountryID, Country, and year are still columns in df_table, as we declared them as the index of the new dataframe. The unique values of IndicatorName have now become their own variables (columns), filled in with data from the value column. That last column was not mentioned in the previous line of code (it was clear to pandas what had been left out) but it is the values of that column that now appear in the body of the df_table dataframe. See what happens if you leave two of the original columns unmentioned, for instance take year out of the index list. You should find that you get an error message because it is then ambiguous as to how to create df_table from df_long.

Now we create a net_exports column based on the existing columns (exports - imports), and we know that this will be a unique country–year combination for each row. First we need to drop the top level of the column index, which is currently called value: we don’t need this any more. Doing so will allow us to access the exports and imports columns directly. We’ll also reset the index to row numbers rather than those three columns we used in the pivot. We’ll also remove the name of the columns as we won’t need that any longer.

df_table.columns = df_table.columns.droplevel()
df_table = df_table.reset_index()
df_table.columns.name = ""
df_table["net_exports"] = df_table["exports"] - df_table["imports"]

Let us select our three chosen countries to check that we calculated net exports correctly.

sel_countries = ["Brazil", "United States", "China"]
cols_to_keep = ["Country", "year", "exports", "imports", "net_exports"]


df_sel_un = df_table.loc[df_table["Country"].isin(sel_countries), cols_to_keep]
df_sel_un.head()
Country year exports imports net_exports
1092 Brazil 1970 1.081500e+10 1.961365e+10 −8.798644e+09
1093 Brazil 1971 1.141087e+10 2.347561e+10 −1.206475e+10
1094 Brazil 1972 1.416813e+10 2.820028e+10 −1.403215e+10
1095 Brazil 1973 1.618776e+10 3.395635e+10 −1.776859e+10
1096 Brazil 1974 1.656551e+10 4.354633e+10 −2.698082e+10

Now we will create charts to show the GDP components in order to look for general patterns over time and make comparisons between countries.

  1. Evaluate the components over time, for two countries of your choice.
  • Create a new row for each of the four components of GDP (Household consumption expenditure, General government final consumption expenditure, Gross capital formation, Net exports). To make the charts easier to read, convert the values into billions (for example, 4.38 billion instead of 4,378,772,008). Round your values to two decimal places.
  • Plot a separate line chart for each country, showing the value of the four components of GDP on the vertical axis and time (the years 1970–Present) on the horizontal. (Use more than one line chart per country if necessary, to show the data more clearly). Name each component in the chart legend appropriately.
  • Which of the components would you expect to move together (increasing or decreasing together) or move in opposite directions, and why? Using your charts from Question 3(a), describe any patterns you find in the relationship between the components. Does the data support your hypothesis about the behaviour of the components?
  • For each country, describe any patterns you find in the movement of components over time. Which factors could explain the patterns that you find within countries, and any differences between countries (for example, economic or political events)? You may find it helpful to research the history of the countries you have chosen.
  • Extension: For one country, add data labels to your chart to indicate the relevant events that happened in that year.

Python walk-through 4.4 Plotting and annotating time series data

Extract the relevant data

We will work with the df_long dataset, as the long format is well suited to produce charts with the lets_plot package. In this example, we use the US and China (which we will now save as the dataframe comp).

# take a copy of df_long that just has data for the US and China
comp = df_long.loc[df_long["Country"].isin(["United States", "China"]), :].copy()
# Convert value to billion USD
comp["value"] = comp["value"] / 1.0e9
# Filter down to certain cols and values
comp = comp.loc[
    comp["IndicatorName"].isin(
        ["gov_expenditure", "hh_expenditure", "capital", "imports", "exports"]
    ),
    ["Country", "year", "IndicatorName", "value"],
]
comp.head()
Country year IndicatorName value
675 China 1970 hh_expenditure 136.272429
676 China 1970 gov_expenditure 28.925787
677 China 1970 capital 76.821414
680 China 1970 exports 5.240282
681 China 1970 imports 5.603186

Plot a line chart

We can now plot this data using the lets_plot data visualization library. We’ll plot the US data first.

(
    ggplot(
        comp.loc[comp["Country"] == "United States", :],
        aes(
            x="year",
            y="value",
            color="IndicatorName",
            linetype="IndicatorName",
        ),
    )
    + geom_line(size=1)
)
The US’s GDP components (expenditure approach).
Fullscreen

Figure 4.2 The US’s GDP components (expenditure approach).

There are plenty of problems with this chart:

  • The vertical axis label is uninformative.
  • There is no chart title.
  • The vertical axis dips below zero (even though the data can’t take on negative values).
  • The legend is uninformative.

To improve this chart, we add features to the figure by creating the axis, ax, explicitly. We’ll also use a trick where we invert the dictionary from earlier and use this to supply full names to the legend via a new ‘indicator’ column.

In Python walk-through 4.3, we had a dictionary, short_names_dict, that mapped the long names, such as Final consumption expenditure, into short names, such as final_expenditure. To make our plot, we’d like to reverse this so that the long names are used in the legend instead of the short names. The first line of code reverses the dictionary—you don’t need to understand the details of the code, just know that it’s an inversion so that final_expenditure maps into Final consumption expenditure and so on.

# reverse the dictionary from earlier
rev_name_dict = {v: k.split("(")[0] for k, v in short_names_dict.items()}
# create a new col with the original names
comp["Indicator"] = comp["IndicatorName"].replace(rev_name_dict)


(
    ggplot(
        comp.loc[comp["Country"] == "United States", :],
        aes(x="year", y="value", color="Indicator", linetype="Indicator"),
    )
    + geom_line(size=2)
    + labs(
        y="USD ($Billions)",
        x="Year",
        title="GDP components over time",
        color="Indicator",
    )
    + ggsize(800, 500)
    + scale_x_continuous(format="d")
)
The US’s GDP components (expenditure approach), amended chart.
Fullscreen

Figure 4.3 The US’s GDP components (expenditure approach), amended chart.

We can make a chart for more than one country simultaneously using comp rather than just a subset of comp.

(
    ggplot(comp, aes(x="year", y="value", color="Indicator", linetype="Indicator"))
    + geom_line(size=2)
    + labs(
        y="USD ($Billions)",
        x="Year",
        title="GDP components over time",
        color="Indicator",
    )
    + ggsize(1000, 500)
    + scale_x_continuous(format="d")
    + facet_wrap(facets="Country", ncol=2)
)
GDP components over time (China and the US).
Fullscreen

Figure 4.4 GDP components over time (China and the US).

  1. Another way to visualize the GDP data is to look at each component as a proportion of total GDP. Use the same countries that you chose for Question 3.
  • For each country used in Question 3, create a new column showing the proportion of each component of total GDP (in other words, as a value ranging from 0 to 1). (Hint: to calculate the proportion of a component, divide the value of that component by the sum of all four components.)
  • Plot a separate line chart for each country, showing the proportion of the component of GDP on the vertical axis and time (the years 1970 to the latest year available) on the horizontal axis.
  • Describe any patterns in the proportion of spending over time for each country, and compare these patterns across countries.
  • Compared to the charts in Question 3, what are some advantages of this method for making comparisons over time and between countries?

Python walk-through 4.5 Calculating new variables and plotting time series data

Calculate proportion of total GDP

We will use the comp dataset created in Python walk-through 4.4. First we will calculate net exports, as that contributes to GDP. As the data is currently in long format, we will reshape the data into wide format so that the variables we need are in separate columns instead of separate rows (using the pivot method, as in Python walk-through 4.3), calculate net exports, then transform the data back into long format using the melt method.

We’ll end up dropping the indicator names, and dropping the top level value:

comp_wide = comp.drop("Indicator", axis=1).pivot(
    index=["Country", "year"], columns="IndicatorName"
)
comp_wide.columns = comp_wide.columns.droplevel()
comp_wide = comp_wide.reset_index()
comp_wide.head()
IndicatorName Country year capital exports gov_expenditure hh_expenditure imports
0 China 1970 76.821414 5.240282 28.925787 136.272429 5.603186
1 China 1971 83.888985 6.274381 33.929948 141.980944 5.609614
2 China 1972 80.281758 7.587016 35.626445 149.603204 6.824620
3 China 1973 91.842331 10.690313 36.481480 160.172298 10.578897
4 China 1974 94.781286 12.711351 39.181109 163.443553 14.701894

To figure out what .droplevel() and .reset_index() do, you should comment out each of the two lines in turn and re-run the block of code to see what difference these lines make.

Add a new column called net_exports (exports minus imports):

comp_wide["net_exports"] = comp_wide["exports"] - comp_wide["imports"]
comp_wide.head()
IndicatorName Country year capital exports gov_expenditure hh_expenditure imports net_exports
0 China 1970 76.821414 5.240282 28.925787 136.272429 5.603186 -0.362903
1 China 1971 83.888985 6.274381 33.929948 141.980944 5.609614 0.664767
2 China 1972 80.281758 7.587016 35.626445 149.603204 6.824620 0.762396
3 China 1973 91.842331 10.690313 36.481480 160.172298 10.578897 0.111416
4 China 1974 94.781286 12.711351 39.181109 163.443553 14.701894 -1.990543

We create another dataframe called comp2 which returns comp_wide to long format with the household expenditure, capital, and net export variables:

comp2 = pd.melt(
    comp_wide.drop(["exports", "imports"], axis=1),
    id_vars=["year", "Country"],
    var_name="indicator",
    value_name="2015_bn_usd",
)
comp2.head()
year Country indicator 2015_bn_usd
0 1970 China capital 76.821414
1 1971 China capital 83.888985
2 1972 China capital 80.281758
3 1973 China capital 91.842331
4 1974 China capital 94.781286

Now we will create a new dataframe (props) that also contains the proportions for each GDP component (proportion), using method chaining to link functions together.

props = comp2.assign(
    proportion=comp2.groupby(["Country", "year"])["2015_bn_usd"].transform(
        lambda x: x / x.sum()
    )
)

In words, we did the following: Take the comp2 dataframe and add in a new column called proportion (.assign(proportion=) that, within area and year groups (.groupby(["Country", "year"])) takes the value (["2015_bn_usd"]) and divides it by the total value for that group (.transform(lambda x: x/x.sum())). For example, the first row gives capital as a proportion of GDP for China in 1970.

The result is then saved in props. Look at the props dataframe to confirm that the above command has achieved the desired result. (You can check the answer with props.groupby(["Country", "year"])["proportion"].sum().)

Plot a line chart

Now we redo the line chart from Python walk-through 4.4 using the variable props.

# Update dictionary for net exports, which is new
rev_name_dict.update({"net_exports": "Net exports"})
props["Component"] = props["indicator"].map(rev_name_dict)
(
    ggplot(
        props, aes(x="year", y="proportion", color="Component", linetype="Component")
    )
    + labs(
        y="Proportion",
        x="Year",
    )
    + geom_line(size=2)
    + facet_wrap(facets="Country", ncol=2)
)
GDP component proportions over time (China and the US).
Fullscreen

Figure 4.5 GDP component proportions over time (China and the US).

time series data
A time series is a set of time-ordered observations of a variable taken at successive, in most cases regular, periods or points of time. Example: The population of a particular country in the years 1990, 1991, 1992, … , 2015 is time series data.
cross-sectional data
Data that is collected from participants at one point in time or within a relatively short time frame. In contrast, time series data refers to data collected by following an individual (or firm, country, etc.) over a course of time. Example: Data on degree courses taken by all the students in a particular university in 2016 is considered cross-sectional data. In contrast, data on degree courses taken by all students in a particular university from 1990 to 2016 is considered time series data.

So far, we have done comparisons of time series data, which is a collection of values for the same variables and subjects, taken at different points in time (for example, GDP of a particular country, measured each year). We will now make some charts using cross-sectional data, which is a collection of values for the same variables for different subjects, usually taken at the same time.

  1. Choose three developed countries, three countries in economic transition, and three developing countries (for a list of these countries, see Tables A–C in the UN country classification document).
  • For each country, calculate each component as a proportion of GDP for the year 2015 only.
  • Now create a stacked bar chart that shows the composition of GDP in 2015 on the horizontal axis, and country on the vertical axis. Arrange the columns so that the countries in a particular category are grouped together. (See the walk-through in Figure 3.8 of Economy, Society, and Public Policy for an example of what your chart should look like.)
  • Describe the differences (if any) between the spending patterns of developed, economic transition, and developing countries.

Python walk-through 4.6 Creating stacked bar charts

Calculate proportion of total GDP

This walk-through uses the following countries (selected for purposes of illustration):

  • developed countries: Germany, Japan, United States
  • transition countries: Albania, Russian Federation, Ukraine
  • developing countries: Brazil, China, India.

The relevant data are still in the df dataframe. Before we select these countries, we first calculate the required proportions for all countries for capital, final expenditure, and net exports (out of those columns).

columns_to_track = ["capital", "final_expenditure", "net_exports"]
countries_to_use = [
    "Germany",
    "Japan",
    "United States",
    "Albania",
    "Russian Federation",
    "Ukraine",
    "Brazil",
    "China",
    "India",
]

# Find the proportions for these columns and create new columns called "prop_" + original col name
for col in columns_to_track:
    df_table["prop_" + col] = df_table[col].divide(
        df_table[columns_to_track].sum(axis=1)
    )


# filter this down to 2015 for the countries and cols we want
cols_to_keep = ["prop_" + col for col in columns_to_track] + ["Country", "year"]
df_2015 = df_table.loc[
    (df_table["Country"].isin(countries_to_use)) & (df_table["year"] == 2015),
    cols_to_keep,
]
df_2015.head()
prop_capital prop_final_expenditure prop_net_exports Country year
97 0.257269 0.914777 −0.172046 Albania 2015
1137 0.174116 0.837416 −0.011532 Brazil 2015
2073 0.430328 0.537384 0.032288 China 2015
3737 0.197429 0.726618 0.075953 Germany 2015
4465 0.323575 0.699562 −0.023138 India 2015

Plot a stacked bar chart

Now let’s create the bar chart. First we need to melt the data into a format where each row is an observation, each column a variable.

df_2015_melt = pd.melt(
    df_2015,
    id_vars=["Country", "year"],
    value_name="Proportion",
    var_name=["Component"],
)
df_2015_melt["Per cent"] = df_2015_melt["Proportion"] * 100


# Create the bar chart
(
    ggplot(df_2015_melt, aes(x="Country", y="Per cent", fill="Component"))
    + geom_bar(stat="identity")
    + coord_flip()
    + labs(fill="Components of GDP")
)
GDP component proportions in 2015.
Fullscreen

Figure 4.6 GDP component proportions in 2015.

Note that even when a country has a trade deficit (proportion of net exports less than 0), the proportions will add up to 1, but the proportions of final expenditure and capital will add up to more than 1.

  1. GDP per capita is often used to indicate material wellbeing instead of GDP, because it accounts for differences in population across countries. Refer to the following articles to help you to answer the questions:

  • Discuss the usefulness and limitations of GDP per capita as a measure of material wellbeing.
  • Based on the arguments in the articles, do you think GDP per capita is an appropriate measure of both material wellbeing and overall wellbeing? Why or why not?

Part 4.2 The HDI as a measure of wellbeing

Learning objectives for this part

  • Sort data and assign ranks based on values.
  • Distinguish between time series and cross-sectional data, and plot appropriate charts for each type of data.
  • Calculate the geometric mean and explain how it differs from the arithmetic mean.
  • Construct indices using the geometric mean, and use index values to rank observations.
  • Explain the difference between two measures of wellbeing (GDP per capita and the Human Development Index).

In Part 4.1, we looked at GDP per capita as a measure of material wellbeing. While income has a major influence on wellbeing because it allows us to buy the goods and services we need or enjoy, it is not the only determinant of wellbeing. Many aspects of our wellbeing cannot be bought, for example, good health or having more time to spend with friends and family.

We are now going to look at the Human Development Index (HDI), a measure of wellbeing that includes non-material aspects, and make comparisons with GDP per capita (a measure of material wellbeing). GDP per capita is a simple index calculated as the sum of its elements, whereas the HDI is more complex. Instead of using different types of expenditure or output to measure wellbeing or living standards, the HDI consists of three dimensions associated with wellbeing:

  • a long and healthy life (health)
  • knowledge (education)
  • a decent standard of living (income).

We will first learn about how the HDI is constructed, and then use this method to construct indices of wellbeing according to criteria of our choice.

The HDI data we will look at is from the Human Development Report by the United Nations Development Programme (UNDP). To answer the questions below, download the data and technical notes from the report:

  • Go to the UNDP’s website.
  • Click ‘Table 1: Human Development Index and its components’ to download the HDI data as an Excel file.
  • Save the file in an easily accessible location, and make sure to give it a suitable name.
  • The ‘Technical notes’ give a diagrammatic presentation of how the HDI is constructed from four indicators.
  1. Refer to the technical notes and the spreadsheet you have downloaded. For each indicator, explain whether you think it is a good measure of the dimension, and suggest alternative indicators, if any. (For example, is GNI per capita a good measure of the dimension ‘a decent standard of living’?)
  1. Figure 4.7 shows the minimum and maximum values for each indicator. Discuss whether you think these are reasonable. (You can read the justification for these values in the technical notes.)
Dimension Indicator Minimum Maximum
Health Life expectancy (years) 20 85
Education Expected years of schooling (years) 0 18
Mean years of schooling (years) 0 15
Standard of living Gross national income per capita (2011 PPP $) 100 75,000

Figure 4.7 Maximum and minimum values for each indicator in the HDI.

United Nations Development Programme. 2022. ‘Technical notes’ in Human Development Report 2021/22: p. 2.

We are now going to apply the method for constructing the HDI, by recalculating the HDI from its indicators. We will use the formula below, and the minimum and maximum values in the table in Figure 4.7. These are taken from page 2 of the technical notes, which you can refer to for additional details.

The HDI indicators are measured in different units and have different ranges, so in order to put them together into a meaningful index, we need to normalize the indicators using the following formula:

Doing so will give a value in between 0 and 1 (inclusive), which will allow comparison between different indicators.

  1. Refer to Figure 4.7 and calculate the dimension index for each of the dimensions:
  • Using the HDI indicator data in Column E of the spreadsheet, calculate the dimension index for a long and healthy life (health).
  • Using the HDI indicator data in Columns G and I of the spreadsheet, calculate the dimension index for knowledge (education). Note that the knowledge dimension index is the average of the dimension index for expected years of schooling for those entering school, and mean years of schooling for adults aged 25 or older. When calculating the index for expected years of schooling you need to restrict the values to take a maximum value of 1.
  • Using the HDI indicator data in Column K of the spreadsheet, calculate the dimension index for a decent standard of living (income). Note (from the technical notes) that you should calculate the GNI index using the natural log of the values. (See the ‘Find out more’ box below for an explanation of the natural log and how to calculate it in Python.)

Find out more The natural log: What it means, and how to calculate it in Python

The natural log turns a linear variable into a concave variable, as shown in Figure 4.8. For any value of income on the horizontal axis, the natural log of that value on the vertical axis is smaller. At first, the difference between income and log income is not that big (for example, an income of 2 corresponds to a natural log of 0.7), but the difference becomes bigger as we move rightwards along the horizontal axis (for example, when income is 100,000, the natural log is only 11.5).

In this chart, the horizontal axis shows income, ranging from 1 to 7, and the vertical axis shows the natural log of income, ranging from 0 to 2.5. Coordinates are (income, natural log of income) The natural log of income is an upward-sloping concave curve that passes through the points (0, 0) and (2, 0.7).
Fullscreen

Figure 4.8 Comparing income with the natural logarithm of income.

The reason why natural logs are useful in economics is because they can represent variables that have diminishing marginal returns: an additional unit of input results in a smaller increase in the total output than did the previous unit. (If you have studied production functions, then the shape of the natural log function might look familiar.)

When applied to the concept of wellbeing, the ‘input’ is income, and the ‘output’ is material wellbeing. It makes intuitive sense that a $100 increase in per capita income will have a much greater effect on wellbeing for a poor country compared to a rich country. Using the natural log of income incorporates this notion into the index we create. Conversely, the notion of diminishing marginal returns (the larger the value of the input, the smaller the contribution of an additional unit of input) is not captured by GDP per capita, which uses actual income and not its natural log. Doing so makes the assumption that a $100 increase in per capita income has the same effect on wellbeing for rich and poor countries.

The numpy log function in Python calculates the natural log of a value for you. To calculate the natural log of a value, x, type np.log(x). If you have a scientific calculator, you can check that the calculation is correct by using the ln or log key.

Now that you know about the natural log, you might want to go back to Question 3(c) in Part 4.1, and create a new chart using the natural log scale. The natural log is used in economics because it approximates percentage changes; that is, log(x) – log(y) is a close approximation to the percentage change between x and y. So, using the natural log scale, you will be able to ‘read off’ the relative growth rates from the slopes of the different series you have plotted. For example, a 0.01 change in the vertical axis value corresponds to a 1% change in that variable. This will allow you to compare the growth rates of the different components of GDP.

geometric mean
A summary measure calculated by multiplying N numbers together and then taking the Nth root of this product. The geometric mean is useful when the items being averaged have different scoring indices or scales, because it is not sensitive to these differences, unlike the arithmetic mean. For example, if education ranged from 0 to 20 years and life expectancy ranged from 0 to 85 years, life expectancy would have a bigger influence on the HDI than education if we used the arithmetic mean rather than the geometric mean. Conversely, the geometric mean treats each criteria equally. Example: Suppose we use life expectancy and mean years of schooling to construct an index of wellbeing. Country A has life expectancy of 40 years and a mean of 6 years of schooling. If we used the arithmetic mean to make an index, we would get (40 + 6)/2 = 23. If we used the geometric mean, we would get (40 × 6)1/2 = 15.5. Now suppose life expectancy doubled to 80 years. The arithmetic mean would be (80 + 6)/2 = 43, and the geometric mean would be (80 × 6)1/2 = 21.9. If, instead, mean years of schooling doubled to 12 years, the arithmetic mean would be (40 + 12)/2 = 26, and the geometric mean would be (40 × 12)1/2 = 21.9. This example shows that the arithmetic mean can be ‘unfair’ because proportional changes in one variable (life expectancy) have a larger influence over the index than changes in the other variable (years of schooling). The geometric mean gives each variable the same influence over the value of the index, so doubling the value of one variable would have the same effect on the index as doubling the value of another variable.

Now, we can combine these dimensional indices to give the HDI. The HDI is the geometric mean of the three dimension indices (IHealth = Life expectancy index, IEducation = Education index, and IIncome = GNI index):

  1. Use the formula above and the data in the spreadsheet to calculate the HDI for all the countries excluding those in the ‘Other countries or territories’ category. You should get the same values as those in Column C, rounded to three decimal places.

Python walk-through 4.7 Calculating the HDI

We will use pd.read_excel to import the data file, which we saved using the easy-to-understand name of ‘HDI_data.xlsx’ in the data folder within our working directory. Before importing, open the Excel file so that you understand its structure and how it corresponds to the code options used below. It’s a long way from being a neat and tidy dataset! We will save the imported data as the df_hdr dataframe. Having taken a look at the Excel file, we can see we should skip the first few rows and take row 1 as the header (which becomes our column names).

df_hdr = pd.read_excel(
    Path("data/HDI_data.xlsx"), skiprows=3, header=1
)
df_hdr.head()
Unnamed: 0 Unnamed: 1 Human Development Index (HDI) Unnamed: 3 Life expectancy at birth Unnamed: 5 Expected years of schooling Unnamed: 7 Mean years of schooling Unnamed: 9 Gross national income (GNI) per capita Unnamed: 11 GNI per capita rank minus HDI rank Unnamed: 13 HDI rank
0 HDI rank Country Value NaN (years) NaN (years) NaN (years) NaN (2017 PPP $) NaN NaN NaN NaN
1 NaN NaN 2021 NaN 2021 NaN 2021 a 2021 a 2021 NaN 2021 b 2020
2 NaN VERY HIGH HUMAN DEVELOPMENT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1 Switzerland 0.962 NaN 83.9872 NaN 16.500299 NaN 13.85966 NaN 66933.00454 NaN 5 NaN 3
4 2 Norway 0.961 NaN 83.2339 NaN 18.1852 c 13.00363 NaN 64660.10622 NaN 6 NaN 1

Looking at the df_hdr dataframe, we see that there are rows that have information that isn’t data (for example, all the rows with an ‘NaN’ in them), as well as variables and columns that do not contain data, or are a mix of ‘NaN’ and data.

Cleaning up the dataframe can be easier to do in Excel by deleting irrelevant rows and columns, but one advantage of doing it in Python is replicability. Suppose in a year’s time you carried out the analysis again with an updated spreadsheet containing new information. If you had done the cleaning in Excel, you would have to redo it from scratch, but if you had done it in Python, you can simply rerun the code below. (This works for new data that is in the same format, too.)

Let’s do some data cleaning.

First, we rename the last column by picking up the year entry below it (.iloc[1, -1]) so we can distinguish between different years of HDI rank. Next, we use a list comprehension to replace any columns that are called Unnamed with entries from the first row of observations. Then we eliminate rows that do not have any numbers in the "HDI rank" column. (To learn more about the list comprehension syntax and how to use it, read this tutorial by W3Schools.)

df_hdr = df_hdr.rename(columns={"HDI rank": "HDI_rank_" + str(df_hdr.iloc[1, -1])})
df_hdr.columns = [
    df_hdr.columns[i] if "Unnamed" not in df_hdr.columns[i] else df_hdr.iloc[0, i]
    for i in range(len(df_hdr.columns))
]
df_hdr = df_hdr.loc[~pd.isna(df_hdr["HDI rank"]), :]
df_hdr.head()
HDI rank Country Human Development Index (HDI) NaN Life expectancy at birth NaN Expected years of schooling NaN Mean years of schooling NaN Gross national income (GNI) per capita NaN GNI per capita rank minus HDI rank NaN HDI_rank_2020
0 HDI rank Country Value NaN (years) NaN (years) NaN (years) NaN (2017 PPP $) NaN NaN NaN NaN
3 1 Switzerland 0.962 NaN 83.9872 NaN 16.500299 NaN 13.85966 NaN 66933.00454 NaN 5 NaN 3
4 2 Norway 0.961 NaN 83.2339 NaN 18.1852 c 13.00363 NaN 64660.10622 NaN 6 NaN 1
5 3 Iceland 0.959 NaN 82.6782 NaN 19.163059 c 13.76717 NaN 55782.04981 NaN 11 NaN 2
6 4 Hong Kong, China (SAR) 0.952 NaN 85.4734 d 17.27817 NaN 12.22621 NaN 62606.8454 NaN 6 NaN 4

Now we can eliminate rows in HDI rank that do not have numbers in them (~pd.isna) and, following that, eliminate columns that contain NaNs (.dropna). (You may need to replace 2020 with a later year, depending on the version of HDI data you’re using.)

df_hdr = df_hdr.loc[~pd.isna(df_hdr["HDI_rank_2020"]), :]
df_hdr = df_hdr.dropna(axis=1, how="any")
df_hdr.head()
HDI rank Country Human Development Index (HDI) Life expectancy at birth Expected years of schooling Mean years of schooling Gross national income (GNI) per capita GNI per capita rank minus HDI rank HDI_rank_2020
3 1 Switzerland 0.962 83.9872 16.500299 13.85966 66933.00454 5 3
4 2 Norway 0.961 83.2339 18.1852 13.00363 64660.10622 6 1
5 3 Iceland 0.959 82.6782 19.163059 13.76717 55782.04981 11 2
6 4 Hong Kong, China (SAR) 0.952 85.4734 17.27817 12.22621 62606.8454 6 4
7 5 Australia 0.951 84.5265 21.05459 12.72682 49238.43335 18 5

Now let’s switch to shorter column names and check what datatypes we have in our data:

new_column_names = [
    "hdi_rank",
    "country",
    "hdi",
    "life_exp",
    "exp_yrs_school",
    "mean_yrs_school",
    "gni_capita",
    "gni_hdi_rank",
    "hdi_rank_2020",
]
df_hdr.columns = new_column_names
df_hdr.info()
<class 'pandas.core.frame.DataFrame'>
Index: 191 entries, 3 to 196
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   hdi_rank         191 non-null    object
 1   country          191 non-null    object
 2   hdi              191 non-null    object
 3   life_exp         191 non-null    object
 4   exp_yrs_school   191 non-null    object
 5   mean_yrs_school  191 non-null    object
 6   gni_capita       191 non-null    object
 7   gni_hdi_rank     191 non-null    object
 8   hdi_rank_2020    191 non-null    object
dtypes: object(9)
memory usage: 14.9+ KB

Looking at the structure of the data, we see that pandas thinks that all the data consists of objects because the original datafile contained non-numerical entries (these rows have now been deleted). Apart from the "country" variable, which we want to be a categorical variable, all variables should be ‘double’ or ‘int’ (numbers).

Now, we could make this change for each variable individually, by running:

df_hdr = df_hdr.astype({"hdi_rank": "int32"})

and similar for each name and its corresponding type. (Note that the structure { ... : ...} is a dictionary, which has keys and values.)

However, it would be quite tedious to have to retype the same line over and over again. Instead we can pass a dictionary with key-value pairs for all mappings of names to data types we’d like to have in our dataframe.

To do this, we’ll use a trick where we ‘zip’ two variables (the column names and datatypes) together into a dictionary that maps the column name into the datatype we’d like it to have. The zip function brings an element of two lists together in turn, a bit like a zipper on your clothes brings two interlocking plastic teeth together.

new_column_datatypes = [
    "int",
    "category",
    "double",
    "double",
    "double",
    "double",
    "double",
    "int",
    "int",
]
dict_of_names_to_types = {k: v for k, v in zip(new_column_names, new_column_datatypes)}
dict_of_names_to_types
{'hdi_rank': 'int',
 'country': 'category',
 'hdi': 'double',
 'life_exp': 'double',
 'exp_yrs_school': 'double',
 'mean_yrs_school': 'double',
 'gni_capita': 'double',
 'gni_hdi_rank': 'int',
 'hdi_rank_2020': 'int'}

We now have a dictionary of all names and their associated types that we can use:

df_hdr = df_hdr.astype(dict_of_names_to_types)
df_hdr.info()
<class 'pandas.core.frame.DataFrame'>
Index: 191 entries, 3 to 196
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   hdi_rank         191 non-null    int64   
 1   country          191 non-null    category
 2   hdi              191 non-null    float64 
 3   life_exp         191 non-null    float64 
 4   exp_yrs_school   191 non-null    float64 
 5   mean_yrs_school  191 non-null    float64 
 6   gni_capita       191 non-null    float64 
 7   gni_hdi_rank     191 non-null    int64   
 8   hdi_rank_2020    191 non-null    int64   
dtypes: category(1), float64(5), int64(3)
memory usage: 19.4 KB

Now we have a nice clean dataset that we can work with.

We start by calculating the three indices, using the information given. For the education index, we calculate the index for expected and mean schooling separately, then take the arithmetic mean to get i_education. As some mean schooling observations exceed the specified ‘maximum’ value of 18, the calculated index values would be larger than 1. To avoid this, we first replace these observations with 18 to obtain an index value of 1.

df_hdr.loc[df_hdr["exp_yrs_school"] > 18, "exp_yrs_school"] = 18


# Now create the indices
df_hdr["i_health"] = (df_hdr["life_exp"] - 20) / (85 - 20)
df_hdr["i_education"] = (
    ((df_hdr["exp_yrs_school"] - 0) / (18 - 0))
    + (df_hdr["mean_yrs_school"] - 0) / (15 - 0)
) / 2
df_hdr["i_income"] = (np.log(df_hdr["gni_capita"]) - np.log(100)) / (
    np.log(75000) - np.log(100)
)
df_hdr["hdi_calc"] = np.power(
    df_hdr["i_health"] * df_hdr["i_education"] * df_hdr["i_income"], 1 / 3
)

Now we can compare the HDI given in the table and our calculated HDI (they should be the same when rounded to three decimal places).

df_hdr[["hdi", "hdi_calc"]]
hdi hdi_calc
3 0.962 0.962050
4 0.961 0.961086
5 0.959 0.959482
6 0.952 0.954481
7 0.951 0.950664
... ... ...
192 0.426 0.426349
193 0.404 0.403530
194 0.400 0.400358
195 0.394 0.393726
196 0.385 0.385143

The HDI is one way to measure wellbeing, but you may think that it does not use the most appropriate measures for the non-material aspects of wellbeing (health and education).

Now we will use the same method to create our own index of non-material wellbeing (an ‘alternative HDI’), using different indicators. You can find alternative indicators to measure health and education from the World Bank’s World Development Indicators. Use the interactive menu on the left to select and download alternative indicators of health and education:

  • In the ‘Country’ section, click the tick box to select all countries.
  • In the ‘Time’ section, click the tick box to select all available years.
  • In the ‘Series’ section, find and select two or three indicators to measure health, and two to three indicators to measure education (check Python walk-through 4.8 for some ideas). Also select the series ‘GNI per capita (constant 2015 US$)’, which we’ll use for the income indicator.
  • Minimise all sections in the left-side menu. Then click ‘Apply Changes’ to preview the data.
  • Click ‘Download options’ (near the top of the page) and select ‘CSV’ to download the data as a .csv file.
  • The data will download as a .zip file containing two .csv files. The first file, with ‘Data.csv’ in the filename, contains the indicators we need. Save this file in your ‘data’ subfolder and give it an easy-to-remember name (like ‘world_bank_data.csv’).
  1. Create an alternative index of wellbeing. In particular, propose alternative Education and Health indices in (a) and (b), then combine these with the alternative Income index in (c) to calculate an alternative HDI. Examine whether the changes caused substantial changes in country rankings in (d).
  • Choose two to three indicators to measure health, and two to three indicators to measure education. Explain why you have chosen these indicators.
  • Choose a reasonable maximum and minimum value for each indicator and justify your choices.
  • Calculate your alternative versions of the education and health dimension indices. Since you have chosen more than one indicator for this dimension, make sure to average the dimension indices as done in Question 3(b). Also ensure that higher indicator values always represent better outcomes. Recalculate the income dimension index using the GNI per capita data. Now calculate the alternative HDI as done in Questions 3 and 4.
  • Create a new variable showing each country’s rank according to your alternative HDI, where 1 is assigned to the country with the highest value. Compare your ranking to the HDI rank. Are the rankings generally similar, or very different? (See Python walk-through 4.8 on how to do this.)

Python walk-through 4.8 Creating your own HDI

Merge data and calculate alternative indices

This example uses the following indicators:

  • Education: Literacy rate, adult total (% ages 15 and above); School enrollment, tertiary (% gross); Trained teachers in primary education (% of total teachers).
  • Health: Prevalence of stunting, height for age (% under age 5); Mortality rate, adult, female (per 1,000 female adults); Mortality rate, adult, male (per 1,000 male adults).
  • Income: GNI per capita (in constant 2015 USD).

If you’re using different indicators, remember to adjust the code below accordingly.

First, we import the data, which we saved as world_bank_data.csv, and check that it has been imported correctly. You can see that each row represents a different country, and each column represents a different year–indicator combination. Note that the data comes with several empty rows at the end before some metadata that has been inserted (information about where the data came from and the download date). To remove the metadata, we drop rows that have NaN in the Country Code column.

all_hdr = pd.read_csv(Path("data/world_bank_data.csv")).dropna(
    subset=["Country Code"], axis=0
)
all_hdr
Country Name Country Code Series Name Series Code 1960 [YR1960] 1961 [YR1961] 1962 [YR1962] 1963 [YR1963] 1964 [YR1964] 1965 [YR1965] ... 2013 [YR2013] 2014 [YR2014] 2015 [YR2015] 2016 [YR2016] 2017 [YR2017] 2018 [YR2018] 2019 [YR2019] 2020 [YR2020] 2021 [YR2021] 2022 [YR2022]
0 Afghanistan AFG Literacy rate, adult total (% of people ages 1... SE.ADT.LITR.ZS .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. 37.266040802002 ..
1 Afghanistan AFG School enrollment, tertiary (% gross) SE.TER.ENRR .. .. .. .. .. .. ... .. 8.31087017059326 .. .. .. 9.96378993988037 .. 10.8584403991699 .. ..
2 Afghanistan AFG Trained teachers in primary education (% of to... SE.PRM.TCAQ.ZS .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. .. ..
3 Afghanistan AFG Prevalence of stunting, height for age (% of c... SH.STA.STNT.ZS .. .. .. .. .. .. ... 40.4 .. .. .. .. 38.2 .. .. .. ..
4 Afghanistan AFG Mortality rate, adult, female (per 1,000 femal... SP.DYN.AMRT.FE 550.189 543.6 537.703 531.856 526.179 520.698 ... 214.871 212.215 209.573 204.096 196.522 193.284 190.261 210.053 214.241 ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1857 World WLD Trained teachers in primary education (% of to... SE.PRM.TCAQ.ZS .. .. .. .. .. .. ... 86.6119384765625 86.3298873901367 85.9246826171875 84.8016510009766 85.599006652832 86.6845779418945 86.2046585083008 86.3962631225586 86.606086730957 85.6526412963867
1858 World WLD Prevalence of stunting, height for age (% of c... SH.STA.STNT.ZS .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. .. ..
1859 World WLD Mortality rate, adult, female (per 1,000 femal... SP.DYN.AMRT.FE 302.23839310137 284.48339478412 264.599246966128 261.515942913558 256.345687314532 256.825165517128 ... 117.805291746937 116.934167334985 116.220022447097 115.161161407209 113.867167759306 112.363558656741 111.111154392665 119.443111464671 138.465381887828 ..
1860 World WLD Mortality rate, adult, male (per 1,000 male ad... SP.DYN.AMRT.MA 386.345709343829 365.917299719234 341.855302636778 338.074201833933 332.029766235342 337.657487077999 ... 186.161600673567 182.332619127125 177.465118028284 175.751796102991 174.57957030685 171.986914064584 170.283676576264 182.403699690544 206.345016669394 ..
1861 World WLD GNI per capita (constant 2015 US$) NY.GNP.PCAP.KD .. .. .. .. .. .. ... 9806.34774226996 10001.5565898911 10177.0396729044 10336.6271001545 10579.8337141469 10796.977476103 10975.7017030918 10500.2859821596 11044.6183686609 ..

Then we need to filter to get the year we want. We’ll do this in a new dataframe. Note that two dots (..) are a string that indicates a missing variable. Unfortunately, if we have a mix of strings and other data types (like double), pandas isn’t sure how to treat the column. So we need to replace the code for missing values in a string, .., with ‘proper’ numeric missing values (NaN), which will allow us to do operations like arithmetic on the column. To do this, we use the pd.to_numeric function—once it has been applied using the errors="coerce" option, you’ll see that any occurrences of .. have been converted to NaN, which means not a number. (Remember to change the year number from 2021 to whatever year is in your HDI data from Questions 1–4).

hdr_2021 = all_hdr.loc[
    :, ["Country Name", "Country Code", "Series Name", "Series Code", "2021 [YR2021]"]
]
hdr_2021["2021 [YR2021]"] = pd.to_numeric(
    hdr_2021["2021 [YR2021]"], errors="coerce"
).fillna(np.nan)
hdr_2021.sample(5)
Country Name Country Code Series Name Series Code 2021 [YR2021]
269 Chad TCD Prevalence of stunting, height for age (% of c... SH.STA.STNT.ZS 31.1
1786 Pacific island small states PSS School enrollment, tertiary (% gross) SE.TER.ENRR NaN
36 Angola AGO School enrollment, tertiary (% gross) SE.TER.ENRR NaN
182 Brazil BRA Literacy rate, adult total (% of people ages 1... SE.ADT.LITR.ZS NaN
679 Jamaica JAM Literacy rate, adult total (% of people ages 1... SE.ADT.LITR.ZS NaN

For ease of coding, we’ll introduce some short name versions of the series names and we will make some additional changes to the datatypes:

dict_to_shorter_hdi_names = {
    "SE.TER.ENRR": "enrollment",
    "SE.PRM.TCAQ.ZS": "trained_teachers",
    "SE.ADT.LITR.ZS": "literacy",
    "SP.DYN.AMRT.MA": "mortality_m",
    "SP.DYN.AMRT.FE": "mortality_f",
    "SH.STA.STNT.ZS": "stunting",
    "NY.GNP.PCAP.KD": "income",
}


dict_of_types = {
    "Country Name": "category",
    "Country Code": "category",
    "Series Code": "category",
    "indicator": "category",
    "Series Name": "string",
}

hdr_2021["indicator"] = hdr_2021["Series Code"].map(dict_to_shorter_hdi_names)
hdr_2021 = hdr_2021.astype(dict_of_types)
hdr_2021.head()
Country Name Country Code Series Name Series Code 2021 [YR2021] indicator
0 Afghanistan AFG Literacy rate, adult total (% of people ages 1... SE.ADT.LITR.ZS 37.266041 literacy
1 Afghanistan AFG School enrollment, tertiary (% gross) SE.TER.ENRR NaN enrollment
2 Afghanistan AFG Trained teachers in primary education (% of to... SE.PRM.TCAQ.ZS NaN trained_teachers
3 Afghanistan AFG Prevalence of stunting, height for age (% of c... SH.STA.STNT.ZS NaN stunting
4 Afghanistan AFG Mortality rate, adult, female (per 1,000 femal... SP.DYN.AMRT.FE 214.241000 mortality_f

Now we can look at some summary statistics; we’ll use the skimpy package.

You will need to install skimpy on the command line of your computer by typing

pip install skimpy

before running the code below.

skim(hdr_2021)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types               Categories                                        │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓                                │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃ ┃ Categorical Variables ┃                                │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩                                │
│ │ Number of rows    │ 1862   │ │ category    │ 4     │ │ Country Name          │                                │
│ │ Number of columns │ 6      │ │ string      │ 1     │ │ Country Code          │                                │
│ └───────────────────┴────────┘ │ float64     │ 1     │ │ Series Code           │                                │
│                                └─────────────┴───────┘ │ indicator             │                                │
│                                                        └───────────────────────┘                                │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name         NA     NA %      mean    sd      p0     p25    p50   p75   p100      hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩  │
│ │ 2021 [YR2021]       866   46.51  2600  9400  0.5   76 110 280  110000 │  │
│ └────────────────────┴───────┴──────────┴────────┴────────┴───────┴───────┴──────┴──────┴──────────┴─────────┘  │
│                                                    category                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                         NA         NA %           ordered                unique            ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩  │
│ │ Country Name                              0            0False                              266 │  │
│ │ Country Code                              0            0False                              266 │  │
│ │ Series Code                               0            0False                                7 │  │
│ │ indicator                                 0            0False                                7 │  │
│ └────────────────────────────────────┴───────────┴───────────────┴───────────────────────┴───────────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                NA       NA %        words per row                 total words              ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ Series Name                    0         0                         8.3                   15428 │  │
│ └───────────────────────────┴─────────┴────────────┴──────────────────────────────┴──────────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

Before we can calculate indices, we need to set minimum and maximum values, which we base on the minimum and maximum values in the data. Let’s look at the max and min by indicator:

min_max = hdr_2021.groupby(["indicator"])["2021 [YR2021]"].agg(["min", "max"])
min_max
min max
indicator
enrollment 2.411387 150.201767
income 265.277629 107243.417579
literacy 37.266041 100.000000
mortality_f 22.098000 432.957000
mortality_m 42.594000 552.668000
stunting 0.500000 46.000000
trained_teachers 22.573050 100.000000

We want observations to be within the interval of min to max. Let’s do this systematically for all of the indicators:

hdr_2021["index_indicator"] = hdr_2021.groupby(["indicator"])[
    "2021 [YR2021]"
].transform(lambda x: (x - x.min()) / (x.max() - x.min()))
hdr_2021.head(7)
Country Name Country Code Series Name Series Code 2021 [YR2021] indicator index_indicator
0 Afghanistan AFG Literacy rate, adult total (% of people ages 1... SE.ADT.LITR.ZS 37.266041 literacy 0.000000
1 Afghanistan AFG School enrollment, tertiary (% gross) SE.TER.ENRR NaN enrollment NaN
2 Afghanistan AFG Trained teachers in primary education (% of to... SE.PRM.TCAQ.ZS NaN trained_teachers NaN
3 Afghanistan AFG Prevalence of stunting, height for age (% of c... SH.STA.STNT.ZS NaN stunting NaN
4 Afghanistan AFG Mortality rate, adult, female (per 1,000 femal... SP.DYN.AMRT.FE 214.241000 mortality_f 0.467662
5 Afghanistan AFG Mortality rate, adult, male (per 1,000 male ad... SP.DYN.AMRT.MA 342.158000 mortality_m 0.587295
6 Afghanistan AFG GNI per capita (constant 2015 US$) NY.GNP.PCAP.KD NaN income NaN

You should be able to see that the column index_indicator is the literacy indicator for Afghanistan, converted to index format by subtracting the minimum value (in the whole dataset) and dividing by the difference between the maximum and minimum for literacy.

Now we’ll build the new health and education indicators. To do this, we’ll aggregate over existing indicators—so we’re going to end up with a smaller dataframe that only has each country in it once. For neatness, we can also pack these into a single dataframe.

# create educ index
educ_index_indicators = ["literacy", "enrollment", "trained_teachers"]
hdr_2021_educ = (
    hdr_2021.loc[hdr_2021["indicator"].isin(educ_index_indicators), :]
    .groupby(["Country Code"])["index_indicator"]
    .mean()
)


# create health index. These indicators are less good when the number is high
# so we invert this measure
health_index_indicators = ["stunting", "mortality_m", "mortality_f"]
hdr_2021_health = (
    1.0
    - hdr_2021.loc[hdr_2021["indicator"].isin(health_index_indicators), :]
    .groupby(["Country Code"])["index_indicator"]
    .mean()
)


# create income indicator
hdr_2021_income = (
    hdr_2021.loc[hdr_2021["indicator"].isin(["income"]), :]
    .groupby(["Country Code"])["index_indicator"]
    .mean()
)


# combine them all
hdr_2021_long = pd.concat([hdr_2021_educ, hdr_2021_health, hdr_2021_income], axis=1)
# give each sensible names (they both inherit column name of "index_indicator" by default)
hdr_2021_long.columns = ["educ_index", "health_index", "income_index"]
skim(hdr_2021_long.reset_index())
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types               Categories                                        │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓                                │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃ ┃ Categorical Variables ┃                                │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩                                │
│ │ Number of rows    │ 266    │ │ float64     │ 3     │ │ Country Code          │                                │
│ │ Number of columns │ 4      │ │ category    │ 1     │ └───────────────────────┘                                │
│ └───────────────────┴────────┘ └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name        NA    NA %     mean    sd      p0    p25      p50      p75     p100   hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩  │
│ │ educ_index         64  24.06  0.59  0.23   0   0.47   0.61  0.74    1▁▂▃▇▆▃  │  │
│ │ health_index       42  15.79  0.68   0.2   0   0.55    0.7  0.84    1 ▁▅▅▇▆  │  │
│ │ income_index       85  31.95  0.12  0.17   0  0.015  0.048  0.14    1  ▇▁▁   │  │
│ └───────────────────┴──────┴─────────┴────────┴────────┴──────┴─────────┴─────────┴────────┴───────┴─────────┘  │
│                                                    category                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                         NA         NA %           ordered                unique            ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩  │
│ │ Country Code                              0            0False                              266 │  │
│ └────────────────────────────────────┴───────────┴───────────────┴───────────────────────┴───────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

We’d like to merge our two new indicators back into the main dataset on the same basis as the existing ones. Our new indices are in a different shape, so to do this we have to either reshape hdr_2021 to be in wide format or reshape hdr_2021_long to be in wide format. Let’s try the latter, adding in the ‘missing’ columns as we go:

new_indices_wide = pd.melt(
    hdr_2021_long.reset_index(),
    id_vars="Country Code",
    var_name="indicator",
    value_name="index_indicator",
)
country_code_to_name = dict(
    zip(
        hdr_2021["Country Code"].drop_duplicates(),
        hdr_2021["Country Name"].drop_duplicates(),
    )
)
new_indices_wide["Series Name"] = new_indices_wide["indicator"]
new_indices_wide["Series Code"] = new_indices_wide["indicator"]
new_indices_wide["Country Name"] = new_indices_wide["Country Code"].map(
    country_code_to_name
)
hdr_2021 = pd.concat([new_indices_wide, hdr_2021], axis=0)
hdr_2021
Country Code indicator index_indicator Series Name Series Code Country Name 2021 [YR2021]
0 ABW educ_index NaN educ_index educ_index Aruba NaN
1 AFE educ_index 0.303252 educ_index educ_index Africa Eastern and Southern NaN
2 AFG educ_index 0.000000 educ_index educ_index Afghanistan NaN
3 AFW educ_index 0.479030 educ_index educ_index Africa Western and Central NaN
4 AGO educ_index NaN educ_index educ_index Angola NaN
... ... ... ... ... ... ... ...
1857 WLD trained_teachers 0.827012 Trained teachers in primary education (% of to... SE.PRM.TCAQ.ZS World 86.606087
1858 WLD stunting NaN Prevalence of stunting, height for age (% of c... SH.STA.STNT.ZS World NaN
1859 WLD mortality_f 0.283229 Mortality rate, adult, female (per 1,000 femal... SP.DYN.AMRT.FE World 138.465382
1860 WLD mortality_m 0.321034 Mortality rate, adult, male (per 1,000 male ad... SP.DYN.AMRT.MA World 206.345017
1861 WLD income 0.100762 GNI per capita (constant 2015 US$) NY.GNP.PCAP.KD World 11044.618369

Now we have added both of the alternative indicators to our original HDI data, we can now compute our own HDI indicator. First, we’ll flip (pivot) the data to wide format, then repeat the same calculation as in Python walk-through 4.7.

hdr_2021_w = hdr_2021.pivot(
    index=["Country Code", "Country Name"],
    columns="indicator",
    values="index_indicator",
)
hdr_2021_w = hdr_2021_w.assign(
    hdi_own=lambda x: np.power(
        x["educ_index"] * x["health_index"] * x["income_index"], 1 / 3
    )
)
skim(hdr_2021_w)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types                                                                 │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                                                          │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃                                                          │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                                                          │
│ │ Number of rows    │ 266    │ │ float64     │ 11    │                                                          │
│ │ Number of columns │ 11     │ └─────────────┴───────┘                                                          │
│ └───────────────────┴────────┘                                                                                  │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┓  │
│ ┃ column_name            NA     NA %     mean    sd     p0   p25      p50     p75    p100   hist   ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━┩  │
│ │ educ_index              64  24.06  0.59 0.23  0   0.47  0.61 0.74    1▁▂▃▇▆▃ │  │
│ │ enrollment             110  41.35  0.33  0.2  0   0.14  0.35 0.49    1▇▆▇▆▁  │  │
│ │ health_index            42  15.79  0.68  0.2  0   0.55   0.7 0.84    1 ▁▅▅▇▆ │  │
│ │ income                  85  31.95  0.12 0.17  0  0.015 0.048 0.14    1 ▇▁▁   │  │
│ │ income_index            85  31.95  0.12 0.17  0  0.015 0.048 0.14    1 ▇▁▁   │  │
│ │ literacy               184  69.17  0.79 0.23  0   0.62  0.91 0.98    1  ▂▂▂▇ │  │
│ │ mortality_f             44  16.54   0.3  0.2  0   0.14  0.24 0.42    1▇▇▅▃▁  │  │
│ │ mortality_m             44  16.54  0.35  0.2  0   0.19  0.34 0.47    1▅▇▇▅▁  │  │
│ │ stunting               249  93.61  0.43 0.31  0   0.15  0.47 0.57    1▇▃▆▃▂▅ │  │
│ │ trained_teachers       150  56.39  0.83 0.19  0   0.74  0.87    1    1   ▂▅▇ │  │
│ │ hdi_own                139  52.26  0.28 0.16  0   0.14  0.26 0.37 0.71▅▇▇▃▂▂ │  │
│ └───────────────────────┴───────┴─────────┴────────┴───────┴─────┴─────────┴────────┴───────┴───────┴────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

We can see that our alternative HDI has 139 missing values (52% of all entries) because of the high number of missing data for at least one of the health and education indicators that we used to construct this HDI.

Calculate ranks

To compare the ranks of the two indices (the original HDI and our alternative HDI), we should only rank the countries that have observations for both indices. We will create a dataframe called HDR2021_sub that contains this subset of countries.

# drop columns which don’t have an alternative HDI
hdr_2021_w_sub = hdr_2021_w.dropna(subset=["hdi_own"]).drop(
    health_index_indicators + educ_index_indicators + ["income"], axis=1
)
hdr_2021_w_sub
indicator educ_index health_index income_index hdi_own
Country Code Country Name
AFE Africa Eastern and Southern 0.303252 0.453241 0.010844 0.114228
AFW Africa Western and Central 0.479030 0.357431 0.013574 0.132462
ALB Albania 0.449018 0.873329 0.042324 0.255080
ARB Arab World 0.542176 0.758086 0.053874 0.280810
ARG Argentina 0.708566 0.813830 0.111018 0.400039
... ... ... ... ... ...
VUT Vanuatu 0.913125 0.727765 0.025482 0.256794
WLD World 0.626291 0.697868 0.100762 0.353142
WSM Samoa 0.544117 0.770036 0.032910 0.239798
ZAF South Africa 0.497539 0.318312 0.051993 0.201934
ZWE Zimbabwe 0.969264 0.224262 0.009212 0.126042

There are 127 countries with sufficient information to produce our alternative HDI.

We’ll now merge hdr_2021_w_sub, which has our HDI indicator in it, with the dataframe containing the original HDI indicator. The only column in common is one that contains country names—though these are called something different in each dataset—so we must rename the relevant column in at least one of the two dataframes before merging. The merge will create a new dataframe called hdr_merged.

hdr_merged = pd.merge(
    hdr_2021_w_sub.reset_index().rename(columns={"Country Name": "country"}),
    df_hdr.drop(
        ["mean_yrs_school", "life_exp", "exp_yrs_school", "gni_capita"], axis=1
    ),
    on=["country"],
    how="inner",
)
hdr_merged
Country Code country educ_index health_index income_index hdi_own hdi_rank hdi gni_hdi_rank hdi_rank_2020 i_health i_education i_income hdi_calc
0 ALB Albania 0.449018 0.873329 0.042324 0.255080 67 0.796 17 68 0.868655 0.777548 0.747871 0.796405
1 ARG Argentina 0.708566 0.813830 0.111018 0.400039 47 0.842 17 47 0.852152 0.868100 0.807173 0.842076
2 ARM Armenia 0.521781 0.711140 0.038687 0.243035 85 0.759 4 87 0.800663 0.742031 0.737094 0.759390
3 BDI Burundi 0.026222 0.460971 0.000000 0.000000 187 0.426 4 187 0.640965 0.402162 0.300649 0.426349
4 BEL Belgium 0.543180 0.932112 0.402899 0.588667 13 0.937 7 16 0.951980 0.912523 0.945527 0.936516
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
77 VNM Viet Nam 0.246713 0.793902 0.027691 0.175697 115 0.703 6 113 0.824894 0.638785 0.659405 0.703021
78 VUT Vanuatu 0.913125 0.727765 0.025482 0.256794 140 0.607 23 142 0.776138 0.555920 0.518011 0.606872
79 WSM Samoa 0.544117 0.770036 0.032910 0.239798 111 0.707 24 112 0.811808 0.725095 0.599962 0.706845
80 ZAF South Africa 0.497539 0.318312 0.051993 0.201934 109 0.713 -17 102 0.651400 0.758097 0.734668 0.713216
81 ZWE Zimbabwe 0.969264 0.224262 0.009212 0.126042 146 0.593 9 145 0.603894 0.626779 0.549871 0.592623

So we have a successful inner merge for 82 countries. Finally, we’re going to compute a ranking of countries based on our alternative HDI. We use the ascending=False option in the .rank method so the ‘best’ value will have a rank of 1. We’ll also have to re-rank these 82 countries according to their original HDI (so the ranks will range from 1 to 82, not 1 to 191).

hdr_merged["hdi_own_rank"] = hdr_merged["hdi_own"].rank(ascending=False).astype("int")
hdr_merged["hdi_original_rank"] = hdr_merged["hdi_rank"].rank(ascending=True).astype("int")

Now we can make a scatter plot of the ranks compared with one another:

(ggplot(hdr_merged, aes(x="hdi_original_rank", y="hdi_own_rank")) + geom_point(size=3))
Scatterplot of ranks for HDI and alternative HDI index.
Fullscreen

Figure 4.9 Scatterplot of ranks for HDI and alternative HDI index.

We can improve this chart quite a bit. One particularly useful addition would be some text showing some key values. Using lets_plot’s system for mapping variables onto layers of a chart, we can do this fairly easily using long format data by adding a column called text that has an entry only every ten values (otherwise it would overwhelm the chart). We first sort by the 2021 original HDI rank, then create an empty text column, and then add the real country name into the text column only for every tenth entry.

hdr_merged = hdr_merged.sort_values("hdi_rank_2021")
hdr_merged["text"] = ""
hdr_merged.iloc[::10, -1] = hdr_merged.iloc[::10, 1]
hdr_merged.head()
Country Code country educ_index health_index income_index hdi_own hdi_rank hdi gni_hdi_rank hdi_rank_2020 i_health i_education i_income hdi_calc hdi_own_rank hdi_original_rank text
15 CHE Switzerland 0.470143 0.972723 0.796181 0.714074 1 0.962 5 3 0.984418 0.920330 0.982810 0.962050 1 1 Switzerland
23 DNK Denmark 0.551937 0.947001 0.562121 0.664798 6 0.948 6 5 0.944235 0.932016 0.967208 0.947708 2 2
70 SWE Sweden 0.565201 0.965971 0.526429 0.659937 7 0.947 9 9 0.968974 0.920324 0.951740 0.946797 3 3
27 FIN Finland 0.666191 0.931229 0.436748 0.647086 11 0.940 11 12 0.954432 0.929121 0.937088 0.940154 4 4
59 NZL New Zealand 0.521018 0.932627 0.374390 0.566624 13 0.937 16 13 0.960789 0.931490 0.919639 0.937147 8 5
(
    ggplot(hdr_merged, aes(x="hdi_original_rank", y="hdi_own_rank"))
    + geom_point(size=3)
    + geom_text(aes(label="text"), size=5)
    + labs(
        x="HDI rank (2021)",
        y="Alternative HDI rank (2021)",
        title="Comparing two human development indices",
    )
)
Scatterplot of ranks for HDI and alternative HDI index with text.
Fullscreen

Figure 4.10 Scatterplot of ranks for HDI and alternative HDI index with text.

You can see that in general the rankings are similar. If they were identical, the points in the scatterplot would form a straight upward-sloping line. They do not form a straight line, but there is a very strong positive correlation. There are, however, a few countries where the alternative definitions have caused a change in ranking, so let’s look at where the alternative definitions have caused a big change in ranking. We’ll use the .diff method, then the head and tail methods to find the largest positive and negative differences in rank (our HDI vs the original HDI).

hdr_merged = hdr_merged.assign(
    rank_diff=hdr_merged[["hdi_own_rank", "hdi_original_rank"]].diff(axis=1)[
        "hdi_original_rank"
    ]
).sort_values("rank_diff", ascending=False)
hdr_merged.head(6)
Country Code country educ_index health_index income_index hdi_own hdi_rank hdi gni_hdi_rank hdi_rank_2020 i_health i_education i_income hdi_calc hdi_own_rank hdi_original_rank text rank_diff
78 VUT Vanuatu 0.913125 0.727765 0.025482 0.256794 140 0.607 23 142 0.776138 0.555920 0.518011 0.606872 43 66 23
46 MDV Maldives 0.965888 0.968041 0.080761 0.422675 90 0.747 -14 97 0.921818 0.595056 0.761332 0.747468 21 43 22
12 BRN Brunei Darussalam 0.961584 0.814414 0.277262 0.601046 51 0.829 -42 49 0.840652 0.693549 0.977193 0.829007 5 21 Brunei Darussalam 16
35 JAM Jamaica 1.000000 0.684760 0.042672 0.308008 110 0.709 4 110 0.776928 0.677513 0.676918 0.708943 37 50 13
44 MAR Morocco 0.636719 0.840264 0.027859 0.246098 123 0.683 1 122 0.831414 0.590309 0.648157 0.682641 45 57 12
26 ECU Ecuador 0.883752 0.745624 0.048020 0.316293 95 0.740 11 99 0.825692 0.700133 0.700273 0.739756 36 46 10
hdr_merged.tail(6)
Country Code country educ_index health_index income_index hdi_own hdi_rank hdi gni_hdi_rank hdi_rank_2020 i_health i_education i_income hdi_calc hdi_own_rank hdi_original_rank text rank_diff
0 ALB Albania 0.449018 0.873329 0.042324 0.255080 67 0.796 17 68 0.868655 0.777548 0.747871 0.796405 44 31 Albania -13
14 BWA Botswana 0.138664 0.324398 0.054252 0.134634 117 0.693 -43 110 0.632937 0.685365 0.768495 0.693385 71 56 -15
9 BIH Bosnia and Herzegovina 0.278545 0.889152 0.052261 0.234792 74 0.780 4 73 0.850774 0.734544 0.759302 0.779978 50 35 -15
28 GEO Georgia 0.503483 0.718592 0.040599 0.244904 63 0.802 17 64 0.795292 0.859866 0.753465 0.801691 46 29 -17
49 MKD North Macedonia 0.258053 0.852949 0.045352 0.215316 78 0.770 -3 79 0.828331 0.719395 0.765856 0.769909 53 36 -17
57 NIC Nicaragua 0.118331 0.764462 0.015667 0.112326 126 0.667 6 129 0.828258 0.589282 0.608720 0.667271 76 59 -17

From this, we can see that countries such as Vanuatu, Maldives, and Brunei Darussalam (now called Brunei) do much better on the alternative ranking. Meanwhile, Georgia, North Macedonia and Nicaragua have fared less well on this alternative ranking.

  1. Compare your alternative index to the HDI:
  • The UN classifies countries into four groups depending on their HDI, as shown in Figure 4.11. Would the classification of any country change under your alternative HDI?
  • Based on your answers to Questions 5(d) and 6(a), do you think that the HDI is a robust measure of non-material wellbeing? (In other words, does changing the indicators used in the HDI give similar conclusions about the non-material wellbeing of countries?)
Classification HDI
Very high human development 0.800 and above
High human development 0.700–0.799
Medium human development 0.550–0.699
Low human development Below 0.550

Figure 4.11 Classification of countries according to their HDI value.

United Nations Development Programme. 2016. ‘Technical notes’ in Human Development Report 2016: p.3.

We will now investigate whether HDI and GDP per capita give similar information about overall wellbeing, by comparing a country’s rank in both measures. To answer Question 7, first add (merge) this data to the dataframe that contains your HDI calculations, making sure to match the data to the correct country. (Python walk-through 4.8 shows you how to extract the indicator(s) you need and merge them with another dataset.)

  1. Evaluate GDP per capita and the HDI as measures of overall wellbeing:
  • Create a new column showing each country’s rank according to GDP per capita, where 1 is assigned to the country with the highest value. Show this rank on a scatterplot, with GDP per capita rank on the vertical axis and HDI rank on the horizontal axis. (See Python walk-through 4.8 for a step-by-step explanation of how to create scatterplots in R.)
  • Does there appear to be a strong correlation between these two variables? Give an explanation for the observed pattern in your chart.
  • Create a table similar to Figure 4.12 below. Using your answers to Question 7(a), fill each box with three countries. You can use the UN’s definition of ‘high’ for the HDI, as in Figure 4.11, and choose a reasonable definition of ‘high’ for GDP. Based on this table, which country or countries would you prefer to grow up in, and why?
  • Explain the differences between HDI and GDP as measures of wellbeing. You may want to consider the way each measure includes income in its calculation (the actual value or a transformation), and the inclusion of other aspects of wellbeing.
HDI
Low High
GDP Low
High

Figure 4.12 Classification of countries according to their HDI and GDP values.

  1. The HDI is one way to measure wellbeing, but there are many other ways to measure wellbeing.
  • What are the strengths and limitations of the HDI as a measure of wellbeing?
  • Find some alternative measures of non-material wellbeing that we could use alongside the HDI to provide a more comprehensive picture of wellbeing. For each measure, evaluate the elements used to construct the measure, and discuss the additional information we can learn from it. (You may find it helpful to read Our World in Data’s page on happiness and life satisfaction.)