5. Measuring inequality: Lorenz curves and Gini coefficients Working in Python

Download the code

To download the code chunks used in this project, right-click on the download link and select ‘Save Link As…’. You’ll need to save the code download to your working directory, and open it in Python.

Don’t forget to also download the data into your working directory by following the steps in this project.

Python-specific learning objectives

In addition to the learning objectives for this project, in Part 5.1 you will learn how to use loops to repeat specified tasks for a list of values (Note: this is an extension task, and so may not apply to all users).

Getting started in Python

Head to the ‘Getting Started in Python’ page for help and advice on setting up a Python session to work with. Remember, you can run any page from this book as a notebook by downloading the relevant file from this repository and running it on your own computer. Alternatively, you can run pages online in your browser over at Binder.

Preliminary settings

Let’s import the packages we’ll need and also configure the settings we want:

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from skimpy import skim
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html(no_js=True)

### You don't need to use these settings yourself
### — they are just here to make the book look nicer!
# Set the plot style for prettier charts:
plt.style.use(
"https://raw.githubusercontent.com/aeturrell/core_python/main/plot_style.txt"
)

Part 5.1 Measuring income inequality

Learning objectives for this part

  • Draw Lorenz curves.
  • Calculate and interpret the Gini coefficient.
  • Interpret alternative measures of income inequality.

One way to visualize the income distribution in a population is to draw a Lorenz curve. This curve shows the entire population lined up along the horizontal axis from the poorest to the richest. The height of the curve at any point on the vertical axis indicates the fraction of total income received by the fraction of the population given by that point on the horizontal axis.

We will start by using income decile data from the Global Consumption and Income Project to draw Lorenz curves and compare changes in the income distribution of a country over time. Note that income here refers to market income, which does not take into account taxes or government transfers (see Section 5.10 of Economy, Society, and Public Policy for further details).

To answer the question below:

  • Go to the Globalinc website and download the Excel file containing the data by clicking ‘xlsx’.
  • Save the Excel file in the ‘data’ subfolder of the directory you are coding in, so that the relative filepath is data/GCIPrawdata.xlsx.
  • Import the data into Python as explained in Python walk-through 5.1.

Python walk-through 5.1 Importing the Excel file (.xlsx or .xls) into Python

As we are importing an Excel file, we use the pd.read_excel function from the pandas package. The file is called ‘GCIPrawdata.xlsx’. Before you import the file into Python, open the datafile in Excel to understand its structure. You will see that the data is all in one worksheet (which is convenient), and that the headings for the variables are in the third row. Hence, we will use the skiprows=2 option in the pd.read_excel function to skip the first two rows.

Now let’s import the data using the Path module to create the path to the data, and look at the first few rows with head():

df = pd.read_excel(Path("data/GCIPrawdata.xlsx"), skiprows=2)
df.head()
Country Year Decile 1 Income Decile 2 Income Decile 3 Income Decile 4 Income Decile 5 Income Decile 6 Income Decile 7 Income Decile 8 Income Decile 9 Income Decile 10 Income Mean Income Population
0 Afghanistan 1980 206 350 455 556 665 793 955 1,187 1,594 3,542 1,030 13,211,412
1 Afghanistan 1981 212 361 469 574 686 818 986 1,225 1,645 3,655 1,063 12,996,923
2 Afghanistan 1982 221 377 490 599 716 854 1,029 1,278 1,717 3,814 1,109 12,667,001
3 Afghanistan 1983 238 405 527 644 771 919 1,107 1,376 1,848 4,105 1,194 12,279,095
4 Afghanistan 1984 249 424 551 674 806 961 1,157 1,438 1,932 4,291 1,248 11,912,510

The data is now in a pandas dataframe, which is the primary object for data analysis in Python. You can always tell the type of object you are dealing with in Python by running type on it:

type(df)
pandas.core.frame.DataFrame

In the data, each row represents a different country–year combination. The first row is for Afghanistan in 1980, and the first value (in the third column) is 206, for the variable Decile 1 Income. This value indicates that the mean annual income of the poorest 10% in Afghanistan was the equivalent of 206 USD (in 1980, adjusted using purchasing power parity). Looking at the next column, you can see that the mean income of the next richest 10% (those in the 11th to 20th percentiles for income) was 350.

To see the list of variables, we use the df.info() method.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4799 entries, 0 to 4798
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Country           4799 non-null   object
 1   Year              4799 non-null   int64 
 2   Decile 1 Income   4799 non-null   int64 
 3   Decile 2 Income   4799 non-null   int64 
 4   Decile 3 Income   4799 non-null   int64 
 5   Decile 4 Income   4799 non-null   int64 
 6   Decile 5 Income   4799 non-null   int64 
 7   Decile 6 Income   4799 non-null   int64 
 8   Decile 7 Income   4799 non-null   int64 
 9   Decile 8 Income   4799 non-null   int64 
 10  Decile 9 Income   4799 non-null   int64 
 11  Decile 10 Income  4799 non-null   int64 
 12  Mean Income       4799 non-null   int64 
 13  Population        4799 non-null   int64 
dtypes: int64(13), object(1)
memory usage: 525.0+ KB

In addition to the country, year, and the ten income deciles, we have mean income and the population.

To draw Lorenz curves, we need to calculate the cumulative share of total income owned by each decile (these will be the vertical axis values). The cumulative income share of a particular decile is the proportion of total income held by that decile and all the deciles below it. For example, if Decile 1 has 1/10 of total income and Decile 2 has 2/10 of total income, the cumulative income share of Decile 2 is 3/10 (or 0.3).

  1. Choose two countries. You will be using their data, for 1980 and 2014, as the basis for your Lorenz curves. Use the country data you have selected to calculate the cumulative income share of each decile. (Remember that each decile represents 10% of the population.)

Python walk-through 5.2 Calculating cumulative shares using the cumsum function

Before we calculate cumulative income shares, we need to calculate the total income for each country–year combination using the mean income and the population size. We’ll save this variable as total_income.

df["total_income"] = df["Mean Income"] * df["Population"]

Here we have chosen China (a country that recently underwent enormous economic changes) and the US (a developed country). We use the .loc function to create a new dataset (called xf) containing only the countries and years we need.

To select the columns, we’re going to use some lists (defined using listname = ['first property', 'second property', ...]) and then the df[columnname].isin(columnproperties) syntax. What this code does is ask whether the column name specified is in the list of columns that we pass to isin. If it is, ‘True’ is returned, and we get only the relevant rows. For example, in the code below, we only get rows that have years that are either 1980 or 2014. By using the & operator, we can get rows that have both the years of interest and the countries of interest simultaneously.

# Create lists for the years and countries we'd like
sel_year = [1980, 2014]
sel_country = ["United States", "China"]

xf = df.loc[(df["Year"].isin(sel_year)) & (df["Country"].isin(sel_country)), :]
xf
Country Year Decile 1 Income Decile 2 Income Decile 3 Income Decile 4 Income Decile 5 Income Decile 6 Income Decile 7 Income Decile 8 Income Decile 9 Income Decile 10 Income Mean Income Population total_income
893 China 1980 79 113 146 177 210 245 286 336 404 520 252 981,200,000 247,262,400,000
927 China 2014 448 927 1,440 2,008 2,659 3,445 4,457 5,911 8,473 18,689 4,846 1,364,000,000 6,609,944,000,000
4554 United States 1980 3,392 5,820 7,855 9,724 11,574 13,549 15,843 18,839 23,622 37,949 14,817 227,200,000 3,366,422,400,000
4588 United States 2014 3,778 6,534 9,069 11,552 14,132 16,993 20,429 25,061 32,763 60,418 20,073 318,900,000 6,401,279,700,000

These numbers are very large, so for our purposes it is easier to assume that there is only one person in each decile; in other words, that the total income is 10 times the mean income. This simplification works because, by definition, each decile has exactly the same number of people (10% of the population).

We will be using the very useful cumsum function (short for ‘cumulative sum’) to calculate the cumulative income. To understand what this function does, look at this simple example:

test_series = pd.Series([2, 4, 10, 22])
test_series.cumsum()
0     2
1     6
2    16
3    38
dtype: int64

You can see that each number in the sequence is the sum of all the preceding numbers (including itself), for example, we got the third number, 16, by adding 2, 4, and 10. We now apply this function to calculate the cumulative income shares for China (1980) and save them as cum_inc_share_c80.

rows_query = (xf["Year"] == 1980) & (
    xf["Country"] == "China"
)  # create a boolean that is true only for specific year-country rows
cols_with_decile_in = xf.columns[
    xf.columns.str.contains(pat="Decile")
]  # list of columns that have the word 'Decile'

# use the .loc[rows, columns] pattern:
decs_c80 = xf.loc[
    rows_query, cols_with_decile_in
]  # this gets us China, 1980, for columns with 'Decile'
# Give the total income, assuming a population of 10
total_inc = 10 * xf.loc[rows_query, "Mean Income"]
cum_inc_share_c80 = decs_c80.cumsum(axis=1) / total_inc.values[0]
cum_inc_share_c80
Decile 1 Income Decile 2 Income Decile 3 Income Decile 4 Income Decile 5 Income Decile 6 Income Decile 7 Income Decile 8 Income Decile 9 Income Decile 10 Income
893 0.031349 0.07619 0.134127 0.204365 0.287698 0.384921 0.498413 0.631746 0.792063 0.998413

Now although this code clearly shows exactly what we did for China in 1980, what if we want to do the same task for all year–country combinations? If we define a function, which we’ll call create_cumulative_income_shares, we can perform this task on any inputs specified (dataframe, year, and country):

def create_cumulative_income_shares(data, year, country):
    query = (data["Year"] == year) & (data["Country"] == country)
    decs = data.loc[query, [x for x in data.columns if "Decile" in x]]
    # Give the total income, assuming a population of 10
    total_inc = 10 * data.loc[query, "Mean Income"]
    cum_inc_share = decs.cumsum(axis=1) / total_inc.values[0]
    cum_inc_share.index = [country + ", " + str(year)]
    cum_inc_share.columns = range(1, len(cum_inc_share.columns) + 1)
    return cum_inc_share

Now we need to pass all combinations of countries and years into our function. (This task could be automated too, but it would only be worth it for many combinations. As we only have four combinations, we’ll just enter the different combinations manually.)

cum_inc_share_c14 = create_cumulative_income_shares(xf, 2014, "China")
cum_inc_share_us80 = create_cumulative_income_shares(xf, 1980, "United States")
cum_inc_share_us14 = create_cumulative_income_shares(xf, 2014, "United States")
cum_inc_share_c80 = create_cumulative_income_shares(xf, 1980, "China")
  1. Use the cumulative income shares to draw Lorenz curves for each country in order to visually compare the income distributions over time.
  • Draw a line chart with cumulative share of population on the horizontal axis and cumulative share of income on the vertical axis. Make sure to include a chart legend, and label your axes and chart appropriately.
  • Follow the steps in Python walk-through 5.3 to add a straight line representing perfect equality to each chart. (Hint: If income was shared equally across the population, the bottom 10% of people would have 10% of the total income, the bottom 20% would have 20% of the total income, and so on.)

Python walk-through 5.3 Drawing Lorenz curves

Let us plot the cumulative income shares for China (1980), which we previously stored in the variable cum_inc_share_c80. We’ll use the standard fig, ax = plt.subplots method of constructing an axis on which to plot the data, using the plotting library matplotlib. To remind you of how matplotlib works, we’ll put some comments into the code below to explain what each bit does.

# Add a column of zeros to each dataframe (so the lines will start at (0, 0)
cum_inc_share_c14.insert(0, 0, 0)
cum_inc_share_us80.insert(0, 0, 0)
cum_inc_share_us14.insert(0, 0, 0)
cum_inc_share_c80.insert(0, 0, 0)

fig, ax = plt.subplots()  # create the canvas to plot on (figure and axis)
cum_inc_share_c80.T.plot(
    ax=ax
)  # transpose cum_inc_share_c80 data and plot on axis called ax
ax.plot(
    cum_inc_share_c80.columns, [x / 10 for x in cum_inc_share_c80.columns], color="k"
)  # add 45 degree line for reference
ax.set_ylim(0, 1)  # y-limits run between 0 and 1
ax.set_xlim(0, 10)  # x-limits run between 1 and 10
ax.legend([])  # turn off legend
ax.set_title("Lorenz curve for China (1980)")
ax.set_xlabel("Income decile")
ax.set_ylabel("Cumulative income share")
plt.show() # show the final plot

Lorenz curve for China (1980).
Fullscreen

Figure 5.1 Lorenz curve for China (1980).

The purple line is the Lorenz curve. The Gini coefficient is the ratio of the area between the two lines and the total area under the black line. We will calculate the Gini coefficient in Python walk-through 5.4.

Now we add the other Lorenz curves to the chart using the line function. We use the col= option to specify a different colour for each line, and the lty option to make different line patterns for each country–year pair. Finally, we use the legend function to add a chart legend in the top-left corner of the chart.

fig, ax = plt.subplots()
for line, style in zip(
    [cum_inc_share_c80, cum_inc_share_us80, cum_inc_share_us14, cum_inc_share_c14],
    ["-", "-.", "dashed", ":"],
):
    line.T.plot(ax=ax, linestyle=style)
ax.plot(cum_inc_share_c80.columns, [x / 10 for x in cum_inc_share_c80], color="k")
ax.set_ylim(0, 1)
ax.set_xlim(0, 10)
ax.set_title("Lorenz curves for China and the US (1980 and 2014)")
ax.set_xlabel("Income decile")
ax.set_ylabel("Cumulative income share")
plt.show()
Lorenz curves, China and the US (1980 and 2014).
Fullscreen

Figure 5.2 Lorenz curves, China and the US (1980 and 2014).

As the chart shows, the income distribution has changed more clearly for China (from the orange dotted line to the purple line) than for the US (from the green dashed line to the red dash-dotted line).

  1. Using your Lorenz curves:
  • Compare the distribution of income across time for each country.
  • Compare the distribution of income across countries for each year (1980 and 2014).
  • Suggest some explanations for any similarities and differences you observe. (You may want to research your chosen countries to see if there were any changes in government policy, political events, or other factors that may affect the income distribution.)

A rough way to compare income distributions is to use a summary measure such as the Gini coefficient. The Gini coefficient ranges from 0 (complete equality) to 1 (complete inequality). It is calculated by dividing the area between the Lorenz curve and the perfect equality line, by the total area underneath the perfect equality line. Intuitively, the further away the Lorenz curve is from the perfect equality line, the more unequal the income distribution is, and the higher the Gini coefficient will be.

To calculate the Gini coefficient you can either use a Gini coefficient calculator, or calculate it directly in Python as shown in Python walk-through 5.4.

  1. Calculate the Gini coefficient for each of your Lorenz curves. You should have four coefficients in total. Label each Lorenz curve with its corresponding Gini coefficient, and check that the coefficients are consistent with what you see in your charts.

Python walk-through 5.4 Calculating Gini coefficients

The Gini coefficient is graphically represented by dividing the area between the perfect equality line and the Lorenz curve by the total area under the perfect equality line (see Section 5.9 of Economy, Society, and Public Policy for further details). Let’s first write a function that can compute Gini coefficients on input data. We’ll call the function that calculates Gini coefficients from a vector of numbers gini_coefficient, and we apply it to the income deciles in our data (as done in Python walk-through 5.3).

def gini_coefficient(x):
    """Compute Gini coefficient of array of values"""
    x = np.double(x.values)
    x = x / x.sum()
    # Mean absolute difference
    mad = np.abs(np.subtract.outer(x, x)).mean()
    # Relative mean absolute difference
    rmad = mad / np.mean(x)
    # Gini coefficient
    g = 0.5 * rmad
    return g

Let’s now demonstrate using this on the four country–year pairs we used earlier. As before, we’ll define a function that returns only the income deciles for a given country–year pair.

def grab_deciles_for_year_country_pair(data, year, country):
    query = (data["Year"] == year) & (data["Country"] == country)
    decs = data.loc[query, [x for x in data.columns if "Decile" in x]]
    return decs

gini_c14 = gini_coefficient(grab_deciles_for_year_country_pair(xf, 2014, "China"))
gini_us80 = gini_coefficient(
    grab_deciles_for_year_country_pair(xf, 1980, "United States")
)
gini_us14 = gini_coefficient(
    grab_deciles_for_year_country_pair(xf, 2014, "United States")
)
gini_c80 = gini_coefficient(grab_deciles_for_year_country_pair(xf, 1980, "China"))

Let’s check one of the Gini coefficients:

print(f"The Gini coefficient for the China in 1980 is {gini_c80:.2f}")
The Gini coefficient for the China in 1980 is 0.29

Now we make the same line chart as in Python walk-through 5.3, but use the annotate function to label curves with their respective Gini coefficients.

fig, ax = plt.subplots()
for line, style in zip(
    [cum_inc_share_c80, cum_inc_share_us80, cum_inc_share_us14, cum_inc_share_c14],
    ["-", "-.", "dashed", ":"],
):
    line.T.plot(ax=ax, linestyle=style)
ax.plot(cum_inc_share_c80.columns, [x / 10 for x in cum_inc_share_c80], color="k")
ax.set_ylim(0, 1)
ax.set_xlim(0, 10)
ax.set_title("Lorenz curves for China and the US (1980 and 2014)")
ax.set_xlabel("Income decile")
ax.set_ylabel("Cumulative income share")
# Find four points along the lines to use for labels
no_points = len(ax.lines[0].get_ydata())
points_to_label = np.rint(np.linspace(0, no_points - 2, num=4)).astype(int)
for line, name, point in zip(
    ax.lines, [gini_c80, gini_us80, gini_us14, gini_c14], points_to_label
):
    y = line.get_ydata()[point]  # NB: to use start value, set [-1] to [0] instead
    x = line.get_xdata()[point]
    text = ax.annotate(
        f"{name:.2f}",
        xy=(x, y),
        xytext=(x + 1.5, y + 0.2 / x),
        color=line.get_color(),
        textcoords="data",
        fontweight="bold",
        backgroundcolor="white",
        arrowprops=dict(arrowstyle="->", connectionstyle="angle3"),
    )
plt.show()
Lorenz curves for China and the US (1980 and 2014), with Gini coefficients labelled.
Fullscreen

Figure 5.3 Lorenz curves for China and the US (1980 and 2014), with Gini coefficients labelled.

The Gini coefficients for both countries have increased, confirming what we already saw from the Lorenz curves that in both countries the income distribution has become more unequal.

Extension Python walk-through 5.5 Calculating Gini coefficients for all countries and all years

In this extension walk-through, we show you how to calculate the Gini coefficient for all countries and years in your dataset.

This sounds like a tedious task, and indeed if we were to use the same method as before, it would be mind-numbing. However, we have a powerful programming language at hand, and this is the time to use it.

Here we use a very useful programming tool you may not have come across yet: vectorized operations. These have some analogies with for loops, which iterate over the same code chunk while something changes.

As a reminder, this is what a for loop that prints the square of the numbers from 0 to 9 looks like:

for i in range(10):
    print(i**2)
0
1
4
9
16
25
36
49
64
81

In the for command, range(10) creates a vector of numbers from 0 to 9 (0, 2, 3, …, 9). The command for i in range(10): defines the variable i initially as 0, then iterates for everything in the given range. Here our command prints the squared value of i for each value of i. Check that you understand the syntax above by modifying it to print only the first five square numbers only, or adding 2 to the numbers from 0 to 9 (instead of squaring these numbers).

We can achieve a similar feat using pandas series and vectorized operations:

number_series = pd.Series(range(10))
number_series.apply(lambda x: x**2)
0     0
1     1
2     4
3     9
4    16
5    25
6    36
7    49
8    64
9    81
dtype: int64

apply(lambda x: x**2) tells Python to apply the operation, , to every element in the given series.

Note that, for this simple example, there is a shorter way to achieve the same effect (number_series.pow(2)), but for anything outside of a set of standard functions, you’ll need to use apply.

Let’s now move on to computing the Gini coefficient for all country–year pairs in the dataset.

The following code does many tasks at the same time. Let’s unpack each task. df["gini"] is a new column that we create in the dataframe we’ve called df (you can tell because it’s on the left-hand side of an expression). The right-hand side tells Python how to create it and uses the syntax df.apply(..., axis=1), which means that Python applies whatever is in the ... to all rows (rows because axis=1). Finally, we need to explain what happens in the .... Here we use one of those ‘lambda’ functions. Remember that they use a dummy name; here we use ‘row’ but we could have called it anything else. The function we apply is one we already created, gini_coefficient, and we pass it a list of all columns that have ‘Decile’ in the title (of which there should be ten).

Remember if you ever lose track of what something is you can always use type(object)!

cols_with_decile_in = [x for x in df.columns if "Decile" in x]

df["Gini"] = df.apply(lambda row: gini_coefficient(row[cols_with_decile_in]), axis=1)
df.head()
Country Year Decile 1 Income Decile 2 Income Decile 3 Income Decile 4 Income Decile 5 Income Decile 6 Income Decile 7 Income Decile 8 Income Decile 9 Income Decile 10 Income Mean Income Population total_income Gini
0 Afghanistan 1980 206 350 455 556 665 793 955 1,187 1,594 3,542 1,030 13,211,412 13,607,754,360 0.424313
1 Afghanistan 1981 212 361 469 574 686 818 986 1,225 1,645 3,655 1,063 12,996,923 13,815,729,149 0.424447
2 Afghanistan 1982 221 377 490 599 716 854 1,029 1,278 1,717 3,814 1,109 12,667,001 14,047,704,109 0.424380
3 Afghanistan 1983 238 405 527 644 771 919 1,107 1,376 1,848 4,105 1,194 12,279,095 14,661,239,430 0.424506
4 Afghanistan 1984 249 424 551 674 806 961 1,157 1,438 1,932 4,291 1,248 11,912,510 14,866,812,480 0.424361

Using this apply approach, we have 4,799 Gini coefficients in one line. We can even look at some summary statistics for the Gini column across the year–country–year pairs:

df["gini"].describe().round(2)
count    4799.00
mean        0.46
std         0.13
min         0.18
25%         0.35
50%         0.48
75%         0.57
max         0.74
Name: gini, dtype: float64

The average Gini coefficient is 0.46, the maximum is 0.74, and the minimum 0.18. Let’s look at these extreme cases.

First we will look at the extremely equal income distributions (those with a Gini coefficient smaller than 0.20):

small_gini = df.loc[df["Gini"] < 0.2, ["Country", "Year", "Gini"]]
small_gini
Country Year Gini
585 Bulgaria 1987 0.190979
1170 Czech Republic 1985 0.195307
1171 Czech Republic 1986 0.193865
1172 Czech Republic 1987 0.192435
1173 Czech Republic 1988 0.191005
1174 Czech Republic 1989 0.193618
1175 Czech Republic 1990 0.196313
1176 Czech Republic 1991 0.199130
3807 Slovak Republic 1985 0.195279
3808 Slovak Republic 1986 0.194227
3809 Slovak Republic 1987 0.193211
3810 Slovak Republic 1988 0.192201
3811 Slovak Republic 1989 0.193252
3812 Slovak Republic 1990 0.194330
3813 Slovak Republic 1991 0.195390
3814 Slovak Republic 1992 0.196473
3815 Slovak Republic 1993 0.179115

These correspond to eastern European countries before the fall of communism.

Now let’s display the most unequal countries (those with a Gini coefficient larger than 0.73):

big_gini = df.loc[df["Gini"] > 0.73, ["Country", "Year", "Gini"]]
big_gini
Country Year Gini
613 Burkina Faso 1980 0.738240
614 Burkina Faso 1981 0.738498
615 Burkina Faso 1982 0.738410
616 Burkina Faso 1983 0.738265
617 Burkina Faso 1984 0.738386
618 Burkina Faso 1985 0.738234
619 Burkina Faso 1986 0.738399
620 Burkina Faso 1987 0.738330
621 Burkina Faso 1988 0.738190
622 Burkina Faso 1989 0.738642
623 Burkina Faso 1990 0.738423
624 Burkina Faso 1991 0.738383
625 Burkina Faso 1992 0.738541
626 Burkina Faso 1993 0.738425
627 Burkina Faso 1994 0.738484
628 Burkina Faso 1995 0.738253
629 Burkina Faso 1996 0.738039
630 Burkina Faso 1997 0.737580
631 Burkina Faso 1998 0.737997
2783 Mauritania 1980 0.736685
2784 Mauritania 1981 0.736710
2785 Mauritania 1982 0.736671
2786 Mauritania 1983 0.736621
2787 Mauritania 1984 0.736712
2788 Mauritania 1985 0.736715
2789 Mauritania 1986 0.736777
2790 Mauritania 1987 0.736768

Extension Python walk-through 5.6 Plotting time series of Gini coefficients

In this extension walk-through, we show you how to make time series plots (time on the horizontal axis, the variable of interest on the vertical axis) with Gini coefficients for a list of countries of your choice.

There are many ways to plot data in Python, but the imperative plotting tool matplotlib is the most widely used (and extended). It is widely used in science and academia, most famously to help create the first ever image of a black hole. Although matplotlib is the core tool and can do almost any visualization (if you know how), you may want to check out some other packages, with different strengths and weaknesses here.

First we use the subset function to select a small list of countries and save their data. As an example, we have chosen four anglophone countries: the United Kingdom, the United States, Ireland, and Australia.

countries = ["United Kingdom", "United States", "Ireland", "Australia"]
plot_df = df.loc[df["Country"].isin(countries), ["Country", "Year", "Gini"]]
plot_df.head()
Country Year Gini
210 Australia 1980 0.307924
211 Australia 1981 0.307924
212 Australia 1982 0.310705
213 Australia 1983 0.313513
214 Australia 1984 0.316317

Let’s now plot these as a time series.

fig, ax = plt.subplots()
for country, style in zip(countries, ["-", "-.", "dashed", ":"]):
    plot_df_c = plot_df.loc[plot_df["Country"] == country]
    ax.plot(plot_df_c["Year"], plot_df_c["gini"], label=country, linestyle=style)
ax.set_xlim(1970, None)
ax.set_title("Gini coefficients for anglophone countries")
ax.set_xlabel("Year")
ax.set_ylabel("Gini")
for line, country in zip(ax.lines, countries):
    y = line.get_ydata()[0]  # NB: to use start value, set [-1] to [0] instead
    x = line.get_xdata()[0]
    text = ax.annotate(
        country,
        xy=(x, y),
        fontsize=8,
        xytext=(-5, 0),
        color=line.get_color(),
        textcoords="offset points",
        fontweight="bold",
        ha="right",
    )
plt.show();
Gini coefficients for anglophone countries.
Fullscreen

Figure 5.4 Gini coefficients for anglophone countries.

We asked matplotlib to use the plot_df dataframe, with Year on the horizontal axis (plot_df_c["Year"]) and Gini on the vertical axis (plot_df_c["Gini"]). The style= option indicates which variable we use to make it clear the lines are different; matplotlib automatically cycles through colours unless we tell it not to. (See what happens when you change the xytext= options.)

matplotlib is extremely powerful, and if you want to produce a variety of different charts, you may want to read more about that package and other packages for making different kinds of charts. You can find out more about matplotlib on the official documentation, and you can find a long list of commonly used plots here.

Now we will look at other measures of income inequality and see how they can be used along with the Gini coefficient to summarize a country’s income distribution. Instead of summarizing the entire income distribution like the Gini coefficient does, we can take the ratio of incomes at two points in the distribution. For example, the 90/10 ratio takes the ratio of the top 10% of incomes (Decile 10) to the lowest 10% of incomes (Decile 1). A 90/10 ratio of 5 means that the richest 10% earns five times more than the poorest 10%. The higher the ratio, the higher the inequality between these two points in the distribution.

  1. Look at the following ratios:

    • 90/10 ratio = the ratio of Decile 10 income to Decile 1 income
    • 90/50 ratio = the ratio of Decile 10 income to Decile 5 income (the median)
    • 50/10 ratio = the ratio of Decile 5 income (the median) to Decile 1 income.
  • For each of these ratios, explain why policymakers might want to compare these two deciles in the income distribution.
  • What kinds of policies or events could affect these ratios?

We will now compare these summary measures (ratios and the Gini coefficient) for a larger group of countries, using OECD data. The OECD has annual data for different ratio measures of income inequality for 42 countries around the world, and has an interactive chart function that plots them for you.

Go to the OECD website to access the data. You will see a chart similar to Figure 5.5, showing the most recent data. The countries are ranked from smallest to largest Gini coefficient on the horizontal axis, and the vertical axis gives the Gini coefficient.

  1. Compare summary measures of inequality for all available countries on the OECD website:
  • Plot the data for the ratio measures by changing the variable selected in the drop-down menu ‘Gini coefficient’. The three ratio measures we looked at previously are called ‘Interdecile P90/P10’, ‘Interdecile P90/P50’, and ‘Interdecile P50/P10’, respectively. (If you click the ‘Compare variables’ option, you can plot more than one variable (except the Gini coefficient) on the same chart.)
  • For each measure, give an intuitive explanation of how it is measured and what it tells us about income inequality. (For example: What do the larger and smaller values of this measure mean? Which parts of the income distribution does this measure use?)
  • Do countries that rank highly on the Gini coefficient also rank highly on the ratio measures, or do the rankings change depending on the measure used? Based on your answers, explain why it is important to look at more than one summary measure of a distribution.
OECD countries ranked according to their Gini coefficient (2015).
Fullscreen

Figure 5.5 OECD countries ranked according to their Gini coefficient (2015).

The Gini coefficient and the ratios we have used are common measures of inequality, but there are other ways to measure income inequality.

  1. Go to the Chartbook of Economic Inequality, which contains five measures of income inequality, including the Gini coefficient, for 25 countries around the world.
  • Choose two measures of income inequality that you find interesting (excluding the Gini coefficient). For each measure, give an intuitive explanation of how it is measured and what we can learn about income inequality from it. You may find the page on ‘Inequality measures’ helpful. (For example: What do larger or smaller values of this measure mean? Which parts of the income distribution does this measure use?)
  • On the Chartbook of Economic Inequality main page, charts of these measures are available for all countries shown in green on the map. For two countries of your choice, look at the charts and explain what these measures tell us about inequality in those countries.

Part 5.2 Measuring other kinds of inequality

Learning objectives for this part

  • Research other dimensions of inequality and how they are measured.

There are many ways to measure income inequality, but income inequality is only one dimension of inequality within a country. To get a more complete picture of inequality within a country, we need to look at other areas in which there may be inequality in outcomes. We will explore two particular areas, namely:

  • health inequality
  • gender inequality in education.

First, we will look at how researchers have measured inequality in health-related outcomes. Besides income, health is an important aspect of wellbeing, partly because it determines how long an individual will be alive to enjoy his or her income. If two people had the same annual income throughout their lives, but one person had a much shorter life than the other, we might say that the distribution of wellbeing is unequal, despite annual incomes being equal.

As with income, inequality in life expectancy can be measured using a Gini coefficient. In the study ‘Mortality inequality’, researcher Sam Peltzman (2009) estimated Gini coefficients for life expectancy based on the distribution of total years lived (life-years) across people born in a given year (birth cohort). If everybody born in a given year lived the same number of years, then the total years lived would be divided equally among these people (perfect equality). If a few people lived very long lives but everybody else lived very short lives, then there would be a high degree of inequality (Gini coefficient close to 1).

We will now look at mortality inequality Gini coefficients for 10 countries around the world: Brazil, France, Germany, India, Japan, Russia, Spain, Sweden, United Kingdom, and the United States. First, download the data:

  • Go to the ‘Health Inequality’ section of the Our World in Data website. Find the chart called ‘Lifespan inequality: Gini coefficient in females’ and click the ‘Download’ icon at the bottom of the chart.
  • Select the ‘Full data (CSV)’ option to download the data in CSV format.

Import the data into Python and investigate the structure of the data as explained in Python walk-through 5.7.

Python walk-through 5.7 Importing .csv files into Python

Before importing, make sure the .csv file is saved in the data subfolder of your current working directory. After importing (using the pd.read_csv function from pandas), use the df.info() function to check that the data was imported correctly.

Path( "data/gini-coefficient-of-lifespan-inequality-in-females.csv"
    )
)
df_gini.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19922 entries, 0 to 19921
Data columns (total 4 columns):
 #   Column                                                 Non-Null Count  Dtype  
---  ------                                                 --------------  -----  
 0   Entity                                                 19922 non-null  object 
 1   Code                                                   18322 non-null  object 
 2   Year                                                   19922 non-null  int64  
 3   Gini coefficient of lifespan inequality - Sex: female  19922 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 622.7+ KB

The variable "Entity" is the country and the variable "Gini coefficient of lifespan inequality - Sex: female" is the health Gini for females. Let’s change these variable names (to "country" and "health", respectively) to clarify what they actually refer to, which will help when writing code (and if we go back to read this code at a later date).

df_gini = df_gini.rename(
    columns={
        "Entity": "Country",
        "Gini coefficient of lifespan inequality - Sex: female": "Health",
    }
)
df_gini.head()
Country Code Year Health
0 Afghanistan AFG 1950 0.552113
1 Afghanistan AFG 1951 0.549548
2 Afghanistan AFG 1952 0.544394
3 Afghanistan AFG 1953 0.539224
4 Afghanistan AFG 1954 0.535712

We will also use the isin method to create a new dataframe called df_gini_10 that only contains the years (1950–2002) and the 10 countries we need.

sel_year = range(1952,2003,1)
sel_country = ["Brazil", "France", "Germany", "India", "Japan", "Russia", "Spain", "Sweden", "United Kingdom", "United States"]

df_gini_10 = df_gini.loc[(df["Year"].isin(sel_year)) & (df_gini["Country"].isin(sel_country)), :]
df_gini_10.head()
Country Code Year Health
2298 Brazil BRA 1952 0.339688
2299 Brazil BRA 1953 0.337049
2300 Brazil BRA 1954 0.330421
2301 Brazil BRA 1955 0.325488
2302 Brazil BRA 1956 0.320513
  1. Using the mortality inequality data for the 10 selected countries:
  • Plot all the countries on the same line chart, with Gini coefficient on the vertical axis and year (1952–2002 only) on the horizontal axis. Make sure to include a legend showing country names, and label the axes appropriately.
  • Describe any general patterns in mortality inequality over time, as well as any similarities and differences between countries.

Python walk-through 5.8 Creating line graphs with lets_plot

Most of the code below is similar to our use of lets_plot from previous walk-throughs. While this type of plot might be useful for exploratory data analysis, the sheer number of lines makes it unhelpful as a chart to share with others—it’s not clear what story is being told.

(
    ggplot(df_gini_10, aes(x="Year", y="Health", color="Country", linetype="Country"))
    + geom_line(size=2)
    + labs(
        title="Mortality inequality in Gini coefficient",
        y="Gini",
        caption="Source: Our World in Data",
    )
    + ggsize(800, 500)
    + scale_x_continuous(format="d")
)
Mortality inequality Gini coefficients (1952–2002).
Fullscreen

Figure 5.6 Mortality inequality Gini coefficients (1952–2002).

  1. Now compare the Gini coefficients in the first year of your line chart (1952) with the last year (2002).
  • For the year 1952, sort the countries according to their mortality inequality Gini coefficient from smallest to largest. Plot a column chart showing these Gini coefficients on the vertical axis, and country on the horizontal axis.
  • Repeat Question 2(a) for the year 2002.
  • Comparing your charts for 1952 and 2002, have the rankings between countries changed? Suggest some explanations for any observed changes. (You may want to do some additional research, for example, look at the healthcare systems of these countries.)

Python walk-through 5.9 Drawing a column chart with sorted values

Plot a column chart for 1952

First we use .loc to provide convenient access to the data for 1952 only, and store it in a temporary dataset called df_52, and then we rearrange that by the health Gini.

year = 1952
df_subset = df_gini_10.loc[df_gini_10["Year"] == 1952]
df_subset = df_subset.sort_values(by="Health")
df_subset
Country Code Year Health
17357 Sweden SWE 1952 0.113346
18827 United Kingdom GBR 1952 0.126588
18916 United States USA 1952 0.139356
6432 Germany DEU 1952 0.143802
6000 France FRA 1952 0.148520
16870 Spain ESP 1952 0.186549
8641 Japan JPN 1952 0.192446
14789 Russia RUS 1952 0.208313
2298 Brazil BRA 1952 0.339688
7912 India IND 1952 0.432061

The rows are now ordered according to health, in ascending order. Let’s use matplotlib again for the chart.

(
    ggplot(df_subset, aes(x="Code", y="Health"))
    + geom_bar(stat="identity")
    + labs(
        x="Country code",
        y="Mortality inequality Gini coefficient",
        title=f"Mortality Gini ({year})",
        caption="Source: Our World in Data",
    )
)
Mortality Gini coefficients (1952).
Fullscreen

Figure 5.7 Mortality Gini coefficients (1952).

Plot a column chart for 2002

Now we’d like to do the same for 2002. Rather than re-specify everything, we can write a function that accepts a year, and our data, and does this for us for arbitrary years.

def plot_bar_chart_health_gini(data, year):
    plot = (
        ggplot(
            data.loc[data["Year"] == year].sort_values(by="Health"),
            aes(x="Code", y="Health"))
        + geom_bar(stat="identity")
        + labs(
            x="Country code",
            y="Mortality inequality Gini coefficient",
            title=f"Mortality Gini ({year})",
            caption="Source: Our World in Data",
        )
    )
    plot.show()

Now let’s use it on 2002:

plot_bar_chart_health_gini(df_gini_10, 2002)
Mortality Gini coefficients (2002).
Fullscreen

Figure 5.8 Mortality Gini coefficients (2002).

Let’s now plot both years in a split bar chart design. To ensure we get both years in the same order, we’ll use the 1952 order, declare that the country column is an ordered categorical variable, and then sort the values by that order.

countries_in_order = df_gini_10.loc[df["Year"] == 1952, :].sort_values(by="Health")[
    "Country"
]
df_gini_10["Country"] = df_gini_10["Country"].astype("category")
df_gini_10 = df_gini_10.sort_values(by="Country")
df_gini_10.head()
Country Code Year Health
2298 Brazil BRA 1952 0.339688
2326 Brazil BRA 1980 0.210762
2327 Brazil BRA 1981 0.205652
2328 Brazil BRA 1982 0.200597
2329 Brazil BRA 1983 0.196008

Now we can plot both years:

year1, year2 = 1952, 2002
(
    ggplot(
        df_gini_10.loc[df_gini_10["Year"].isin([year1, year2]), :],
        aes(x="Code", y="Health", fill=as_discrete("Year")),
    )
    + geom_bar(stat="identity", position=position_dodge())
)
Mortality Gini coefficients (1952 and 2002).
Fullscreen

Figure 5.9 Mortality Gini coefficients (1952 and 2002).

Note: Questions 3 and 4 can be done independently of each other.

Other measures of health inequality, such as those used by the World Health Organization (WHO), are based on access to healthcare, affordability of healthcare, and quality of living conditions. Choose one of the following measures of health inequality to answer Question 3:

  • access to essential medicines
  • type of healthcare accessed (public vs private)
  • composite coverage index.

The composite coverage index is a weighted score of coverage for eight different types of healthcare.

To download the data for your chosen measure:

  • If you choose to look at access to essential medicines, go to the WHO’s data repository to view this data as a table, by country and type of healthcare (public vs private). Above the table, under the heading ‘Download complete data set as:’, select ‘CSV table’ (or right-click and click ‘Download linked file’) to download the full data in CSV format.
  • If you choose to look at type of healthcare accessed, go to the WHO’s data repository to view this data as a table, by country and wealth quintile (you may need to select this option if it is not pre-selected). Above the table, under the heading ‘Download complete data set as:’, select ‘CSV table’ (or right-click and click ‘Download linked file’) to download the full data in CSV format.
  • If you choose to look at the composite coverage index, go to the WHO’s global health observatory. You can choose which population subgroup to organise the data with (wealth, education level, or type of residence area). Click on the population subgroup to view the table. Above the table, under the heading ‘Download complete data set as:’, select ‘CSV table’ (or right-click and click ‘Download linked file’) to download the full data in CSV format.
  1. For your chosen measure:
  • Explain how it is constructed and what outcomes it assesses.
  • Create an appropriate chart to summarize the data for all available countries. (You can replicate a chart shown on the website or draw a similar chart.)
  • Explain what your chart shows about health inequality within and between countries, and discuss the limitations of using this measure (for example, measurement issues or other aspects of inequality that this measure ignores).

Python walk-through 5.10 Drawing a column chart with sorted values

For this walk-through, we downloaded the ‘Median availability of selected generic medicines’ data, which you can find here. We saved it as the default name ‘MDG_0000000010,WHS6_101.csv’, in the data subdirectory of our working directory. Looking at the spreadsheet in Excel, Numbers, OpenOffice, or LibreOffice, you can see that the actual data starts in the third row, meaning that there are two header rows. So let’s skip the first row when opening it.

df_med = pd.read_csv(Path("data/MDG_0000000010,WHS6_101.csv"), skiprows=1)
df_med.head()
Country 2007–2013 2007–2013.1
0 Afghanistan 94.0 81.1
1 Bahamas 42.9 43.2
2 Bolivia (Plurinational State of) 86.7 31.9
3 Brazil 76.7 0.0
4 Burkina Faso 72.1 87.1

Having inspected the dataset in a spreadsheet program and opened it with pandas, we know that the second and third columns don’t have particularly informative column names. From the spreadsheet, you know that they should be ‘Private access %’ and ‘Public access %’, respectively. So, let’s rename the columns to give them the right labels.

The columns of a dataframe, df.columns, are immutable, meaning we cannot change individual entries with an assignment statement (using =), but we can either use the .rename method or replace all the column names. Here, it’s more convenient to replace all the column names:

df_med.columns = ["Country", "Private access", "Public access"]
df_med["Country"] = df_med["Country"].astype("category")
df_med.head(2)
Country Private access Public access
0 Afghanistan 94.0 81.1
1 Bahamas 42.9 43.2

To find details about these variables, click the column headers of the table shown on the WHO website. For example, ‘Median availability of selected generic medicines (%)’ is measured using the following method:

A standard methodology has been developed by WHO and Health Action International (HAI). Data on the availability of a specific list of medicines are collected in at least four geographic or administrative areas in a sample of medicine dispensing points. Availability is reported as the percentage of medicine outlets where a medicine was found on the day of the survey.

Before we produce charts of the data, let’s look at some summary measures of the variables using the skimpy package. You may need to install this package to use it. (You can do this by running pip install skimpy on your computer’s command line.)

If you have trouble installing skimpy, remember to check out the ‘Getting started’ page, which has some tips on troubleshooting.

skim(df_med)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types               Categories                                        │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓                                │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃ ┃ Categorical Variables ┃                                │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩                                │
│ │ Number of rows    │ 38     │ │ float64     │ 2     │ │ Country               │                                │
│ │ Number of columns │ 3      │ │ category    │ 1     │ └───────────────────────┘                                │
│ └───────────────────┴────────┘ └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name            NA    NA %     mean     sd    p0     p25    p50    p75    p100    hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩  │
│ │ Private access          0      0     66  25  2.8   55   70   87   100▁▃▂▆▇▇  │  │
│ │ Public access           2   5.26     58  27    0   40   56   82   100▂▃▇▅▃▇  │  │
│ └───────────────────────┴──────┴─────────┴─────────┴──────┴───────┴───────┴───────┴───────┴────────┴─────────┘  │
│                                                    category                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                       NA         NA %            ordered                unique             ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩  │
│ │ Country                                 0             0False                                38 │  │
│ └──────────────────────────────────┴───────────┴────────────────┴───────────────────────┴────────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

On average, private-sector patients have better access to generic medication.

From the summary statistics for the public_access variable, you can see that there are two missing observations (NA). Here, we will keep these observations because leaving them in doesn’t affect the following analysis.

There are a number of interesting aspects to look at. We will produce a bar chart comparing the private and public access in countries, ordered according to values of private access (largest to smallest). First, we need to reformat the data into ‘long’ format (so there is a single variable containing all the values we want to plot), then use matplotlib to make the chart.

df_med = df_med.sort_values(by="Private access")
melt_df = pd.melt(df_med, id_vars="Country", value_name="Percentage", var_name="Access")
melt_df.head()
Country Access Percentage
0 India private_access 2.8
1 China private_access 13.3
2 Philippines private_access 21.7
3 Sao Tome and Principe private_access 22.2
4 Congo private_access 31.3
(
    ggplot(melt_df, aes(x="Percentage", y="Country", fill="Access"))
    + geom_bar(stat="identity", position=position_dodge(), orientation="y")
    + labs(title="Access to generic medication")
    + ggsize(600, 600)
)
Access to essential medication.
Fullscreen

Figure 5.10 Access to essential medication.

Let’s find the extreme values, starting with the two countries where public-sector patients have access to all (100%) of essential medications (which you can also see in the chart).

df_med.loc[df_med["Public access"] == 100, :]
Country Private access Public access
9 Cook Islands 33.3 100.0
29 Russian Federation 100.0 100.0

Let’s see which countries provide 0% access to essential medication for people in the public sector.

df_med.loc[df_med["Public access"] == 0, :]
Country Private access Public access
3 Brazil 76.7 0.0

Since an individual’s income and available options in later life partly depend on their level of education, inequality in educational access or attainment can lead to inequality in income and other outcomes. Gender inequality can be measured by the share of women at different levels of attainment. We will focus on the aspect of gender inequality in educational attainment, using data from the Our World in Data website, to make our own comparisons between countries and over time. Choose one of the following measures to answer Question 4:

  • gender parity in primary school life expectancy
  • share of women, between 15 and 64 years old, with no formal education

To download the data for your chosen measure:

  • Go to the ‘Educational Mobility and Inequality’ section of the Our World in Data website, and find the chart for your chosen measure. Under the heading ‘Interactive Charts on Global Education’, use the scrollable menu on the left to select the chart for your chosen measure (charts are listed in alphabetical order).
  • Click the ‘Download’ icon at the bottom-right of the chart, then click ‘Full data (CSV)’ to download the data as a CSV file.
  1. For your chosen measure:
  • Choose ten countries that have data from 1980 to 2010. Plot your chosen countries on the same line chart, with year on the horizontal axis and share on the vertical axis. Make sure to include a legend showing country names and label the axes appropriately.
  • Describe any general patterns in gender inequality in education over time, as well as any similarities and differences between countries.
  • Calculate the change in the value of this measure between 1980 and 2010 for each country chosen. Sort these countries according to this value, from the smallest change to largest change. Now plot a column chart showing the change (1980 to 2010) on the vertical axis, and country on the horizontal axis. Add data labels to display the value for each country.
  • Which country had the largest change? Which country had the smallest change?
  • Suggest some explanations for your observations in Questions 4(b) and (d). (You may want to do some background research on your chosen countries.)
  • Discuss the limitations of using this measure to assess the degree of gender inequality in educational attainment and propose some alternative measures.

Python walk-through 5.11 Using line and bar charts to illustrate changes in time

Import data and plot a line chart

First we download data on gender parity in primary school life expectancy from Our World in Data and save it in a subdirectory of our working directory called ‘data/’. (To find the data on the Our World in Data website, click on the download button under the chart.) Now let’s import it into our Python session and check its structure:

# Open the CSV file from the data directory

df_gap = pd.read_csv(Path("data/school-life-expectancy-primary-gender-parity-index-gpi.csv"))
df_gap.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7226 entries, 0 to 7225
Data columns (total 4 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   Entity                                                      7226 non-null   object 
 1   Code                                                        7226 non-null   object 
 2   Year                                                        7226 non-null   int64  
 3   School life expectancy, primary, gender parity index (GPI)  7226 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 225.9+ KB

The data is now in the dataframe df_gap. The variable of interest, "School life expectancy, primary, gender parity index (GPI)", has a very long name so we will shorten it to "Gender Parity Index".

df_gap = df_gap.rename(columns={"School life expectancy, primary, gender parity index (GPI)": "Gender Parity Index"})

As usual, ensure that you understand the definition of the variables you are using. On the Our World in Data website, click the option underneath the graph called ‘Learn more about this data’ for a definition:

Ratio of female school life expectancy to the male school life expectancy. It is calculated by dividing the female value for the indicator by the male value for the indicator. A GPI equal to 1 indicates parity between females and males. In general, a value less than 1 indicates disparity in favor of males and a value greater than 1 indicates disparity in favor of females.

Before choosing ten countries, we will use is.in and groupby to figure out which countries have complete data for the years 1980 to 2010. We will then choose the first 10 countries (in alphabetical order) with complete data.

# Filter the data (keeping the years 1980-2010)
sel_year = range(1980,2011,1)
cols_to_keep = ["Entity", "Code", "Year", "Gender Parity Index"]

df_gap_subset = df_gap.loc[(df_gap["Year"].isin(sel_year)), cols_to_keep]

# Keep countries that have complete data for the period 1980-2010
year_count = df_gap_subset.groupby("Entity").count()
year_count
sum(year_count["Gender Parity Index"] == year_count["Gender Parity Index"].max())
# 32 countries have complete data for 1980-2010 (31 years)

complete_data = year_count[year_count["Year"] == 31]
complete_data.head(10) # list the first 10 countries
Code Year Gender Parity Index
Entity
Albania 31 31 31
Algeria 31 31 31
Australia 31 31 31
Austria 31 31 31
Bulgaria 31 31 31
Congo 31 31 31
Cuba 31 31 31
Ethiopia 31 31 31
Finland 31 31 31
Ireland 31 31 31

You could also choose 10 countries from complete_data at random, using the sample method.

complete_data.sample(10)
Code Year Gender Parity Index
Entity
Ethiopia 31 31 31
Tunisia 31 31 31
Senegal 31 31 31
Paraguay 31 31 31
Austria 31 31 31
Sweden 31 31 31
Mauritius 31 31 31
Bulgaria 31 31 31
Mexico 31 31 31
Albania 31 31 31

Plot a line chart for a selection of countries

We now save our selection of ten countries as the dataframe df_gap_10. (You can of course make a different selection, but ensure that you get the spelling right!).

countries = ["Albania", "Algeria", "Australia", "Austria", "Bulgaria", "Congo", "Cuba", "Ethiopia", "Finland", "Ireland"]
df_gap_10 = df_gap_subset.loc[(df_gap["Entity"].isin(countries))]

Now we plot the data, following similar steps to Python walk-through 5.8.

(
    ggplot(df_gap_10, aes(x="Year", y="Gender Parity Index", color="Entity", linetype="Entity"))
    + geom_line(size=1)
    + labs(
        x="Year",
        y="Gender Parity Index",
        title="Gender parity in primary school life expectancy",
    )
    + scale_x_continuous(format="d")
)
Gender parity in primary school life expectancy (1980–2010) for selected countries.
Fullscreen

Figure 5.11 Gender parity in primary school life expectancy (1980–2010) for selected countries.

Plot a column chart with sorted values

To calculate the change in the value of this measure between 1980 and 2010 for each country chosen, we have to manipulate the data so that we have one entry (row) for each entity (or country), but two different variables for the Gender Parity Index (one for each year).

We’ll do this using the pd.pivot function to pivot years to columns; then we can subtract one year from another and then filter to just the columns we want.

df_sub_piv = pd.pivot(df_gap_10, index=["Entity", "Code"], columns=["Year"], values="Gender Parity Index")
# Note that existing column titles are integers
df_sub_piv["2010—1980"] = df_sub_piv[2010] - df_sub_piv[1980]
# Filter to our new column and re-number index
df_sub_piv = df_sub_piv["2010—1980"].reset_index()
# Sort rows by size of gap
df_sub_piv = df_sub_piv.sort_values(by="2010—1980")
df_sub_piv.head()

Now we can plot this as a bar chart by country.

(
    ggplot(df_sub_piv, aes(x="Code", y="2010—1980"))
    + geom_bar(stat="identity")
    + labs(
        title="Change in gender parity in primary school life expectancy",
        x="Country",
        y="Change in Gender Parity Index"
    )
)
Change in the Gender Parity Index from 1980 to 2010.
Fullscreen

Figure 5.12 Change in the Gender Parity Index from 1980 to 2010.

It is apparent that some countries saw very little or no change (the countries that already had very high gender parity). The countries with initially low female participation in primary education have significantly improved.