Empirical Project 4 Working in R

R-specific learning objectives

In addition to the learning objectives for this project, in this section you will learn how to convert (‘reshape’) data from wide to long format and vice versa.

Getting started in R

For this project you will need the following packages:

You will also use the ggplot2 package to produce accurate graphs, but that comes as part of the tidyverse package.

If you need to install these packages, run the following code:

install.packages(c("readxl","tidyverse","reshape2"))

You can import these libraries now, or when they are used in the R walk-throughs below.

library(readxl)
library(tidyverse)
library(reshape2)

Part 4.1 GDP and its components as a measure of material wellbeing

The GDP data we will look at is from the United Nations’ National Accounts Main Aggregates Database, which contains estimates of total GDP and its components for all countries over the period 1970 to present. We will look at how GDP and its components have changed over time, and investigate the usefulness of GDP per capita as a measure of wellbeing.

To answer the questions below, download the data and make sure you understand how the measure of total GDP is constructed.

R walk-through 4.1 Importing the Excel file (.xlsx or .xls format) into R

First use setwd to tell R which folder you are working from. Keep all the files you need in that folder, including the Excel sheet you just downloaded.

setwd("C:/YOUR_DIRECTORY")

Then use the function readxl, part of the tidyverse suite of packages. Before importing the file into R, see how the data is organized in the spreadsheet by opening the file in Excel, and note that:

  • There is a heading that we don’t need, followed by a blank row.
  • The data we need starts on row three.
library(tidyverse) # Load the library
library(readxl)

UN = read_excel("Download-GDPconstant-USD-countries.xls", # Excel filename
  sheet="Download-GDPconstant-USD-countr", # Sheet name
  skip=2) # Number of rows to skip

head(UN)
## # A tibble: 6 x 50
##   CountryID Country  IndicatorName      `1970` `1971` `1972` `1973` `1974`
##       <dbl> <chr>    <chr>               <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1         4 Afghani~ Final consumption~ 5.56e9 5.33e9 5.20e9 5.75e9 6.15e9
## 2         4 Afghani~ Household consump~ 5.07e9 4.84e9 4.70e9 5.21e9 5.59e9
## 3         4 Afghani~ General governmen~ 3.72e8 3.82e8 4.02e8 4.21e8 4.31e8
## 4         4 Afghani~ Gross capital for~ 9.85e8 1.05e9 9.19e8 9.19e8 1.18e9
## 5         4 Afghani~ Gross fixed capit~ 9.85e8 1.05e9 9.19e8 9.19e8 1.18e9
## 6         4 Afghani~ Exports of goods ~ 1.12e8 1.45e8 1.73e8 2.18e8 3.00e8
## # ... with 42 more variables: `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
## #   `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
## #   `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
## #   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
## #   `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>,
## #   `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>,
## #   `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>,
## #   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
## #   `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>
  1. You can see from the tab ‘Download-GDPconstant-USD-countr’ that some countries have missing data for some of the years. The missing data could be due to political reasons (for example, countries formed after 1970) or data availability issues.
Country Number of years of GDP data
   
   
   
   

Number of years of GDP data available for each country.

Figure 4.1 Number of years of GDP data available for each country.

R walk-through 4.2 Making a frequency table

We want to create a table showing how many years of Final consumption expenditure data are available for each country.

You can see that countries and indicators (for example, Afghanistan and Final Consumption expenditure) are the row variables, while years are the column variable. This data is organized in ‘wide’ format.

For many data operations it is more convenient to have indicators as column variables, so we would like Final consumption expenditure to be a column variable, and years would be the row variables. Each observation would represent the value of an indicator for a particular country and year. This data is organized in ‘long’ format.

To change data from wide to long format, we use the reshape2 package. To learn more about organizing data in R, see the R for Data Science website.

library(reshape2)

wide_UN <- UN
wide_UN = wide_UN[,-1] # This keeps all data except for the first column (CountryID); we will use the country name instead.
long_UN = melt(wide_UN,id.vars=c("Country","IndicatorName"), # These are the names of the column variables.
  value.vars=4:ncol(UN))

head(long_UN)
##       Country
## 1 Afghanistan
## 2 Afghanistan
## 3 Afghanistan
## 4 Afghanistan
## 5 Afghanistan
## 6 Afghanistan
##   IndicatorName
## 1 Final consumption expenditure
## 2 Household consumption expenditure (including Non-profit institutions serving households)
## 3 General government final consumption expenditure
## 4 Gross capital formation
## 5 Gross fixed capital formation (including Acquisitions less disposals of valuables)
## 6 Exports of goods and services
##   variable      value
## 1     1970 5559066266
## 2     1970 5065088737
## 3     1970  372478456
## 4     1970  984580895
## 5     1970  984580895
## 6     1970  112390156

The melt command is very powerful and useful, as you will find many large datasets are in wide format. In this case, it takes the data in columns 4 to the last column (these columns indicate the years) and uses them to create two new columns: one column (variable) contains the name of the row variable (the year) and the other column (value) contains the associated value. Compare long_UN to wide_UN to understand how the melt command works.

Rename the column variable as ‘Year’.

names(long_UN)[names(long_UN) == "variable"] <- "Year"

To create the required table, we only need Final consumption expenditure of each country, which we extract using the subset function.

cons = subset(long_UN,IndicatorName=="Final consumption expenditure")

Now we create the table showing the number of missing years by country:

# Here we use the pipe operator (%>%) from the tidyverse package.
# This means: use the result of the current line
# as the first argument in the next line's function.

missing_by_country = cons %>%
  group_by(Country) %>%
  summarize(available_years=sum(!is.na(value))) %>%
  print()
## # A tibble: 220 x 2
##    Country             available_years
##    <chr>                         <int>
##  1 Afghanistan                      47
##  2 Albania                          47
##  3 Algeria                          47
##  4 Andorra                          47
##  5 Angola                           47
##  6 Anguilla                         47
##  7 Antigua and Barbuda              47
##  8 Argentina                        47
##  9 Armenia                          27
## 10 Aruba                            47
## # ... with 210 more rows

Translating the code in words: Take cons (cons %>%) and group the observations by country (group_by(Country)), then take this result (%>%) and produce a table (summarize(...)) that shows a calculated variable (available_years) which is the sum (sum(...)) of the variable !is.na(value).

To understand what !is.na(value) means, recall that value contains the numerical values for the variable of interest. When an observation is missing, it is recorded as NA. The function is.na(value) will return a value of 1 (or TRUE) if the value is missing and 0 (or FALSE) otherwise. We add a ! in front since we want the function to return a 1 if the observation exists and a 0 otherwise. For R, ! means ‘not’ so we get a 1 if the particular observation is not missing.

Now we can establish how many of the 220 countries in the dataset have complete information. A dataset is complete if it has the maximum number of available observations (max(missing_by_country$available_years)).

sum(missing_by_country$available_years==max(missing_by_country$available_years))
## [1] 179

If you add up the data on the right-hand side of this equation, you may find that it does not add up to the reported GDP value. The UN notes this discrepancy in Section J, item 17 of the ‘Methodology for the national accounts’: ‘The sums of com­ponents in the tables may not necessarily add up to totals shown because of rounding’.

There are three different ways in which countries calculate GDP for their national accounts, but we will focus on the expenditure approach, which calculates Gross Domestic Product (GDP) as:

Final consumption expenditure is the sum of Household consumption expenditure (including Non-profit institutions serving households), and General government final consumption expenditure.

  1. Rather than looking at Exports and Imports separately, we usually look at the difference between them (Exports minus Imports), also known as Net Exports. Choose three countries that have GDP data over the entire period (1970 to the latest year available). For each country, create a variable that shows the values of Net Exports in each year.

R walk-through 4.3 Creating new variables

We will use Brazil, the US, and China as examples.

Before we select these three countries, we will calculate the net exports for all countries, as we need that information in R walk-through 4.4.

# Shorten the names of the variables we need

long_UN$IndicatorName[long_UN$IndicatorName=="Household consumption expenditure (including Non-profit institutions serving households)"] <- "HH.Expenditure"
long_UN$IndicatorName[long_UN$IndicatorName=="General government final consumption expenditure"] <- "Gov.Expenditure"
long_UN$IndicatorName[long_UN$IndicatorName=="Final consumption expenditure"] <- "Final.Expenditure"
long_UN$IndicatorName[long_UN$IndicatorName=="Gross capital formation"] <- "Capital"
long_UN$IndicatorName[long_UN$IndicatorName=="Imports of goods and services"] <- "Imports"
long_UN$IndicatorName[long_UN$IndicatorName=="Exports of goods and services"] <- "Exports"

long_UN still has several rows for a particular country and year, due to multiple indicators. We will reshape this data to ensure that we have only one row per country and per year.

# We need to cast (reshape) the long_UN data.
# Given that the new shape is a data.frame, we use the function dcast.
table_UN = dcast(long_UN,Country+Year ~ IndicatorName)

# Finally, we add a new column for net exports = exports – imports.
table_UN$Net.Exports = table_UN[,"Exports"]-table_UN[,"Imports"]

Let us select our three chosen countries to check that we calculated net exports correctly.

sel_countries = c("Brazil", "United States", "China")

# Using our long format dataset, we get imports, exports, and year for these countries.
sel_UN1 = subset(table_UN,subset = (Country %in% sel_countries), select = c("Country","Year","Exports","Imports","Net.Exports"))

head(sel_UN1)
##      Country Year     Exports     Imports  Net.Exports
## 1223  Brazil 1970 12337240060 20187929130  -7850689070
## 1224  Brazil 1971 13016975734 24162976191 -11146000457
## 1225  Brazil 1972 16162334455 29025977711 -12863643256
## 1226  Brazil 1973 18466228366 34950588132 -16484359766
## 1227  Brazil 1974 18897150611 44821362355 -25924211744
## 1228  Brazil 1975 21084064715 42838470795 -21754406080

Now we will create charts to show the GDP components in order to look for general patterns over time and make comparisons between countries.

  1. Evaluate the value of the components over time, for two countries of your choice.

R walk-through 4.4 Plotting and annotating time series data

Extract the relevant data

We will work with the long_UN dataset, as the long format is well suited to produce charts with theggplot package. In this example, we use the US and China.

# Select our chosen countries
comp = subset(long_UN,Country %in% c("United States","China"))

# Convert values to billions of USD
comp$value = comp$value /1e9
comp = subset(comp,select = c("Country","Year","IndicatorName","value"),subset=IndicatorName %in% c("Gov.Expenditure","HH.Expenditure","Capital","Imports","Exports"))

Plot a line chart

We can now plot this data using the ggplot library.

library(ggplot2)

# ggplot allows us to build a chart step-by-step.
pl = ggplot(subset(comp,Country=="United States"),aes(x = Year,y=value)) # Base chart, defining the x (horizontal) and y (vertical) variables
pl = pl + geom_line(aes(group=IndicatorName,color=IndicatorName),size=1) # Specify a line chart, with a different colour for each indicator name and line size=1
pl # Display the chart

US’s GDP components (expenditure approach).

Figure 4.2 US’s GDP components (expenditure approach).

There are plenty of problems with this chart:

  • we cannot read the horizontal axis, because it labels every year
  • the vertical axis label is uninformative
  • there is no chart title
  • the grey (default) background makes the chart difficult to read
  • the legend is uninformative.

To improve this chart, we merely add ‘features’ to the already existing figure p1.

pl = pl + scale_x_discrete(breaks=seq(1970,2016,by=10))
pl = pl + scale_y_continuous(name="Billion US$")
pl = pl + ggtitle("GDP components over time")
pl = pl + scale_colour_discrete(name  ="Components of GDP",   # Change the legend title
                                labels = c("Gross capital formation",  # Change the legend labels
                                           "Exports",
                                           "Government expenditure",
                                           "Household expenditure",
                                           "Imports")) 
pl = pl + theme_bw()
pl = pl + annotate("text", x = 37, y = 850, label = "2008 global financial crisis")
pl

US’s GDP components (expenditure approach), amended chart.

Figure 4.3 US’s GDP components (expenditure approach), amended chart.

We can make a chart for more than one country simultaneously:

# Repeat all steps without subsetting the data

pl = ggplot(comp,aes(x = Year,y=value,color=IndicatorName)) # Base line chart
pl = pl + geom_line(aes(group=IndicatorName), size=1)
pl = pl + scale_x_discrete(breaks=seq(1970,2016,by=10))
pl = pl + scale_y_continuous(name="Billion US$")
pl = pl + ggtitle("GDP components over time")
pl = pl + scale_colour_discrete(name  ="Component")   
pl = pl + theme_bw()

# Make a separate chart for each country
pl = pl + facet_wrap(~Country)
pl = pl + scale_colour_discrete(name  ="Components of GDP",   
                                labels = c("Gross capital formation",  
                                           "Exports",
                                           "Government expenditure",
                                           "Household expenditure",
                                           "Imports"))
pl

GDP components over time, United States and China.

Figure 4.4 GDP components over time, United States and China.

  1. Another way to visualize the GDP data is to look at each component as a proportion of total GDP. Use the same countries that you chose for Question 3.

R walk-through 4.5 Calculating new variables and plotting time series data

Calculate proportion of total GDP

We will use the comp dataset created in R walk-through 4.4. First we will calculate net exports, as that contributes to GDP. As the data is currently in long format, we will reshape the data into wide format (using the dcast function, as in R walk-through 4.3), calculate net exports, then transform the data back into long format using the melt function.

# Reshape the data to wide format (with indicators in columns)
comp_wide = dcast(comp,Country+Year ~ IndicatorName)
head(comp_wide)
##   Country Year  Capital   Exports Gov.Expenditure HH.Expenditure   Imports
## 1   China 1970 67.58221  5.305242        19.30034       107.8411  6.119649
## 2   China 1971 73.79977  6.318662        22.63929       112.3586  6.094361
## 3   China 1972 70.62638  7.740001        23.77126       118.3906  7.510857
## 4   China 1973 80.79658 10.810644        24.34177       126.7546 11.540975
## 5   China 1974 83.38207 12.856408        26.14306       129.3434 16.041397
## 6   China 1975 93.47130 13.309628        27.29335       134.5575 15.859461
# Add the new column for net exports = exports – imports
comp_wide$Net.Exports = comp_wide[,"Exports"]-comp_wide[,"Imports"]
head(comp_wide)
##   Country Year  Capital   Exports Gov.Expenditure HH.Expenditure   Imports
## 1   China 1970 67.58221  5.305242        19.30034       107.8411  6.119649
## 2   China 1971 73.79977  6.318662        22.63929       112.3586  6.094361
## 3   China 1972 70.62638  7.740001        23.77126       118.3906  7.510857
## 4   China 1973 80.79658 10.810644        24.34177       126.7546 11.540975
## 5   China 1974 83.38207 12.856408        26.14306       129.3434 16.041397
## 6   China 1975 93.47130 13.309628        27.29335       134.5575 15.859461
##   Net.Exports
## 1  -0.8144069
## 2   0.2243011
## 3   0.2291447
## 4  -0.7303314
## 5  -3.1849891
## 6  -2.5498330
# Return to the long format with the HH.expenditure, Capital, and Net Export variables
comp2_wide <- subset(comp_wide,select = -c(Exports,Imports))
comp2 <- melt(comp2_wide, id.vars = c("Year","Country"))

Now we will add a new variable with the proportions for each GDP component.

props = comp2 %>%
  group_by(Country,Year) %>%
  mutate(proportion = value / sum(value))

In words, we did the following: Take the comp dataframe and create groups by country and year (for example, all indicators for France in 1970). Then create a new variable (mutate) called proportion, which divides the variable value of an indicator by the sum of all value for that group (for example, all indicators for France in 1970). The result is then saved in props. Look at the props dataframe to confirm that the above command has achieved the desired result.

Plot a line chart

Redo the line chart from R walk-through 4.4 using these proportions.

pl = ggplot(props,aes(x = Year,y=proportion,color=variable)) # Base line chart
pl = pl + geom_line(aes(group=variable),size=1)
pl = pl + scale_x_discrete(breaks=seq(1970,2016,by=10))
pl = pl + ggtitle("GDP component proportions over time")
pl = pl + theme_bw()

# Make a separate chart for each country

pl = pl + facet_wrap(~Country)
pl = pl + scale_colour_discrete(name  ="Components of GDP",   
                                labels = c("Gross capital formation",  
                                           "Government expenditure",
                                           "Household expenditure",
                                           "Net Exports"))
pl

GDP component proportions over time.

Figure 4.5 GDP component proportions over time.

time series data
A time series is a set of time-ordered observations of a variable taken at successive, in most cases regular, periods or points of time. Example: The population of a particular country in the years 1990, 1991, 1992, … , 2015 is time series data.
cross-sectional data
Data that is collected from participants at one point in time or within a relatively short time frame. In contrast, time series data refers to data collected by following an individual (or firm, country, etc.) over a course of time. Example: Data on degree courses taken by all the students in a particular university in 2016 is considered cross-sectional data. In contrast, data on degree courses taken by all students in a particular university from 1990 to 2016 is considered time series data.

So far, we have done comparisons of time series data, which is a collection of values for the same variables and subjects, taken at different points in time (for example, GDP of a particular country, measured each year). We will now make some charts using cross-sectional data, which is a collection of values for the same variables for different subjects, usually taken at the same time.

  1. Choose three developed countries, three countries in economic transition, and three developing countries (for a list of these countries, see Tables A–C in the UN country classification document).

R walk-through 4.6 Creating stacked bar charts

Calculate proportion of total GDP

This walk-through uses the following countries:

  • developed countries: Germany, Japan, United States
  • transition countries: Albania, Russian Federation, Ukraine
  • developing countries: Brazil, China, India.

The relevant data are still in the table_UN dataframe. Before we select these countries we first calculate the required proportions for all countries.

# Calculate proportions

table_UN$p_Capital <- table_UN$Capital/(table_UN$Capital+table_UN$Final.Expenditure+table_UN$Net.Exports)
table_UN$p_FinalExp <- table_UN$Final.Expenditure/(table_UN$Capital+table_UN$Final.Expenditure+table_UN$Net.Exports)
table_UN$p_NetExports <- table_UN$Net.Exports/(table_UN$Capital+table_UN$Final.Expenditure+table_UN$Net.Exports)
sel_countries = c("Germany", "Japan", "United States", "Albania", "Russian Federation", "Ukraine", "Brazil", "China", "India")

# Using our long format dataset, we select imports, exports, 
# and year for our chosen countries in 2015.

# Select the columns we need
sel_2015 = subset(table_UN,subset = (Country %in% sel_countries) & (Year == 2015),select = c("Country","Year","p_FinalExp","p_Capital","p_NetExports"))

Plot a stacked bar chart

Now let’s create the bar chart.

# Reshape the table into long format, then use ggplot
sel_2015_m <- melt(sel_2015, id.vars = c("Year","Country"))

g <- ggplot(sel_2015_m, aes(x = Country, y = value, fill = variable)) + 
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("GDP component proportions in 2015") +
  scale_fill_discrete(name  ="Components of GDP",
                      labels = c("Final expenditure",
                                 "Gross capital formation",
                                 "Net Exports")) +
  theme_bw()

plot(g)

GDP component proportions in 2015.

Figure 4.6 GDP component proportions in 2015.

Note that even when a country has a trade deficit (net export proportion < 0), the proportions will add up to 1, but the proportions of final expenditure and capital will add up to more than 1.

We have not yet ordered the countries so that they form the pre-specified groups. To achieve this, we need to explicitly impose an ordering on the Country variable.

# Impose the order in the sel_countries object, then use ggplot
sel_2015_m$Country <- factor(sel_2015_m$Country,levels = sel_countries)

g <- ggplot(sel_2015_m, aes(x = Country, y = value, fill = variable)) + 
    geom_bar(stat = "identity") +
    coord_flip() +
    ggtitle("GDP component proportions in 2015 (ordered)") +
    scale_fill_discrete(name  ="Components of GDP",
                      labels = c("Final expenditure",
                                 "Gross capital formation",
                                 "Net Exports")) +
    theme_bw()

plot(g)

GDP component proportions in 2015 (ordered).

Figure 4.7 GDP component proportions in 2015 (ordered).

  1. GDP per capita is often used to indicate material wellbeing instead of GDP, because it accounts for differences in population across countries. Refer to the following articles to help you to answer the questions:

Part 4.2 The HDI as a measure of wellbeing

In Part 4.1 we looked at GDP per capita as a measure of material wellbeing. While income has a major influence on wellbeing because it allows us to buy the goods and services we need or enjoy, it is not the only determinant of wellbeing. Many aspects of our wellbeing cannot be bought, for example, good health or having more time to spend with friends and family.

We are now going to look at the Human Development Index (HDI), a measure of wellbeing that includes non-material aspects, and make comparisons with GDP per capita (a measure of material wellbeing). GDP per capita is a simple index calculated as the sum of its elements, whereas the HDI is more complex. Instead of using different types of expenditure or output to measure wellbeing or living standards, the HDI consists of three dimensions associated with wellbeing:

We will first learn about how the HDI is constructed, and then use this method to construct indices of wellbeing according to criteria of our choice.

The HDI data we will look at is from the Human Development Report 2016 by the United Nations Development Programme (UNDP). To answer the questions below, download the data and technical notes from the report:

  1. Refer to the technical notes and Table 1 in the spreadsheet. For each indicator, explain whether you think it is a good measure of the dimension, and suggest alternative indicators, if any. (For example, is GNI per capita a good measure of the dimension ‘a decent standard of living’?)
  1. Figure 4.8 shows the minimum and maximum values for each indicator. Discuss whether you think these are reasonable. (You can read the justification for these values in the technical notes.)
Dimension Indicator Minimum Maximum
Health Life expectancy (years) 20 85
Education Expected years of schooling (years) 0 18
Mean years of schooling (years) 0 15
Standard of living Gross national income per capita (2011 PPP $) 100 75,000

Maximum and minimum values for each indicator in the HDI.

Figure 4.8 Maximum and minimum values for each indicator in the HDI.

United Nations Development Programme. 2016. ‘Technical notes’ in Human Development Report 2016: p. 2.

We are now going to apply the method for constructing the HDI, by recalculating the HDI from its indicators. We will use the formula below, and the minimum and maximum values in the table in Figure 4.8. These are taken from page 2 of the technical notes, which you can refer to for additional details.

The HDI indicators are measured in different units and have different ranges, so in order to put them together into a meaningful index, we need to normalize the indicators using the following formula:

Doing so will give a value in between 0 and 1 (inclusive), which will allow comparison between different indicators.

  1. Refer to Figure 4.8 and calculate the dimension index for each of the dimensions in a separate column in the Table 1 tab of the spreadsheet:

Find out more The natural log: What it means, and how to calculate it in R

The natural log turns a linear variable into a concave variable, as shown in Figure 4.9. For any value of income on the horizontal axis, the natural log of that value on the vertical axis is smaller. At first, the difference between income and log income is not that big (for example, an income of 2 corresponds to a natural log of 0.7), but the difference becomes bigger as we move rightwards along the horizontal axis (for example, when income is 100,000, the natural log is only 11.5).

Comparing income with the natural logarithm of income.

Figure 4.9 Comparing income with the natural logarithm of income.

The reason why natural logs are useful in economics is because they can represent variables that have diminishing marginal returns: an additional unit of input results in a smaller increase in the total output than did the previous unit. (If you have studied production functions, then the shape of the natural log function might look familiar.)

When applied to the concept of wellbeing, the ‘input’ is income, and the ‘output’ is material wellbeing. It makes intuitive sense that a $100 increase in per capita income will have a much greater effect on wellbeing for a poor country compared to a rich country. Using the natural log of income incorporates this notion into the index we create. Conversely, the notion of diminishing marginal returns is not captured by GDP per capita, which uses actual income and not its natural log. Doing so makes the assumption that a $100 increase in per capita income has the same effect on wellbeing for rich and poor countries.

The log function in R calculates the natural log of a value for you. To calculate the natural log of a value, x, type log(x). If you have a scientific calculator, you can check that the calculation is correct by using the ln or log key.

Now that you know about the natural log, you might want to go back to Question 2(c) in Part 4.1, and create a new chart using the natural log scale. Using the natural log scale, you will be able to ‘read off’ the relative growth rates from the slopes of the different series you have plotted. For example, a 0.01 change in the vertical axis value corresponds to a 1% change in that variable. This will allow you to compare the growth rates of the different components of GDP.

geometric mean
A summary measure calculated by multiplying N numbers together and then taking the Nth root of this product. The geometric mean is useful when the items being averaged have different scoring indices or scales, because it is not sensitive to these differences, unlike the arithmetic mean. For example, if education ranged from 0 to 20 years and life expectancy ranged from 0 to 85 years, life expectancy would have a bigger influence on the HDI than education if we used the arithmetic mean rather than the geometric mean. Conversely, the geometric mean treats each criteria equally. Example: Suppose we use life expectancy and mean years of schooling to construct an index of wellbeing. Country A has life expectancy of 40 years and a mean of 6 years of schooling. If we used the arithmetic mean to make an index, we would get (40 + 6)/2 = 23. If we used the geometric mean, we would get (40 × 6)1/2 = 15.5. Now suppose life expectancy doubled to 80 years. The arithmetic mean would be (80 + 6)/2 = 43, and the geometric mean would be (80 × 6)1/2 = 21.9. If, instead, mean years of schooling doubled to 12 years, the arithmetic mean would be (40 + 12)/2 = 26, and the geometric mean would be (40 × 12)1/2 = 21.9. This example shows that the arithmetic mean can be ‘unfair’ because proportional changes in one variable (life expectancy) have a larger influence over the index than changes in the other variable (years of schooling). The geometric mean gives each variable the same influence over the value of the index, so doubling the value of one variable would have the same effect on the index as doubling the value of another variable.

Now, we can combine these dimensional indices to give the HDI. The HDI is the geometric mean of the three dimension indices (IHealth = Life expectancy index, IEducation = Education index, and IIncome = GNI index):

  1. Use the formula above and the data in Table 1 of the spreadsheet to calculate the HDI for all the countries excluding those in the ‘Other countries or territories’ category. You should get the same values as those in Column C, rounded to three decimal places.

R walk-through 4.7 Calculating the HDI

We will import the data file that we saved as ‘HDR_data.xlsx’ in the working directory. Look at the Excel file (the ‘Table 1’ worksheet) so that you understand its structure.

HDR2015 <- read_excel("HDR_data.xlsx", # File path
                sheet="Table 1", # Worksheet to import
                skip = 2) # Number of rows to skip
head(HDR2015)
## # A tibble: 6 x 15
##   X__1   X__2         `Human Development ~ X__3  `Life expectancy a~ X__4 
##   <chr>  <chr>        <chr>                <lgl> <chr>               <chr>
## 1 HDI r~ Country      Value                NA    (years)             <NA> 
## 2 <NA>   <NA>         2015                 NA    2015                <NA> 
## 3 <NA>   VERY HIGH H~ <NA>                 NA    <NA>                <NA> 
## 4 1      Norway       0.94942283449106446  NA    81.710999999999999  <NA> 
## 5 2      Australia    0.93867953564660933  NA    82.537000000000006  <NA> 
## 6 2      Switzerland  0.93913086905938037  NA    83.132999999999996  <NA> 
## # ... with 9 more variables: `Expected years of schooling` <chr>,
## #   X__5 <chr>, `Mean years of schooling` <chr>, X__6 <chr>, `Gross
## #   national income (GNI) per capita` <chr>, X__7 <chr>, `GNI per capita
## #   rank minus HDI rank` <chr>, X__8 <lgl>, `HDI rank` <chr>
str(HDR2015)
## Classes 'tbl_df', 'tbl' and 'data.frame':  264 obs. of 15 variables:
##  $ X__1                                  : chr  "HDI rank" NA NA "1" ...
##  $ X__2                                  : chr  "Country" NA "VERY HIGH HUMAN DEVELOPMENT" "Norway" ...
##  $ Human Development Index (HDI)         : chr  "Value" "2015" NA "0.94942283449106446" ...
##  $ X__3                                  : logi  NA NA NA NA NA NA ...
##  $ Life expectancy at birth              : chr  "(years)" "2015" NA "81.710999999999999" ...
##  $ X__4                                  : chr  NA NA NA NA ...
##  $ Expected years of schooling           : chr  "(years)" "2015" NA "17.671869999999998" ...
##  $ X__5                                  : chr  NA "a" NA NA ...
##  $ Mean years of schooling               : chr  "(years)" "2015" NA "12.746420000000001" ...
##  $ X__6                                  : chr  NA "a" NA NA ...
##  $ Gross national income (GNI) per capita: chr  "(2011 PPP $)" "2015" NA "67614.353480000005" ...
##  $ X__7                                  : chr  NA NA NA NA ...
##  $ GNI per capita rank minus HDI rank    : chr  NA "2015" NA "5" ...
##  $ X__8                                  : logi  NA NA NA NA NA NA ...
##  $ HDI rank                              : chr  NA "2014" NA "1" ...

Looking at the HDR dataframe, there are rows that have information that isn’t data (for example, all the rows with an ‘NA’ in the first column), as well as variables/columns that do not contain data (for example, most columns beginning with an ‘X_’, though columns labelled X_1 and X_2 contain the HDI rank and the country names respectively).

Cleaning up the dataframe can be easier to do in Excel by deleting irrelevant rows and columns, but one advantage of doing it in R is replicability. Suppose in a year’s time you carried out the analysis again with an updated spreadsheet containing new information. If you had done the cleaning in Excel, you would have to redo it from scratch, but if you had done it in R, you merely run the code below again.

Firstly, we eliminate rows that do not have any numbers in the HDI rank column (or X_1 column).

names(HDR2015)[1] <- "HDI.rank" # Rename the first column, currently named X_1
names(HDR2015)[2] <- "Country" # Rename the second column, currently named X_2
names(HDR2015)[names(HDR2015)=="HDI rank"] <- "HDI.rank.2014" # Rename the last column, which contains the 2014 rank
HDR2015 <- subset(HDR2015,!is.na(HDI.rank) & HDI.rank != "HDI rank" ) # Eliminate the row that contains the column title

Then we eliminate columns that contain notes in the original spreadsheet (names starting with ‘X_’).

sel_columns <- !startsWith(names(HDR2015),"X_") # Check which variables do NOT (!) start with X_
HDR2015 <- subset(HDR2015,select = sel_columns) # Select the columns that do not start with X_
str(HDR2015)
## Classes 'tbl_df', 'tbl' and 'data.frame':  188 obs. of 9 variables:
##  $ HDI.rank                              : chr  "1" "2" "2" "4" ...
##  $ Country                               : chr  "Norway" "Australia" "Switzerland" "Germany" ...
##  $ Human Development Index (HDI)         : chr  "0.94942283449106446" "0.93867953564660933" "0.93913086905938037" "0.9256689410716622" ...
##  $ Life expectancy at birth              : chr  "81.710999999999999" "82.537000000000006" "83.132999999999996" "81.091999999999999" ...
##  $ Expected years of schooling           : chr  "17.671869999999998" "20.43272" "16.040410000000001" "17.095939999999999" ...
##  $ Mean years of schooling               : chr  "12.746420000000001" "13.1751" "13.37" "13.18762553" ...
##  $ Gross national income (GNI) per capita: chr  "67614.353480000005" "42822.19627" "56363.957799999996" "44999.647140000001" ...
##  $ GNI per capita rank minus HDI rank    : chr  "5" "19" "7" "13" ...
##  $ HDI.rank.2014                         : chr  "1" "3" "2" "4" ...

Let’s change some of the long variable names to shorter ones.

names(HDR2015)[3] <- "HDI"
names(HDR2015)[4] <- "LifeExp"
names(HDR2015)[5] <- "ExpSchool"
names(HDR2015)[6] <- "MeanSchool"
names(HDR2015)[7] <- "GNI.capita"
names(HDR2015)[8] <- "GNI.HDI.rank"

Looking at the structure of the data, we see that R thinks that all the data are chr, which are character or text variables, because, when the data was imported, there were non-numerical entries in rows that have now been deleted. Apart from the Country variable, which we want to be a factor variable, all variables should be numeric.

HDR2015$HDI.rank <- as.numeric(HDR2015$HDI.rank)
HDR2015$Country <- as.factor(HDR2015$Country)
HDR2015$HDI <- as.numeric(HDR2015$HDI)
HDR2015$LifeExp <- as.numeric(HDR2015$LifeExp)
HDR2015$ExpSchool <- as.numeric(HDR2015$ExpSchool)
HDR2015$MeanSchool <- as.numeric(HDR2015$MeanSchool)
HDR2015$GNI.capita <- as.numeric(HDR2015$GNI.capita)
HDR2015$GNI.HDI.rank <- as.numeric(HDR2015$GNI.HDI.rank)
HDR2015$HDI.rank.2014 <- as.numeric(HDR2015$HDI.rank.2014)
str(HDR2015)
## Classes 'tbl_df', 'tbl' and 'data.frame':  188 obs. of 9 variables:
##  $ HDI.rank     : num  1 2 2 4 5 5 7 8 9 10 ...
##  $ Country      : Factor w/ 188 levels "Afghanistan",..: 125 9 163 64 47 151 120 81 76 32 ...
##  $ HDI          : num  0.949 0.939 0.939 0.926 0.925 ...
##  $ LifeExp      : num  81.7 82.5 83.1 81.1 80.4 ...
##  $ ExpSchool    : num  17.7 20.4 16 17.1 19.2 ...
##  $ MeanSchool   : num  12.7 13.2 13.4 13.2 12.7 ...
##  $ GNI.capita   : num  67614 42822 56364 45000 44519 ...
##  $ GNI.HDI.rank : num  5 19 7 13 13 -3 8 11 20 12 ...
##  $ HDI.rank.2014: num  1 3 2 4 6 4 6 8 9 9 ...

Now we have a nice clean dataset that we can work with.

We start by calculating the three indices, using the information given. For the education index we calculate the index for expected and mean schooling separately, then take the arithmetic mean to get I.Education. As some mean schooling observations exceed the ‘maximum’ value of 18, the calculated index values would be larger than 1. To avoid this, we use pmin to replace these observations with 18 to obtain an index value of 1.

HDR2015$I.Health <- (HDR2015$LifeExp-20)/(85-20)
HDR2015$I.Education <- ((pmin(HDR2015$ExpSchool,18)-0)/(18-0) + (HDR2015$MeanSchool-0)/(15-0))/2
HDR2015$I.Income <- (log(HDR2015$GNI.capita)-log(100))/(log(75000)-log(100))
HDR2015$HDI.calc <- (HDR2015$I.Health * HDR2015$I.Education * HDR2015$I.Income)^(1/3)

Now we can compare the HDI given in the table and our calculated HDI.

HDR2015[,c("HDI","HDI.calc")]
## # A tibble: 188 x 2
##   HDI   HDI.calc
##   <dbl> <dbl>
## 1 0.949 0.949
## 2 0.939 0.939
## 3 0.939 0.939
## 4 0.926 0.926
## 5 0.925 0.925
## 6 0.925 0.927
## 7 0.924 0.924
## 8 0.923 0.923
## 9 0.921 0.921
## 10 0.920 0.920
## # ... with 178 more rows

The HDI is one way to measure wellbeing, but you may think that it does not use the most appropriate measures for the non-material aspects of wellbeing (health and education).

Now we will use the same method to create our own index of non-material wellbeing (an ‘alternative HDI’), using different indicators. Tables 8, 9, 11, 12, and 13 in the spreadsheet contain the indicators that you can use to measure health and education instead of those used in Questions 2 to 4.

  1. Create an alternative index of wellbeing. In particular, propose alternative Education and Health indices in (a) and (b), then combine these with the existing Income index in (c) to calculate an alternative HDI. Examine whether the changes caused substantial changes in country rankings in (d).

R walk-through 4.8 Creating your own HDI

Merge data and calculate alternative indices

In the ‘HDR_data.xlsx’ spreadsheet, educational indicators are in worksheet ‘Table 9’ and health indicators are in ‘Table 8’.

This example uses the following indicators:

  • Education (Table 9): Adult literacy rate (Column C), Tertiary Enrolment (Column Q), Primary school teachers trained to teach (Column U)
  • Health (Table 8): Child malnutrition (Column I), Female mortality rate (Column O), Male mortality rate (Column Q).
HDR2015.Edu <- read_excel("HDR_data.xlsx", # Filename
                sheet="Table 9", # Sheet to import
                skip = 6,  # Number of rows to skip
                na = "..")  # This indicates how missing values are coded.
head(HDR2015.Edu)
## # A tibble: 6 x 25
##    X__1 `Very high human ~  X__2 X__3   X__4 X__5   X__6 X__7   X__8 X__9 
##   <dbl> <chr>              <dbl> <lgl> <dbl> <lgl> <dbl> <lgl> <dbl> <chr>
## 1     1 Norway              NA   NA     NA   NA     NA   NA     95.3 <NA> 
## 2     2 Australia           NA   NA     NA   NA     NA   NA     91.5 <NA> 
## 3     2 Switzerland         NA   NA     NA   NA     NA   NA     96.7 <NA> 
## 4     4 Germany             NA   NA     NA   NA     NA   NA     96.7 <NA> 
## 5     5 Denmark             NA   NA     NA   NA     NA   NA     89.5 <NA> 
## 6     5 Singapore           96.8 NA     99.9 NA     99.9 NA     78.6 <NA> 
## # ... with 15 more variables: X__10 <dbl>, X__11 <lgl>, X__12 <dbl>,
## #   X__13 <lgl>, X__14 <dbl>, X__15 <lgl>, X__16 <dbl>, X__17 <lgl>,
## #   X__18 <dbl>, X__19 <lgl>, X__20 <dbl>, X__21 <lgl>, X__22 <dbl>,
## #   X__23 <lgl>, X__24 <dbl>
str(HDR2015.Edu)
## Classes 'tbl_df', 'tbl' and 'data.frame':  240 obs. of 25 variables:
##  $ X__1                       : num  1 2 2 4 5 5 7 8 9 10 ...
##  $ Very high human development: chr  "Norway" "Australia" "Switzerland" "Germany" ...
##  $ X__2                       : num  NA NA NA NA NA ...
##  $ X__3                       : logi  NA NA NA NA NA NA ...
##  $ X__4                       : num  NA NA NA NA NA ...
##  $ X__5                       : logi  NA NA NA NA NA NA ...
##  $ X__6                       : num  NA NA NA NA NA ...
##  $ X__7                       : logi  NA NA NA NA NA NA ...
##  $ X__8                       : num  95.3 91.5 96.7 96.7 89.5 ...
##  $ X__9                       : chr  NA NA NA NA ...
##  $ X__10                      : num  98.2 109.2 105.1 110.9 96.4 ...
##  $ X__11                      : logi  NA NA NA NA NA NA ...
##  $ X__12                      : num  100 107 103 103 101 ...
##  $ X__13                      : logi  NA NA NA NA NA NA ...
##  $ X__14                      : num  112.6 137.6 99.8 102.4 129.9 ...
##  $ X__15                      : logi  NA NA NA NA NA NA ...
##  $ X__16                      : num  76.8 86.6 57.2 65.5 81.5 ...
##  $ X__17                      : logi  NA NA NA NA NA NA ...
##  $ X__18                      : num  0.425 NA NA 3.518 0.52 ...
##  $ X__19                      : logi  NA NA NA NA NA NA ...
##  $ X__20                      : num  NA NA NA NA NA ...
##  $ X__21                      : logi  NA NA NA NA NA NA ...
##  $ X__22                      : num  8.85 NA 10.1 12.3 NA ...
##  $ X__23                      : logi  NA NA NA NA NA NA ...
##  $ X__24                      : num  7.37 5.27 5.05 4.94 8.55 ...

The actual data starts in Row 8 so we will skip 6 rows (leaving the 7th row for variable names). Also note that the missing values are indicated by ‘…’, which we specified when importing the data.

The variable names in the Excel file stretch over several rows, making them difficult to import. So we need to do a little manual manipulation. Check the spreadsheet to confirm that the three educational variables we want are labelled X__2, X__16 and X__20. We only need these variables and the one named ‘Very high human development’ where the country names are stored. We need the country names later for data merging.

We now have to repeat some of the steps we followed when we imported Table 1 in R walk-through 4.7.

names(HDR2015.Edu)[1] <- "HDI.rank"
names(HDR2015.Edu)[2] <- "Country"    # Rename the second column, currently named 'Very high …'
names(HDR2015.Edu)[names(HDR2015.Edu)=="X__2"] <- "Adult.Lit"
names(HDR2015.Edu)[names(HDR2015.Edu)=="X__16"] <- "Tert.Enrol"
names(HDR2015.Edu)[names(HDR2015.Edu)=="X__20"] <- "Prim.Teacher"

HDR2015.Edu <- subset(HDR2015.Edu,!is.na(HDI.rank), select = c("Country","Adult.Lit","Tert.Enrol","Prim.Teacher")  )  # The select function indicates which variables to keep.

str(HDR2015.Edu)
## Classes 'tbl_df', 'tbl' and 'data.frame':  188 obs. of 4 variables:
##  $ Country     : chr  "Norway" "Australia" "Switzerland" "Germany" ...
##  $ Adult.Lit   : num  NA NA NA NA NA ...
##  $ Tert.Enrol  : num  76.8 86.6 57.2 65.5 81.5 ...
##  $ Prim.Teacher: num  NA NA NA NA NA ...

Looking at the structure (str( )), we can see that all three indicators are correctly in numerical (num) format.

Before we can calculate indices, we need to set minimum and maximum values, which we base on the minimum and maximum values in the sample.

summary(HDR2015.Edu)
##    Country            Adult.Lit       Tert.Enrol        Prim.Teacher    
##  Length:188         Min.   :19.13   Min.   :  0.7977   Min.   :  5.864  
##  Class :character   1st Qu.:74.95   1st Qu.: 13.4234   1st Qu.: 72.339  
##  Mode  :character   Median :93.09   Median : 35.0808   Median : 89.957  
##                     Mean   :83.54   Mean   : 38.8333   Mean   : 82.448  
##                     3rd Qu.:97.96   3rd Qu.: 62.9203   3rd Qu.: 98.256  
##                     Max.   :99.89   Max.   :110.1627   Max.   :100.000  
##                     NA's   :37      NA's   :33         NA's   :73

As we want the observations to be inside the [min, max] interval, we choose the following [min, max] pairs: Adult.Lit: [19, 100], Tert.Enrol: [0.79, 110.2], and Prim.Teacher: [5.86, 100]. You may want to research why there can be countries with a tertiary enrolment ratio larger than 100%.

Let’s calculate the alternative Education index.

HDR2015.Edu$I.Adult.Lit <- (HDR2015.Edu$Adult.Lit-(19))/(100-(19))
HDR2015.Edu$I.Tert.Enrol <- (HDR2015.Edu$Tert.Enrol-0.79)/(110.2-0.79) 
HDR2015.Edu$I.Prim.Teacher <- (HDR2015.Edu$Prim.Teacher-5.86)/(100-5.86)

Now we calculate an arithmetic average just as we did for I.Education.

HDR2015.Edu$I.Education.alt <- (HDR2015.Edu$I.Adult.Lit + HDR2015.Edu$I.Tert.Enrol + HDR2015.Edu$I.Prim.Teacher)/3
summary(HDR2015.Edu$I.Education.alt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.1599  0.4850  0.6254  0.5983  0.7236  0.9319     101

You can see that we could not calculate this index for 101 countries, as at least one of the three values was missing.

We repeat this procedure to calculate an alternative health index. Remember that, in ‘Health’ (Table 8), we are using:

  • Child malnutrition (Column I)
  • Female mortality rate (Column O)
  • Male mortality rate (Column Q).
HDR2015.Health <- read_excel("HDR_data.xlsx", # Filename
                sheet="Table 8", # Sheet to import
                skip = 6,  # Number of rows to skip
                na = "..")  # This indicates how missing values are coded.

names(HDR2015.Health)[1] <- "HDI.rank"
names(HDR2015.Health)[2] <- "Country"    # Rename the second column, currently named 'Very high …'
names(HDR2015.Health)[names(HDR2015.Health)=="X__8"] <- "Child.MalNu"
names(HDR2015.Health)[names(HDR2015.Health)=="X__14"] <- "Mortality.Female"
names(HDR2015.Health)[names(HDR2015.Health)=="X__16"] <- "Mortality.Male"

HDR2015.Health <- subset(HDR2015.Health,!is.na(HDI.rank), 
  select = c("Country","Child.MalNu","Mortality.Female","Mortality.Male"))  # The select function indicates which variables to keep.

summary(HDR2015.Health)
##    Country           Child.MalNu    Mortality.Female Mortality.Male 
##  Length:188         Min.   : 1.30   Min.   : 32.24   Min.   : 63.7  
##  Class :character   1st Qu.: 9.65   1st Qu.: 77.07   1st Qu.:137.1  
##  Mode  :character   Median :20.55   Median :113.05   Median :200.7  
##                     Mean   :22.16   Mean   :149.88   Mean   :212.3  
##                     3rd Qu.:32.88   3rd Qu.:203.76   3rd Qu.:271.8  
##                     Max.   :57.50   Max.   :612.37   Max.   :580.5  
##                     NA's   :50      NA's   :27       NA's   :27

As we want the observations to be inside the [min, max] interval we choose the following [min, max] pairs: Child.MalNu: [1.3, 57.5], Mortality.Female: [32.2, 612.4], and Mortality.Male: [63.7, 580.5].

HDR2015.Health$I.Child.MalNu <- (HDR2015.Health$Child.MalNu-(1.3))/(57.5-(1.3))
HDR2015.Health$I.Mortality.Female <- (HDR2015.Health$Mortality.Female-32.2)/(612.4-32.2) 
HDR2015.Health$I.Mortality.Male <- (HDR2015.Health$Mortality.Male-63.7)/(580.5-63.7)

HDR2015.Health$I.Health.alt <- (HDR2015.Health$I.Child.MalNu + HDR2015.Health$I.Mortality.Female + HDR2015.Health$I.Mortality.Male)/3

# Note that these are all 'bad' indicators in the sense that higher numbers indicate worse outcomes.

# For all other indicators, larger numbers mean better outcomes. So we need to 'flip' the values of this indicator.
HDR2015.Health$I.Health.alt <- (1-HDR2015.Health$I.Health.alt)
summary(HDR2015.Health$I.Health.alt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.1520  0.5616  0.7292  0.6909  0.8463  0.9731      54

Now we use the merge function to merge this variable into our existing HDR2015 dataframe.

HDR2015 <- merge(HDR2015,HDR2015.Edu)
HDR2015 <- merge(HDR2015,HDR2015.Health)

Calculate an alternative HDI

Looking at HDR2015, you will see that the variables from HDR2015.Health and HDR2015.Edu have been added. Finally we are in a position to calculate our own HDI.

HDR2015$HDI.own <- (HDR2015$I.Health.alt * HDR2015$I.Education.alt * HDR2015$I.Income)^(1/3) 
summary(HDR2015$HDI.own)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.2865  0.4739  0.6276  0.6082  0.7554  0.8909     108

We have a substantial number of missing observations, leaving us with only 80 countries for which we could calculate the index.

Calculate ranks

To compare the ranks of the two indices (the original HDI and our alternative HDI), we should only rank the countries that have observations for both indices.

HDR2015_sub <- subset(HDR2015,!is.na(HDI) & !is.na(HDI.own)) 

Let’s calculate the rank for our index.

HDR2015_sub$HDI.own.rank <- rank(-HDR2015_sub$HDI.own, na.last = "keep") 
HDR2015_sub$HDI.rank <- rank(-HDR2015_sub$HDI, na.last = "keep") 

The rank function will assign rank 1 to the smallest index value, but we want the largest (best) index value to have the rank 1. We add - in front of the variable name to obtain the desired effect.

Now we will make a scatterplot in which we compare the rank of the HDI with that of our own index.

ggplot(HDR2015_sub, aes(x=HDI.rank, y=HDI.own.rank)) +
  geom_point(shape=16) + # Use solid circles
  labs(y = "Alternative HDI rank", x = "HDI rank") +
  ggtitle("Comparing ranks between HDI and HDI.own") +
  theme_bw()

Scatterplot of ranks for HDI and alternative HDI index.

Figure 4.10 Scatterplot of ranks for HDI and alternative HDI index.

You can see that in general the rankings are similar. If they were identical, the points in the scatterplot would form a straight line. They do not form a straight line, but there is a very strong positive correlation. There are a few countries where the alternative definitions have caused a change in ranking, so let’s find out which countries these are.

temp <- HDR2015_sub[order(HDR2015_sub$HDI.rank-HDR2015_sub$HDI.own.rank),
                c("Country","HDI.rank","HDI.own.rank")]
    # Order countries according to rank difference and show a selection of variables
head(temp,5)  # Show the countries with the largest fall in rank
##        Country HDI.rank HDI.own.rank
## 102 Madagascar       57           77
## 161  Swaziland       51           70
## 96     Lesotho       60           78
## 158  Sri Lanka       13           29
## 18      Belize       27           40
tail(temp,5)  # Show the countries with the largest increase in rank
##                 Country HDI.rank HDI.own.rank
## 55              Eritrea       73           64
## 167            Thailand       20           11
## 129 Palestine, State of       33           23
## 159               Sudan       63           51
## 37             Colombia       24           10
  1. Compare your alternative index to the HDI:
Classification HDI
Very high human development 0.800 and above
High human development 0.700–0.799
Medium human development 0.550–0.699
Low human development Below 0.550

Classification of countries according to their HDI value.

Figure 4.11 Classification of countries according to their HDI value.

United Nations Development Programme. 2016. ‘Technical notes’ in Human Development Report 2016: p.3.

We will now investigate whether HDI and GDP per capita give similar information about overall wellbeing, by comparing a country’s rank in both measures. Refer to the data in Table 10 of the spreadsheet. Column E contains the GDP per capita in 2015, measured in 2011 constant prices in US dollars. To answer Question 7, first copy and paste this data into a new column in the Table 1 tab of the spreadsheet, making sure to match the data to the correct country.

  1. Evaluate GDP per capita and the HDI as measures of overall wellbeing:
HDI
Low High
GDP Low
High

Classification of countries according to their HDI and GDP values.

Figure 4.12 Classification of countries according to their HDI and GDP values.

  1. The HDI is one way to measure wellbeing, but there are many other ways to measure wellbeing.