A direction from cause to effect, establishing that a change in one variable produces a change in another. While a correlation gives an indication of whether two variables move together (either in the same or opposite directions), causation means that there is a mechanism that explains this association. Example: We know that higher levels of CO2 in the atmosphere lead to a greenhouse effect, which warms the Earth’s surface. Therefore we can say that higher CO2 levels are the cause of higher surface temperatures.
conditional mean
An average of a variable, taken over a subgroup of observations that satisfy certain conditions, rather than all observations.
correlation coefficient
numerical measure of how closely associated two variables are and whether they tend to take similar or dissimilar values, ranging from a value of 1, indicating that the variables take similar values (positively correlated), to −1, indicating that the variables take dissimilar variables (negative or inverse correlation). A value of 1 or −1 indicates that knowing the value of one of the variables would allow you to perfectly predict the value of the other. A value of 0 indicates that knowing one of the variables provides no information about the value of the other.
A measure of how closely related two variables are. Two variables are correlated if knowing the value of one variable provides information on the likely value of the other, for example high values of one variable being commonly observed along with high values of the other variable. Correlation can be positive or negative. It is negative when high values of one variable are observed with low values of the other. Correlation does not mean that there is a causal relationship between the variables. Example: When the weather is hotter, purchases of ice cream are higher. Temperature and ice cream sales are positively correlated. On the other hand, if purchases of hot beverages decrease when the weather is hotter, we say that temperature and hot beverage sales are negatively correlated.
cross-sectional data
Data that is collected from participants at one point in time or within a relatively short time frame. In contrast, time series data refers to data collected by following an individual (or firm, country, etc.) over a course of time. Example: Data on degree courses taken by all the students in a particular university in 2016 is considered cross-sectional data. In contrast, data on degree courses taken by all students in a particular university from 1990 to 2016 is considered time series data.
A method that applies an experimental research design to outcomes observed in a natural experiment. It involves comparing the difference in the average outcomes of two groups, a treatment and control group, both before and after the treatment took place.
frequency table
A record of how many observations in a dataset have a particular value, range of values, or belong to a particular category.
geometric mean
A summary measure calculated by multiplying N numbers together and then taking the Nth root of this product. The geometric mean is useful when the items being averaged have different scoring indices or scales, because it is not sensitive to these differences, unlike the arithmetic mean. For example, if education ranged from 0 to 20 years and life expectancy ranged from 0 to 85 years, life expectancy would have a bigger influence on the HDI than education if we used the arithmetic mean rather than the geometric mean. Conversely, the geometric mean treats each criteria equally. Example: Suppose we use life expectancy and mean years of schooling to construct an index of wellbeing. Country A has life expectancy of 40 years and a mean of 6 years of schooling. If we used the arithmetic mean to make an index, we would get (40 + 6)/2 = 23. If we used the geometric mean, we would get (40 × 6)1/2 = 15.5. Now suppose life expectancy doubled to 80 years. The arithmetic mean would be (80 + 6)/2 = 43, and the geometric mean would be (80 × 6)1/2 = 21.9. If, instead, mean years of schooling doubled to 12 years, the arithmetic mean would be (40 + 12)/2 = 26, and the geometric mean would be (40 × 12)1/2 = 21.9. This example shows that the arithmetic mean can be ‘unfair’ because proportional changes in one variable (life expectancy) have a larger influence over the index than changes in the other variable (years of schooling). The geometric mean gives each variable the same influence over the value of the index, so doubling the value of one variable would have the same effect on the index as doubling the value of another variable.
Gini coefficient
A measure of inequality of any quantity such as income or wealth, varying from a value of zero (if there is no inequality) to one (if a single individual receives all of it).
An index is formed by aggregating the values of multiple items into a single value, and is used as a summary measure of an item of interest. Example: The HDI is a summary measure of wellbeing, and is calculated by aggregating the values for life expectancy, expected years of schooling, mean years of schooling, and gross national income per capita.
Lorenz curve
A graphical representation of inequality of some quantity such as wealth or income. Individuals are arranged in ascending order by how much of this quantity they have, and the cumulative share of the total is then plotted against the cumulative share of the population. For complete equality of income, for example, it would be a straight line with a slope of one. The extent to which the curve falls below this perfect equality line is a measure of inequality. See also: Gini coefficient.
natural experiment
An empirical study exploiting naturally occurring statistical controls in which researchers do not have the ability to assign participants to treatment and control groups, as is the case in conventional experiments. Instead, differences in law, policy, weather, or other events can offer the opportunity to analyse populations as if they had been part of an experiment. The validity of such studies depends on the premise that the assignment of subjects to the naturally occurring treatment and control groups can be plausibly argued to be random.
The probability of observing the data collected, assuming that the two groups have the same mean. The p-value ranges from 0 to 1, where lower values indicate a higher probability that the underlying assumption (same means) is false. The lower the probability (the lower the p-value), the less likely it is to observe the given data, and therefore the more likely it is that the assumption is false (the means of both distributions is not the same).
The interval formed by the smallest (minimum) and the largest (maximum) value of a particular variable. The range shows the two most extreme values in the distribution, and can be used to check whether there are any outliers in the data. (Outliers are a few observations in the data that are very different from the rest of the observations.)
significance level
A cut-off probability that determines whether a p-value is considered statistically significant. If a p-value is smaller than the significance level, it is considered unlikely that the differences observed are due to chance, given the assumptions made about the variables (for example, having the same mean). Common significance levels are 1% (p-value of 0.01), 5% (p-value of 0.05), and 10% (p-value of 0.1). See also: statistically significant, p-value.
spurious correlation
A strong linear association between two variables that does not result from any direct relationship, but instead may be due to coincidence or to another unseen factor.
standard deviation
A measure of dispersion in a frequency distribution, equal to the square root of the variance. The standard deviation has a similar interpretation to the variance. A larger standard deviation means that the data is more spread out. Example: The set of numbers 1, 1, 1 has a standard deviation of zero (no variation or spread), while the set of numbers 1, 1, 999 has a standard deviation of 46.7 (large spread).
statistically significant
When a relationship between two or more variables is unlikely to be due to chance, given the assumptions made about the variables (for example, having the same mean). Statistical significance does not tell us whether there is a causal link between the variables.
time series data
A time series is a set of time-ordered observations of a variable taken at successive, in most cases regular, periods or points of time. Example: The population of a particular country in the years 1990, 1991, 1992, … , 2015 is time series data.
A measure of dispersion in a frequency distribution, equal to the mean of the squares of the deviations from the arithmetic mean of the distribution. The variance is used to indicate how ‘spread out’ the data is. A higher variance means that the data is more spread out. Example: The set of numbers 1, 1, 1 has zero variance (no variation), while the set of numbers 1, 1, 999 has a high variance of 2178 (large spread).