# Glossary

box and whisker plot
A graphic display of the range and quartiles of a distribution, where the first and third quartile form the ‘box’ and the maximum and minimum values form the ‘whiskers’.
causation
A direction from cause to effect, establishing that a change in one variable produces a change in another. While a correlation gives an indication of whether two variables move together (either in the same or opposite directions), causation means that there is a mechanism that explains this association. Example: We know that higher levels of CO2 in the atmosphere lead to a greenhouse effect, which warms the Earth’s surface. Therefore we can say that higher CO2 levels are the cause of higher surface temperatures.
conditional mean
An average of a variable, taken over a subgroup of observations that satisfy certain conditions, rather than all observations.
confidence interval
A range of values that is centred around the sample value, and is defined so that there is a specified probability (usually 95%) that it contains the ‘true value’ of interest.
contingent valuation
A survey-based technique used to assess the value of non-market resources. Also known as: stated-preference model.
correlation coefficient
A numerical measure of how closely associated two variables are and whether they tend to take similar or dissimilar values, ranging from a value of 1, indicating that the variables take similar values (positively correlated), to −1, indicating that the variables take dissimilar variables (negative or inverse correlation). A value of 1 or −1 indicates that knowing the value of one of the variables would allow you to perfectly predict the value of the other. A value of 0 indicates that knowing one of the variables provides no information about the value of the other.
correlation
A measure of how closely related two variables are. Two variables are correlated if knowing the value of one variable provides information on the likely value of the other, for example high values of one variable being commonly observed along with high values of the other variable. Correlation can be positive or negative. It is negative when high values of one variable are observed with low values of the other. Correlation does not mean that there is a causal relationship between the variables. Example: When the weather is hotter, purchases of ice cream are higher. Temperature and ice cream sales are positively correlated. On the other hand, if purchases of hot beverages decrease when the weather is hotter, we say that temperature and hot beverage sales are negatively correlated.
Cronbach’s alpha
A measure used to assess the extent to which a set of items is a reliable or consistent measure of a concept. This measure ranges from 0–1, with 0 meaning that all of the items are independent of one another, and 1 meaning that all of the items are perfectly correlated with each other.
cross-sectional data
Data that is collected from participants at one point in time or within a relatively short time frame. In contrast, time series data refers to data collected by following an individual (or firm, country, etc.) over a course of time. Example: Data on degree courses taken by all the students in a particular university in 2016 is considered cross-sectional data. In contrast, data on degree courses taken by all students in a particular university from 1990 to 2016 is considered time series data.
decile
A subset of observations, formed by ordering the full set of observations according to the values of a particular variable and then splitting the set into ten equally-sized groups. For example, the 1st decile refers to the smallest 10% of values in a set of observations. See also: percentile.
deflation
A decrease in the general price level. See also: inflation.
differences-in-differences
A method that applies an experimental research design to outcomes observed in a natural experiment. It involves comparing the difference in the average outcomes of two groups, a treatment and control group, both before and after the treatment took place.
disinflation
A decrease in the rate of inflation. See also: inflation, deflation.
dummy variable (indicator variable)
A variable that takes the value 1 if a certain condition is met, and 0 otherwise.
endogenous
Produced by the workings of a model rather than coming from outside the model. See also: exogenous
exogenous
Coming from outside the model rather than being produced by the workings of the model itself. See also: endogenous.
frequency table
A record of how many observations in a dataset have a particular value, range of values, or belong to a particular category.
geometric mean
A summary measure calculated by multiplying N numbers together and then taking the Nth root of this product. The geometric mean is useful when the items being averaged have different scoring indices or scales, because it is not sensitive to these differences, unlike the arithmetic mean. For example, if education ranged from 0 to 20 years and life expectancy ranged from 0 to 85 years, life expectancy would have a bigger influence on the HDI than education if we used the arithmetic mean rather than the geometric mean. Conversely, the geometric mean treats each criteria equally. Example: Suppose we use life expectancy and mean years of schooling to construct an index of wellbeing. Country A has life expectancy of 40 years and a mean of 6 years of schooling. If we used the arithmetic mean to make an index, we would get (40 + 6)/2 = 23. If we used the geometric mean, we would get (40 × 6)1/2 = 15.5. Now suppose life expectancy doubled to 80 years. The arithmetic mean would be (80 + 6)/2 = 43, and the geometric mean would be (80 × 6)1/2 = 21.9. If, instead, mean years of schooling doubled to 12 years, the arithmetic mean would be (40 + 12)/2 = 26, and the geometric mean would be (40 × 12)1/2 = 21.9. This example shows that the arithmetic mean can be ‘unfair’ because proportional changes in one variable (life expectancy) have a larger influence over the index than changes in the other variable (years of schooling). The geometric mean gives each variable the same influence over the value of the index, so doubling the value of one variable would have the same effect on the index as doubling the value of another variable.
Gini coefficient
A measure of inequality of any quantity such as income or wealth, varying from a value of zero (if there is no inequality) to one (if a single individual receives all of it).
hypothesis test
A test in which a null (default) and an alternative hypothesis are posed about some characteristic of the population. Sample data is then used to test how likely it is that these sample data would be seen if the null hypothesis was true.
incomplete contract
A contract that does not specify, in an enforceable way, every aspect of the exchange that affects the interests of parties to the exchange (or of others).
index
An index is formed by aggregating the values of multiple items into a single value, and is used as a summary measure of an item of interest. Example: The HDI is a summary measure of wellbeing, and is calculated by aggregating the values for life expectancy, expected years of schooling, mean years of schooling, and gross national income per capita.
inflation
An increase in the general price level in the economy. Usually measured over a year. See also: deflation, disinflation.
leverage ratio (for banks or households)
The value of assets divided by the equity stake (capital contributed by owners and shareholders) in those assets.
leverage ratio (for non-bank companies)
The value of total liabilities divided by total assets.
Likert scale
A numerical scale (usually ranging from 1–5 or 1–7) used to measure attitudes or opinions, with each number representing the individual’s level of agreement or disagreement with a particular statement.
logarithmic scale
A way of measuring a quantity based on the logarithm function, f(x) = log(x). The logarithm function converts a ratio to a difference: log (a/b) = log a – log b. This is very useful for working with growth rates. For instance, if national income doubles from 50 to 100 in a poor country and from 1,000 to 2,000 in a rich country, the absolute difference in the first case is 50 and in the second 1,000, but log(100) – log(50) = 0.693, and log(2,000) – log(1,000) = 0.693. The ratio in each case is 2 and log(2) = 0.693.
Lorenz curve
A graphical representation of inequality of some quantity such as wealth or income. Individuals are arranged in ascending order by how much of this quantity they have, and the cumulative share of the total is then plotted against the cumulative share of the population. For complete equality of income, for example, it would be a straight line with a slope of one. The extent to which the curve falls below this perfect equality line is a measure of inequality. See also: Gini coefficient.
mean
A summary statistic for a set of observations, calculated by adding all values in the set and dividing by the number of observations.
mean
A summary statistic for a set of observations, calculated by adding all values in the set and dividing by the number of observations.
mean
A summary statistic for a set of observations, calculated by adding all values in the set and dividing by the number of observations.
mean
A summary statistic for a set of observations, calculated by adding all values in the set and dividing by the number of observations.
median
The middle number in a set of values, such that half of the numbers are larger than the median and half are smaller. Also known as: 50th percentile.
natural experiment
An empirical study exploiting naturally occurring statistical controls in which researchers do not have the ability to assign participants to treatment and control groups, as is the case in conventional experiments. Instead, differences in law, policy, weather, or other events can offer the opportunity to analyse populations as if they had been part of an experiment. The validity of such studies depends on the premise that the assignment of subjects to the naturally occurring treatment and control groups can be plausibly argued to be random.
natural logarithm
See: logarithmic scale.
nominal wage
The actual amount received in payment for work, in a particular currency. See also: real wage.
p-value
The probability of observing the data collected, assuming that the two groups have the same mean. The p-value ranges from 0 to 1, where lower values indicate a higher probability that the underlying assumption (same means) is false. The lower the probability (the lower the p-value), the less likely it is to observe the given data, and therefore the more likely it is that the assumption is false (the means of both distributions is not the same).
percentile
A subset of observations, formed by ordering the full set of observations according to the values of a particular variable and then splitting the set into ten equally-sized groups. For example, the 1st percentile refers to the smallest 1% of values in a set of observations. See also: decile.
principal–agent relationship
This is an asymmetrical relationship in which one party (the principal) benefits from some action or attribute of the other party (the agent) about which the principal’s information is not sufficient to enforce in a complete contract. See also: incomplete contract. Also known as: principal–agent problem.
range
The interval formed by the smallest (minimum) and the largest (maximum) value of a particular variable. The range shows the two most extreme values in the distribution, and can be used to check whether there are any outliers in the data. (Outliers are a few observations in the data that are very different from the rest of the observations.)
real wage
The nominal wage, adjusted to take account of changes in prices between different time periods. It measures the amount of goods and services the worker can buy. See also: nominal wage.
selection bias
An issue that occurs when the sample or data observed is not representative of the population of interest. For example, individuals with certain characteristics may be more likely to be part of the sample observed (such as students being more likely than CEOs to participate in computer lab experiments).
significance level
A cut-off probability that determines whether a p-value is considered statistically significant. If a p-value is smaller than the significance level, it is considered unlikely that the differences observed are due to chance, given the assumptions made about the variables (for example, having the same mean). Common significance levels are 1% (p-value of 0.01), 5% (p-value of 0.05), and 10% (p-value of 0.1). See also: statistically significant, p-value.
simultaneity
When the right-hand and left-hand variables in a model equation affect each other at the same time, so that the direction of causality runs both ways. For example, in supply and demand models, the market price affects the quantity supplied and demanded, but quantity supplied and demanded can in turn affect the market price.
spurious correlation
A strong linear association between two variables that does not result from any direct relationship, but instead may be due to coincidence or to another unseen factor.
standard deviation
A measure of dispersion in a frequency distribution, equal to the square root of the variance. The standard deviation has a similar interpretation to the variance. A larger standard deviation means that the data is more spread out. Example: The set of numbers 1, 1, 1 has a standard deviation of zero (no variation or spread), while the set of numbers 1, 1, 999 has a standard deviation of 46.7 (large spread).
standard error
A measure of the degree to which the sample mean deviates from the population mean. It is calculated by dividing the standard deviation of the sample by the square root of the number of observations.
statistically significant
When a relationship between two or more variables is unlikely to be due to chance, given the assumptions made about the variables (for example, having the same mean). Statistical significance does not tell us whether there is a causal link between the variables.
time series data
A time series is a set of time-ordered observations of a variable taken at successive, in most cases regular, periods or points of time. Example: The population of a particular country in the years 1990, 1991, 1992, … , 2015 is time series data.
variance
A measure of dispersion in a frequency distribution, equal to the mean of the squares of the deviations from the arithmetic mean of the distribution. The variance is used to indicate how ‘spread out’ the data is. A higher variance means that the data is more spread out. Example: The set of numbers 1, 1, 1 has zero variance (no variation), while the set of numbers 1, 1, 999 has a high variance of 2178 (large spread).
weighted average
A type of average that assigns greater importance (weight) to some components than to others, in contrast with a simple average, which weights each component equally. Components with a larger weight can have a larger influence on the average.