Empirical Project 6 Working in R

Getting started in R

For this project you will need the following packages:

We will also use the ggplot2 package to produce graphs, but that does come as part of the tidyverse package.

If you need to install either of these packages, run the following code:

install.packages(c("tidyverse","ggthemes"))

You can import the libraries now, or when they are used in the R walk-throughs below.

library(tidyverse)  
library(ggthemes)  

Part 6.1 Looking for patterns in the survey data

First download the data used in the paper to understand how this information was collected. The data is publicly available and free of charge, but you will need to create a user account in order to access it.

  1. To learn about how Bloom et al. (2012) conducted their survey, read the sections ‘How Can Management Practices Be Measured?’ and ‘Validating the Management Data’ (pages 5–9) of their paper.

Now we will create some charts to summarize the data and make comparisons across countries, industries (manufacturing, healthcare, retail, and education), and firm characteristics.

  1. In ‘Manufacturing: 2004–2010 combined survey data (AMP)’, open the file ‘AMP_graph_manufacturing.csv’. Use this data on manufacturing firms to do the following:
Country Overall management (mean) Monitoring management (mean) Targets management (mean) Incentives management (mean)
         
         

Mean of management scores.

Figure 6.2a Mean of management scores.

Country Overall management (rank) Monitoring management (rank) Targets management (rank) Incentives management (rank)
         
         

Rank according to management scores.

Figure 6.2b Rank according to management scores.

R walk-through 6.1 Importing data into R and creating tables and charts

Before uploading an Excel or csv file into R, first open the file in a spreadsheet software (like Excel) to understand how the file is structured. From looking at the file we learn that:

  • the variable names are in the first row (no need use the skip option)
  • missing values are represented by empty cells (hence we will use na.strings = "")
  • the last variable is in column S, with short variable descriptions in column U: it is easier to import everything first and remove the unnecessary data afterwards.
man_data <- read.csv("AMP_graph_manufacturing.csv",na.strings = "")
str(man_data)
## 'data.frame':    9207 obs. of  21 variables:
##  $ management                : num  3.5 3.17 3 2.41 4.44 ...
##  $ monitor                   : num  3.6 3.8 2.8 2.75 4.6 4.8 4.6 4.8 4.8 3.8 ...
##  $ target                    : num  3.6 2.6 3.6 2.4 4.4 4.4 4.6 4.2 4.8 3 ...
##  $ people                    : num  3.5 2.5 3 2.67 4.33 ...
##  $ lemp_firm                 : num  5.99 6.4 7.6 8.04 5.24 ...
##  $ export                    : num  NA NA 70 NA NA NA NA NA NA NA ...
##  $ competition               : int  NA 2 4 NA NA NA NA NA NA NA ...
##  $ ownership                 : Factor w/ 9 levels "Dispersed Shareholders",..: NA 1 1 NA NA NA NA NA NA NA ...
##  $ mne_country               : Factor w/ 77 levels "Argentina","Australia",..: NA NA 72 NA NA NA NA NA NA NA ...
##  $ mne_f                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mne_d                     : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ degree_m                  : int  NA 100 100 NA NA NA NA NA NA NA ...
##  $ degree_nm                 : int  NA 75 5 NA NA NA NA NA NA NA ...
##  $ country                   : Factor w/ 20 levels "Argentina","Australia",..: 20 20 20 20 20 20 20 20 20 20 ...
##  $ competition2004           : int  3 NA NA NA 3 3 3 3 3 3 ...
##  $ year                      : int  2004 2006 2006 2002 2004 2004 2004 2004 2004 2004 ...
##  $ sic                       : int  382 382 382 308 281 281 366 366 357 382 ...
##  $ lb_employindex            : int  0 0 0 NA 0 0 0 0 0 0 ...
##  $ pppgdp                    : num  11868 13399 14119 10642 11868 ...
##  $ X                         : logi  NA NA NA NA NA NA ...
##  $ storage..display.....value: Factor w/ 22 levels "----------------------------------------------------------------------------------------------------------------------",..: 21 1 11 15 20 17 10 8 3 16 ...
man_varinfo <- unlist(man_data$storage..display.....value[1:23]) # Keep the variable information
man_data <- man_data[,!(names(man_data) %in% c("X","storage..display.....value"))] # Delete last two variables

Let’s look at the variables.

man_varinfo
##  [1] variable name   type   format      label      variable label                                                          
##  [2] ----------------------------------------------------------------------------------------------------------------------
##  [3] management      float  %9.0g                * Average of all management questions                                     
##  [4] monitor         float  %9.0g                  Average of perf1 to perf5                                               
##  [5] target          float  %9.0g                  Average of perf6 to perf10                                              
##  [6] people          float  %9.0g                  Average of talent1 to talent6                                           
##  [7] lemp_firm       float  %9.0g                  Log of 'No. of firm employees as declared in interview'                 
##  [8] export          double %10.0g               * % of production exported                                                
##  [9] competition     byte   %12.0g               * No. of competitors                                                      
## [10] ownership       str33  %33s                 * Who owns the firm?                                                      
## [11] mne_country     str19  %19s                 * Country of origin of multinational (best guess)                         
## [12] mne_f           byte   %9.0g                  = 1 if foreign MNE                                                      
## [13] mne_d           byte   %9.0g                  = 1 if domestic MNE                                                     
## [14] degree_m        byte   %8.0g                * % of managers with a college degree                                     
## [15] degree_nm       float  %8.0g                * % of non-managers with a college degree                                 
## [16] country         str19  %19s                   Country in which plant is located                                       
## [17] competition2004 byte   %9.0g                  1=No competitors, 2=A few competitors, 3=Many competitors               
## [18] year            int    %9.0g                * SENSITIVE: Accts: Year of Accounts Data                                 
## [19] sic             int    %8.0g                * Three digit US SIC 1987 code (999 is missing)                           
## [20] lb_employindex  byte   %10.0g               * WB: Rigidity of employment index (0-100)                                
## [21] pppgdp          float  %9.0g                * IMF: GDP based on PPP valuation of cty GDP (Current international $ -   
## [22]                                                 Billions)                                                             
## [23] <NA>                                                                                                                  
## 22 Levels: ---------------------------------------------------------------------------------------------------------------------- ...

A few of the variables that have been imported as numbers are actually categorical (‘factor’) variables (mne_f, mne_d, and competition2004). We use the factor function to tell R how to treat these variables.

man_data$mne_f <- factor(man_data$mne_f, labels=c("no MNE_f","MNE_f")) # Indicates what to call 0 and 1 entries
man_data$mne_d <- factor(man_data$mne_d, labels=c("no MNE_d","MNE_d"))
man_data$competition2004 <- factor(man_data$competition2004, labels=c("No competitors","A few competitors","Many competitors")) # Indicates what to call 1, 2, and 3 entries

When you create new labels, check the labels have been attached to the correct entries (the labels should be ordered from lowest to highest entry).

To create the tables we use the tidyverse package, specifically piping operators ($>$). Refer to a short introduction on using piping operators.1 In addition to the mean values for the different categories and the overall score, we add a variable recording how many observations we have for each country (obs). Finally, we order the countries according to their overall score (highest to lowest).

library(tidyverse)
table_mean <- man_data %>% group_by(country) %>% summarize(obs = length(management), m_overall = mean(management), m_monitor = mean(monitor), m_target = mean(target), m_incentives = mean(people)) %>% arrange(desc(m_overall))
table_mean
## # A tibble: 20 x 6
##    country               obs m_overall m_monitor m_target m_incentives
##    <fct>               <int>     <dbl>     <dbl>    <dbl>        <dbl>
##  1 United States        1225      3.35      3.58     3.26         3.25
##  2 Germany               646      3.23      3.57     3.22         2.98
##  3 Japan                 176      3.23      3.50     3.34         2.92
##  4 Sweden                388      3.21      3.64     3.19         2.83
##  5 Canada                385      3.17      3.55     3.07         2.94
##  6 UK                   1242      3.03     NA        2.98         2.86
##  7 France                613      3.03      3.43     2.97         2.74
##  8 Italy                 289      3.03      3.26     3.10         2.76
##  9 Australia             392      3.02      3.29     3.02         2.74
## 10 New Zealand           106      2.93      3.18     2.96         2.63
## 11 Mexico                189      2.92      3.29     2.88         2.71
## 12 Poland                351      2.90      3.12     2.94         2.83
## 13 Republic of Ireland   106      2.89      3.14     2.81         2.79
## 14 Portugal              247      2.87      3.27     2.83         2.59
## 15 Chile                 317      2.83      3.14     2.72         2.67
## 16 Argentina             249      2.76      3.08     2.68         2.56
## 17 Greece                251      2.73      2.97     2.66         2.58
## 18 China                 746      2.71      2.90     2.63         2.69
## 19 Brazil                569      2.71      3.06     2.69         2.55
## 20 India                 720      2.67      2.91     2.66         2.63

You will see that m_monitor for the UK is recorded as NA, because there is a NA entry for the monitor variable. The mean function, by default, will not produce a mean value if any observations are missing. Doing so allows you to investigate if there is a data issue. Here, this missing observation isn’t really an issue for our analysis. You use the option na.rm = TRUE in the mean function to calculate the mean, ignoring the missing observation(s).

table_mean <- man_data %>% group_by(country) %>% summarize(obs = length(management), m_overall = mean(management), m_monitor = mean(monitor, na.rm = TRUE), m_target = mean(target), m_incentives = mean(people)) %>% arrange(desc(m_overall))
table_mean
## # A tibble: 20 x 6
##    country               obs m_overall m_monitor m_target m_incentives
##    <fct>               <int>     <dbl>     <dbl>    <dbl>        <dbl>
##  1 United States        1225      3.35      3.58     3.26         3.25
##  2 Germany               646      3.23      3.57     3.22         2.98
##  3 Japan                 176      3.23      3.50     3.34         2.92
##  4 Sweden                388      3.21      3.64     3.19         2.83
##  5 Canada                385      3.17      3.55     3.07         2.94
##  6 UK                   1242      3.03      3.34     2.98         2.86
##  7 France                613      3.03      3.43     2.97         2.74
##  8 Italy                 289      3.03      3.26     3.10         2.76
##  9 Australia             392      3.02      3.29     3.02         2.74
## 10 New Zealand           106      2.93      3.18     2.96         2.63
## 11 Mexico                189      2.92      3.29     2.88         2.71
## 12 Poland                351      2.90      3.12     2.94         2.83
## 13 Republic of Ireland   106      2.89      3.14     2.81         2.79
## 14 Portugal              247      2.87      3.27     2.83         2.59
## 15 Chile                 317      2.83      3.14     2.72         2.67
## 16 Argentina             249      2.76      3.08     2.68         2.56
## 17 Greece                251      2.73      2.97     2.66         2.58
## 18 China                 746      2.71      2.90     2.63         2.69
## 19 Brazil                569      2.71      3.06     2.69         2.55
## 20 India                 720      2.67      2.91     2.66         2.63

Let’s make the table showing the ranks. We use the mutate function, which adds variables calculated from existing variables.

table_rank <- table_mean %>% mutate(r_overall = rank(desc(m_overall)), r_monitor = rank(desc(m_monitor)), 
                          r_target = rank(desc(m_target)), r_incentives = rank(desc(m_incentives)))
                        
table_rank[c(1,7:10)]  # Select the country variable (Column 1) and the columns with rank information (7 to 10)
## # A tibble: 20 x 5
##    country             r_overall r_monitor r_target r_incentives
##    <fct>                   <dbl>     <dbl>    <dbl>        <dbl>
##  1 United States               1         2        2            1
##  2 Germany                     2         3        3            2
##  3 Japan                       3         5        1            4
##  4 Sweden                      4         1        4            6
##  5 Canada                      5         4        6            3
##  6 UK                          6         7        8            5
##  7 France                      7         6        9           11
##  8 Italy                       8        11        5            9
##  9 Australia                   9         8        7           10
## 10 New Zealand                10        12       10           16
## 11 Mexico                     11         9       12           12
## 12 Poland                     12        15       11            7
## 13 Republic of Ireland        13        14       14            8
## 14 Portugal                   14        10       13           17
## 15 Chile                      15        13       15           14
## 16 Argentina                  16        16       17           19
## 17 Greece                     17        18       19           18
## 18 China                      18        20       20           13
## 19 Brazil                     19        17       16           20
## 20 India                      20        19       18           15

Now we use the ggplot set of functions (part of the tidyverse package uploaded earlier) to create a bar chart using the m_overall value in table_mean.

ggplot(table_mean, aes(x=reorder(country,m_overall,mean),y=m_overall)) +
  geom_bar(stat = "identity",position="identity") + 
  xlab("") +
  ylab("Average management practice score") +
  coord_flip() +
  theme_bw()
# Note that x=reorder(country,m_overall,mean), presents the countries in order of the score.
# Using x = country would have ordered the countries alphabetically which is the default option.
# coord_flip() flips the x and y axis.

Management practices in manufacturing firms around the world.

Figure 6.3 Management practices in manufacturing firms around the world.

If you want to make the order the same as in Figure 6.1, use rev(table_mean$m_overall) and rev(table_mean$country) to reverse the order of the values.

To look at how management quality varies within countries, instead of just looking at the mean we can use column charts to visualize the entire distribution of scores (as in Empirical Project 1). To compare distributions, we have to use the same horizontal axis, so we will first need to make a frequency table for each distribution to be used. Also, since each country has a different number of observations, we will use percentages instead of frequencies as the vertical axis variable.

  1. For three countries of your choice and for the US, carry out the following:
Range of management score Frequency Percentage of firms (%)
1.00
1.20
4.80
5.00

Frequency table for overall management score.

Figure 6.4 Frequency table for overall management score.

R walk-through 6.2 Obtaining frequency counts and plotting overlapping histograms

To get frequency counts, use the cut function.

temp_counts <- cut(man_data$management[man_data$country == "Chile"], breaks=seq(0,5,0.2))  
table(temp_counts)  
## temp_counts
##   (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1]   (1,1.2] (1.2,1.4] 
##         0         0         0         0         0         0         1 
## (1.4,1.6] (1.6,1.8]   (1.8,2]   (2,2.2] (2.2,2.4] (2.4,2.6] (2.6,2.8] 
##         3         6        24        15        30        25        49 
##   (2.8,3]   (3,3.2] (3.2,3.4] (3.4,3.6] (3.6,3.8]   (3.8,4]   (4,4.2] 
##        52        28        27        21        20         9         6 
## (4.2,4.4] (4.4,4.6] (4.6,4.8]   (4.8,5] 
##         1         0         0         0

To create a complex chart like this we will call upon the help of geom_histogram member of the ggplot set of functions.

Let’s first collect one pair of countries, Chile and the US. If you wanted to produce a histogram for the overall (management) rating, use the following code.

g1 <- ggplot(subset(man_data, country == "Chile"),aes(management)) +
        geom_histogram(breaks=seq(0,5,0.2)) +
        xlab("Management score") +
        ylab("Frequency Count") 
print(g1)

Distribution of management scores, Chile.

Figure 6.5 Distribution of management scores, Chile.

Tip: Using ggplot, it is straightforward to add a second country to the chart. The way to learn is usually to search the Internet (here we searched for ‘r ggplot multiple histograms’).

g1 <- ggplot(subset(man_data, country %in% c("Chile","United States")),aes(x=management, y = 0.2*..density..,fill = country)) +
	geom_histogram(breaks=seq(0,5,0.2),alpha=.5,position = "identity") +
    xlab("Management score") +
	ylab("Density") + 
  	ggtitle("Histogram for management score") + 
    scale_fill_discrete(name  ="Country") +
  	theme_bw()
# y = 0.2*..density.. - ..density.. is used to ensure the histograms are on the same scale, the result is a graph with an area of 1. As we want proportions we need to scale by multiplying with the bin width, here 0.2.
# fill = country - ensures that R knows to print different histograms for different countries
# breaks=seq(0,5,0.2) - sets the breakpoints
# alpha=.5 - makes the bars semi-transparent
# position = "identity" - ensures that the two histograms are overlaid and not stacked
# scale_fill_discrete(name  ="Country") gives a title to the legend

print(g1)

Comparing the distribution of management scores for the US and Chile.

Figure 6.6 Comparing the distribution of management scores for the US and Chile.

box and whisker plot
A graphic display of the range and quartiles of a distribution, where the first and third quartile form the ‘box’ and the maximum and minimum values form the ‘whiskers’.

Another way to visualize distributions is a box and whisker plot, which shows some parts of a distribution rather than the whole distribution. We can use box and whisker plots to compare particular aspects of distributions more easily than when looking at the entire distribution.

As shown in Figure 6.7, the ‘box’ consists of the first quartile (value corresponding to the bottom 25 per cent, or 25th percentile, of all values), the median, and the third quartile (75th percentile). The ‘whiskers’ are the minimum and maximum values. (In R, the ‘whiskers’ may not be the actual maximum or minimum, since any values larger than 1.5 times the width of the box are considered outliers and are shown as separate points.)

Example of a box and whisker plot.

Figure 6.7 Example of a box and whisker plot.

  1. Using the same countries you chose in Question 3:

R walk-through 6.3 Creating box and whisker plots

We use exactly the same structure as for the overlapping histograms. A useful feature of ggplot is that using more or less the same structure, you can create a variety of graphs. In this example, we include a few more countries, as this can be done without overcrowding the figure.

library(ggthemes) # Change the look of charts
g2 <- ggplot(subset(man_data, country %in% c("Chile","United States","Brazil","Germany","UK")),aes(x = country,y = management)) +
	geom_boxplot() +
    ylab("Management score") +
  	ggtitle("Box and whisker plots for management score") + 
    theme_solarized()
# x = Country – different countries go on the horizontal axis
# y = management – management scores on the vertical axis
# theme_solarized() – gives the plot a different look
print(g2)

Box and whisker plots for a selection of countries.

Figure 6.8 Box and whisker plots for a selection of countries.

From the manufacturing data, firms in the US seem to be managed better (on average) than firms in other countries. To investigate whether this is the case in other sectors, we will use data gathered on hospitals and schools.

  1. Using the data for hospitals and schools (AMP_graph_public.csv):

Part 6.2 Are differences in management practices statistically significant?

Using the management survey data collected by Bloom et al. (2012), we can compare average management scores across countries and industries. Rather than simply identifying differences between groups, we are also interested in whether these differences are statistically significant.

confidence interval
A range of values that is centred around the sample value, and is defined so that there is a specified probability (usually 95%) that it contains the ‘true value’ of interest.

In Empirical Project 2, we assessed statistical significance using p-values. Now we will assess statistical significance using another method called confidence intervals. These two methods are equivalent, meaning that we would get the same conclusions about statistical significance whichever method we use.

A 95% confidence interval is calculated from the data we observed and is designed so that the true value (for example, the mean of a population) will fall into the interval 95% of the time. Other common confidence intervals used in research studies are 90% and 99% confidence intervals, which are similarly defined. We will use 95% confidence intervals throughout this project.

What do we mean by the ‘true value’? Remember that we usually work with data that is a small sample from the entire population of interest. For example, the World Management Survey collects information from a selection of all the firms in a particular country. Since we don’t have data on all the firms in every country, we cannot say with certainty that the average management score across all firms in Country A (the ‘true value’) is higher than that of Country B. However, based on the sample of firms we have from Country A and Country B, we can say whether any observed difference in means is likely to be due to chance, and assess how precisely we have estimated the ‘true value’ of the difference and individual means.

To understand the principle behind confidence intervals, think about playing ring toss while blindfolded. You try to throw a ring so that it lands around a peg. The peg (the true value) is fixed in place, but depending on how wide your ring is (how spread out your sample is) and where you throw it (the mean of your sample), the ring may not land around the peg. Since you are blindfolded, you will never know if the ring actually landed around the peg. However, you know where to stand (at the sample mean) and how wide to make the ring so that it lands around the peg 95% of the time.

As the name suggests, confidence intervals tell us how much confidence we can place in our estimates, in other words how precisely the sample mean is estimated. Wider confidence intervals suggest that our sample mean is estimated less precisely (the data is more spread out). Using the ring toss analogy, we are less sure that we are standing in the right place, so we need a wider ring to have the same chance (95%) of landing it around the peg.

To sum up: A confidence interval is a range of values centred around the sample value and is defined so that there is a specified probability (usually 95%) that it contains the true value of interest.

Rule of thumb for statistical significance

When comparing two distributions, if neither mean is in the confidence interval for the other mean, the difference in means is statistically significant.

This rule of thumb is handy when looking at charts. For a more definite conclusion, we can calculate the p-value (see Empirical Project 2) or construct a confidence interval for the difference in means. (This method involves more mathematics so we will discuss that in Empirical Project 8.)

We will now build on the results from the Bloom et al. (2012) paper by using 95% confidence intervals to make comparisons between the mean overall management score for different countries and types of firms. The confidence interval for the population mean (mean management score for that country) is centred around the sample mean. To determine the width of the interval, we use the standard deviation and number of firms.

  1. First look at manufacturing firms in different countries. Using the manufacturing data (AMP_graph_manufacturing.csv) for three countries of your choice and for the US:
Country Mean Standard deviation Number of firms
       
       

Summary table for manufacturing firms.

Figure 6.9 Summary table for manufacturing firms.

R walk-through 6.4 Calculating confidence intervals and adding them to a chart

As in R walk-through 6.1, we use piping operators from the tidyverse package.

table_stats <- man_data %>% 
    filter(country %in% c("Chile","United States","Brazil","Germany","UK")) %>% 
    group_by(country) %>% 
    summarize(obs = length(management), mean_m = mean(management), sd_m = sd(management, na.rm = TRUE)) %>%
    arrange(rev(mean_m))
table_stats
## # A tibble: 5 x 4
##   country         obs mean_m  sd_m
##   <fct>         <int>  <dbl> <dbl>
## 1 United States  1225   3.35 0.643
## 2 UK             1242   3.03 0.679
## 3 Chile           317   2.83 0.599
## 4 Germany         646   3.23 0.569
## 5 Brazil          569   2.71 0.685

To get the confidence intervals, we use the t.test function.

tUS <- t.test(subset(man_data,country == "United States",select = management))
tUS$conf.int[1:2] # tUS contains a lot of information; $conf.int[1:2] is the confidence interval.
## [1] 3.312379 3.384448

We want to add these interval values to table_stats. The easiest way is to calculate the standard error for the sample mean and multiply this by 1.96 (m_err ), where 1.96 is the factor required to get a 95% confidence interval (assuming a normal distribution). The confidence interval is then [mean_m − m_err, mean_m + m_err].

table_stats <- man_data %>% 
    filter(country %in% c("Chile","United States","Brazil","Germany","UK")) %>% 
    group_by(country) %>% 
    summarize(obs = length(management), mean_m = mean(management), sd_m = sd(management, na.rm = TRUE) , m_err = 1.96*sqrt(sd_m^2/(obs-1))) %>% 
    arrange(rev(mean_m))
table_stats
## # A tibble: 5 x 5
##   country         obs mean_m  sd_m  m_err
##   <fct>         <int>  <dbl> <dbl>  <dbl>
## 1 United States  1225   3.35 0.643 0.0360
## 2 UK             1242   3.03 0.679 0.0378
## 3 Chile           317   2.83 0.599 0.0660
## 4 Germany         646   3.23 0.569 0.0439
## 5 Brazil          569   2.71 0.685 0.0563

Now we can use this information to make a bar chart:

ggplot(table_stats, aes(y=mean_m, x=country)) + 
    geom_bar(position=position_dodge(), stat="identity",
             colour="black", # Use black outlines
             size=.3)  +    # Add thinner lines for bars
    geom_errorbar(aes(ymin=mean_m-m_err, ymax=mean_m+m_err),
                  size=.6,    # Add thinner lines for confidence intervals
                  width=.5,
                  position=position_dodge(.9)) +
    coord_cartesian(ylim=c(2,4)) +
    theme_bw() + 
     theme(axis.text.x=element_text(size=rel(1.5)),axis.text.y=element_text(size=rel(1.3))) 

Bar chart of mean management score in manufacturing firms for a selection of countries, with 95% confidence intervals.

Figure 6.10 Bar chart of mean management score in manufacturing firms for a selection of countries, with 95% confidence intervals.

  1. Using the data for hospitals or schools (AMP_graph_public.csv), using all available countries:
  1. Look at the width of your confidence intervals and relate this to the standard deviation and number of observations. Are confidence intervals generally wider/narrower if the standard deviation is larger? How about if the number of observations is larger? With reference to the ring toss example, explain why we would expect there to be a relationship between the confidence interval width, standard deviation, and number of observations.

Part 6.3 What factors affect the quality of management?

Besides documenting and comparing management practices across industries and countries, another purpose of the World Management Survey was to investigate factors that affect management quality.

One possible factor affecting differences in management is firm ownership. To look at the data for this factor in the healthcare and education sectors, we will focus on broad groups (public vs privately-owned firms), and for manufacturing firms we will focus on different kinds of private ownership.

  1. Using the data for hospitals and schools (AMP_graph_public.csv):

Besides ownership type, management practices may vary depending on firm size, though it is difficult to predict what the relationship between these variables might be. Larger firms have more employees and could be more difficult to manage well, but may also attract more experienced managers. We will look at the conditional means for manufacturing firms, depending on whether they are above or below the median number of employees (calculated from the data), and see if there is a clear relationship.

  1. Using the data for manufacturing firms (AMP_graph_manufacturing.csv):

R walk-through 6.5 Calculating and adding conditional summary statistics and confidence intervals to a chart

We will use many techniques encountered previously, but first we have to create a new variable that indicates whether a firm is large or small (size). A firm with lemp_firm > 5.8 is considered large.

man_data$size <- factor(man_data$lemp_firm > 5.8,labels=c("small","large"))

We choose Canada, Brazil, and the United States. Initially we add the new size variable and ownership as a grouping variable in the group_by command (as we did in R walk-through 6.1).

table_stats2 <- man_data %>% filter(country %in% c("Canada","United States","Brazil")) %>% group_by(country,ownership,size) %>% summarize(obs = length(management), mean_m = mean(management,na.rm = TRUE), sd_m = sd(management, na.rm = TRUE))
table_stats2
## # A tibble: 53 x 6
## # Groups:   country, ownership [?]
##    country ownership                  size    obs mean_m    sd_m
##    <fct>   <fct>                      <fct> <int>  <dbl>   <dbl>
##  1 Brazil  Dispersed Shareholders     small    28   3.06   0.667
##  2 Brazil  Dispersed Shareholders     large    45   3.48   0.731
##  3 Brazil  Family owned, external CEO small     8   2.82   0.725
##  4 Brazil  Family owned, external CEO large    10   2.99   0.688
##  5 Brazil  Family owned, family CEO   small    80   2.50   0.668
##  6 Brazil  Family owned, family CEO   large    41   2.70   0.645
##  7 Brazil  Founder                    small   124   2.35   0.524
##  8 Brazil  Founder                    large    72   2.66   0.591
##  9 Brazil  Government                 small     1   4    NaN    
## 10 Brazil  Government                 large     2   2.44   1.18 
## # ... with 43 more rows

Now we use the variable size as a column variable, so that we can see the summary statistics in two blocks of columns (separately for large and small firms). This is not a standard or straightforward procedure, but an Internet search (for ‘tidyverse spread multiple columns’) gives the following solution.

table_stats2_mc <- table_stats2 %>% gather(variable, value, -(country:size)) %>% unite(temp, size, variable) %>% spread(temp,value)
print(table_stats2_mc)
## # A tibble: 28 x 8
## # Groups:   country, ownership [28]
##    country ownership        large_mean_m large_obs large_sd_m small_mean_m
##    <fct>   <fct>                   <dbl>     <dbl>      <dbl>        <dbl>
##  1 Brazil  Dispersed Share~         3.48        45      0.731         3.06
##  2 Brazil  Family owned, e~         2.99        10      0.688         2.82
##  3 Brazil  Family owned, f~         2.70        41      0.645         2.50
##  4 Brazil  Founder                  2.66        72      0.591         2.35
##  5 Brazil  Government               2.44         2      1.18          4   
##  6 Brazil  Managers                 2.51         7      0.631         2.64
##  7 Brazil  Other                    3.01        29      0.541         2.57
##  8 Brazil  Private Equity          NA           NA     NA             3.23
##  9 Brazil  Private Individ~         2.94        42      0.523         2.69
## 10 Canada  Dispersed Share~         3.52        53      0.582         3.43
## # ... with 18 more rows, and 2 more variables: small_obs <dbl>,
## #   small_sd_m <dbl>

To understand the logic of this command, go through it step by step: first apply gather and look at the result, then add the unite command and look at the result, and then add spread.

So far we have looked at correlations between firm characteristics and management practices, but have not made any causal statements. We will now discuss the difficulties with making causal statements using this data and examine how we might determine the direction of causation.

  1. For each of the following variables, explain how it could affect management practices, and then explain how management practices could affect it:
  1. One way to establish the direction of causation is through a randomized field experiment. Read the discussion on pages 22–23 of the Bloom et al. paper (the section ‘Experimental Evidence on Management Quality and Firm Performance’) about one such experiment that was conducted in Indian textile factories.
  1. University of Manchester’s Econometric Computing Learning Resource (ECLR). 2018. ‘R AnalysisTidy’. Updated 9 January 2018.