Empirical Project 3 Working in R
Download the code
To download the code chunks used in this project, rightclick on the download link and select ‘Save Link As…’. You’ll need to save the code download to your working directory, and open it in RStudio.
Don’t forget to also download the data into your working directory by following the steps in this project.
Rspecific learning objectives
In addition to the learning objectives for this project, in this section you will learn how to use the piping technique to run a sequence of functions.
Getting started in R
For this project you will need the following packages:
tidyverse
, to help with data manipulationreadxl
, to import an Excel spreadsheetmosaic
, to help create frequency tablesreadstata13
, to read in a Stata datafile.
You will also use the ggplot2
package to produce accurate graphs, but that comes as part of the tidyverse
package.
If you need to install any of these packages, run the following code:
install.packages(c("readxl", "tidyverse",
"mosaic", "readstata13"))
You can import these libraries now, or when they are used in the R walkthroughs below.
library(readxl)
library(tidyverse)
library(mosaic)
library(readstata13)
Part 3.1 Beforeandafter comparisons of retail prices
We will first look at price data from the treatment group (stores in Berkeley) to see what happened to the price of sugary and nonsugary beverages after the tax.
 Download the data from the Global Food Research Program’s website, and select the ‘Berkeley Store Price Survey’ Excel dataset. Then upload the dataset into R.
 The first tab of the Excel file contains the data dictionary. Make sure you read the data description column carefully, and check that each variable is in the Data tab.
R walkthrough 3.1 Importing the datafile into R
The data is in
.xlsx
format, so we use thereadxl
package to import it. We also load thetidyverse
library as this includes packages that we will use later for drawing charts and making code easier to read. If you open the data in Excel, you will see that there are two tabs: ‘Data Dictionary’, which contains some information about the variables, and ‘Data’, which contains the actual data. Let’s import both into R, so that do not need to refer back to Excel for the additional information about variables.library(readxl) library(tidyverse) # Set your working directory to the correct folder. # Insert your file path for 'YOURFILEPATH'. setwd("YOURFILEPATH") var_info < read_excel("sps_public.xlsx", sheet = "Data Dictionary") dat < read_excel("sps_public.xlsx", sheet = "Data")
Let’s use the
str
function to check that the variables were classified correctly.str(dat)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2175 obs. of 12 variables: ## $ store_id : num 16 16 16 16 16 16 16 16 16 16 ... ## $ type : chr "WATER" "TEA" "TEA" "WATER" ... ## $ store_type : num 2 2 2 2 2 2 2 2 2 2 ... ## $ type2 : chr NA NA NA NA ... ## $ size : num 33.8 23 23 33.8 128 64 128 64 63.9 144 ... ## $ price : num 1.69 0.99 0.99 1.69 3.79 2.79 3.79 2.79 4.59 5.99 ... ## $ price_per_oz : num 0.05 0.043 0.043 0.05 0.0296 ... ## $ price_per_oz_c: num 5 4.3 4.3 5 2.96 ... ## $ taxed : num 0 1 1 0 0 0 0 0 0 1 ... ## $ supp : num 0 0 0 0 0 0 0 0 0 1 ... ## $ time : chr "DEC2014" "DEC2014" "DEC2014" "DEC2014" ... ## $ product_id : num 29 32 33 38 40 41 42 43 44 50 ...
R classified all the variables containing numbers as numerical (
num
). However, for some of these variables (specifically,type
,taxed
,supp
,store_id
,store_type
,type2
andproduct_id
), the numbers actually represent categories (known as ‘factors’ in R). So let’s use thefactor
function to convert the variables to factor variables.dat$type < factor(dat$type) dat$taxed < factor(dat$taxed, labels = c("not taxed", "taxed")) dat$supp < factor(dat$supp, labels = c("Standard", "Supplemental")) dat$store_id < factor(dat$store_id) dat$store_type < factor(dat$store_type, labels = c("Large Supermarket", "Small Supermarket", "Pharmacy", "Gas Station")) dat$type2 < factor(dat$type2) dat$product_id < factor(dat$product_id)
You can see that we used the
labels
option to specify the names of different categories (where they are clearly defined).There is another variable,
time
, which is classified as achr
variable (chr
stands for ‘characters’, meaning letters and numbers), but should be a factor variable. Before we do this, we use theunique
command to check the categories of this variable.unique(dat$time)
## [1] "DEC2014" "JUN2015" "MAR2015"
If you look at the timeline in the Silver et al. (2017) paper, you will notice that the third price survey was in March 2016, not in March 2015, so the data has been labelled incorrectly. We shall therefore change all the values
MAR2015
toMAR2016
.# Selects all observations with “time” equal to "MAR2015". dat$time[dat$time == "MAR2015"] < "MAR2016"
We can now change
time
into a factor variable.dat$time < factor(dat$time)
 Read ‘S1 Text’, from the journal paper’s supporting information, which explains how the Store Price Survey data was collected.
 In your own words, explain how the product information was recorded, and the measures that researchers took to ensure that the data was accurate and representative of the treatment group. What were some of the data collection issues that they encountered?
 Instead of using the name of the store, each store was given a unique ID number (recorded as store_id on the spreadsheet). Verify that the number of stores in the dataset is the same as that stated in the ‘S1 Text’ (26). Similarly, each product was given a unique ID number (product_id). How many different products are in the dataset?
R walkthrough 3.2 Counting the number of unique elements in a variable
We use two functions here:
unique
to obtain a list of the unique elements of the variables of interest (dat$store_id
anddat$product_id
), thenlength
to count how long the list is. We will create two variables,no_stores
andno_products
, that contain the number of stores and products respectively.no_stores < length(unique(dat$store_id)) no_products < length(unique(dat$product_id)) paste("Stores:", no_stores)
## [1] "Stores: 26"
paste("Products:", no_products)
## [1] "Products: 247"
Following the approach described in ‘S1 Text’, we will compare the variable price per ounce in US$ cents (price_per_oz_c
). We will look at what happened to prices in the two treatment groups before the tax (time = DEC2014
) and after the tax (time = JUN2015
):
 treatment group one: large supermarkets (
store_type = 1
)  treatment group two: pharmacies (
store_type = 3
).
Before doing this analysis, we will use summary measures to see how many observations are in the treatment and control group, and how the two groups differ across some variables of interest. For example, if there are very few observations in a group, we might be concerned about the precision of our estimates and will need to interpret our results in light of this fact.
We will now create frequency tables containing the summary measures that we are interested in.
 Create the following tables:
 A frequency table showing the number (count) of store observations (store type) in December 2014 and June 2015, with ‘store type’ as the row variable and ‘time period’ as the column variable. For each store type, is the number of observations similar in each time period?
 A frequency table showing the number of taxed and nontaxed beverages in December 2014 and June 2015, with ‘store type’ as the row variable and ‘taxed’ as the column variable. (‘Taxed’ equals 1 if the sugar tax applied to that product, and 0 if the tax did not apply). For each store type, is the number of taxed and nontaxed beverages similar?
 A frequency table showing the number of each product type (type), with ‘product type’ as the row variable and ‘time period’ as the column variable. Which product types have the highest number of observations and which have the lowest number of observations? Why might some products have more observations than others?
R walkthrough 3.3 Creating frequency tables
Frequency table for store type and time period
We use the
tally
function, which allows us to produce frequency tables using theformat = "count"
option. We start with the frequency table that shows the number of stores of different types in each time period.library(mosaic) tally(~store_type + time, data = dat, margins = TRUE, format = "count")
## time ## store_type DEC2014 JUN2015 MAR2016 Total ## Large Supermarket 177 209 158 544 ## Small Supermarket 407 391 327 1125 ## Pharmacy 87 102 73 262 ## Gas Station 73 96 75 244 ## Total 744 798 633 2175
There are fewer observations taken from gas stations and pharmacies and more from small supermarkets.
Frequency table for store type and taxed
Now we repeat the steps above to make a frequency table with
store_type
as the row variable andtaxed
as the column variable. We add+time
to the code because we also want separate values for each time period.tally(~store_type + taxed + time, data = dat, margins = TRUE, format = "count")
## , , time = DEC2014 ## ## taxed ## store_type not taxed taxed Total ## Large Supermarket 92 85 177 ## Small Supermarket 196 211 407 ## Pharmacy 44 43 87 ## Gas Station 34 39 73 ## Total 366 378 744 ## ## , , time = JUN2015 ## ## taxed ## store_type not taxed taxed Total ## Large Supermarket 111 98 209 ## Small Supermarket 192 199 391 ## Pharmacy 52 50 102 ## Gas Station 44 52 96 ## Total 399 399 798 ## ## , , time = MAR2016 ## ## taxed ## store_type not taxed taxed Total ## Large Supermarket 88 70 158 ## Small Supermarket 154 173 327 ## Pharmacy 36 37 73 ## Gas Station 31 44 75 ## Total 309 324 633 ## ## , , time = Total ## ## taxed ## store_type not taxed taxed Total ## Large Supermarket 291 253 544 ## Small Supermarket 542 583 1125 ## Pharmacy 132 130 262 ## Gas Station 109 135 244 ## Total 1074 1101 2175
Frequency table for product type and time period
Now we make a frequency table with product type (
type
) as the row variable and time period (time
) as the column variable.tally(~type + time, data = dat, margins = TRUE, format = "count")
## time ## type DEC2014 JUN2015 MAR2016 Total ## ENERGY 56 58 49 163 ## ENERGYDIET 49 54 35 138 ## JUICE 70 64 52 186 ## JUICE DRINK 19 17 6 42 ## MILK 63 61 53 177 ## SODA 239 262 215 716 ## SODADIET 128 174 127 429 ## SPORT 11 16 12 39 ## SPORTDIET 2 2 0 4 ## TEA 52 45 41 138 ## TEADIET 6 6 8 20 ## WATER 48 38 34 120 ## WATERSWEET 1 1 1 3 ## Total 744 798 633 2175
This table shows that there were no observations for the category
Sportdiet
in March 2016. As this is a drink which even in the other months has very few observations, it may be a product that is offered only in one shop, and it is possible that this shop was not visited in March 2016. The product may also be a seasonal product that is not available in March. It is also likely thatWatersweet
is offered in only one shop.
 conditional mean
 An average of a variable, taken over a subgroup of observations that satisfy certain conditions, rather than all observations.
Besides counting the number of observations in a particular group, we can also calculate the mean by only using observations that satisfy certain conditions (known as the conditional mean). In this case, we are interested in comparing the mean price of taxed and untaxed beverages, before and after the tax.
 Calculate and compare conditional means:
 Create a table similar to Figure 3.1, showing the average price per ounce (in cents) for taxed and untaxed beverages separately, with ‘store type’ as the row variable, and ‘taxed’ and ‘time’ as the column variables. To follow the methodology used in the journal paper, make sure to only include products that are present in all time periods, and nonsupplementary products (
supp = 0
).
 Without doing any calculations, summarize any differences or general patterns between December 2014 and June 2015 that you find in the table.
 Would we be able to assess the effect of sugar taxes on product prices by comparing the average price of untaxed goods with that of taxed goods in any given period? Why or why not?
Nontaxed  Taxed  

Store type  Dec 2014  Jun 2015  Dec 2014  Jun 2015 
1  
3 
R walkthrough 3.4 Calculating conditional means
Calculating conditional means is not a straightforward task (in R or in any other statistical program), but is, however, a common data cleaning operation you will encounter. Here is one way to do this.
In order to identify products (
product_id
) that have observations for all three periods (DEC2014
,JUN2015
andMAR2016
), we will first create a new variable calledperiod_test
, which takes the value 1 (orTRUE
) if we have observations in all periods for a product in a particular store, and 0 (FALSE
) otherwise. These true/false variables are called ‘boolean variables’.The easiest way to create this variable is through a loop. For each store and product ID, we will check whether there are observations in all periods (i.e. the variable
time
has the time periodsDEC2014
,JUN2015
andMAR2016
), and temporarily store this information in the variabletest
. We then transfer this information to the rows of the newperiod_test
variable that correspond to that store and product ID.dat$period_test < NA # List of all store IDs sid_list = unique(dat$store_id) # List of all product IDs pid_list = unique(dat$product_id) for (s in sid_list) { for (p in pid_list) { temp < subset(dat, product_id == p & store_id == s) temp_time < temp$time test < ( any(temp_time == "DEC2014") & any(temp_time == "JUN2015") & any(temp_time == "MAR2016")) dat$period_test[dat$product_id == p & dat$store_id == s] < test } }
Now we can use the
period_test
variable to remove all products that have not been observed in all three periods. We will store the remaining products in a new dataframe,dat_c
.dat_c < subset(dat, (period_test == TRUE & supp == "Standard"))
Now we can calculate the means of
price_per_oz
by grouping the data according tostore_type
,taxed
, andtime
. One way to achieve this is to use a technique called piping:table_res < dat_c %>% group_by(taxed, store_type, time) %>% summarize(n = length(price_per_oz), avg.price = mean(price_per_oz)) %>% spread(time, avg.price) %>% print()
## # A tibble: 8 x 6 ## # Groups: taxed, store_type [8] ## taxed store_type n DEC2014 JUN2015 MAR2016 ## <fct> <fct> <int> <dbl> <dbl> <dbl> ## 1 not taxed Large Supermarket 36 0.112 0.115 0.117 ## 2 not taxed Small Supermarket 70 0.137 0.138 0.134 ## 3 not taxed Pharmacy 18 0.152 0.161 0.154 ## 4 not taxed Gas Station 12 0.169 0.170 0.170 ## 5 taxed Large Supermarket 36 0.156 0.169 0.167 ## 6 taxed Small Supermarket 101 0.159 0.160 0.155 ## 7 taxed Pharmacy 18 0.182 0.191 0.186 ## 8 taxed Gas Station 22 0.194 0.203 0.192
Piping is a very useful technique in R for data analysis. If we have to perform a sequence of commands on the same object, then we can simplify the code by writing the sequence as a single command. We use the punctuation
%>%
to link the commands together.In this code above, we 1) took the data in
dat_c
and grouped them according to the variablestaxed
,store_type
, andtime
, 2) summarized the data by calculating the mean ofprice_per_oz
and the number of observations in each group, 3) rearranged the results table so thattime
is the column variable and the values inside the table are the means ofprice_per_oz
.The University of Manchester’s Econometric Computing Learning Resource provides a more detailed introduction to piping.
Use the S3 Table in the journal paper to check how closely your summary data match those in the paper.^{1} You should find that your results for Large Supermarkets and Pharmacies match, but the other store types have discrepancies. In R walkthrough 3.5 we will discuss these differences in more detail.
In order to make a beforeandafter comparison, we will make a chart similar to Figure 2 in the journal paper, to show the change in prices for each store type.
 Using your table from Question 3:
 Calculate the change in the mean price after the tax (price in June 2015 minus price in December 2014) for taxed and untaxed beverages, by store type.
 Using the values you calculated in Question 4(a), plot a column chart to show this information (as done in Figure 2 of the journal paper) with store type on the horizontal axis and price change on the vertical axis. Label each axis and data series appropriately. You should get the same values as shown in Figure 2.
R walkthrough 3.5 Making a column chart to compare two groups
Calculate price differences by store type
Let’s calculate the two price differences (June 2015 minus December 2014 and March 2016 minus December 2014), and store them as
D1
andD2
respectively:table_res$D1 < table_res$JUN2015  table_res$DEC2014 table_res$D2 < table_res$MAR2016  table_res$DEC2014 print("Group Means")
## [1] "Group Means"
table_res
## # A tibble: 8 x 8 ## # Groups: taxed, store_type [8] ## taxed store_type n DEC2014 JUN2015 MAR2016 D1 D2 ## <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 not taxed Large Supermar~ 36 0.112 0.115 0.117 2.88e3 0.00510 ## 2 not taxed Small Supermar~ 70 0.137 0.138 0.134 1.46e3 0.00304 ## 3 not taxed Pharmacy 18 0.152 0.161 0.154 8.80e3 0.00240 ## 4 not taxed Gas Station 12 0.169 0.170 0.170 2.90e4 0.00106 ## 5 taxed Large Supermar~ 36 0.156 0.169 0.167 1.31e2 0.0107 ## 6 taxed Small Supermar~ 101 0.159 0.160 0.155 1.44e3 0.00359 ## 7 taxed Pharmacy 18 0.182 0.191 0.186 8.97e3 0.00448 ## 8 taxed Gas Station 22 0.194 0.203 0.192 9.25e3 0.00174
Plot a column chart for average price changes
To display
D1
andD2
in a column chart, we will use theggplot2
package (which is part of thetidyverse
package we loaded for R walkthrough 3.1). Let’s start with displaying the average price change from December 2014 to June 2015 (which is stored inD1
):ggplot(table_res, aes(fill = taxed, y = D1, x = store_type)) + geom_bar(position = "dodge", stat = "identity") + # Add the axes labels labs(y = "Price change (US$/oz)", x = "Store type") + # Add the title and legend labels scale_fill_discrete(name = "Beverages", labels = c("Nontaxed", "Taxed")) + ggtitle("Average price change from Dec 2014 to Jun 2015")
Now we do the same for the price change from Dec 2014 to Mar 2016:
ggplot(table_res, aes(fill = taxed, y = D2, x = store_type)) + geom_bar(position = "dodge", stat = "identity") + # Add the axes labels labs(y = "Price change (US$/oz)", x = "Store type") + # Add the title and legend labels scale_fill_discrete(name = "Beverages", labels = c("Nontaxed", "Taxed")) + ggtitle("Average price change from Dec 2014 to Mar 2016")
 statistically significant
 When a relationship between two or more variables is unlikely to be due to chance, given the assumptions made about the variables (for example, having the same mean). Statistical significance does not tell us whether there is a causal link between the variables.
To assess whether the difference in mean prices before and after the tax could have happened by chance due to the samples chosen (and there are no differences in the population means), we could calculate the pvalue. (Here, ‘population means’ refer to the mean prices before/after the tax that we would calculate if we had all prices for all stores in Berkeley.) The authors of the journal article calculate pvalues, and use the idea of statistical significance to interpret them. Whenever they get a pvalue of less than 5%, they conclude that the assumption of no differences in the population is unlikely to be true: they say that the price difference is statistically significant. If they get a pvalue higher than 5%, they say that the difference is not statistically significant, meaning that they think it could be due to chance variation in prices.
Using a particular cutoff level for the pvalue, and concluding that a result is only statistically significant if the pvalue is below the cutoff, is common in statistical studies, and 5% is often used as the cutoff level. But this approach has been criticized recently by statisticians and social scientists. The main criticisms raised are that any cutoffs are arbitrary. Instead of using a cutoff, we prefer to calculate pvalues and use them to assess the strength of the evidence against our assumption that there are no differences in the population means. Whether the statistical evidence is strong enough for us to draw a conclusion about a policy, such as a sugar tax, will always be a matter of judgement.
According to the journal paper, the pvalue is 0.02 for large supermarkets, and 0.99 for pharmacies.
 Based on these pvalues and your chart from Question 4, what can you conclude about the difference in means? (Hint: You may find the discussion in Part 2.3 helpful.)
Extension R walkthrough 3.6 Calculating the pvalue for price changes
In this walkthrough, we show the calculations required to obtain the p values in Table S3 of the Silver et al. (2017) paper. Since the pvalues are already provided, this walkthrough is only for those who want to see how these pvalues were calculated. For the categories of Large Supermarkets and Pharmacies, we conduct a hypothesis test, which tests the null hypothesis that the price difference between June 2015 and December 2014 (and March 2016 and December 2014) for the taxed and untaxed beverages in the two store types could be due to chance.
We are interested in whether the difference in average price between
JUN2015
andDEC2014
(orMAR2016
andDEC2014
) for one group (say, Large Supermarket and taxed) is zero (i.e. there is no difference in the means of the two populations). Note that we are dealing with paired observations (the same product in both time periods).Let’s use the price difference between June 2015 and December 2014 in Large Supermarkets for taxed beverages as an example. First, we extract the prices for both periods (the vectors
p1
andp2
) and then calculate the difference, element by element (stored asd_t
).p1 < dat_c$price_per_oz[ dat_c$store_type == "Large Supermarket" & dat_c$taxed == "taxed" & dat_c$time == "DEC2014"] p2 < dat_c$price_per_oz[ dat_c$store_type == "Large Supermarket" & dat_c$taxed == "taxed" & dat_c$time == "JUN2015"] # Price difference for taxed products d_t < p2  p1
All three new variables are vectors with 36 elements. For
d_t
to correctly represent the price difference for a particular product in a particular store, we need to be certain that each element in both vectors corresponds to the same product in the same store. To check that the elements match, we will extract the store and product IDs along with the prices, and compare the ordering inp1_alt
andp2_alt
.p1_alt < dat_c[ dat_c$store_type == "Large Supermarket" & dat_c$taxed == "taxed" & dat_c$time == "DEC2014", c("product_id", "store_id", "price_per_oz")] p2_alt < dat_c[ dat_c$store_type == "Large Supermarket" & dat_c$taxed == "taxed" & dat_c$time == "JUN2015", c("product_id", "store_id", "price_per_oz")]
You can see that the ordering matches, since the original datafile was ordered in a way (first according to
time
, thenstore_id
and thenproduct_id
) that in this instance guarantees identical ordering.The average value of the price difference is 0.0131222, and our task is to evaluate whether this is likely to be due to sampling variation (given the assumption that there is no difference between the populations) or not. To do this, we can use the
t.test
function, which provides the associated pvalue.
 standard error
 A measure of the degree to which the sample mean deviates from the population mean. It is calculated by dividing the standard deviation of the sample by the square root of the number of observations.
Alternatively, we can calculate the respective test statistic manually, which also requires us to calculate the standard error of this value:
t < mean(d_t) / sqrt(var(d_t) / (length(d_t)))
Alternatively, we can use the available
t.test
function in R, which gives exactly the same results but has the advantage of directly obtaining pvalues and confidence intervals.# Recognize that the differences come from paired samples. t.test(p2, p1, paired = TRUE)
## ## Paired ttest ## ## data: p2 and p1 ## t = 4.7681, df = 35, pvalue = 3.226e05 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.007535221 0.018709202 ## sample estimates: ## mean of the differences ## 0.01312221
# Or get the same result using the function directly on d # t.test(d)
To compare this result to the journal paper, look at the extract from Table S3 (the section on Large Supermarkets) shown in Figure 3.4 below. The cell with ‘**’ shows the mean price difference of 1.31 cents ($0.0131).
Large supermarkets
(n = 6)Taxed beverage price
(36 sets)Untaxed beverage price
(36 sets)Taxed – untaxed difference cents/oz cents/oz cents/oz Round 1: December 2014 15.62 11.19 Round 2: June 2015 16.93 11.48 Round 3: March 2016 16.68 11.70 Mean change
(March 2016–Dec 2014)1.07, (p=0.01) 0.51, (p=0.01) 0.56, (p=0.22) Mean change
(June 2015–Dec 2014)1.31, (p<0.001)^{**} 0.29, (p=0.08) 1.02, (p=0.002) n = number of stores of each type.
In our test output we get a very small pvalue (0.0000323) which in the table is indicated by the double asterisk.
Tests for other store types are calculated similarly, by changing the data extracted to
p1
andp2
. Let’s do that for one more example: the price difference between June 2015 and December 2014 in Large Supermarkets for untaxed beverages.p1 < dat_c$price_per_oz[ dat_c$store_type == "Large Supermarket" & dat_c$taxed == "not taxed" & dat_c$time == "DEC2014"] p2 < dat_c$price_per_oz[ dat_c$store_type == "Large Supermarket" & dat_c$taxed == "not taxed" & dat_c$time == "JUN2015"] d_nt < p2  p1 t.test(p2, p1, paired = TRUE)
## ## Paired ttest ## ## data: p2 and p1 ## t = 1.8179, df = 35, pvalue = 0.07765 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.0003367942 0.0061057804 ## sample estimates: ## mean of the differences ## 0.002884493
You should be able to recognize the mean difference, the pvalue, and the confidence interval in the excerpt of Table S3 provided in Figure 3.4.
Let’s also replicate the last section of Table S3, which shows the difference between the price changes in taxed and untaxed products, that is, we want to know whether
d_t
ANDd_nt
have different means. We will apply the two sample hypothesis tests, but this time for unpaired data, as the products differ across samples.t.test(d_t,d_nt)
## ## Welch two sample ttest ## ## data: d_t and d_nt ## t = 3.2227, df = 55.955, pvalue = 0.002119 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.003873834 0.016601603 ## sample estimates: ## mean of x mean of y ## 0.013122212 0.002884493
Again you should be able to identify the corresponding entries in Table S3 shown in Figure 3.4. The main entry in the table is 1.02, indicating that the means of the two groups differ by 1.02 cents. This is confirmed in our calculations, as $0.01312 − $0.00288 is about $0.0102 or 1.02 cents. The pvalue of 0.002 is also the same as the one in Table S3.
Part 3.2 Beforeandafter comparisons with prices in other areas
When looking for any price patterns, it is possible that the observed changes in Berkeley were not solely due to the tax, but instead were also influenced by other events that happened in Berkeley and in neighbouring areas. To investigate whether this is the case, the researchers conducted another differencesindifferences analysis, using a different treatment and control group:
 The treatment group: Beverages in Berkeley
 The control group: Beverages in surrounding areas.
The researchers collected price data from stores in the surrounding areas and compared them with prices in Berkeley. If prices changed in a similar way in nearby areas (which were not subject to the tax), then what we observed in Berkeley may not be primarily a result of the tax. We will be using the data the researchers collected to make our own comparisons.
Download the following files:
 The Berkeley PointofSale Stata file on the Global Food Research Program’s website, containing the price data they collected, including information on the date (year and month), location (Berkeley or NonBerkeley), beverage group (soda, fruit drinks, milk substitutes, milk and water), and the average price for that month. Stata is another popular statistical software package, and the data is provided as a
.dta
file.  ‘S5 Table’ comparing the neighbourhood characteristics of the Berkeley and nonBerkeley stores.
 Based on ‘S5 Table’, do you think the researchers chose suitable comparison stores? Why or why not?
We will now plot a line chart similar to Figure 3 in the journal paper, to compare prices of similar goods in different locations and see how they have changed over time. To do this, we will need to summarize the data so that there is one value (the mean price) for each location and type of good in each month.
 Assess the effects of a tax on prices:
 Create a table similar to the one provided in Figure 3.5 to show the average price in each month for taxed and nontaxed beverages, according to location. Use ‘year and month’ as the row variables, and ‘tax’ and ‘location’ as the column variables. (Hint: You may find R walkthrough 3.4 helpful.)
 Plot the four columns of your table on the same line chart, with average price on the vertical axis and time (months) on the horizontal axis. Describe any differences you see between the prices of nontaxed goods in Berkeley and those outside Berkeley, both before the tax (January 2013 to December 2014) and after the tax (March 2015 onwards). Do the same for prices of taxed goods.
 Based on your chart, is it reasonable to conclude that the sugar tax had an effect on prices?
Nontaxed  Taxed  

Year/Month  Berkeley  NonBerkeley  Berkeley  NonBerkeley 
January 2013  
February 2013  
March 2013  
…  
December 2013  
January 2014  
…  
February 2016 
R walkthrough 3.7 Importing data from a Stata file and plotting a line chart
Import data and create a table of average monthly prices
To import data from a Stata file (
.dta
format) we need thereadstata13
package.library(readstata13) PoSd < read.dta13("public_use_weighted_prices2.dta")
Before proceeding, use the command
str(PoSd)
to look at the structure of this dataset (output not shown here). You will see that for each month and location (Berkeley or NonBerkeley), there are prices for a variety of beverage categories, and we know whether the category is taxed or not. For any particular timelocationtax status combination we want the average price of all products.To make the summary table, we use the piping technique again:
table_test < PoSd %>% group_by(year, month, location, tax) %>% summarize(avg.price = mean(price)) %>% spread(location, avg.price) %>% print()
## # A tibble: 78 x 5 ## # Groups: year, month [39] ## year month tax Berkeley `NonBerkeley` ## <dbl> <dbl> <chr> <dbl> <dbl> ## 1 2013 1 Nontaxed 5.72 5.35 ## 2 2013 1 Taxed 8.69 7.99 ## 3 2013 2 Nontaxed 5.81 5.36 ## 4 2013 2 Taxed 8.65 8.18 ## 5 2013 3 Nontaxed 5.86 5.42 ## 6 2013 3 Taxed 8.82 8.19 ## 7 2013 4 Nontaxed 5.86 5.64 ## 8 2013 4 Taxed 9.02 8.25 ## 9 2013 5 Nontaxed 5.79 5.18 ## 10 2013 5 Taxed 8.68 7.76 ## # ... with 68 more rows
We have the necessary information, but the data is not quite in the right shape yet, because we need to separate the
Taxed
from theNontaxed
data (which we do using thesubset
function).tax_table < subset(table_test, tax == "Taxed") ntax_table < subset(table_test, tax == "Nontaxed")
Plot a line chart
Now we can create a line chart. Before we do this, we will convert the
Berkeley
andNonBerkeley
columns of both datasets to time series data using thets
function, which will make plotting a little easier. We can then plottax_table$Berkeley
and add lines for the three other variables.# Use inverted commas ('') to refer to the 'NonBerkeley' # variable, since R interprets the hyphen as a minus sign. tax_table$Berkeley < ts(tax_table$Berkeley, start = c(2013, 1), end = c(2016, 3), frequency = 12) tax_table$'NonBerkeley' < ts(tax_table$'NonBerkeley', start = c(2013, 1), end = c(2016, 3), frequency = 12) ntax_table$Berkeley < ts(ntax_table$Berkeley, start = c(2013, 1), end = c(2016, 3), frequency = 12) ntax_table$'NonBerkeley' < ts(ntax_table$'NonBerkeley', start = c(2013, 1), end = c(2016, 3), frequency = 12) plot(tax_table$Berkeley, col = "deepskyblue4", lwd = 2, ylab = "Average price", xlab = "Time", ylim = c(4, 12)) # \n creates a line break. title("Average price of taxed and nontaxed beverages \n in Berkeley and nonBerkeley areas") lines(tax_table$'NonBerkeley', col = "deeppink", lwd = 2) lines(ntax_table$Berkeley, col = "darkgreen", lwd = 2) lines(ntax_table$'NonBerkeley', col = "darkorange", lwd = 2) # Add vertical lines abline(v = 2015.1, col = "grey") abline(v = 2015.3, col = "grey") # Add labels text(2014.6, 4, "Pretax") text(2015.8, 4, "Posttax") legend(2013.1, 12, lwd = 2, lty = 1, cex = 0.8, legend = c("Taxed (Berkeley)", "Taxed (nonBerkeley)", "Nontaxed (Berkeley)", "Nontaxed (nonBerkeley)"), col = c("deepskyblue4", "deeppink", "darkgreen", "darkorange"))
How strong is the evidence that the sugar tax affected prices? According to the journal paper, when comparing the mean Berkeley and nonBerkeley price of sugary beverages after the tax, the pvalue is smaller than 0.00001, and it is 0.63 for nonsugary beverages after the tax.
 What do the pvalues tell us about the difference in means and the effect of the sugar tax on the price of sugary beverages? (Hint: You may find the discussion in Part 2.3 helpful.)
The aim of the sugar tax was to decrease consumption of sugary beverages. Figure 3.7 shows the mean number of calories consumed and the mean volume consumed before and after the tax. The researchers reported the pvalues for the difference in means before and after the tax in the last column.
Usual intake  Pretax (Nov–Dec 2014), n = 623 
Posttax (Nov–Dec 2015), n = 613 
Pretax–posttax difference  

Caloric intake (kilocalories/capita/day)  
Taxed beverages  45.1  38.7  −6.4, p = 0.56  
Nontaxed beverages  115.7  147.6  31.9, p = 0.04  
Volume of intake (grams/capita/day)  
Taxed beverages  121.0  97.0  −24.0, p = 0.24  
Nontaxed beverages  1,839.4  1,896.5  57.1, p = 0.22  
Models account for age, gender, race/ethnicity, income level, and educational attainment. n is the sample size at each round of the survey after excluding participants with missing values on selfreported race/ethnicity, age, education, income, or monthly intake of sugarsweetened beverages. 
Lynn D. Silver, Shu Wen Ng, Suzanne RyanIbarra, Lindsey Smith Taillie, Marta Induni, Donna R. Miles, Jennifer M. Poti, and Barry M. Popkin. 2017. Table 1 in ‘Changes in prices, sales, consumer spending, and beverage consumption one year after a tax on sugarsweetened beverages in Berkeley, California, US: A beforeandafter study’. PLoS Med 14 (4): e1002283.
 Based on Figure 3.7, what can you say about consumption behaviour in Berkeley after the tax? Suggest some explanations for the evidence.
 Read the ‘Limitations’ in the ‘Discussions’ section of the paper and discuss the strengths and limitations of this study. How could future studies on the sugar tax in Berkeley address these problems? (Some issues you may want to discuss are: the number of stores observed, number of people surveyed, and the reliability of the price data collected.)
 Suppose that you have the authority to conduct your own sugar tax natural experiment in two neighbouring towns, Town A and Town B. Outline how you would conduct the experiment to ensure that any changes in outcomes (prices, consumption of sugary beverages) are due to the tax and not due to other factors. (Hint: think about what factors you need to hold constant.)

Lynn D. Silver, Shu Wen Ng, Suzanne RyanIbarra, Lindsey Smith Taillie, Marta Induni, Donna R. Miles, Jennifer M. Poti, and Barry M. Popkin. 2017. S3 Table in ‘Changes in prices, sales, consumer spending, and beverage consumption one year after a tax on sugarsweetened beverages in Berkeley, California, US: A beforeandafter study’. PLoS Med 14 (4): e1002283. ↩