# Empirical Project 3 Working in R

Don’t forget to also download the data into your working directory by following the steps in this project.

## R-specific learning objectives

In addition to the learning objectives for this project, in this section you will learn how to use the piping technique to run a sequence of functions.

## Getting started in R

For this project you will need the following packages:

• tidyverse, to help with data manipulation
• readxl, to import an Excel spreadsheet
• mosaic, to help create frequency tables
• readstata13, to read in a Stata datafile.

You will also use the ggplot2 package to produce accurate graphs, but that comes as part of the tidyverse package.

If you need to install any of these packages, run the following code:

install.packages(c("readxl","tidyverse",


You can import these libraries now, or when they are used in the R walk-throughs below.

library(readxl)
library(tidyverse)
library(mosaic)


## Part 3.1 The treatment group: before-and-after comparisons of retail prices

We will first look at price data from the treatment group (stores in Berkeley) to see what happened to the price of sugary and non-sugary beverages after the tax.

• Download the data from the Global Food Research Program’s website, and select the ‘Berkeley Store Price Survey’ Excel dataset. Then upload the dataset into R.
• The first tab of the Excel file contains the data dictionary. Make sure you read the data description column carefully, and check that each variable is in the Data tab.

### R walk-through 3.1 Importing the datafile into R

The data is in .xlsx format, so we use the readxl package to import it. We also load the tidyverse library as this includes packages that we will use later for piping and graphing. There are two worksheets: one containing some information on the variables (Data Dictionary), and the other containing the data (Data). It is useful to import both into R, so that you have the additional data information available while you are working without having to refer to Excel.

library(readxl)
library(tidyverse)



Let’s look at the datatypes this import process has assigned to the respective variables.

str(dat)

## Classes 'tbl_df', 'tbl' and 'data.frame': 2175 obs. of 12 variables:
## $store_id : num 16 16 16 16 16 16 16 16 16 16 ... ##$ type          : chr "WATER" "TEA" "TEA" "WATER" ...
## $store_type : num 2 2 2 2 2 2 2 2 2 2 ... ##$ type2         : chr NA NA NA NA ...
## $size : num 33.8 23 23 33.8 128 64 128 64 63.9 144 ... ##$ price         : num 1.69 0.99 0.99 1.69 3.79 2.79 3.79 2.79 4.59 5.99 ...
## $price_per_oz : num 0.05 0.043 0.043 0.05 0.0296 ... ##$ price_per_oz_c: num 5 4.3 4.3 5 2.96 ...
## $taxed : num 0 1 1 0 0 0 0 0 0 1 ... ##$ supp          : num 0 0 0 0 0 0 0 0 0 1 ...
## $time : chr "DEC2014" "DEC2014" "DEC2014" "DEC2014" ... ##$ product_id    : num 29 32 33 38 40 41 42 43 44 50 ...


All numerical variables are correctly classified as numerical, but there are some variables that may be useful to have as categorical variables (factor variables in R terminology). Specifically, type, taxed, supp, store_id, store_type, type2 and product_id have been expressed as numbers, but are actually categorical (factor) variables. So let’s use the factor function to convert them to factor variables.

dat$type <- factor(dat$type)
dat$taxed <- factor(dat$taxed,
labels=c("not taxed","taxed"))
dat$supp <- factor(dat$supp,
labels=c("Standard","Supplemental"))
dat$store_id <- factor(dat$store_id)
dat$store_type <- factor(dat$store_type,
labels=c("Large Supermarket",
"Small Supermarket",
"Pharmacy",
"Gas Station"))
dat$type2 <- factor(dat$type2)
dat$product_id <- factor(dat$product_id)


We use the labels option to specify the names of different categories (where they are clearly defined).

There is another variable, time, which is characterized as a chr variable (chr stands for characters, meaning letters and numbers). It may be useful to change time into a factor variable. Before we do this, we use the unique command to check the values this variable takes.

unique(dat$time)  ## [1] "DEC2014" "JUN2015" "MAR2015"  If you look at the timeline in the Silver et al. (2017) paper, you will notice that the third survey was not in March 2015 but in March 2016, so the data has been labelled incorrectly. We shall therefore change all the values MAR2015 to MAR2016. dat$time[dat$time == "MAR2015"] <- "MAR2016" # [dat$time == "MAR2015"] selects all observations for which the value of “time” is equal to "MAR2015".


We can now change time into a factor variable.

dat$time <- factor(dat$time)

1. Read ‘S1 Text’, from the journal paper’s supporting information, which explains how the Store Price Survey data was collected.
• In your own words, explain how the product information was recorded, and the measures that researchers took to ensure that the data was accurate and representative of the treatment group. What were some of the data collection issues that they encountered?
• Instead of using the name of the store, each store was given a unique ID number (recorded as store_id on the spreadsheet). Verify that the number of stores in the dataset is the same as that stated in the ‘S1 Text’ (26). Similarly, each product was given a unique ID number (product_id). How many different products are in the dataset?

### R walk-through 3.2 Counting the number of unique elements in a variable

We use two functions here: unique to reduce a list (dat$store_id and dat$product_id) to its unique elements, then length to count how many unique elements we have. We store the number of stores and products in the two variables no_stores and no_products respectively.

no_stores <- length(unique(dat$store_id)) no_products <- length(unique(dat$product_id))
paste("Stores:", no_stores)

## [1] "Stores: 26"

paste("Products:", no_products)

## [1] "Products: 247"


Following the approach described in ‘S1 Text’, we will compare the variable price per ounce in US$cents (price_per_oz_c). We will look at what happened to prices in the two treatment groups before the tax (time = DEC2014) and after the tax (time = JUN2015): • treatment group one: large supermarkets (store_type = 1) • treatment group two: pharmacies (store_type = 3). We will create frequency tables containing the summary measures that we are interested in. 1. Create the following tables: • A frequency table showing the number (count) of store observations (store type) in December 2014 and June 2015, with ‘store type’ as the row variable and ‘time period’ as the column variable. For each store type, is the number of observations similar in each time period? • A frequency table showing the number of taxed and non-taxed beverages in December 2014 and June 2015, with ‘store type’ as the row variable and ‘taxed’ as the column variable. (‘Taxed’ equals 1 if the sugar tax applied to that product, and 0 if the tax did not apply). For each store type, is the number of taxed and non-taxed beverages similar? • A frequency table showing the number of each product type (type), with ‘product type’ as the row variable and ‘time period’ as the column variable. Which product types have the highest number of observations and which have the lowest number of observations? Why might some products have more observations than others? ### R walk-through 3.3 Creating frequency tables #### Frequency table for store type and time period We use the tally function, which allows us to produce frequency tables using the format = "count" option. We start with the frequency table that shows the number of stores of different types in each time period. library(mosaic) tally(~store_type+time, data=dat, margins = TRUE, format = "count")  ## time ## store_type DEC2014 JUN2015 MAR2016 Total ## Large Supermarket 177 209 158 544 ## Small Supermarket 407 391 327 1125 ## Pharmacy 87 102 73 262 ## Gas Station 73 96 75 244 ## Total 744 798 633 2175  There are fewer observations taken from gas stations and pharmacies and more from small supermarkets. #### Frequency table for store type and taxed Now we repeat the steps above to make the frequency table with store_type as the row variable and taxed as the column variable. Since we also want separate values for each time period, we add +time to the specification. tally(~store_type+taxed+time, data=dat, margins = TRUE, format = "count")  ## , , time = DEC2014 ## ## taxed ## store_type not taxed taxed Total ## Large Supermarket 92 85 177 ## Small Supermarket 196 211 407 ## Pharmacy 44 43 87 ## Gas Station 34 39 73 ## Total 366 378 744 ## ## , , time = JUN2015 ## ## taxed ## store_type not taxed taxed Total ## Large Supermarket 111 98 209 ## Small Supermarket 192 199 391 ## Pharmacy 52 50 102 ## Gas Station 44 52 96 ## Total 399 399 798 ## ## , , time = MAR2016 ## ## taxed ## store_type not taxed taxed Total ## Large Supermarket 88 70 158 ## Small Supermarket 154 173 327 ## Pharmacy 36 37 73 ## Gas Station 31 44 75 ## Total 309 324 633 ## ## , , time = Total ## ## taxed ## store_type not taxed taxed Total ## Large Supermarket 291 253 544 ## Small Supermarket 542 583 1125 ## Pharmacy 132 130 262 ## Gas Station 109 135 244 ## Total 1074 1101 2175  #### Frequency table for product type and time period Now we make a frequency table showing the number of each product type (type), with product type (type) as the row variable and time period (time) as the column variable. tally(~type+time, data=dat, margins = TRUE, format = "count")  ## time ## type DEC2014 JUN2015 MAR2016 Total ## ENERGY 56 58 49 163 ## ENERGY-DIET 49 54 35 138 ## JUICE 70 64 52 186 ## JUICE DRINK 19 17 6 42 ## MILK 63 61 53 177 ## SODA 239 262 215 716 ## SODA-DIET 128 174 127 429 ## SPORT 11 16 12 39 ## SPORT-DIET 2 2 0 4 ## TEA 52 45 41 138 ## TEA-DIET 6 6 8 20 ## WATER 48 38 34 120 ## WATER-SWEET 1 1 1 3 ## Total 744 798 633 2175  This table shows that there were no observations for the category Sport-diet in March 2016. As this is a drink which even in the other months has very few observations, it may be a product that is offered only in one shop, and it is possible that this shop was not visited in March 2016. Or the product may be a seasonal product that is not available in March. It is also likely that Water-sweet is offered in only one shop. conditional mean An average of a variable, taken over a subgroup of observations that satisfy certain conditions, rather than all observations. Besides counting the number of observations in a particular group, we can also calculate the mean by only using observations that satisfy certain conditions (known as the conditional mean). In this case, we are interested in comparing the mean price of taxed and untaxed beverages, before and after the tax. 1. Calculate and compare conditional means: • Create a table similar to Figure 3.1, showing the average price per ounce (in cents) for taxed and untaxed beverages separately, with ‘store type’ as the row variable, and ‘taxed’ and ‘time’ as the column variables. To follow the methodology used in the journal paper, make sure to only include products that are present in all time periods, and non-supplementary products (supp = 0). • Without doing any calculations, summarize any differences or general patterns between December 2014 and June 2015 that you find in the table. • Would we be able to assess the effect of sugar taxes on product prices by comparing the average price of untaxed goods with that of taxed goods in any given period? Why or why not? Non-taxed Taxed Store type Dec 2014 Jun 2015 Dec 2014 Jun 2015 1 3 The average price of taxed and non-taxed beverages, according to time period and store type. Figure 3.1 The average price of taxed and non-taxed beverages, according to time period and store type. ### R walk-through 3.4 Calculating conditional means Calculating conditional means is not a straightforward task to achieve (in R or in any other statistical program). It is, however, a common data cleaning operation you will encounter. Here is one way to do this. In order to identify products (identified by product_id) that have observations for all three periods (DEC2014, JUN2015 and MAR2016), we will first create a new variable called period_test, which takes the value 1 (or TRUE) if we have observations for all periods for a product in a particular store, and 0 (FALSE) otherwise. We call true/false variables like this ‘boolean variables’. Creating this variable is not straightforward, but the easiest way to achieve this is with a loop. dat$period_test <- NA

sid_list = unique(dat$store_id) # List of all store IDs pid_list = unique(dat$product_id) # List of all product IDs

for (s in sid_list) {
for (p in pid_list) {
temp <- subset(dat, product_id == p & store_id == s)
temp_time <- temp$time test <- (any(temp_time == "DEC2014") & any(temp_time == "JUN2015") & any(temp_time == "MAR2016")) dat$period_test[dat$product_id == p & dat$store_id == s] <- test
}
}


Now we can use the period_test variable to remove all products that have not been observed in all three periods. We define a new dataframe, dat_c, containing only products that are observed in all three periods in a particular store.

dat_c <- subset(dat,(period_test == TRUE & supp == "Standard"))


Now we can calculate the price averages (for price_per_oz) by grouping the data according to store_type, taxed, and time. One way to achieve this is to use a technique called piping:

table_res <- dat_c %>%
group_by(taxed,store_type,time) %>%
summarize(n = length(price_per_oz),avg.price = mean(price_per_oz)) %>%
print()

## # A tibble: 8 x 6
## # Groups:   taxed, store_type [8]
##   taxed     store_type            n DEC2014 JUN2015 MAR2016
##   <fct>     <fct>             <int>   <dbl>   <dbl>   <dbl>
## 1 not taxed Large Supermarket    36   0.112   0.115   0.117
## 2 not taxed Small Supermarket    70   0.137   0.138   0.134
## 3 not taxed Pharmacy             18   0.152   0.161   0.154
## 4 not taxed Gas Station          12   0.169   0.170   0.170
## 5 taxed     Large Supermarket    36   0.156   0.169   0.167
## 6 taxed     Small Supermarket   101   0.159   0.160   0.155
## 7 taxed     Pharmacy             18   0.182   0.191   0.186
## 8 taxed     Gas Station          22   0.194   0.203   0.192


Piping is a very useful technique in R for data analysis. In words, what we did in the line above is: Take the data in dat_c, group them according to the variables taxed, store_type, and time, and then summarize the data by calculating the mean of price_per_oz (and calculate the number of observations in each group). Then, rearrange the results table so that the time variable shows across the columns and the avg.price variable gives the values inside the table.

The University of Manchester’s Econometric Computing Learning Resource provides a more detailed introduction to piping.

Use the S3 Table in the journal paper to check how closely your summary data match those in the paper.1 You should find that your results for Large Supermarkets and Pharmacies match, but the other store types have discrepancies. In R walk-through 3.5 we will discuss these differences in more detail.

In order to make a before-and-after comparison, we will make a chart similar to Figure 2 in the journal paper, to show the change in prices for each store type.

1. Using your table from Question 3:
• Calculate the change in the mean price after the tax (price in June 2015 minus price in December 2014) for taxed and untaxed beverages, by store type.
• Using the values you calculated in Question 4(a), plot a column chart to show this information (as done in Figure 2 of the journal paper) with store type on the horizontal axis and price change on the vertical axis. Label each axis and data series appropriately. You should get the same values as shown in Figure 2.

### R walk-through 3.5 Making a column chart to compare two groups

#### Calculate price differences by store type

Let’s calculate the price differences for June 2015 minus December 2014 and March 2016 minus December 2014, and store them as D1 and D2 respectively:

table_res$D1 <- table_res$JUN2015 - table_res$DEC2014 table_res$D2 <- table_res$MAR2016 - table_res$DEC2014
print("Group Means")

## [1] "Group Means"

table_res

## # A tibble: 8 x 8
## # Groups:   taxed, store_type [8]
##   taxed     store_type          n DEC2014 JUN2015 MAR2016      D1       D2
##   <fct>     <fct>           <int>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
## 1 not taxed Large Supermar~    36   0.112   0.115   0.117 2.88e-3  0.00510
## 2 not taxed Small Supermar~    70   0.137   0.138   0.134 1.46e-3 -0.00304
## 3 not taxed Pharmacy           18   0.152   0.161   0.154 8.80e-3  0.00240
## 4 not taxed Gas Station        12   0.169   0.170   0.170 2.90e-4  0.00106
## 5 taxed     Large Supermar~    36   0.156   0.169   0.167 1.31e-2  0.0107
## 6 taxed     Small Supermar~   101   0.159   0.160   0.155 1.44e-3 -0.00359
## 7 taxed     Pharmacy           18   0.182   0.191   0.186 8.97e-3  0.00448
## 8 taxed     Gas Station        22   0.194   0.203   0.192 9.25e-3 -0.00174


#### Plot a column chart for average price changes

It is these two new variables we want to display in a column chart. We will use the ggplot2 package for graphs (which is part of the tidyverse package we loaded for R walk-through 3.1). Let’s start with displaying the average price change from December 2014 to June 2015 (which is stored in D1):

ggplot(table_res, aes(fill=taxed, y=D1, x=store_type)) +
geom_bar(position="dodge", stat="identity") +
labs(y = "Price change (US$/oz)", x = "Store type") + # Add the axes labels scale_fill_discrete(name="Beverages", # Add the title and legend labels labels=c("Non-taxed", "Taxed")) + ggtitle("Average price change from Dec 2014 to Jun 2015")  Average price change from December 2014 to June 2015. Figure 3.2 Average price change from December 2014 to June 2015. Now we do the same for the price change from Dec 2014 to Mar 2016: ggplot(table_res, aes(fill=taxed, y=D2, x=store_type)) + geom_bar(position="dodge", stat="identity") + labs(y = "Price change (US$/oz)", x = "Store type") +  # Add the axes labels
scale_fill_discrete(name="Beverages",      # Add the title and legend labels
labels=c("Non-taxed", "Taxed")) +
ggtitle("Average price change from Dec 2014 to Mar 2016")


Average price change from December 2014 to March 2016.

Figure 3.3 Average price change from December 2014 to March 2016.

It is important to know whether the difference in means is statistically significant or not. According to the journal paper, the p-value is less than 0.05 for large supermarkets, but greater than 0.05 for pharmacies.

1. Based on a cut-off (significance level) of 5%, what can we conclude about the difference in means? (Hint: You may find the discussion in Part 2.3 helpful.)

### Extension R walk-through 3.6 Testing for significant differences in price changes

In this walk-through, we show the calculations required to obtain the p-values above. As the p-values are already provided, this walk-through is only for those who want to replicate where these p-values come from. For the categories of Large Supermarkets and Pharmacies, we conduct a hypothesis test which tests the null hypothesis that the price difference between June 2015 and Dec 2014 (and Mar 2016 and Dec 2014) for the taxed and untaxed beverages in the two store types are actually zero.

We are interested in whether for one group (say, Large Supermarket and taxed), the difference in average price between JUN2015 and DEC2014 (or MAR2016 and DEC2014) is statistically significant. Note that we are dealing with paired observations (recall that we only allowed observations for products that were observed in all periods in the same store type).

Let’s use the price difference between June 2015 and December 2014 in Large Supermarkets for taxed beverages as an example. First, we extract price vectors for both periods (p1 and p2) and then calculate their difference (d_t).

p1 <-
dat_c$price_per_oz[dat_c$store_type=="Large Supermarket"
& dat_c$taxed == "taxed" & dat_c$time=="DEC2014"]
p2 <-
dat_c$price_per_oz[dat_c$store_type=="Large Supermarket"
& dat_c$taxed == "taxed" & dat_c$time=="JUN2015"]
d_t <- p2-p1 # Price difference for taxed products


All three new variables are vectors with 36 elements. For d_t to correctly represent the price difference for a particular product in a particular store, we need to be certain that every element in both vectors corresponds to the same product in the same store. To check that the elements match, we will extract the store and product IDs along with the prices, and compare the ordering in p1_alt and p2_alt.

p1_alt <-
dat_c[dat_c$store_type=="Large Supermarket" & dat_c$taxed == "taxed"
& dat_c$time=="DEC2014", c("product_id", "store_id", "price_per_oz")] p2_alt <- dat_c[dat_c$store_type=="Large Supermarket"
& dat_c$taxed == "taxed" & dat_c$time=="JUN2015",
c("product_id",
"store_id",
"price_per_oz")]


You can see that the ordering matches, since the original datafile was ordered in a way (first according to time, then store_id and then product_id) that in this instance guarantees identical ordering.

standard error
A measure of the degree to which the sample mean deviates from the population mean. It is calculated by dividing the standard deviation of the sample by the square root of the number of observations.

The average value of the price difference is 0.0131222, and our task is to evaluate whether this is statistically significantly different from 0. To calculate the respective test statistic ($t = \bar d/se_{\bar{d}}$), we will need this value’s standard error ($se_{\bar{d}}$). Manually the hypothesis test is calculated as follows:

t <- mean(d_t)/sqrt(var(d_t)/(length(d_t)))


Alternatively, we can use the available t.test function in R, which gives exactly the same results but has the advantage of directly obtaining p-values and confidence intervals.

t.test(p2,p1,paired=TRUE) # Recognize that the differences come from paired samples.

##

## Paired t-test
##
## data: p2 and p1
## t = 4.7681, df = 35, p-value = 3.226e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.007535221 0.018709202
## sample estimates:
## mean of the differences
## 0.01312221

# t.test(d) # Or use the function directly on d, which delivers identical results.


To compare this result to the journal paper, look at the extract from Table S3 (the section on Large Supermarkets) shown in Figure 3.4 below. The cell with ‘**’ shows the mean price difference of 1.31 cents ($0.0131). Table S3 in Silver et al. (2017), showing means and confidence intervals. Figure 3.4 Table S3 in Silver et al. (2017), showing means and confidence intervals. n = number of stores of each type; ** denotes statistically significant difference between prices in March 2016 compared to earlier round (December 2014 or June 2015) at p < 0.01 using paired t-tests. * denote statistically significant difference between prices in March 2016 compared to earlier round (December 2014 or June 2015) at p < 0.05 using paired-t-tests. ‡ denotes statistically significant difference of price of taxed beverages compared to untaxed beverages at p < 0.05 (unpaired t-tests since taxed and untaxed beverage items are different). In our test output we get a very small p-value (0.0000323) which in the table is indicated by the double asterisk. Our test output also automatically delivers the confidence interval for the mean of price differences: [0.0075, 0.0187]. Tests for other store types are calculated similarly, by changing the data extracted to p1 and p2. Let’s do that for one more example: the price difference between June 2015 and December 2014 in Large Supermarkets for untaxed beverages. p1 <- dat_c$price_per_oz[dat_c$store_type=="Large Supermarket" & dat_c$taxed == "not taxed" & dat_c$time=="DEC2014"] p2 <- dat_c$price_per_oz[dat_c$store_type=="Large Supermarket" & dat_c$taxed == "not taxed" & dat_c$time=="JUN2015"] d_nt <- p2-p1 t.test(p2,p1,paired=TRUE)  ## ## Paired t-test ## ## data: p2 and p1 ## t = 1.8179, df = 35, p-value = 0.07765 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.0003367942 0.0061057804 ## sample estimates: ## mean of the differences ## 0.002884493  You should be able to recognize the mean difference, the p-value, and the confidence interval in the excerpt of Table S3 provided in Figure 3.4. Let’s also replicate the last section of Table S3, which shows the difference between the price changes in taxed and untaxed products, that is, we want to know whether d_tAND d_nt have different means. We will apply the two sample hypothesis tests, but this time for unpaired data, as the products differ across samples. t.test(d_t,d_nt)  ## ## Welch two sample t-test ## ## data: d_t and d_nt ## t = 3.2227, df = 55.955, p-value = 0.002119 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.003873834 0.016601603 ## sample estimates: ## mean of x mean of y ## 0.013122212 0.002884493  Again you should be able to identify the corresponding entries in Table S3 shown in Figure 3.4. The main entry in the table is 1.02, indicating that the means of the two groups differ by 1.02 cents. This is confirmed in our calculations, as$0.01312 − $0.00288 is about$0.0102 or 1.02 cents. The p-value of 0.002 is also the same as the one in Table S3.

## Part 3.2 The control group: before-and-after comparisons with prices in other areas

When looking for any price patterns, it is possible that the observed changes were not due to the tax, but instead were due to other events that happened in Berkeley and the neighbouring areas. If prices changed in a similar way in nearby areas, then what we observed in Berkeley may not be a result of the tax. To investigate whether this was the case, the researchers collected price data from stores in the surrounding areas and compared them with prices in Berkeley.

• The Berkeley Point-of-Sale Stata file on the Global Food Research Program’s website, containing the price data they collected, including information on the date (year and month), location (Berkeley or Non-Berkeley), beverage group (soda, fruit drinks, milk substitutes, milk and water), average price, and the consumer price index (CPI) for that month. Stata is another popular statistical software package, and the data is provided as a .dta file.
• ‘S5 Table’ comparing the neighbourhood characteristics of the Berkeley and non-Berkeley stores.
1. Based on ‘S5 Table’, do you think the researchers chose suitable comparison stores? Why or why not?

We will now plot a line chart similar to Figure 3 in the journal paper, to compare prices of similar goods in different locations and see how they have changed over time. To do this, we will need to summarize the data so that there is one value (the mean price) for each location and type of good in each month.

1. Assess the effects of a tax on prices:
• Create a table similar to the one provided in Figure 3.5 to show the average price in each month for taxed and non-taxed beverages, according to location. Use ‘year and month’ as the row variables, and ‘tax’ and ‘location’ as the column variables. (Hint: You may find R walk-through 3.4 helpful.)
• Plot the four columns of your table on the same line chart, with average price on the vertical axis and time (months) on the horizontal axis. Describe any differences you see between the prices of non-taxed goods in Berkeley and those outside Berkeley, both before the tax (January 2013 to December 2014) and after the tax (March 2015 onwards). Do the same for prices of taxed goods.
• Based on your chart, is it reasonable to conclude that the sugar tax had an effect on prices?
Non-taxed Taxed
Year/Month Berkeley Non-Berkeley Berkeley Non-Berkeley
January 2013
February 2013
March 2013
December 2013
January 2014
February 2016

The average price of taxed and non-taxed beverages, according to location and month.

Figure 3.5 The average price of taxed and non-taxed beverages, according to location and month.

### R walk-through 3.7 Importing data from a Stata file and plotting a line chart

#### Import data and create a table of average monthly prices

To import data from a .dta file we need the readstata13 package.

library(readstata13)



Before proceeding, use the command str(PoSd) to look at the structure of this dataset (output not shown here). You will see that for each month and location (Berkeley or Non-Berkeley), there are a number of prices for a variety of beverage categories, and we know whether the category is taxed or not. For any particular time-location-tax status combination we want the average price of all products.

To make the summary table, we will use the tidyverse package again:

table_test <- PoSd %>%
group_by(year,month,location,tax) %>%
summarize(avg.price = mean(price)) %>% spread(location,avg.price) %>%
print()

## # A tibble: 78 x 5
## # Groups:   year, month [39]
##     year month tax       Berkeley Non-Berkeley
##    <dbl> <dbl> <chr>        <dbl>          <dbl>
##  1  2013     1 Non-taxed     5.72           5.35
##  2  2013     1 Taxed         8.69           7.99
##  3  2013     2 Non-taxed     5.81           5.36
##  4  2013     2 Taxed         8.65           8.18
##  5  2013     3 Non-taxed     5.86           5.42
##  6  2013     3 Taxed         8.82           8.19
##  7  2013     4 Non-taxed     5.86           5.64
##  8  2013     4 Taxed         9.02           8.25
##  9  2013     5 Non-taxed     5.79           5.18
## 10  2013     5 Taxed         8.68           7.76
## # ... with 68 more rows


You can see that the data is not quite in the right shape yet, because we need to separate the Taxed from the Non-taxed data.

tax_table <- subset(table_test,tax == "Taxed")
ntax_table <- subset(table_test,tax == "Non-taxed")


#### Plot a line chart

Now we can create a line chart. Before we do this, we will convert the Berkeley and Non-Berkeley columns of both datasets to time series data using the ts function, which will make plotting a little easier. We then start by plotting tax_table$Berkeley and subsequently add lines for the three other variables. # Note for below: Use inverted commas ('') to refer to the 'Non-Berkeley' variable, since R interprets the hyphen as a minus sign. tax_table$Berkeley <- ts(tax_table$Berkeley, start=c(2013,1), end=c(2016,3), frequency=12) tax_table$'Non-Berkeley' <- ts(tax_table$'Non-Berkeley', start=c(2013,1), end=c(2016,3), frequency=12) ntax_table$Berkeley <- ts(ntax_table$Berkeley, start=c(2013,1), end=c(2016,3), frequency=12) ntax_table$'Non-Berkeley' <- ts(ntax_table$'Non-Berkeley', start=c(2013,1), end=c(2016,3), frequency=12) plot(tax_table$Berkeley, col = "deepskyblue4", lwd=2, ylab = "Average price", xlab = "Time", ylim=c(4, 12) )
title("Average price of taxed and non-taxed beverages \n in Berkeley and non-Berkeley areas") # \n creates a line break.
lines(tax_table$'Non-Berkeley', col="deeppink",lwd=2) lines(ntax_table$Berkeley,
col="darkgreen",lwd=2)
lines(ntax_table\$'Non-Berkeley',
col="darkorange",lwd=2)
abline(v = 2015.1, col = "grey") # Add vertical lines
abline(v = 2015.3, col = "grey")
text(2015.8,4,"Post-tax")
legend(2013.1, 12, legend=c("Taxed (Berkeley)", "Taxed (non-Berkeley)", "Non-taxed (Berkeley)", "Non-taxed (non-Berkeley)"),col=c("deepskyblue4", "deeppink", "darkgreen", "darkorange"), lwd=2,lty=1, cex=0.8)


Average price of taxed and non-taxed beverages in Berkeley and non-Berkeley areas.

Figure 3.6 Average price of taxed and non-taxed beverages in Berkeley and non-Berkeley areas.

It is important to know whether the observed differences between Berkeley and non-Berkeley prices are statistically significant or not. According to the journal paper, when comparing the mean Berkeley and non-Berkeley price of sugary beverages after the tax, the p-value is less than 0.01, but it is greater than 0.05 for non-sugary beverages after the tax.

1. Based on a cut-off (significance level) of 5%, what can we conclude about the difference in means and the effect of the sugar tax on the price of sugary beverages? (Hint: You may find the discussion in Part 2.3 helpful.)

The aim of the sugar tax was to decrease consumption of sugary beverages. Figure 3.7 shows the mean number of calories consumed and the mean volume consumed before and after the tax. The researchers tested whether the difference in means before and after the tax were statistically significant or not, and reported the p-values in the last column.

Usual intake Pre-tax (Nov–Dec 2014),
n = 623
Post-tax (Nov–Dec 2015),
n = 613
Pre-tax–post-tax difference
Mean 95% CI Mean 95% CI
Caloric intake (kilocalories/per capita/day)
Taxed beverages 45.1 29.4, 60.7 38.7 23.0, 54.4 −6.4, p = 0.56
Non-taxed beverages 115.7 87.6, 142.5 116.3, 178.9 31.9*, p = 0.04
Volume of intake (grams/capita/day)
Taxed beverages 121.0 78.7, 163.3 97.0 56.5, 137.4 −24.0, p = 0.24
Non-taxed beverages 1,839.4 1,692.7, 1,986.1 1,896.5 1,742.3, 2,050.8 57.1, p = 0.22
Models account for age, gender, race/ethnicity, income level, and educational attainment. n is the sample size at each round of the survey after excluding participants with missing values on self-reported race/ethnicity, age, education, income, or monthly intake of sugar-sweetened beverages.

* Statistically significant difference in mean per capita intake between pre-tax and post-tax values, p < 0.05.

Changes in prices, sales, consumer spending, and beverage consumption one year after a tax on sugar-sweetened beverages in Berkeley, California, US: A before-and-after study.

Figure 3.7 Changes in prices, sales, consumer spending, and beverage consumption one year after a tax on sugar-sweetened beverages in Berkeley, California, US: A before-and-after study.

Lynn D. Silver, Shu Wen Ng, Suzanne Ryan-Ibarra, Lindsey Smith Taillie, Marta Induni, Donna R. Miles Jennifer M. Poti, and Barry M. Popkin. 2017. Table 1 in ‘Changes in prices, sales, consumer spending, and beverage consumption one year after a tax on sugar-sweetened beverages in Berkeley, California, US: A before-and-after study’. PLoS Med 14 (4): e1002283.

1. Based on Figure 3.7, was there a statistically significant change in consumption behaviour in Berkeley after the tax (at a 5% significance level)? Suggest some reasons why or why not.
1. Read the ‘Limitations’ in the ‘Discussions’ section of the paper and discuss the strengths and limitations of this study. How could future studies on the sugar tax in Berkeley address these problems? (Some issues you may want to discuss are: the number of stores observed, number of people surveyed, and the reliability of the price data collected.)
1. Suppose that you have the authority to conduct your own sugar tax natural experiment in two neighbouring towns, Town A and Town B. Outline how you would conduct the experiment to ensure that any changes in outcomes (prices, consumption of sugary beverages) are due to the tax and not due to other factors. (Hint: think about what factors you need to hold constant.)
1. Lynn D. Silver, Shu Wen Ng, Suzanne Ryan-Ibarra, Lindsey Smith Taillie, Marta Induni, Donna R. Miles, Jennifer M. Poti, and Barry M. Popkin. 2017. S3 Table in ‘Changes in prices, sales, consumer spending, and beverage consumption one year after a tax on sugar-sweetened beverages in Berkeley, California, US: A before-and-after study’. PLoS Med 14 (4): e1002283.