# How to calculate correlation and draw a scatterplot
00:00:00,980 --> 00:00:14,728
Correlation is a measure of how closely related two variables are. So we say two variables are correlated if knowing the value of one variable provides information on the likely value of the other.
00:00:14,728 --> 00:00:40,159
If this is the case in our data a plot of one value against the other will have a definite pattern and we'll see what that pattern is. In this exercise this is what our monthly CO2 data looks like: column A has the year, column B the months which are numbered 1 to 12 - that means 1 is January and 12 is December, column C to E are different measures of CO2 emissions.
00:00:40,159 --> 00:00:50,790
So to do this we're going to use column E which is trend data to make a scatterplot that compares CO2 emissions and temperature anomalies for month 1 - that's January.
00:00:50,790 --> 00:01:06,180
To do this exercise you're going to need to have two spreadsheets open at the same time. The first is the CO2 data that we're looking at, and the other one is the temperature anomaly spreadsheet. Note that some of the data in your project might have been updated so it won't match what you see on the screen here.
00:01:06,180 --> 00:01:18,270
Don't worry all the steps are exactly the same even if your result might be slightly different. Now go back to the CO2 data. First we need to filter it so it shows only January.
00:01:18,270 --> 00:01:38,520
In Windows that means you click 'Sort and Filter' in the top right, then 'Filter'. Click on the arrow for the month and you will see the values 1 to 12. Unclick the top box to deselect all of the months and then just click 1, hit Enter and you can see that not all the rows are visible anymore.
00:01:38,520 --> 00:01:41,796
We are just looking at the data now for January each year.
00:01:44,014 --> 00:01:51,780
if you're working on a Mac the Box looks a little bit different, so here it is but as you can see the process is basically the same.
00:01:56,520 --> 00:02:01,697
Remember the data for the other months is still there, the filters have just hidden it.
00:02:01,697 --> 00:02:22,320
Column E has the trend data that we need. Select all the cells in Column E, click and drag on the numbers only to highlight them all the way to the bottom of the column. In Windows Control+C copies that data to the clipboard and for Mac users that's Command+C. Now switch to the temperature anomaly data.
00:02:22,320 --> 00:02:46,469
We need to filter this as well so that the range of years matches. The CO2 data begins in 1959, so we need to filter between 1959 in the top box and 2017 in the bottom box. Here's what it looks like in Windows and again for our Mac users here's what it's going to look like for you.
00:02:57,485 --> 00:03:46,130
The data just copied should have the same number of rows as the data on the worksheet. Click on the first empty column, level with the first row of numbers, type Control+V in Windows - that's Command+V on a Mac of course, and paste the data. Type a header for the column so that other people can see what your data is and scroll down just to check that the number of rows matches. I'm going to minimize the CO2 data so that we can concentrate on this spreadsheet because it's the only one we're going to use from now on. The next thing we're going to do is to make a chart called a scatterplot from the data. First we need to select the January temperature data and then hold down the Control key - Mac users that's the Command key, and select the CO2 data as well Both should be highlighted.
00:03:46,130 --> 00:04:26,180
If you're not used to doing this it can be a bit tricky and take a couple of goes at first. Click on the 'Insert' menu on the top left because we're going to insert a chart in the middle of the toolbar that pops up when you click insert there's a gallery of chart types pick the scatterplot icon - it has little dots on it, and select the first chart. The basic scatter plot should pop up in the middle of your data. If you are making a report you'd move it. Let's leave it there for now and at quick look it, it looks like the data is correlated because when CO2 is higher so is the temperature.
00:04:26,180 --> 00:05:03,370
But it's not very clear at the moment and we can make this easier to look at by adjusting the scale of the vertical axis. To do this, position the cursor over the vertical axis and Windows users click the right button on the mouse - Mac users you have to hold down the Control key this time while you click the mouse button. A menu should pop up hovering over the chart. Select 'Format axis' and click on the axis options. The minimum bound is Excel's name for the point at which the vertical axis crosses the horizontal axis. None of our data points are lower than 250 so we can type that into the box.
00:05:03,370 --> 00:05:40,712
Close the access options box and you will see that the scatterplot is much clearer now. But at the moment no one else in looking at this chart would know what it was - even you are going to struggle to remember it yourself when you open it up next week. So always, always, always label your chart and both of the axes. To do this click on the chart, click 'Design' and you'll see placeholders for the axes in the title. We're going to type short titles for speed but it's always better to be precise - for example use years, locations, or data sources when you label your charts.
00:05:44,199 --> 00:05:56,628
Our last step is to calculate the correlation. To do this we're going to use a statistic called the correlation coefficient. It tells us how close the data is to a straight line on a scatterplot.
00:05:56,628 --> 00:06:02,010
The correlation coefficient ranges in value from -1 to 1. If the coefficient is 1, then the data is a straight line sloping upwards and if it's -1 it is a straight line sloping downwards, and if it's 0 then the variables are probably not related to each other.
00:06:12,720 --> 00:06:41,659
Looking at our scatterplot we can see something that's a bit like a straight line going upwards and so we should not expect the correlation to be 0 and we definitely wouldn't expect it to be negative but Excel offers a formula to calculate the correlation exactly. Select an empty column, label the cell at the top so that we know that the number is the correlation and in the first blank row type this formula: =CORREL and an open bracket and now we need the arguments.
00:06:41,659 --> 00:06:54,170
First select the temperature anomaly data we using which is column B. Add a comma, select the CO2 data and close the bracket.
00:06:56,990 --> 00:07:07,070
This is what the formula should look like. You can see how it represents the two data series. You can always type these cell references directly into the formula.
00:07:09,240 --> 00:07:29,400
So the last thing to do is to see what it's calculated for our correlation and as you can see the number is positive and it's close to 1, as we expected. Remember when calculating statistics like this always check that the outcome seems to match what you see on the chart. It's a simple step to make sure that you haven't accidentally selected the wrong data.