Welcome to this video about correlation.

We will be talking about what the

correlation is. So that is the correlation

coefficient and a particular graph
that visualizes the relationship

between two variables known as a
scatterplot or scattergram.

Then we'll be looking at
different types of

correlation by
interpreting the

correlation coefficient
and the scatterplot.

After that there is a part that is
slightly more theoretical

because we will be looking at
how you can derive the

correlation
coefficient from something

called covariance.

After that we will move on to
hypothesis testing and the

coefficient of determination and
we'll finish today's lecture by

looking at why correlation does
not infer concept causation.

This is a mantra that I'll keep
repeating and these next few

weeks just because two variables
are associated doesn't mean that

a change in one variable causes
the other variable to change.

Now in a separate video I
will talk you through how to

conduct the correlation
analysis in R and also how to

report the results following
APA guidelines.

OK, what is a correlation?

It really is a measure
of the relationship between two

continuous or numeric variables.

So by numeric or continuous we
mean variables

that are measured on the ordinal
or the rational scale.

Examples are height and
weight. You can put a number

on that. We can measure
somebody's weight and that

number means something in
relation to other numbers on

that scale.

So.

Does height relate to weight or is
seminar absence related to WBA

schore? Or does the number of
cats that you own relate to how

violent you are?

And some of these examples may
sound a little silly, but all of

these questions can be tested
using correlation analysis.

So in other words, if
something happens to X, what

happens to Y? That is what
you get an answer to if you

do a correlation analysis.

OK, the scatterplot.

So we assess the relationship
between two variables visually

via a scatterplot.

or a scattergram?

So you can see what it looks
like on this slide here.

So each single point in this

graph indicates the two values
for one individual on the two

variables of interest. So this
participant here, for

instance, at the end of this
arrow, scores about 23

on the variable on the X
axis and about 60 on the

variable on the Y axis.

Now the first step in a
correlation analysis is

constructing and
interpreting a scatter plot.

Now two aspects of the
relationship between our two

variables of interest are
reflected in the scatter plot.

So the first one is the strength
of the relationship and the

second one is the direction of

the relationship. So here we
have six different

scatterplots, so the strength of
the correlation can be visually

interpreted from a scatter plot
by looking at the proximity of

the scores to the line of

best fit. So how closely these
points are to this line?

And the direction of the
correlation can be visually

interpreted from the scatter
plot by looking at the direction

of the line. Does it go up as in
here, or does it go down as in

here. So let's start maybe
with the example here at the

bottom right of the slide. So
As discussed the scores are

very close to the line of best
fit, which suggests it is a

strong correlation and the line
starts kind of in the top left

and goes down to the bottom
right, which suggests that the

correlation is negative.

So now another example, the graph on
the top right shows

schores that fall further away
from the line of best fit. So

there's more scatter around
this line, but overall you can

still see that there is
a is a positive relationship,

because it kind of starts at
the bottom left and it

goes to the top right. So if X
increases, so does Y. So that's

why we call it the positive
relationship.

Now the graph in the top row in the middle shows schores that

fall even further away from the
line of best fit, which is

indicative of a weaker
relationship and here the line

goes down again. Indicative of a
negative relationship. So a weak

negative relationship would look

like this. Now finally, the
graph in the top row here on the

left shows a situation where there
is a null correlation. It's

basically just a cloud of dots
without a clear direction.

So the use of scatterplots is a
really good example of the

importance of visualizing your
data. Graphs are not just some

end product or a pretty addition
to your report. They allow us to

familiarize ourselves with the
data and identify the

distribution of the data and the
initial relationships. And they

also actually allow us to
identify outliers and we'll talk

a bit more about that next week.

OK, the statistic that we use
to quantify a linear

relationship between two
variables is called Pearson's

product moment correlation
coefficients or Pearson's r, so

this is a number that ranges
from minus one to one so it

fall anywhere between minus
one and one and the same two

aspects that are visible in the
scatter plot are also reflected

in the correlation coefficient.

So the value of the correlation
coefficient (ignoring the plus or

minus sign) reflects the strength
of the relationship. So the

closer to one or minus one, the
stronger the correlation.

The sign reflects the direction of
the relationship, so

A positive coefficient
reflects a positive

correlation and negative
coefficient reflects a

negative correlation.

A coefficient close to 0
reflects null correlation.

Now let's look at some

extreme examples of different
types of correlation.

Hey, we have a positive
correlation, so when X

increases, Y also
increases. So the line starts at

the bottom left

of the plot and goes up to the
top right corner, so this

suggests that a higher score on
X (in this case

height) is associated with a
higher score on weight.

This is actually a perfect
positive correlation because

all the dots fall exactly on
the line of best fit, and

that's why R = 1.

This doesn't usually happen in
real life and will look at some

more realistic scatter plots
later on and in the lab.

Another type of
correlation is the

negative correlation. So
in a negative correlation

as X increases,

Y decreases. So the line starts
in the top left corner of the

plot and goes down to the bottom
right. This suggests that a

higher score on the variable on
the X axis is associated with a

lower score on the variable on
the Y axis.

So he would say as seminar absence increases, WBA schore

decreases, so seminar absence is
negatively correlated with WBA

schore. And here again we have a
perfect negative correlation

with an R of minus one.

So the perfect positive and
negative correlations are really

the two extremes.

Now when the correlation is
null, so r is zero or

close to zero, an increase
in the variable on the X axis

isn't really associated with a consistent

change in the variable on the
Y axis. So as the number of

cats owned increases, the
level of violence doesn't change

in any kind of systematic way.

Now here we have those scatter
plots again and this time you

can also see the correlation
coefficients that characterize

these relationships. So, the
strong, negative correlation at

the bottom right here has an r
of minus .99. The data in the

graph in the top row on the
right here have an r of .5 and

in this one r is minus .3 so note

the negative sign which
indicates that with correlation

is negative. Now you might have
heard of the term effect size.

An effect size indicates how
relevant and effect is, so

Pearson's r is an effect size in
itself and there is a rule of

thumb which is useful when
describing the

results of your correlation

analysis. So the rule of thumb
suggests that correlations

bigger than .5 are referred to as
large or strong.

Correlations between .3 and .5
are referred to as medium or

moderate and correlations
between .1 and .3 or referred to as

a small or weak.

Now, as is often the case
with rules of thumb. There

are different opinions about
how useful they are, but I

think when you're starting
out in psychology as you are,

it is a useful way of
describing correlations.

So these indications, these rules
of bumps obviously apply to both

negative and positive

correlation coefficients. Now
we've looked at the scatter

plots and at the statistic
Pearson's r. Let's now have a

look at how you actually derive
this correlation coefficient.

So if you think back to
statistics, last term (PSYC121)

you were taught how
to derive a measure of

variance. So how much
variation in scores there is

around the mean in a sample.

You do so using this formula.

So you have the score of a
participant on X. From that you

subtract the mean score on that
variable across the sample and

you square that number.

You do that for every
participant in the sample. You

add up all those totals and you
divide by

by the number of
participants in the sample.

So that is how you
compute the variance of the

sample. And

related to variance is
something called covariance,

which we need for calculating
the correlation coefficient.

The covariance is
the extent to which two

variables vary together.

So instead of multiplying the
schores by by itself, as you do

here by squaring it, you
multiply it by the score on the

Y variable. So we
multiply it by the other

variable, which gives us kind of
this this formula. So let's have

a look at that.

So to compute the covariance,
you take a participants score on

variable X, subtract the mean of
X across the sample, and you do

the same for Y.

So you have the
participants score for Y and you

subtract the mean for Y across
the sample and you then multiply these

two numbers and you do that for
each participant and then you add up

all those numbers. The sigma
sign here, and then you divide

the outcome by N (the number of
participants, that is N) minus 1.

OK, let's look at an example.

So here we have some data and so
here we have data for 15

participants. Here in these rows
are visible. So we have data

on four different variables. So
here in the columns.

The first column we have job
performance scores, in the second

column we have IQ scores, in
the third column we have

motivation schores and in the
fourth column we have scores

for social support.

So participants one scores,
85 on job performance, 109 on IQ,

89 on motivation and 73 on
social support.

Down here we have calculated
earlier the means and

standard deviations for two of

these variables. So if you want
to compute the covariance

between job performance and
motivation, you would do the

following. We have the
formula here at the top.

So for participant one,
participant one schores 85

on job performance, oir first
variable. From that we subtract

the mean across the sample of
job performance, which is 78.

We do the same for Y so
participant once scores 89 on

motivation, and from that we
subtract the mean on motivation

across the sample that is 67. So
if we do the sums we end up

with a number

of 154 for

Participant one and you then do this for each

participant. Participant
two, etc etc etc.

Then you add up all of those
totals and you divide by N minus 1.

Now, this is how you compute
covariance by hand.

Of course, we wouldn't
usually do it by hand, but

it is helpful to kind of
understand the principle.

Just to keep in mind, unlike
variance, covariance may have a

positive or a negative value.
'cause we're not squaring

basically so it can have a
positive value or a

negative value, suggesting the
direction of the variance. So

that kind of maps onto the
positive and the negative

correlation coefficients that we
looked at earlier. So our

example of covariance

for job performance and
motivation, we get a number of

69.24. From this

covariance we'll
derive Pearson's r.

Now you might ask, why do we
need Pearson's r? Isn't the

covariance enough? Now the thing
is that the size of the

covariance is affected by the
size of variances of the

variables in the sample.

So depending on how big that is,

the covariance could be really
big or really small and that

makes it very difficult to
compare covariances across

different measures or across

different samples. So the
correlation coefficient

Pearson's r improves that
situation by dividing by the

standard deviations. Standard deviation is something that you

might remember from last

term. So you could say that
Pearson's r is a standardized

version of the covariance.

So down here you can see how it
works. So to calculate our

Pearson's r we divide the
covariance by the standard

deviation of X multiplied by the
standard deviation for Y.

Now if we do that (here we have
our covariance and here we have our

standard deviations that we've
calculated earlier) and then we

end up with an r of .63.

Now, as I mentioned previously,
Pearson's r can range from minus

one to one, making it really
easy to interpret.

So as I said, the covariance
could have any value

depending on the variance of
the sample, but r always is

between minus one and one.

So the important bits to
remember from this part are that

the covariance tells you to what
extent one variable changes when

the other variable changes.

And the correlation coefficient
is derived from the covariance

by standardizing it, so you end
up with a number that falls

between minus one and one.

OK. After that bit of
theory, we'll move on to

hypothesis testing. So you have
learned about hypothesis testing

when comparing group means
during last term statistics.

When doing correlation analysis,
this works as follows.

So you might remember that
when you do hypothesis

testing you have a null
hypothesis and an alternative

hypothesis, so the null
hypothesis is that there is no

correlation between the two
variables of interest.

And for us to conclude that the
correlation is significant, it

needs to be bigger than the
so-called critical value.

So let's have a look at this.
So if you're doing a

correlation analysis by hand
and you've calculated r, you

can then look at a table of
critical values as the one

displayed here on this slide
to check whether that

correlation coefficient is
significant given the size of

the sample it was calculated
from.

So let's have a closer look at
this. So in this first column

here it lists the different
sample sizes, so you'd go down

to the value closest to the size
of your sample. So for example,

let's say that we have for
example, of the correlation

between job performance and
motivation. Let's say that our

sample size was 25.

Now we go down to this row

here. If the correlation
coefficient you calculate then

falls between these values,
two 2 leftmost ones are for

negative coefficients and the
two rightmost green columns are

for positive coefficients.

So our r was positive, so let's have a look
at this one. If it falls between

.4 and 1 for a sample of 25
participants, you can conclude

it is significant, so ours was.

It was .63 and we had a sample size of 25, so
it falls in between these

numbers, meaning it is
significant. It is deemed

significant. So that means it is

sufficiently unlikely for us to reject the null

hypothesis and conclude that
Job performance and motivation

are positively correlated.

Now as we saw in the previous
slide, whether or not a

correlation coefficient is deemed
significant depends on the

sample size. And is it important
to keep that in mind.

So in this graph here we can see
sample size on the X axis from

zero to 250 and correlation
coefficient on the Y axis and

this line kind of indicates the
relationship between these two

things. So if you only have 10
participants, so it would be

somewhere down here, a
correlation coefficient needs to

be bigger than .6 to be

significant at the p < .05 level.

If you have 100
participants for measuring

the same two variables, but
you have hundreds

participants and the
correlation would only have

to be bigger than .2.

So that is both positive and
negative. So ignore the plus

minus sign for these examples.

So that means if you have a very
big sample, a small correlation

like .2 or .1 will be deemed
significant. The thing to keep

in mind is that even though it

is significant, which leads us
to conclude that there is a

relationship between these two
variables, it is important

to incorporate our
interpretation of the

coefficient in our conclusion.

Because the value of the
coefficient will tell us how

relevant the correlation is,
right? If that number is very

small

it might
be significant, but we will

have to think about whether
it actually is relevant.

So in addition to the
correlation coefficient, the

coefficient of
determination is helpful in

determining the relevance of a

significant correlation. So the
coefficient of determination is

derived from the correlation
coefficient and tells us the

proportion of variance in one
variable that can be accounted

for by the other variable.

So to calculate the coefficient
of determination, or R-squared

it's also called, you simply
square the correlation

coefficient so. Because it's
squared, R-squared is always

positive. So it doesn't matter
whether your number is

positive or negative. If you
square it, it always becomes

positive. So, R-squared, or the
coefficient of determination is

always a positive number.

So in our example, we had an R
of .63. If we square that, we

get an R-squared or coefficient
of determination of .4.

So this tells us that 40% of
the variance in job

performance is accounted for
by variance in motivation.

Giving us more information to
interpret that relationship.

OK, we're almost there. Why does
correlation not infer causation?

So correlation tells you
whether there is a relationship

between two variables. It
doesn't tell you anything about

the direction of that
relationship in the sense that

you can't conclude

if X goes up,

It causes Y to go up. It is
associated. It might be

associated with Y going up
in case of a positive

correlation, but we don't know
whether one thing causes the

other or the other way around.

The only way to infer causation
is to do an experiment in which

you manipulate an independent
variable and measure what that

does to your dependent variable.

So only in that case can you say
that this manipulation caused

any variation observed

in the dependent variable. So for
example, let's say we want to

understand how children learn to
read (something that I'm

interested in). So in a group of
children we observed a

significant positive
correlation between how much

time they spend reading every
day and their reading

proficiency.

So from that correlation,

we can tell that there is a
relationship, but we can't

tell whether spending more
time reading causes reading

proficiency to go up or whehter
the children with higher

reading proficiency choose
to spend more time reading.

It could be either.

If we do an experiment, we can
figure this out right. So

imagine we have a measure of how
accurately children read a list

of words. This is our measure of

reading proficiency. That is our

dependent variable. We then
divide the sample into two

groups. One group continues with
normal classroom reading

instruction and the other group
receives an intervention in

which they read a book at
their reading level together

with an older child

who reads fluently. So
after six weeks we measure the

children's reading proficiency
again, and if the children in

the group who received the
intervention improved more

than the group of children who
received the kind of standard

classroom instruction

then can we conclude
that spending more time

reading, causes reading
proficiency to go up.

OK, so let's look at another
example of kind of a significant

positive correlation and why we
cannot infer causation.

So here we have a scatter plot
with the number of ice cream

sold on the X axis here and the
number of shark attacks on the Y

axis here. This would be a
relatively strong positive

correlation as the line goes up
and the individual dots are

relatively close to that line of best fit. So that

suggests that shark attacks are
related to ice cream sales.

Now, does that mean that eating
ice cream causes you to be

attacked by a shark?

No. And in this scenario it is
much more likely that there is a

3rd factor or a third
variable that affects both

measures. So hot weather results
in more people swimming in the

sea, which leads to, which makes
it more likely that more people

are attacked by a shark. Hot
weather also leads to people

eating more ice cream, so I
guess this third factor, hot

whether, influences both

shark attacks and ice
cream sold.

So the bottom line is you
need to be careful in how you

interpreted a significant
correlation and really think

about what a significant
correlation means.

OK, to summarize, we looked at correlation as a measure of

the relationship between two
numeric variables. We looked at

the scatter plot on how it is
useful to construct one before

the correlation analysis to
interpret the relationship and

check certain assumptions
which will be talking about

next week. We talked about
Pearson's correlation

coefficient.

We looked at hypothesis
testing and how the

coefficient of
determination helps you

interpreting the
relationship, and we looked

at how we shouldn't confuse
correlation with causation.

You always need to think what
else could be influencing

the correlation. OK, thank you
very much for your attention and

that's it for now.