Hi, it's Margriet Groen here.
In this video I will talk

you through what the linear
model is and how we can use

linear regression to build
such a model.

So more specifically, we will be
looking at what a regression

line is, or a line of best fit,

and how we can specify such a
line and by

looking at its intercept and its

slope. Also we'll be looking at
the formula that specifies

the regression line and
its different components.

And then, we'll be talking about
residuals, what they are, and

what they tell you about the
model and how you can use them.

We'll be briefly review
different types of regression.

We'll be looking at the
assumptions that are relevant

in the context of regression,
and finally we will be

looking at how you can
measure how well your model

fits, by using R ^2.

So. Let's start with an example.

Hundreds of studies have found
that frequent words are

comprehended faster than

infrequent words. So in this
Figure you see the

response durations from a
linguistic study conducted as

part of the English Lexicon
project by Balota et al.

2007. So this study uses
something that's called a

lexical decision task, in which the
participant sees a word and is

asked to decide whether it is an
English word or not.

So the Y axis extends from about
400 milliseconds (that's 2/5 of

a second) to 1000 milliseconds (
which is 1 second), and longer

response durations mean that
participants responded more

slowly. Shorter response
durations mean that they

responded faster.

Now the word frequencies on The
x axis are taken from a large

corpus study and the frequency
of a word is a measure that

tells you how often that word
occurs in the language.

Now word frequency is
represented on a

logarithmic scale, but
don't worry about that. For

now and for understanding the
basics of regression it

really doesn't matter what
scale it is.

So the relationship between
response duration and word

frequency is neatly summarized
by a line.

This is the regression line
and it represents the average

response duration for
different frequency values.

Now, generally you specify
regression in the direction of

assumed causality. That is,

we expect word frequency to
affect response durations rather

than the other way around.

But it is important to remember
that mantra that says:

correlation does not infer

causation, because a
regression model cannot tell

you whether there actually is
a causal relationship.

OK, now some terminology.

The variable plotted on the Y
axis is referred to as the

response variable or the outcome
variable. Other people talk

about the dependent variable.

On the X axis, we plot our
predictor that's also referred

to as the independent variable,
sometimes the explanatory

variable or the regressor.

They all refer to the same
thing, but different people

use different words.

OK. Mathematically, lines are
represented in terms of

intercepts and slopes.

Now let's talk about slopes

first. So in the case of
the word frequency effect

on the previous slide,
uou saw that the slope

of the line is negative.

As word frequency values
increase, response

durations decrease.

Now, in
contrast, a positive

slope goes up, so as X
increases, Y

increases as well.

Now the figure on the left of
this slide shows two slopes that

differ in sign.

One line has a slope of plus 1/2,

And the other one has a slope of

minus 1/2. The one that
goes down.

And the slope is defined as the
change in Y (or delta Y) over the

change in X. Sometimes the
mnemonic 'rise over run' is

used to remember this formula.

So how much do you have to
rise in Y for a specified run

along the X axis? So how
much do you have to rise?

on the Y axis an for
a specified run, kind of from

unit 1 to 2, for instance,
along the X axis.

Now in our word frequency
example, the slope turns out to

be minus 70.

So for each increase in sord
frequency by 1 unit, the

predicted response duration
decreases by 70 milliseconds.

Now,

to specify a line, you
need a second piece of

information and
that is its intercept.

The figure on the right of
the slide here,

shows 2 lines with the same

slope, but different
intercepts. So you can

think of the intercept
informally as the point

where the line starts on
the Y axis. So we have the

Y axis here.

So this first line

has an intercept

close to one and this
second line has an

intercept of about 3.

So the Intercept is the
predicted Y value for X is 0.

Now for the for the word
frequency effect data, this

happens to be the number 880
milliseconds and that was

represented again with this
white square in the previous

slide.

Now, once you know the intercept
and the slope of a line,

there can only be one line. It
is completely specified.

Now,

this is summarized in the

following formula. So here we
have the dependent variable Y.

And that equals the

intercept (beta zero), plus

the predictor and it's
weight. So beta one is the

weight of the predictor.

This bit

is the slope, sorry this bit

is the slope and this
bit is the intercept.

So let's have a look at our
word frequency effect example

so you can see that same graph
here on on the right, the

slope, the intercept.

The dependent variable or response
variable: response duration, here

equals the intercept beta

0. Which is here, that's
880 milliseconds, plus and here

we have to slope of minus 70
milliseconds per unit increase

in the frequency.

Multiply it by the word

frequency. Now you might, you
might ask, why do I need to know

that or what use is that? Well,
you can make predictions using

this formula. So if we want to

know. What is a word's
response duration

does a participant need to decide whether

something is a word or
is not a word?

We can predict that using this
model if we know what the word

frequency for that particular

word is. So let's take

the word script.

That does not occur in the
original data set, so it would

be a prediction about a new word.

But if we know what the word
frequency is of script, that

equation will tell us what it
predicts the response duration

to be.

In the case of
script it's word frequency is 3,

so we put that in this formula.
So we have the intercept, we

have the slope and here is where
we put this this three. So if we

do the sums here. We
end up with a prediction of 670

milliseconds, so that is the
time we think it will take a

participant to decide whether or
not 'script' is an English word.

Now this prediction is called a
fitted value, as it results from

kind of fitting a regression
model to the data set.

In fact, all these points
along the regression line are

called fitted values.

Now.

Let's now look at residuals.

The regression model doesn't fit
any of the data points perfectly

usually. And the extent to which
the model is wrong for any

specific data point is
quantified by the residuals.

So the residuals are
these vertical differences of

the observed

from the regression line.

So you can see that all
those different residuals

for different data points
into figure.

So the observed values here,
above the regression line, they

have positive residuals. And the
observed values here, below the

regression line, they have

negative residuals. And the
actual numerical values

represent how much the
predictions would have to be

adjusted upwards or
downwards to reach each

observed value.

Now the relationship between
fitted values, observed values

and residuals can be summarized
in the following way:

The residuals equal the
observed values minus

the fitted values.

So now that you know about
residuals, the general form

of a regression line can be
completed by adding what is

known as an error term.

This is 'e'.

So essentially you can think of
the regression equation

As to be composed of two parts:
One part is kind of the

deterministic part and that
allows you to make predictions

for any mean of Y given a certain
value of X. So that is this part.

We saw that in the example with
the word frequency where we put

in a word frequency for the
word script and then got

the value for the response
duration.

So that is deterministic in the
sense that for a particular

value of X it will always give
you the same value for y.

Now the second part that is
this error term 'e'. This is what is

called the stochastic part of
the model and that kind of

messes with those predictions.

Your predictions are basically
never going to be perfect.

OK, now what is linear

regression? Basically it is a
statistical method that is used

to create a linear model.

And there are different types:
there is simple linear

regression. That is

models when you only have one

predictor. Then there is
multiple linear regression,

then the model has
multiple predictors.

There's also something called
logistic regression, and that

models a categorical response

variable. So for instance,
whether you have passed an exam

or not passed an exam.

And then there is multivariate
linear regression where you

model multiple response
variables at once.

Now in the in the remainder of
this video we will be looking at

simple linear regression.

So statistical models rely on
assumptions and regression is no

exception in that regard. So all
claims made on the basis of a

model are contingent on
satisfying its assumptions to

some reasonable degree. So
we'll talk

about assumptions in more

detail elsewhere. But here I'd
like to focus on this.

For regression, the assumptions
discussed are actually about the

error. That is, they relate to
the residuals of the model.

So if the model satisfies the
normality assumption, its

residuals are approximately
normally distributed.

If the model satisfies the
constant variance assumption,

the spread of the residuals
should be about equal while

moving along the regression, the

regression line. So that is
known as homoscedasticity.

Sorry,
really difficult word.

Anyway, in the figures here on
the slide, you can see examples

of what residuals look like if
they do not meet these

assumptions. So on the right
here, we see a clear violation

of the normality assumption.

So if we look at this histogram,
it's reveals a positive skew.

And that is there are very few
extreme values here, and most of

the values are really bunched up
here. So if we have a regression

line here and here, we have all
our observed values, but you can

see that this, the scatter, around
this line is not kind of nicely,

randomly distributed like a
cloud. There is a clear skew.

So it is important to emphasize
that the normality assumption in the

case of regression is about the
residuals, not about the response

or the dependent variable.

And it is possible that a model
of a of a skewed response or dependent variable, has

normally distributed residuals.

You always need to look
at the residuals not at the

distribution of the
response measure.

Now in the figure on the left
you see a clear violation of

the constant variance
assumption. So in this case

the residuals are larger for
larger X values, so you can see

that there is more spread here
on this side of the regression

line, then there is here. So
It's kind of fanning out.

So these residuals are therefore
not homoscedastic, but they are

heteroscedastic. So that
is a problem.

So here an in this figure you
can see some more examples of

data where the residuals

meet or do not meet the constant

variance assumption. So on the
left: this is what

constant variance looks like.

So there is no clear pattern
and that that is what we need.

It indicates that the residuals
meet the assumption of constant

variance, or in other words,
they are homoscedastic.

So in the middle, here in blue, we
have a situation where the

residuals show a kind of bow
tie shaped pattern with less

spread, so smaller
residuals here in the middle and

larger residuals, more spread here

on both sides.

So again, this is not constant

variance. So they are

heteroscedastic. And here we
have this fan-shaped pattern

that we saw before in

green again; violating the
constant variance assumption.

Now the residuals are
useful for creating a

measure of the goodness of
fit of a model. So once

you've fitted your
regression model, it is

useful to know how well
that model actually fits

with the observations.

So what do you think? Will a
well fitting model have large

residuals or small residuals?

Maybe think about
that for a second.

The well fitting model
has small residuals.

So let's get back to our word

frequency model. Here on the
right of the slide this one.

So the closer the observations
are to the line, all these

different observations are to
the line, the smaller the residuals.

Because you might remember that
residuals equals the observed values

minus the fitted values.

Now, to assess a goodness of
fit of our model with word

frequency as a predictor for
response duration, we compare

it to a model that does not
include that predictor.

So that is the model over here.

That model includes only the

intercept and is therefore
also referred

to as the null model.

So the slope of the regression
line is zero in this case

because it does not include a
predictor and a line with zero

slope is horizontal OK.

So if we have our original
regression line formula here

with the intercept, the
intercept slope and the error

term. The new model only has an

an intercept and the
error term.

So you can see that the observed

values in the null model are a
lot further away from the

regression line. So here we
have the regression line and

you can see that these
residuals are larger. So it

is worse than the model with
the frequent word frequency

predictor.

To get an overall measure
of fit or 'misfit',

the residuals can
be squared and summed so that

gives you something called the
sum of squared errors.

So we can do that for the
model with the word frequency

predictor, and in that case, if
you do the sums, you end up with

an SSD of 42,609.

Now, without context, that number
is pretty meaningless.

It is unstandardized, and it
changes depending on the metric

of the response measure.

Now the null model can be used
to put the SSE of the

main model into perspective, so
it can be used to compute a

standardized measure of model

fit. And that standardized
measure is, R-squared.

And this is what it looks like.

Take the SSE of the
main model. That's that.

And divide it by the SSE.
of the null model and

subtract that from 1.

So if you do the sums, we get an
R-squared of .70.

So what does that number tell

you? That number can be
conceptualized as how much

variance is described by the

model. So in this case, 72% of
the variation in response

durations in the lexical
decision task can be accounted

for by including word frequency
into the model.

On the other end, 38% of the

variation is due to chance.

Sorry, 28% of the variation is
due to chance or due to factors

that you've omitted.

So that are not included in
the model.

So, R-squared is actually a
measure of effect size.

It ranges from zero to 1.

So you can have any value in
between zero and one and

values closer to 1 indicate
that the model fits better

and it also shows you that
there is a stronger

effect.

So that is illustrated in
this figure.

A regression line or
regression model, and

that's that. Looks like
that I would have R

squared of approximately .3
so,

whatever the predictor is,
would account for about 30% of

the variation in Y.

Now, as the model
fit becomes better,

the distances between the
observed values and the fitted

line become smaller. Here an
even smaller here as we get

closer to 1.

OK, so just to summarize we
talked about a mathematical

specification of the regression
line by using the intercept

and slope of the line.

We looked at the regression

formula. Then we talked about
residuals and how residuals are

basically the observed values
minus the fitted values.

We looked at different types of

regression. I also discussed the
assumptions. So residuals need

to be normally distributed and
they need to show constant

variance. They need to
be homoscedastic.

And finally we looked at R-squared that uses the residuals

of the null model to
standardize the residuals of the

main model and that provides an
effect size measure which tells

you what proportion of variance
in the dependent variable can be

accounted for by the predictor
in the main model.

Thank you very much for
your attention.