Better understanding the linear model

Rob Davies

Department of Psychology, Lancaster University

2024-02-26

PSYC122: Classes in weeks 16-20

My name is Dr Rob Davies, I am an expert in communication, individual differences, and methods

Tip

Ask me anything:

questions during class in person or anonymously through slido;
all other questions on discussion forum

Week 17: Better understanding the linear model

The figure presents a scatterplot indicating the association between variables mean accuracy (on y-axis) and vocabulary (x-axis) scores. The points are shown in grey, and clustered such that higher vocabulary scores tend to be associated with higher accuracy scores. The trend is indicated by a thick red line. In the background, a fan of light pink lines indicate possible ways to capture the trend of the association.

Objectives: 1. Link together ideas on how to do psychological science

You are learning about:

the scientific method
measurement and hypothesis testing
modern reproducible open science

Our job now is to connect these ideas together

Picture shows a spider's web against a green background. — flickr, Khunal Gate ‘Web’

Objectives: 2. Strengthen your practice and build your independence

In PSYC121 and PSYC122, you have learned about working with data
In PSYC122, so far: you have been introduced to correlations and linear models
Our job now is to deepen and broaden your skills

Picture shows a group of climbers on a snow field, standing near some rocks. In the background, there is a mountain peak and blue cloudless skies. — flickr, Magryciak ‘Great weekend’

Targets for weeks 16-19: Concepts

We are working together to develop concepts:

Week 16 — Hypotheses, measurement and associations
Week 17 — Predicting people using linear models
Week 18 — Everything is some kind of linear model
Week 19 — The real challenge in psychological science

Targets for weeks 16-19: Skills

We are working together to develop skills:

Week 16 — Visualizing, estimating, and reporting associations
Week 17 — Using data to predict people
Week 18 — Going deeper on linear models
Week 19 — Evaluating evidence across multiple studies

Learning targets for this week

Skills – Understand how to code: lm(mean.acc ~ SHIPLEY)
Concepts – To answer questions like: Is comprehension success influenced by vocabulary knowledge?

Learning targets for this week

We will learn how to do analysis in the context of a live research project: the health comprehension project

code linear models
identify and interpret model statistics
critically evaluate the results
communicate the results

Thinking about relationships in psychologial science

We often want to know about relationships

Does variation in X predict variation in Y?
What are the factors that influence outcome Y?
Is a theoretical model consistent with observed behaviour?

Now consider our research aims in the context of the health comprehension project

Because public health impacts depend on giving people information they can understand
We want to know: What makes it easy or difficult to understand written health information?

flickr: Sasin Tipchair ‘Senior woman in wheelchair talking to a nurse in a hospital’

Health comprehension project: questions and analyses

We want to know: What makes it easy or difficult to understand written health information?
So our research questions are:

What person attributes predict success in understanding?
Can people accurately evaluate whether they correctly understand written health information?

These kinds of research questions can be answered using methods like linear models

Context: Individual differences theory of comprehension success

Understanding text depends on (1.) language experience and (2.) reasoning ability (Freed et al., 2017)

Figure 2: Hypothesized predictors of comprehension

The measurement context: Where the data come from

We measure reading comprehension: asking people to read text and then answer multiple choice questions
We measure background knowledge: vocabulary knowledge (Shipley); health literacy (HLVA)

Reflect: The kinds of critical evaluation questions you can ask yourself

Are multiple choice questions good ways to probe understanding? What alternatives are there?
Are tests like the Shipley good measures of language knowledge? What do we miss?

Reflect: As we move into thinking about the data analysis, we need to identify our assumptions

validity: that differences in knowledge or ability cause differences in test scores
measurement: that this is equally true across the different kinds of people we tested
generalizability: that the sample of people we recruited looks like the population

We need to think about the derivation chain

Figure 3: The derivation chain, introduced in week 16

Questions, assumptions, predictions

Link: concepts, questions \(\rightarrow\) assumptions \(\rightarrow\) testable predictions

concepts, questions: Can people accurately understand health guidance? \(\rightarrow\)
assumptions: People who know more about language should also present more accurate understanding \(\rightarrow\)
testable predictions: Higher levels of vocabulary should be associated with higher levels of comprehension accuracy: we expect to estimate a positive coefficient

One way of thinking about the association is to visualize it

For each value of the predictor vocabulary
Does the the value of the outcome accuracy
Increase or decrease?

Let’s take a break

End of part 1

Predicted association as expected change in average outcome

Figure 5 shows the distribution curve of mean (comprehension) accuracy scores observed at each value of vocabulary
You can see that the middle – the average – of each distribution increases
as we go from left (low scores) to right (high scores) on vocabulary

The figure presents a indicating the association between variables mean accuracy (on y-axis) and vocabulary (x-axis) scores. In the plot, the points are shown in dark red and clustered such that higher vocabulary scores tend to be associated with higher accuracy scores. Ridges are superimposed on the points, shown in dark grey and red. We show the distribution curve of mean (comprehension) accuracy scores observed at each value of vocabulary. You can see that the middle -- the average -- of each distribution increases as we go from left (low scores) to right (high scores). — Figure 5: Association between accuracy and vocabulary

How do we estimate the association between two variables?

model <- lm(mean.acc ~ SHIPLEY, 
            data = clearly.one.subjects)
summary(model)

Specify the lm function and the model mean.acc ~ SHIPLEY
Specify what data we use data = clearly.one.subjects
Get the results summary(model)

The sentence structure of models in R

Take a good look:

lm(mean.acc ~ SHIPLEY ...)

You will see this sentence structure in coding for many different analysis types

method(outcome ~ predictors)
method could be aov, brm, lm, glm, glmm, lmer, t.test, cor.test

Results: How does the outcome vary in relation to the predictor?


Call:
lm(formula = mean.acc ~ SHIPLEY, data = clearly.one.subjects)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42871 -0.04921  0.02079  0.07480  0.18430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44914    0.08053   5.577 9.67e-08 ***
SHIPLEY      0.01050    0.00229   4.585 8.85e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1115 on 167 degrees of freedom
Multiple R-squared:  0.1118,    Adjusted R-squared:  0.1065 
F-statistic: 21.03 on 1 and 167 DF,  p-value: 8.846e-06

A model summary gives us estimates of:
The coefficient \(= 0.44914\) for the intercept
The coefficient \(= 0.01050\) for the slope of the SHIPLEY ‘effect’

These coefficients build a line

The line represents:
our prediction for how the outcome varies on average
given change in the predictor

So now we need to think about straight lines

You may remember from school that to draw a straight line you need four numbers:

\[y = a + bx\]

We calculate the height \(y\) by adding

\(a\) the intercept, the value of y when \(x = 0\)
to the product of \(b\) the coefficient for the slope of the line
multiplied by \(x\) the value of the predictor variable

Let’s draw it

Look at what we get if we draw the line using the linear model coefficients:
\(= 0.449\) for the intercept, \(a\)
\(= 0.011\) for the slope, \(b\)
In the formula: \(y = 0.449 + 0.011x\)
(I round the numbers to three decimal places.)

The figure presents a line indicating the predicted association between variables mean accuracy (on y-axis) and vocabulary (x-axis) scores. In the plot, higher vocabulary scores are predicted to be associated with higher accuracy scores. Model prediction of change in outcome is shown with a red line. — Figure 6: The *predicted* association between comprehension accuracy and vocabulary

We can understand the line as representing a set of predictions

To see how — we use the coefficients to predict just one potential outcome:
the expected accuracy for someone with a vocabulary score of 20
We do this using the formula:

\(\text{predicted y} = 0.449 + \text{0.011 } \times \text{ Shipley score of } 20\)

We can understand the line as representing a set of predictions

Let’s expand our predictions

Predict accuracy given a Shipley score of 20

\(y = 0.449 + 0.011 \times 20\)

Predict accuracy given a Shipley score of 30

\(y = 0.449 + 0.011 \times 30\)

The linear model allows us to predict the average outcome we can expect given any value of the predictor

Figure 9: The predicted change in mean comprehension accuracy, given variation in vocabulary scores

Let’s take a break

End of part 2

We could draw a variety of different model prediction lines: how do we pick the right one?

We need to go back to the prediction model

To calculate a predicted outcome value, we calculated it as: \(\text{predicted y} = 0.449 + 0.011 \times \text{ Shipley score } 20\)
Assuming the linear model \(\text{predicted y} = intercept + \text{slope } \times \text{ vocabulary}\)
But we missed a bit: error

Linear models are typically estimated given sample data

Maybe you noticed that I talked about how the model allows us to predict
how the outcome varies on average given different values of the predictor
When we use a linear model to estimate the intercept and slope – to build the predictions – we fit a model to the sample data
And no model will fit sample data perfectly

Linear models are typically estimated given sample data

Usually, this means there are differences between the expected outcomes that the model predicts and the observed outcomes

So we often write the linear model like this: \(y = a + bx + \epsilon\)

The observed outcome \(y\) equals
the intercept \(a\)
plus the difference associated with a specific predictor value \(bx\)
plus some amount of mismatch or error \(\epsilon\), the difference between the observed outcome and the predicted outcome

We can derive the formulas used to calculate the estimates using calculus

But we won’t
Because the linear model calculations are done using matrix solution algorithms in R so we don’t have to

What do the prediction errors – the residuals – look like?

Figure 10: The predicted change in mean comprehension accuracy, given variation in vocabulary scores. Observed values are shown in orange-red. Predicted values are shown in blue

What the prediction errors look like

The model expectation: higher vocabulary predicts higher mean comprehension accuracy
The predicted points are shown by the blue line
The prediction line increases in height for higher values of vocabulary
Look at the differences in height between the observed points (in orange-red) and predicted points

What the prediction errors look like

If the regression model were perfect then all the observed points would lie on the prediction line
They do not

What the prediction errors look like

Differences between observed and predicted outcomes are shown by the vertical lines
Better models should show smaller differences between observed and predicted outcome values
Notice: some participants had same vocabulary scores but different outcomes

We typically assume that the residuals are normally distributed

Some are positive: observed outcome larger than predicted outcome
Some are negative: observed outcome smaller than predicted outcome
The average of the residuals will be zero overall

The figure a histogram of the residuals, the prediction errors, for the linear model of the association between mean comprehension accuracy and vocabulary. The histogram is shown in grey, and the peak is centered at residuals = 0. A dashed red line is drawn at resdiduals = 0. A red density curve is superimposed on the histogram to indicate the theoretical normal distribution of residuals. — Figure 14: Plot showing the distribution of prediction errors – residuals – for the linear model of comprehension accuracy

So: We pick the line that minimizes the residuals – the mismatch between predicted and observed outcomes

Let’s take a break

End of part 3

Identifying the key information in the linear model results

The summary() of the linear model shows …
The Estimate of the Coefficient of the effect of individual differences in vocabulary (SHIPLEY)
how much the outcome mean.acc value changes, given differences in SHIPLEY score
Associated t value and Pr(> |t|) statistics for the coefficient t-test
Model fit statistics: R-squared and F-statistic


Call:
lm(formula = mean.acc ~ SHIPLEY, data = clearly.one.subjects)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42871 -0.04921  0.02079  0.07480  0.18430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44914    0.08053   5.577 9.67e-08 ***
SHIPLEY      0.01050    0.00229   4.585 8.85e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1115 on 167 degrees of freedom
Multiple R-squared:  0.1118,    Adjusted R-squared:  0.1065 
F-statistic: 21.03 on 1 and 167 DF,  p-value: 8.846e-06

For the effect of vocabulary (SHIPLEY), we have:

The coefficient for the slope of the effect of variation in vocabulary scores: 0.01050
The Std. Error (standard error) 0.00229 for that estimate
The tvalue 4.585 and associated Pr(>|t|) p-value 8.85e-06 for the null hypothesis test of the coefficient


Call:
lm(formula = mean.acc ~ SHIPLEY, data = clearly.one.subjects)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42871 -0.04921  0.02079  0.07480  0.18430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44914    0.08053   5.577 9.67e-08 ***
SHIPLEY      0.01050    0.00229   4.585 8.85e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1115 on 167 degrees of freedom
Multiple R-squared:  0.1118,    Adjusted R-squared:  0.1065 
F-statistic: 21.03 on 1 and 167 DF,  p-value: 8.846e-06

Identifying the key information in the results

Pay attention to the sign and the size of the coefficient estimate:
Is the coefficient (e.g., SHIPLEY 0.01050) a positive or a negative number?
Is it relatively large or small?
We come back to this, shortly, in the context of interpretation and reporting


Call:
lm(formula = mean.acc ~ SHIPLEY, data = clearly.one.subjects)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42871 -0.04921  0.02079  0.07480  0.18430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44914    0.08053   5.577 9.67e-08 ***
SHIPLEY      0.01050    0.00229   4.585 8.85e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1115 on 167 degrees of freedom
Multiple R-squared:  0.1118,    Adjusted R-squared:  0.1065 
F-statistic: 21.03 on 1 and 167 DF,  p-value: 8.846e-06

The t-tests in the linear model

\[t = \frac{\beta_j}{s_{\beta_j}}\]

For each coefficient, the t-test is used to evaluate if the coefficient \(\beta_j\) is significantly different from zero
We assume the null hypothesis that the coefficient \(\beta_j\) is zero
We do the test by comparing the estimated coefficient \(\beta_j\) with the standard error of the estimate

The t-tests in the linear model

\[t = \frac{\beta_j}{s_{\beta_j}}\]

The standard error \(s_{\beta_j}\) indicates our uncertainty about the estimate
Larger standard errors represent greater uncertainty

The t-tests in the linear model

\[t = \frac{\beta_j}{s_{\beta_j}}\]

Standard errors can be calculated using information about:
Error in the model — think of the distribution of residuals
Variation of values in the predictor — how widely they range
The sample size
Standard errors will be smaller for the coefficients of effects that appear to have bigger impacts, in models that describe outcomes better, in larger samples

Identifying the key information in the results

Pay attention to R-squared:
The model summary gives us the Multiple R-squared and Adjusted R-squared
These numbers represent how much of the variation in the outcome can be predicted by the model
We usually report Adjusted R-squared because it tends to be more accurate


Call:
lm(formula = mean.acc ~ SHIPLEY, data = clearly.one.subjects)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42871 -0.04921  0.02079  0.07480  0.18430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44914    0.08053   5.577 9.67e-08 ***
SHIPLEY      0.01050    0.00229   4.585 8.85e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1115 on 167 degrees of freedom
Multiple R-squared:  0.1118,    Adjusted R-squared:  0.1065 
F-statistic: 21.03 on 1 and 167 DF,  p-value: 8.846e-06

R-squared – what is it? – an indicator of the proportion of outcome variation we can predict

Better models should show smaller differences between observed and predicted outcomes
R-squared (\(R^2\)) gives the proportion of outcome variance
we can predict given information about differences in vocabulary

R-squared as an indicator of the proportion of the outcome variation we can predict

To understand what this means, look at the scatterplot
On average, values in outcome (accuracy) increase with increasing values in the predictor (vocabulary)
But different people got different outcomes even with same vocabulary scores

R-squared as an indicator of the proportion of the outcome variation we can predict

So: we have variation in the outcome that is related to variation in the predictor
And: we have variation in the outcome that seems unrelated to the predictor
\(R^2\) tells us how much variation in the outcome is explained by the model
\(R^2\) gives us a proportion where \(R^2 = \frac{\text{predicted outcome variation}}{\text{total outcome variation}}\)

Identifying the key information in the results

Pay attention to F:
The model summary gives us the F-statistic:
This is the test statistic for the test of the null hypothesis that the model does not predict the outcome


Call:
lm(formula = mean.acc ~ SHIPLEY, data = clearly.one.subjects)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42871 -0.04921  0.02079  0.07480  0.18430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44914    0.08053   5.577 9.67e-08 ***
SHIPLEY      0.01050    0.00229   4.585 8.85e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1115 on 167 degrees of freedom
Multiple R-squared:  0.1118,    Adjusted R-squared:  0.1065 
F-statistic: 21.03 on 1 and 167 DF,  p-value: 8.846e-06

Reporting the results of a linear model

You will need to report three bits of information:

\(R^2\) how much outcome variation is explained by the model
\(F\) test for the null hypothesis that none of the predictors actually predict the outcome
Coefficient estimates with the t-tests for the null hypothesis for each coefficient


Call:
lm(formula = mean.acc ~ SHIPLEY, data = clearly.one.subjects)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42871 -0.04921  0.02079  0.07480  0.18430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44914    0.08053   5.577 9.67e-08 ***
SHIPLEY      0.01050    0.00229   4.585 8.85e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1115 on 167 degrees of freedom
Multiple R-squared:  0.1118,    Adjusted R-squared:  0.1065 
F-statistic: 21.03 on 1 and 167 DF,  p-value: 8.846e-06

The language and style of reporting linear model results

Here is an example of results reporting text that is conventional:

We fitted a linear model with mean comprehension accuracy as the outcome and vocabulary (Shipley) as the predictor. Our analysis indicated a significant effect of vocabulary knowledge. The model is significant overall, with \(F(1, 167) = 21.03, p < .001\), and explains 11% of variance (\(\text{adjusted } R^2 = 0.11\)). The model estimates showed that the accuracy of comprehension increased with increasing levels of participant vocabulary knowledge (\(\beta = .011, t = 4.59, p <.001\)).

Look at what we do with the text

Explain what I did, specifying the method (linear model), the outcome variable (accuracy) and the predictor variables (health literacy, reading strategy, reading skill and vocabulary)
Report the model fit statistics overall (\(F, R^2\))
Report the significant effects (\(\beta, t, p\)) and describe the nature of the effects (does the outcome increase or decrease?)

We fitted a linear model with mean comprehension accuracy as the outcome and vocabulary (Shipley) as the predictor. Our analysis indicated a significant effect of vocabulary knowledge. The model is significant overall, with \(F(1, 167) = 21.03, p < .001\), and explains 11% of variance (\(\text{adjusted } R^2 = 0.11\)). The model estimates showed that the accuracy of comprehension increased with increasing levels of participant vocabulary knowledge (\(\beta = .011, t = 4.59, p <.001\)).

Summary

In psychological science, we often ask questions like:

Does variation in X predict variation in Y?
What are the factors that influence outcome Y?
Is a theoretical model consistent with observed behaviour?

We can answer these questions using the linear model
Given sample data, we can predict the average difference in outcome values, for different levels of a predictor variable
We (or the math engine R uses) calculate the predictions so that they minimize the residuals, the errors of prediction or the mismatch between predicted and observed outcomes
Our results report tells the reader about the model and the estimated effects

End of week 9

References

Freed, E. M., Hamilton, S. T., & Long, D. L. (2017). Comprehension in proficient readers: The nature of individual variation. Journal of Memory and Language, 97, 135–153. https://doi.org/10.1016/j.jml.2017.07.008