PSYC122-w19-how-to

Author

Rob Davies

Introduction

In Week 19, we aim to further develop skills in working with the linear model.

We do this to learn how to answer research questions like:

What person attributes predict success in understanding?
Can people accurately evaluate whether they correctly understand written health information?

In this class, what is new is our focus on critically evaluating – comparing, reflecting on – the evidence from more than one relevant study.

This work simulates the kind of critical evaluation of evidence that psychologists must do in professional research.

Naming things

I will format dataset names like this:

study-one-general-participants.csv

I will also format variable (data column) names like this: variable

I will also format value or other data object (e.g. cell value) names like this: studyone

I will format functions and library names like this: e.g. function ggplot() or e.g. library {tidyverse}.

The data we will be using

In this how-to guide, we use data from two 2020 studies of the response of adults from a UK national sample to written health information:

study-one-general-participants.csv
study-two-general-participants.csv

The reason we are going to work with two datasets is that we will be comparing the results of analyses of the data to assess whether the results are robust.

Here, our assessment of robustness focuses on whether similar results are found in two different studies.

Check out the PSYC122 Week 19 lecture for a discussion of how an assessment of robustness is important to psychological science.

Answers

Step 1: Set-up

To begin, we set up our environment in R.

Task 1 – Run code to empty the R environment

rm(list=ls())

Task 2 – Run code to load relevant libraries

library("ggeffects")
library("patchwork")
library("tidyverse")

Warning: package 'tidyr' was built under R version 4.1.1

Warning: package 'purrr' was built under R version 4.1.1

Warning: package 'stringr' was built under R version 4.1.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Step 2: Load the data

Task 3 – Read in the data files we will be using

The data files are called:

study-one-general-participants.csv
study-two-general-participants.csv

Use the read_csv() function to read the data files into R:

study.one.gen <- read_csv("study-one-general-participants.csv")

Rows: 169 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): participant_ID, study, GENDER, EDUCATION, ETHNICITY
dbl (7): mean.acc, mean.self, AGE, SHIPLEY, HLVA, FACTOR3, QRITOTAL

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

study.two.gen <- read_csv("study-two-general-participants.csv")

Rows: 172 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): participant_ID, study, GENDER, EDUCATION, ETHNICITY
dbl (7): mean.acc, mean.self, AGE, SHIPLEY, HLVA, FACTOR3, QRITOTAL

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

When you read the data files in, give the data objects you create distinct name e.g. study.one.gen versus study.two.gen.

Task 4 – Inspect the data file

Use the summary() or head() functions to take a look at both datasets.

summary(study.one.gen)

 participant_ID        mean.acc        mean.self        study          
 Length:169         Min.   :0.3600   Min.   :3.440   Length:169        
 Class :character   1st Qu.:0.7600   1st Qu.:6.080   Class :character  
 Mode  :character   Median :0.8400   Median :7.080   Mode  :character  
                    Mean   :0.8163   Mean   :6.906                     
                    3rd Qu.:0.9000   3rd Qu.:7.920                     
                    Max.   :0.9900   Max.   :9.000                     
      AGE           SHIPLEY           HLVA           FACTOR3     
 Min.   :18.00   Min.   :23.00   Min.   : 3.000   Min.   :34.00  
 1st Qu.:24.00   1st Qu.:33.00   1st Qu.: 7.000   1st Qu.:46.00  
 Median :32.00   Median :35.00   Median : 9.000   Median :51.00  
 Mean   :34.87   Mean   :34.96   Mean   : 8.905   Mean   :50.33  
 3rd Qu.:42.00   3rd Qu.:38.00   3rd Qu.:10.000   3rd Qu.:55.00  
 Max.   :76.00   Max.   :40.00   Max.   :14.000   Max.   :63.00  
    QRITOTAL        GENDER           EDUCATION          ETHNICITY        
 Min.   : 6.00   Length:169         Length:169         Length:169        
 1st Qu.:12.00   Class :character   Class :character   Class :character  
 Median :13.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :13.36                                                           
 3rd Qu.:15.00                                                           
 Max.   :19.00

summary(study.two.gen)

 participant_ID        mean.acc        mean.self        study          
 Length:172         Min.   :0.4107   Min.   :3.786   Length:172        
 Class :character   1st Qu.:0.6786   1st Qu.:6.411   Class :character  
 Mode  :character   Median :0.7679   Median :7.321   Mode  :character  
                    Mean   :0.7596   Mean   :7.101                     
                    3rd Qu.:0.8393   3rd Qu.:7.946                     
                    Max.   :0.9821   Max.   :9.000                     
      AGE           SHIPLEY           HLVA           FACTOR3     
 Min.   :18.00   Min.   :23.00   Min.   : 3.000   Min.   :29.00  
 1st Qu.:25.00   1st Qu.:32.75   1st Qu.: 7.750   1st Qu.:47.00  
 Median :32.50   Median :36.00   Median : 9.000   Median :51.00  
 Mean   :35.37   Mean   :35.13   Mean   : 9.064   Mean   :51.24  
 3rd Qu.:44.00   3rd Qu.:39.00   3rd Qu.:11.000   3rd Qu.:56.25  
 Max.   :76.00   Max.   :40.00   Max.   :14.000   Max.   :63.00  
    QRITOTAL        GENDER           EDUCATION          ETHNICITY        
 Min.   : 6.00   Length:172         Length:172         Length:172        
 1st Qu.:12.00   Class :character   Class :character   Class :character  
 Median :14.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :13.88                                                           
 3rd Qu.:16.00                                                           
 Max.   :20.00

Notice that study.two.gen was designed to be a replication of study.one.gen.

We use the same online survey methods to collect data in both studies.
We present different health information texts in the different studies and recorded responses from different groups of adults in the UK.

Step 3: Compare the data from the different studies

Revise: practice to strengthen skills

Task 5 – Compare the data distributions from the two studies

Q.1. What is the mean of the mean.acc and SHIPLEY variables in the two studies?
A.1. The means are:
study one – mean.acc – mean = 0.8163
study one – mean.self – mean = 6.906
study two – mean.acc – mean = 0.7596
study two – mean.self – mean = 7.101
Q.2. Draw histograms of both mean.acc and mean.self for both studies.
A.2. You can write the code as you have been shown to do e.g. in week 17:

ggplot(data = study.one.gen, aes(x = mean.acc)) + 
  geom_histogram(binwidth = .1) +
  theme_bw() +
  labs(x = "Mean accuracy (mean.acc)", y = "frequency count") +
  xlim(0, 1)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

ggplot(data = study.one.gen, aes(x = mean.self)) + 
  geom_histogram(binwidth = 1) +
  theme_bw() +
  labs(x = "Mean self-rated accuracy (mean.self)", y = "frequency count") +
  xlim(0,10)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

ggplot(data = study.two.gen, aes(x = mean.acc)) + 
  geom_histogram(binwidth = .1) +
  theme_bw() +
  labs(x = "Mean accuracy (mean.acc)", y = "frequency count") +
  xlim(0, 1)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

ggplot(data = study.two.gen, aes(x = mean.self)) + 
  geom_histogram(binwidth = 1) +
  theme_bw() +
  labs(x = "Mean self-rated accuracy (mean.self)", y = "frequency count") +
  xlim(0,10)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Introduce: make some new moves

Task 6 – Create grids of plots to make the comparison easier to do

hint: Task 6 – What we are going to do is to create two histograms and then present them side by side to allow easy comparison of variable distributions

We need to make two changes to the coding approach you have been using until now.

Before we explain anything, let’s look at an example: run these line of code and check the result.

Make sure you identify what is different about the plotting code, shown following, compared to what you have done before: there is a surprise in what is going to happen.

First, create plot objects, give them names, but do not show them:

plot.one <- ggplot(data = study.one.gen, aes(x = mean.acc)) + 
  geom_histogram(binwidth = .1) +
  theme_bw() +
  labs(x = "Mean accuracy (mean.acc)", y = "frequency count", title = "Study One") +
  xlim(0, 1)

plot.two <- ggplot(data = study.two.gen, aes(x = mean.acc)) + 
  geom_histogram(binwidth = .1) +
  theme_bw() +
  labs(x = "Mean accuracy (mean.acc)", y = "frequency count", title = "Study Two") +
  xlim(0, 1)

Second, show the plots, side-by-side:

plot.one + plot.two

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

This is what you are doing: check out the process, step-by-step. (And notice that you repeat the process for each of two (or more) plots.)

ggplot(...) tell R you want to make a plot using the ggplot() function;
plot.one <- tell R you want to give the plot a name; the name appears in the environment;
ggplot(data = study.one.gen ...) tell R you want to make a plot with the study.two data;
ggplot(..., aes(x = mean.acc)) tell R that you want to make a plot with the variable mean.acc;

here, specify the aesthetic mapping, x = mean.acc

geom_histogram() tell R you want to plot values of mean.acc as a histogram;
binwidth = .1 adjust the binwidth to show enough detail but not too much in the distribution;
theme_bw() tell R what theme you want, adjusting the plot appearance;
labs(x = "Mean accuracy (mean.acc)", y = "frequency count", title = "Study One") fix the x-axis and y-axis labels;

here, add a title for the plot, so you can tell the two plots apart;

xlim(0, 1) adjust the x-axis limits to show the full range of possible score values on this variable.

Do this process twice, once for each dataset, creating two plots so that you can compare the distribution of mean.acc scores between the studies.

Finally, having created the two plots, produce them for viewing:

plot.one + plot.two having constructed – and named – both plots, you enter their names, separated by a +, to show them in a grid of two plots.

Notice: until you get to step 10, nothing will appear.

This will be surprising but it is perfectly normal when we increase the level of complexity of the plots we build.

You first build the plots.
You are creating plot objects and you give these objects names.
The objects will appear in the Environment with the names you give them.
You then produce the plots for viewing, by using their names.

Until you complete the last step, you will not see any changes until you use the object names to produce them for viewing.

This is how you construct complex arrays of plots.

Task 7 – Try this out for yourself, focusing now on the distribution of `mean.self` scores in the two studies

First, create plot objects but do not show them.

Give each plot a name. You will use the names next.

plot.one <- ggplot(data = study.one.gen, aes(x = mean.self)) + 
  geom_histogram(binwidth = 2) +
  theme_bw() +
  labs(x = "Self-rated accuracy (mean.self)", y = "frequency count", title = "Study One") +
  xlim(0, 10)

plot.two <- ggplot(data = study.two.gen, aes(x = mean.self)) + 
  geom_histogram(binwidth = 2) +
  theme_bw() +
  labs(x = "Self-rated accuracy (mean.self)", y = "frequency count", title = "Study Two") +
  xlim(0, 10)

Second produce the plots for viewing, side-by-side, by naming them.

plot.one + plot.two

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Q.3. Now use the plots to do some data analysis work: how do the mean.self distributions compare, when you compare the mean.self of study.one.gen versus `mean.self of study.two.gen?
A.3. When you compare the plots side-by-side you can see that the mean.self distributions are similar in the two studies: most people have high mean.self scores. This means that they rated the accuracy of their understanding at a high level, on average.
Q.4. Is the visual impression you get from comparing the distributions consistent with the statistics you see in the summary?
A.4. Yes: If you go back to the summary of mean.self, comparing the two studies datasets, then you can see that the median and mean are similar (around 7) in both study.one.gen and study.two.gen datasets.

Step 4: Now use scatterplots and correlation to examine associations between variables

Revise: practice to strengthen skills

Task 8 – Draw scatterplots to compare the potential association between `mean.acc` and `mean.self` in both `study.one.gen` and `study.two.gen` datasets

hint: Task 8 – The plotting steps are explained in some detail in `PSYC122-w17-how-to.Rmd`

ggplot(data = study.one.gen, aes(x = mean.self, y = mean.acc)) +
  geom_point(alpha = 0.75, size = 3, colour = "darkgrey") +
  geom_smooth(method = "lm", size = 1.5, colour = "green") +
  theme_bw() +
  labs(x = "mean self-rated accuracy", y = "mean accuracy") +
  xlim(0, 10) + ylim(0, 1)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using formula = 'y ~ x'

ggplot(data = study.two.gen, aes(x = mean.self, y = mean.acc)) +
  geom_point(alpha = 0.75, size = 3, colour = "darkgrey") +
  geom_smooth(method = "lm", size = 1.5, colour = "green") +
  theme_bw() +
  labs(x = "mean self-rated accuracy", y = "mean accuracy") +
  xlim(0, 10) + ylim(0, 1)

`geom_smooth()` using formula = 'y ~ x'

Task 9 – Create a grid of plots to make the comparison easier to do

hint: Task 9 – We follow the same steps as we used in tasks 6 and 7 to create the plots

We again:

First construct the plot objects and give them names;
Then create and show a grid of named plots.

Though this time we are producing a grid of scatterplots.

First, create plot objects – give them names but do not show them:

plot.one <- ggplot(data = study.one.gen, aes(x = mean.self, y = mean.acc)) +
  geom_point(alpha = 0.75, size = 3, colour = "darkgrey") +
  geom_smooth(method = "lm", size = 1.5, colour = "green") +
  theme_bw() +
  labs(x = "mean self-rated accuracy", y = "mean accuracy", title = "Study One") +
  xlim(0, 10) + ylim(0, 1)

plot.two <- ggplot(data = study.two.gen, aes(x = mean.self, y = mean.acc)) +
  geom_point(alpha = 0.75, size = 3, colour = "darkgrey") +
  geom_smooth(method = "lm", size = 1.5, colour = "green") +
  theme_bw() +
  labs(x = "mean self-rated accuracy", y = "mean accuracy", title = "Study Two") +
  xlim(0, 10) + ylim(0, 1)

Notice that in the plotting code we ask R to give each plot a title using labs().

Second name the plots, to show them side-by-side in the plot window:

plot.one + plot.two

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Now use the plots to make comparison judgments.

Q.5. How does the association, shown in the plots, between mean.self and mean.acc compare when you look at the study.one.gen versus the study.two.gen plot?
hint: Q.5. When comparing evidence about associations in different studies, we are mostly going to focus on the slope – the angle – of the prediction lines, and the ways in which points do or do not cluster about the prediction lines.
A.5. If you examine the study.one.gen versus the study.two.gen plots then you can see that in both plots higher mean.self scores appear to be associated with higher mean.acc scores. But the trend maybe is a bit stronger – the line is steeper – in study.two.gen compared to study.two.gen.

We are now in a position to answer one of our research questions:

Can people accurately evaluate whether they correctly understand written health information?

If people can accurately evaluate whether they correctly understand written health information then mean.self (a score representing their evaluation) should be associated with mean.acc (a score representing their accuracy of understanding) for each person.

Revise: practice to strengthen skills

Task 10 – Can you estimate the association between `mean.acc` and `mean.self` in both datasets?

hint: Task 10 – Use `cor.test()` as you have been shown how to do e.g. in `2023-24-PSYC122-w16-how-to.Rmd`

Do the correlation for both datasets.

First, look at the correlation between mean.acc and mean.self in study.one.gen:

cor.test(study.one.gen$mean.acc, study.one.gen$mean.self, method = "pearson",  alternative = "two.sided")


    Pearson's product-moment correlation

data:  study.one.gen$mean.acc and study.one.gen$mean.self
t = 7.1936, df = 167, p-value = 2.026e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3619961 0.5937425
sample estimates:
      cor 
0.4863771

Q.6. What is r, the correlation coefficient?
A.6. r = 0.4863771
Q.7. Is the correlation significant?
A.7. r is significant
Q.8. What are the values for t and p for the significance test for the correlation?
A.8. t = 7.1936, p-value = 2.026e-11

Second, look at the correlation between mean.acc and mean.self in study.two.gen:

cor.test(study.two.gen$mean.acc, study.two.gen$mean.self, method = "pearson",  alternative = "two.sided")


    Pearson's product-moment correlation

data:  study.two.gen$mean.acc and study.two.gen$mean.self
t = 8.4991, df = 170, p-value = 9.356e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4317217 0.6431596
sample estimates:
      cor 
0.5460792

Q.9. What is r, the correlation coefficient?
A.9. r = 0.5460792
Q.10. Is the correlation significant?
A.10. r is significant
Q.11. What are the values for t and p for the significance test for the correlation?
A.11. t = 8.4991, p = 9.356e-15

Now we can answer the research question:

Can people accurately evaluate whether they correctly understand written health information?

Q.12. What do the correlation estimates tell you is the answer to the research question?
A.12.

The correlations are positive and significant, indicating that higher mean.self (evaluations) are associated with higher mean.acc (understanding), suggesting that people can judge their accuracy of understanding.

Q.13. Can you compare the estimates, given the two datasets, to evaluate if the result in study.one.gen is replicated in study.two.gen?
hint: Q.13. We can judge if the result in a study is replicated in another study by examining if – here – the correlation coefficient is significant in both studies and if the coefficient has the same size and sign in both studies.
A.13. If you compare the correlation estimates from both study.one.gen and study.two.gen you can see:
first, the correlation is significant in both studies;
second, the correlation is positive in both studies,
third, the correlation is similar in magnitude, about \(r = .5\) in both studies.

This suggests that the association we see in study.one.gen is replicated in study.two.gen.

Step 5: Use a linear model to to answer the research questions – multiple predictors

Revise: practice to strengthen skills

Task 11 – Examine the relation between outcome mean accuracy (`mean.acc`) and multiple predictors

We specify linear models including as predictors the variables:

health literacy (HLVA);
vocabulary (SHIPLEY);
reading strategy (FACTOR3).

hint: Task 11 – Use `lm()`, as you have done before, see e.g. `2023-24-PSYC122-w18-how-to.R`

Task 11 – Examine the predictors of mean accuracy (`mean.acc`), first, for the `study.one.gen` data

model <- lm(mean.acc ~ HLVA + SHIPLEY + FACTOR3, data = study.one.gen)
summary(model)


Call:
lm(formula = mean.acc ~ HLVA + SHIPLEY + FACTOR3, data = study.one.gen)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40322 -0.05349  0.01152  0.07128  0.18434 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.302030   0.091257   3.310  0.00115 ** 
HLVA        0.017732   0.003923   4.521 1.17e-05 ***
SHIPLEY     0.005363   0.002336   2.296  0.02291 *  
FACTOR3     0.003355   0.001264   2.654  0.00872 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1033 on 165 degrees of freedom
Multiple R-squared:  0.2474,    Adjusted R-squared:  0.2337 
F-statistic: 18.08 on 3 and 165 DF,  p-value: 3.423e-10

Using the model estimates, we can answer the research question:

What person attributes predict success in understanding?

Inspect the model summary, then answer the following questions:

Q.14. What is the estimate for the coefficient of the effect of the predictor SHIPLEY in this model?
A.14. 0.005363
Q.15. Is the effect significant?
A.15. It is significant, p < .05
Q.16. What are the values for t and p for the significance test for the coefficient?
A.16. t = 2.296, p = 0.02291
Q.17. Now consider the estimates for all the variables, what do you conclude is the answer to the research question – given the study.one.gen data:

What person attributes predict success in understanding?

hint: Q.17. Can you report the model and the model fit statistics using the language you have been shown in the week 18 lecture?
A.17.

We fitted a linear model with mean comprehension accuracy as the outcome and health literacy (HLVA), reading strategy (FACTOR3), and vocabulary (SHIPLEY) as predictors. The model is significant overall, with F(3, 165) = 18.08, p < .001, and explains 23% of variance (adjusted R2 = 0.23). Mean accuracy was predicted to be higher given higher scores in health literacy (HLVA estimate = .018, t = 4.52, p < .001), vocabulary knowledge (SHIPLEY estimate = .005, t = 2.96, p < .001), and reading strategy (FACTOR3 estimate = .003, t = 2.65, p = .009).

Task 12 – Examine the predictors of mean accuracy (`mean.acc`), now, for the `study.two.gen` data

model <- lm(mean.acc ~ HLVA + SHIPLEY + FACTOR3, data = study.two.gen)
summary(model)


Call:
lm(formula = mean.acc ~ HLVA + SHIPLEY + FACTOR3, data = study.two.gen)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.242746 -0.074188  0.003173  0.075361  0.211357 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.146896   0.076325   1.925  0.05597 .  
HLVA        0.017598   0.003589   4.904  2.2e-06 ***
SHIPLEY     0.008397   0.001853   4.533  1.1e-05 ***
FACTOR3     0.003087   0.001154   2.675  0.00822 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.097 on 168 degrees of freedom
Multiple R-squared:  0.3636,    Adjusted R-squared:  0.3522 
F-statistic: 31.99 on 3 and 168 DF,  p-value: < 2.2e-16

Using the model estimates, we can answer the research question:

What person attributes predict success in understanding?

Inspect the model summary, then answer the following questions:

Q.18. What is the estimate for the coefficient of the effect of the predictor, SHIPLEY, in this model?
A.18. 0.008397
Q.19. Is the effect significant?
A.19. It is significant, p < .05
Q.20. What are the values for t and p for the significance test for the coefficient?
A.20. t = 4.533, p = 1.1e-05
Q.21. Now consider the estimates for all the variables, what do you conclude is the answer to the research question – given the study.two.gen data:

What person attributes predict success in understanding?

hint: Q.21. Can you report the model and the model fit statistics using the language you have been shown in the week 18 lecture?
A.21.

We fitted a linear model with mean comprehension accuracy as the outcome and health literacy (HLVA), reading strategy (FACTOR3), and vocabulary (SHIPLEY) as predictors. The model is significant overall, with F(3, 168) = 31.99, p < .001, and explains 35% of variance (adjusted R2 = 0.35). Mean accuracy was predicted to be higher given higher scores in health literacy (HLVA estimate = .018, t = 4.90, p < .001), vocabulary knowledge (SHIPLEY estimate = .008, t = 4.53, p < .001), and reading strategy (FACTOR3 estimate = .003, t = 2.68, p = .008).

Q.22. Are the findings from study.one.gen replicated in study.two.gen?
hint: Q.22. We can judge if the results in an earlier study are replicated in another study by examining if – here – the linear model estimates are significant in both studies and if the coefficient estimates have the same size and sign in both studies.
A.22. If you compare the linear model coefficient estimates from both study.one.gen and study.two.gen you can see:
first, the HLVA, SHIPLEY and FACTOR3 estimates are significant in both study.one.gen and study.two.gen;
second, the estimates have the same sign – positive – in both studies.

This suggests that the results we see in study.one.gen are replicated in study.two.gen.

Q.23. Are there any important differences between the results of the two studies?
hint: Q.23. You can look at the estimates but you can also use the model prediction plotting code you used before, see example code in PSYC122-w18-how-to.R.
hint: Q.23. – Let’s focus on comparing the study.one.gen and study.two.gen estimates for the effect of vocabulary knowledge in both models: we can plot model predictions for comparison:

First: fit the models – using different names for the different models:

model.one <- lm(mean.acc ~ HLVA + SHIPLEY + FACTOR3, data = study.one.gen)
summary(model.one)


Call:
lm(formula = mean.acc ~ HLVA + SHIPLEY + FACTOR3, data = study.one.gen)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40322 -0.05349  0.01152  0.07128  0.18434 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.302030   0.091257   3.310  0.00115 ** 
HLVA        0.017732   0.003923   4.521 1.17e-05 ***
SHIPLEY     0.005363   0.002336   2.296  0.02291 *  
FACTOR3     0.003355   0.001264   2.654  0.00872 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1033 on 165 degrees of freedom
Multiple R-squared:  0.2474,    Adjusted R-squared:  0.2337 
F-statistic: 18.08 on 3 and 165 DF,  p-value: 3.423e-10

model.two <- lm(mean.acc ~ HLVA + SHIPLEY + FACTOR3, data = study.two.gen)
summary(model.two)


Call:
lm(formula = mean.acc ~ HLVA + SHIPLEY + FACTOR3, data = study.two.gen)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.242746 -0.074188  0.003173  0.075361  0.211357 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.146896   0.076325   1.925  0.05597 .  
HLVA        0.017598   0.003589   4.904  2.2e-06 ***
SHIPLEY     0.008397   0.001853   4.533  1.1e-05 ***
FACTOR3     0.003087   0.001154   2.675  0.00822 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.097 on 168 degrees of freedom
Multiple R-squared:  0.3636,    Adjusted R-squared:  0.3522 
F-statistic: 31.99 on 3 and 168 DF,  p-value: < 2.2e-16

Second, create prediction plots for the SHIPLEY effect for each model:

dat.one <- ggpredict(model.one, "SHIPLEY")
plot.one <- plot(dat.one) + labs(title = "Study One")
dat.two <- ggpredict(model.two, "SHIPLEY")
plot.two <- plot(dat.two) + labs(title = "Study Two")

– Third, show the plots side-by-side

plot.one + plot.two

A.23. If we compare the estimates for the coefficient of the SHIPLEY effect in the study.one.gen and study.two.gen models we can see that:

the SHIPLEY effect is significant in both study.one.gen and study.two.gen;
the effect is positive in both studies;
the coefficient estimate is a bit bigger in study.two.gen than in study.two.gen;
the prediction plots suggest the prediction line slope is steeper in study.two.gen.

This suggests that the effect of vocabulary is stronger in Study Two.

Why that is should be the target of further investigation.

Introduction

Naming things

The data we will be using

Answers

Step 1: Set-up

Task 1 – Run code to empty the R environment

Task 2 – Run code to load relevant libraries

Step 2: Load the data

Task 3 – Read in the data files we will be using

Task 4 – Inspect the data file

Step 3: Compare the data from the different studies

Revise: practice to strengthen skills

Task 5 – Compare the data distributions from the two studies

Introduce: make some new moves

Task 6 – Create grids of plots to make the comparison easier to do

hint: Task 6 – What we are going to do is to create two histograms and then present them side by side to allow easy comparison of variable distributions

Task 7 – Try this out for yourself, focusing now on the distribution of mean.self scores in the two studies

Step 4: Now use scatterplots and correlation to examine associations between variables

Revise: practice to strengthen skills

Task 8 – Draw scatterplots to compare the potential association between mean.acc and mean.self in both study.one.gen and study.two.gen datasets

hint: Task 8 – The plotting steps are explained in some detail in PSYC122-w17-how-to.Rmd

Task 9 – Create a grid of plots to make the comparison easier to do

hint: Task 9 – We follow the same steps as we used in tasks 6 and 7 to create the plots

Revise: practice to strengthen skills

Task 10 – Can you estimate the association between mean.acc and mean.self in both datasets?

hint: Task 10 – Use cor.test() as you have been shown how to do e.g. in 2023-24-PSYC122-w16-how-to.Rmd

Step 5: Use a linear model to to answer the research questions – multiple predictors

Revise: practice to strengthen skills

Task 11 – Examine the relation between outcome mean accuracy (mean.acc) and multiple predictors

hint: Task 11 – Use lm(), as you have done before, see e.g. 2023-24-PSYC122-w18-how-to.R

Task 11 – Examine the predictors of mean accuracy (mean.acc), first, for the study.one.gen data

Task 12 – Examine the predictors of mean accuracy (mean.acc), now, for the study.two.gen data

– Third, show the plots side-by-side

Task 7 – Try this out for yourself, focusing now on the distribution of `mean.self` scores in the two studies

Task 8 – Draw scatterplots to compare the potential association between `mean.acc` and `mean.self` in both `study.one.gen` and `study.two.gen` datasets

hint: Task 8 – The plotting steps are explained in some detail in `PSYC122-w17-how-to.Rmd`

Task 10 – Can you estimate the association between `mean.acc` and `mean.self` in both datasets?

hint: Task 10 – Use `cor.test()` as you have been shown how to do e.g. in `2023-24-PSYC122-w16-how-to.Rmd`

Task 11 – Examine the relation between outcome mean accuracy (`mean.acc`) and multiple predictors

hint: Task 11 – Use `lm()`, as you have done before, see e.g. `2023-24-PSYC122-w18-how-to.R`

Task 11 – Examine the predictors of mean accuracy (`mean.acc`), first, for the `study.one.gen` data

Task 12 – Examine the predictors of mean accuracy (`mean.acc`), now, for the `study.two.gen` data