download.file("https://github.com/lu-psy-r/MSc/blob/main/PSYC412/part1/Week12-files/data_exp_72623-v17_questionnaire-i5jp.csv?raw=true", destfile = "data_exp_72623-v17_questionnaire-i5jp.csv")
Week 12. Categorical predictors
Written by Margriet Groen (partly adapted from materials developed by the PsyTeachR team at the University of Glasgow)
So far, all the predictors in the models we’ve looked at were continuous variables. What if you wanted to know whether a response differed between two or more discrete groups? Hang on, you might say, that sounds like doing an ANOVA. True, you might have used ANOVA to assess whether group means differed in previous stats courses. ANOVAs—to some degree—are just a special type of regression where you have categorical predictors. This week we’ll look at how to model responses as a function of categorical predictors and we’ll combine categorical predictors to model how a predictor might affect the outcome variable differently across two different groups. For example, we might be interested in whether the amount of time adolescents use digital devices (screen-time) predicts their well-being. Additionally, we might want to know whether well-being is different for adolescent boys and girls and whether the relationship between screen-time and well-being differs for these two groups. By fitting a regression model in which we combine a continuous (screen-time) and a categorical (sex) predictor, we can do exactly that. We’ll be working on that in the lab.
Lectures
The lecture material for this week follows the recommended chapters in Winter (2020) – see under ‘Reading’ below – and is presented in two parts. The videos have captions, in case you find that helpful.
You’ll find links to the slides and transcripts underneath each video. You can also download them all at once here
- Categorical predictors (~17 min)
- Interactions (~18 min)
If you are (relatively) new to using R and RStudio and not yet confident in using various functions from the dplyr
package, the video below will be useful:
- Data wrangling with dplyr (~10 minutes) Watch this part before you complete the pre-lab activities and before you attend the lab session.
Reading
Blogpost by Professor Dorothy Bishop
In this very short blogpost Professor Dorothy Bishop explains the links between ANOVA and Regression.
Winter (2020)
Chapter 7 provides an excellent overview of using categorical predictors in regression models and explains how this is implemented in R.
Chapter 8 explains what interactions are and how to model and interpret them.
Pre-lab activities
After having watched the lectures and read the textbook chapters you’ll be in a good position to try these activities. Completing them before you attend your lab session will help you to consolidate your learning and help move through the lab activities more smoothly.
Pre-lab activity 1: More data-wrangling in R
We’ll continue to practice data-wrangling in R. This week, we’ll work with some questionnaire data. You might have collected questionnaire data using Qualtrics, Pavlovia or Gorilla. The data we’ll be working with in this pre-lab activity were collected using the Gorilla platform. The questionnaire collects some demographic information. A data collection platform typically provides you with an output file that is quite complex. For instance, in addition to responses to the questions of interest, it will collect information on task name, task version, various time stamps and other information on the data collection session. To be able to do any statistics on the data, you’ll have to extract the relevant information from this larger data file. We can do this by using various functions from the dplyr
package. This cheatsheet provides a useful overview.
Step 1: Background and set up
Make sure you have set your working directory to the correct folder by right-clicking on it and selecting ‘Set as working directory’.
Before you do anything else, when starting a new analysis, it is a good idea to empty the R environment. This prevents objects and variables from previous analyses interfering with the current one. To do this, you can click on the little broom icon in the top right of the Environment pane, or you can use rm(list=ls())
.
Before we can get started we need to tell R which libraries to use. For this analysis we’ll need tidyverse
.
TASK: Load the relevant libraries. If you are unsure how to do that, you can look at the ‘Hint’ below for a clue by expanding it. After that, if you are still unsure, you can view the code by expanding the ‘Code’ section below.
Use the library()
function.
The code to do this is below.
library(tidyverse)
If you experienced difficulties with uploading a folder or a file to the server, you can use the code below to directly download the file you need in this lab activity to the server (instead of first downloading it to you computer and then uploading it to the server). Remember that you can copy the code to your clipboard by clicking on the ‘clipboard’ in the top right corner.
TASK: Read in the data file.
Use the read_csv()
function.
The code to do this is below.
demo <- read_csv("data_exp_72623-v17_questionnaire-i5jp.csv")
Step 2: Have a look what is in this file
The first thing to do when developing an R-script to clean up data prior to data analysis is to have a look what is in the file and identify the rows and columns that you need.
TASK: View the data file. See whether you can identify which columns we need to extract the questionnaire responses. If you are unsure how to do that, you can look at the ‘Hint’ below for a clue by expanding it. After that, if you are still unsure, you can view the code by expanding the ‘Code’ section below.
You can use View()
to open the file in a separate tab. You can also inspect the file by using the head()
function to look at the first few rows or the glimpse()
function to get an overview of all rows and columns. Type ?head
or ?dplyr::glimpse
in the Console to get more information about those functions.
The code to do this is below:
View(demo) # Look at the data frames
head(demo)
glimpse(demo)
You’ll find that the data file has 31 columns and 157 rows. The questionnaire responses are in the last column (Response
). Additionally, we need information in the column Question Key
to make sure we extract the labels associated with the responses. Finally, we need participant numbers (stored in Participant Private ID
) to be able to link responses to a particular participant.
Step 3: Extract relevant columns and rows
TASK: Extract the columns
Participant Private ID
,Question Key
andResponse
, and assign the result to a new data frame.
Use the select()
function.
The code to do this is below:
select(`Participant Private ID`, `Question Key`, `Response`)
You’ll see that the new data frame still has 157 observations or rows, but now we only have a much more manageable 3 variables or columns. If you scroll through the data file, you’ll see that the responses from different participants are in separate rows (meaning that the data are in ‘long’ format). You’ll also see that it separates different participants by rows with the text “BEGIN QUESTIONNAIRE” and “END QUESTIONNAIRE”. We need to get rid of those before we transform the data frame to a ‘wide’ format.
TASK: Extract all rows, except ones that have “BEGIN QUESTIONNAIRE” or “END QUESTIONNAIRE” in the column
Question Key
.
Use the filter()
function, in combination with the logical operator for ‘not’. See here for more info on logical operators.
The code to do this is below:
filter(!`Question Key` == "BEGIN QUESTIONNAIRE")
filter(!`Question Key` == "END QUESTIONNAIRE")
Step 4: Change the format from ‘long’ to ‘wide’
Now that we’ve cleaned up the column and rows, we want to change the format to wide
to have a separate column for each question, and have one row per participant.
TASK: Change the format to wide.
Use the pivot_wider()
function.
The code to do this is below:
pivot_wider(names_from = `Question Key`, values_from = Response)
Step 5: Rename variables
Have another look at the data frame. We now have separate columns for different questions such as participant number, age, gender etc. However, some of the columns have awkward names, consisting of multiple words (such as ‘Participant Private ID’). Let’s rename those to make it easier to work with. Also note that we have data for 10 participants. For the purpose of this exercise, we’ll focus on the variables age, gender, language background and handedness. So don’t worry about renaming the other ones.
TASK: Rename the variables that have multi-word names. Focus on age, gender, language background and handedness.
Use the rename()
function.
The code to do this is below:
rename(origID = `Participant Private ID`,
age = Age,
gender = Gender,
gender_text = `Gender-text`,
bilingual_text = `bilingual-text`,
handedness = categorical_hand) %>%
Step 6: Extract variables
TASK: Extract the following variables: age, gender (as well the associated text variable), language background (as well as the associated text variable) and handedness.
Use the select()
function.
The code to do this is below:
select(origID, age, gender, gender_text, bilingual, bilingual_text, handedness)
Step 7: Recode gender and bilingual
If you look at the gender variable, you’ll see that it has 3 levels: ‘Male’, ‘Female’ and ‘Other (please specify)’. There are 2 issues with this: 1) Male and female are labels associated with biological sex, rather than gender. These labels should be recoded to ‘Man’ and ‘Woman’; 2) The third label reads ‘Other (please specify)’. While a necessary part of the question, we don’t really want the ‘(please specify)’ part to show in tables and figures. We will want to recode that to just ‘Other’. Similarly, for the bilingual variable, we have ‘Yes. My strongest language is:’ and ‘No’. We will want to recode that to a simpler ‘Yes’ and ‘No’.
TASK: Recode the levels of the gender variable to ‘Man, ’Woman’ and ‘Other’, rather than ‘Male’, ‘Female and ’Other (please specify)’.
Use the recode()
function, in combination with the mutate()
function.
Try to fill in the template below:
mutate (new variable = recode(old variable,
`Old label1` = "New label1",
`Old label2` = "New label2",
`Old label3` = "New label3"))
The code to do this is below:
mutate (gender = dplyr::recode(gender,
`Male` = "Man",
`Female` = "Woman",
`Other (please specify)` = "Other")) %>%
mutate(bilingual = dplyr::recode(bilingual,
`Yes. My strongest language is:` = "Yes"))
Step 8: Change variable types
Something else to check at this stage is whether the variable types that R has automatically assigned to your variables are correct. An easy way to check what the variable type for each variable is, is by clicking on the white triangle in the blue circle next to the object name in the Environment. Using the glimpse()
function will also show them. You will see that each variable is a ‘num’ (for ‘numerical’) or ‘chr’ (for ‘character’). Specifically, currently R thinks that the values stored in ‘origID’ are numerical values and that the values stored in all other variables are ‘character’ values (basically reading them as strings). This is not correct. Participant numbers, gender and bilingual are really factors with different levels (or categories) and the values in ‘age’ are numerical values. This is important, because certain functions expect variables of a certain type. For instance, you can’t compute the mean of a variable that consists of ‘characters’. So if you don’t change the variable types and try to compute the mean of ‘age’ it will not work. You can read more about variable types in R here.
TASK: Change the variable types for participant number (factor), age (numerical), gender (factor), and bilingual (factor). Note that the variables ‘gender_text’ and ‘bilingual_text’ indeed contain string information, so are correctly identified as ‘character’ variables.
Use the as.factor()
and as.numeric()
functions, in combination with the mutate()
function.
Try to fill in the template below:
mutate(variable1 = as.factor(variable1),
variable2 = as.numeric(variable2),
variable3 = as.factor(variable3),
variable4 = as.factor(variable4))
The code to do this is below:
mutate(origID = as.factor(origID),
age = as.numeric(age),
gender = as.factor(gender),
handedness = as.factor(handedness))
Step 9: Save processed file
Now the data is in a much better shape to work with for data visualisation, combination with other data files and analysis. In my own work, I’d write a separate ‘cleaning’ script like this for each measure (questionnaire or experimental task) and store the cleaned data frame as a new data file. I can then easily read in those ‘cleaned’ data files when doing further visualisation or analysis, without having to touch the raw data files again.
TASK: Store the data frame that contains the ‘cleaned’ or ‘processed’ data.
Use the write_csv()
function.
The code to do this is below:
write_csv(demo_proc, ("Week12-files/demo_proc.csv"))
You can download the R-script that combines all these bits of code into 1 pipe here. It is always a good idea to build a pipe like this 1 line at a time. Make sure to check that each line actually does what you expect it to do before you add the next line.
Well done! The more you practise data wrangling, the easier it will become.
Pre-lab activity 2: Getting ready for the lab class
Get your files ready
Download the 412_week12_lab.zip file.
Remind yourself of how to access and work with the RStudio Server.
- Sign in to the RStudio Server. Note that when you are not on campus you need to log into the VPN first (look on the portal if you need more information about that).
- Create a new folder for this week’s work.
- Upload the zip-file to the folder you have created on the RStudio server. Note you can either upload a single file or a zip-file.
If you get error messages when attempting to upload a file or a folder with files to the server, you can try the following steps:
- Close the R Studio server, close your browser and start afresh.
- Open the R Studio server in a different browser.
- Follow a work around where you use code to directly download the file to the server. The code to do that will be available at the start of the lab activity where you need that particular file. The code to download the file you need to complete the quiz is below.
Lab activities
In this lab, you’ll gain understanding of and practice with:
- when and why to apply multiple regression to answer questions in psychological science
- conducting multiple regression in R when combining continuous and categorical predictors
- interpreting the R output of multiple linear regression (when combining continuous and categorical predictors)
- reporting results for multiple linear regression (when combining continuous and categorical predictors), following APA guidelines
Lab activity 1: Combining a continuous and a categorical predictor in a regression model
Background: Smartphone screen-time and well-being
There is currently much debate (and hype) surrounding smartphones and their effects on well-being, especially with regard to children and teenagers. We’ll be looking at data from this recent study of English adolescents: Przybylski, A. & Weinstein, N. (2017). A Large-Scale Test of the Goldilocks Hypothesis. Psychological Science, 28, 204–215.
This was a large-scale study that found support for the “Goldilocks” hypothesis among adolescents: that there is a “just right” amount of screen-time, such that any amount more or less than this amount is associated with lower well-being. This was a huge survey study with data containing responses from over 120,000 participants! Fortunately, the authors made the data from this study openly available, which allows us to dig deeper into their results. And the question we want to expand on in this lab is whether the relationship between screen-time and well-being depends on the partcipant’s (self-reported) sex. In other words, our research question is: Does screen-time have a bigger impact on boys or girls, or is it the same for both?
The dependent measure used in the study was the Warwick-Edinburgh Mental Well-Being Scale (WEMWBS). This is a 14-item scale with 5 response categories, summed together to form a single score ranging from 14-70.
On Przybylski & Weinstein’s page for this study on the Open Science Framework, you can find the participant survey, which asks a large number of additional questions (see page 14 for the WEMWBS questions and pages 4-5 for the questions about screen-time). Within the same page you can also find the raw data, which some of you might want to consider using for your research report.
However, for the purpose of this lab, you will be using local pre-processed copies of the data (participant_info.csv
, screen_time.csv
and `wellbeing.csv, which you downloaded as part of the ‘Pre-lab activities’.
Przybylski and Weinstein looked at multiple measures of screen-time, but again for the interests of this lab we will be focusing on smartphone use, but do feel free to expand your skills after by looking at different definitions of screen-time. Overall, Przybylski and Weinstein suggested that decrements in well-being started to appear when respondents reported more than one hour of daily smartphone use. So, bringing it back to our additional variable of sex, our research question is now: Does the negative association between hours of smartphone use and well-being (beyond the one-hour point) differ for boys and girls?
Let’s think about this in terms of the variables. We have:
- a continuous outcome variable: well-being;
- a continuous∗ predictor variable: screen-time;
- a categorical predictor variable: sex.
Please note that well-being and screen-time are technically only quasi-continuous inasmuch as that only discrete values are possible. However, there are a sufficient number of discrete categories in our data that we can treat the data as effectively continuous.
Now, in terms of analysis, what we are effectively trying to do is to estimate two slopes relating screen-time to well-being, one for adolescent girls and one for adolescent boys, and then statistically compare these slopes. Sort of like running a correlation for boys, a correlation for girls, and comparing the two. Or alternatively, where you would run a regression (to estimate the slopes) but also one where you would need a t-test (to compare two groups). But the expressive power of regression allows us to do this all within a single model. Again, as we have seen building up to this lab, an independent groups t-test is just a special case of ordinary regression with a single categorical predictor; ANOVA is just a special case of regression where all predictors are categorical. But remember, although we can express any ANOVA design using regression, the converse is not true: we cannot express every regression design in ANOVA. As such people like regression, and the general linear model, as it allows us to have any combination of continuous and categorical predictors in the model. The only inconvenience with running ANOVA models as regression models is that you have to take care in how you numerically code the categorical predictors. We will use an approach called deviation coding which we will look at today later in this lab.
To complete this lab activity, you can use the R-script (402_wk12_labAct1_template.R
) that you downloaded as part of the ‘Pre-lab activities’ as a template. Work through the activity below, adding relevant bits of code to your script as you go along.
Step 1: Background and set up
The folder you were asked to download under ‘Pre-lab activity 3: Getting ready for the lab class’ contains the data files we’ll need. Make sure you have set your working directory to this folder by right-clicking on it and selecting ‘Set as working directory’.
Before you do anything else, when starting a new analysis, it is a good idea to empty the R environment. This prevents objects and variables from previous analyses interfering with the current one. To do this, you can click on the little broom icon in the top right of the Environment pane, or you can use rm(list=ls())
.
Before we can get started we need to tell R which libraries to use. For this analysis we’ll need broom
, car
and tidyverse
.
TASK: Load the relevant libraries. If you are unsure how to do that, you can look at the ‘Hint’ below for a clue by expanding it. After that, if you are still unsure, you can view the code by expanding the ‘Code’ section below.
Use the library()
function.
The code to do this is below.
library(broom)
library(car)
library(tidyverse)
If you experienced difficulties with uploading a folder or a file to the server, you can use the code below to directly download the file you need in this lab activity to the server (instead of first downloading it to you computer and then uploading it to the server). Remember that you can copy the code to your clipboard by clicking on the ‘clipboard’ in the top right corner.
download.file("https://github.com/mg78/2324_PSYC402/blob/main/data/week12/participant_info.csv?raw=true", destfile = "participant_info.csv")
download.file("https://github.com/mg78/2324_PSYC402/blob/main/data/week12/screen_time.csv?raw=true", destfile = "screen_time.csv")
download.file("https://github.com/mg78/2324_PSYC402/blob/main/data/week12/wellbeing.csv?raw=true", destfile = "wellbeing.csv")
TASK: Finally, read in the three data files; call the participant info
pinfo
; call the screen_time datascreen
and the well-being datawellbeing
.
Use the read_csv()
function.
The code to do this is below.
pinfo <- read_csv("participant_info.csv")
screen <- read_csv("screen_time.csv")
wellbeing <- read_csv("wellbeing.csv")
Step 2: Checking the formatting
Given our research question and the information you have about the scores, provided above under ‘Background’ and from the OSF-webpage, is the data ready for use?
TASK: Add code to look at the first few lines of each data frame.
Use the head()
function (or tail()
function).
The code to do this is below:
head(pinfo) # Look at the data frames
head(screen)
head(wellbeing)
QUESTION 2a: In which table is the variable corresponding to sex located and what is this variable called?
The ‘source and analysis code.sps’ file in the ‘Data and Code’ section on the OSF-webpage tells us how they coded the sex variable: 0 = female indicator and 1 = male indicator. It is worth exploring the OSF-webpage, to get used to foraging other files for these kinds of information, as they are not always clearly explained in a codebook or README. file.
TASK: For ease, lets recode the sex variable to reflect word labels of ‘female’ and ‘male’. This doesn’t change the order: R will still see female as 0, and male as 1 because female occurs before male in the alphabet. Make sure to check your code does what you expect it to do.
Use the recode()
function from the dplyr
package to recode the labels. Use the head()
and tail()
functions to check the new data frame.
The code to do this is below:
pinfo <- pinfo %>%
mutate (sex = dplyr::recode(sex,
`1` = "male",
`0` = "female"))
# Let's check it has done it:
head(pinfo)
tail(pinfo)
QUESTION 2b: In what format is the well-being data (long or wide)? On how many participants does it include observations? And on how many items for each participant?
QUESTION 2c: What is the name of the variable that identifies individual participants in this dataset? It is important to work this out as this variable will allow us to link information across the three data files.
Step 3: Data preparation - Aggregating the total well-being scores
We need to sum the 14 items of the well-being scale for each participant to create one well-being score per participant.
TASK: To create one well-being score per participant, add code to the script to do the following: first, transform the
well-being
data frame from wide to long (usingpivot_longer()
); then, usegroup_by()
to get scores for each participant and finally usesummarise()
to calculate a total well-being score, calling the new variabletot_wellbeing
. Save all of this to an object calledwb_tot
.
Below is a code template. Make sure to add the relevant sections.
wb_tot <- DATA %>%
pivot_longer(-Serial, names_to = "", values_to = "") %>%
group_by(?) %>%
summarise(tot_wellbeing = sum(?))
wb_tot <- wellbeing %>%
pivot_longer(-Serial, names_to = "question", values_to = "score") %>%
group_by(Serial) %>%
summarise(tot_wellbeing = sum(score))
It is useful to calculate some descriptive statistics for the new variable tot_wellbeing
.
TASK: Calculate some descriptive statistics for
tot_wellbeing
.
Use summarise()
to calculate the mean, standard deviation, minimum and maximum values.
wb_tot %>% summarise(mean = mean(tot_wellbeing),
sd = sd(tot_wellbeing),
min = min(tot_wellbeing),
max = max(tot_wellbeing))
Finally, let’s get an idea of the distribution of the new variable tot_wellbeing
.
TASK: Visualise the distribution in a histogram.
Use ggplot()
and geom_historgram()
.
ggplot(wb_tot, aes(tot_wellbeing)) +
geom_histogram()
QUESTION 3a: Is the distribution of well-being scores symmetrical, negatively skewed or positively skewed?
Step 4: Data preparation - Transforming screen time data
Great, so we have the well-being scores sorted out, we now need to think about the screen-time usage data and whether it is being used on a weekday or a weekend. As always, to get an idea of the data, it is often very useful to visualise the variables before proceeding with the analysis.
Before we can do this, we’ll need to tidy these data up. Have another look at the screen
data by using the head()
function. You’ll see that we have Serial
in the first column (this is good), but in the following eight columns, we have columns for each type of activity (Comph, Comp, Smart, Watch) and the part of the week it took place (we and wk) combined. Instead, to be able to work with the data, we need two columns: one for the type of activity (we’ll call it variable
) and one for the part of the week (we’ll call it day
).
A second issue is that we need to alter the abbreviations Comph, Comp, Smart and Watch to reflect more descriptive text for each in plots.
Below are two chunks of code that represent these steps. In the next tasks you’ll practise with taking a set of piped commands apart. The purpose of this is to get you used to “parsing” the code at the right places so that when you see piped commands in other people’s code, you know how to break it down and find the relevant parts that you can use.
TASK: In the code chunk below we use the
separate()
function to split the character strings already in the dataset. You know that with piped commands, there are chunks of code. Run the code first in its entirety and then pull each line apart to see how each function works on the data. Write a descriptive sentence for each function’s role in the command. Don’t forget to copy the chunk to your script and run it.
<- screen %>%
screen_long pivot_longer(-Serial, names_to = "var", values_to = "hours") %>%
separate(var, c("variable", "day"), "_")
TASK: In the next code chunk we use the
dplyr::recode()
function withmutate()
to relabel the separated names into understandable names that will be clear in plots. Again, run the code first in its entirety and then pull each line apart to see how each function works on the data. Write a descriptive sentence for each function’s role in the command. Don’t forget to copy the chunk to your script and run it.
<- screen_long %>%
screen2 mutate(variable = dplyr::recode(variable,
"Watch" = "Watching TV",
"Comp" = "Playing Video Games",
"Comph" = "Using Computers",
"Smart" = "Using Smartphone"),
day = dplyr::recode(day,
"wk" = "Weekday",
"we" = "Weekend"))
The code above has a new feature: the dplyr::recode
part. This syntax – using the double colon – happens when there are many versions of a function with the same name. You can imagine that a function called ‘recode’ is immensely useful at the data wrangling stage of analysis. By using the name of the package, a double set of colons, followed by a function name, you are ensuring that R uses a particular version of the function, at that point only. This avoids having two or more packages loaded in your environment that sometimes do not play nicely together!
To be able to monitor that your code is performing as you want it to, you need to have in your mind an idea of how the data should look at the end of a code chunk. So stop a moment and be clear, discuss with your lab-mates if you feel like it and answer the following question.
QUESTION 4a: What are the variables and the levels or conditions within each variable of screen2?
TASK:Now join
wb_tot
andscreen_2
by participant and then group by the variables ‘variable’, ‘day’ and ‘hours’ and then calculate a ‘mean_wellbeing’ variable for each of the grouped levels. Save it in an object called ‘dat_means’.
Write separate lines of code for each action and then, when you know each of them works, reformat them as a piped command. You’ll need inner_join()
, group_by()
and summarise()
.
Below templates for separate lines of code for each action. You’ll need to replace the names of relevant data frames and variables.
joined <- inner_join(data1, data2, by=)
grouped <- group_by(data, var1, var2, var3)
means <- summarise(data, mean = mean(variable))
Below the code for a piped command.
dat_means <- inner_join(wb_tot, screen2, "Serial") %>%
group_by(variable, day, hours) %>%
summarise(mean_wellbeing = mean(tot_wellbeing))
TASK: Now check that you have an object that is 72 observations of 4 variables. You should have a mean wellbeing score for every level of the hours, over weekdays and weekends for each level of the four types of screen time (4 x 2 x 9)
Next, it is a good idea to visualise the mean well-being data as function of hours of screen-time for the different days (weekday vs. weekend) and types of screen (playing video games, using computers, using smartphone and watching tv). This is quite a complex graph. We’ll go through creating it step-by-step, but let’s first look at the end result:
Ok, that’s what we are working towards.
TASK: Below, a chunk of code is presented. It is your task to fill in the x and y variables.
Go back to the research question - which variable is for the x axis and for the y axis?
ggplot(dat_means, aes(x = , y = )) +
geom_line() +
geom_point() +
theme_bw()
ggplot(dat_means, aes(x = hours, y = mean_wellbeing,)) +
geom_line() +
geom_point() +
theme_bw()
QUESTION 4b: What research question does this plot describe? Is it appropriate for the levels within the data?
TASK: Now, let’s add a different linetype for each day (weekday vs. weekend). Fill in the blanks in the code below.
ggplot(dat_means, aes(x = , y = , linetype = )) +
geom_line() +
geom_point() +
theme_bw()
ggplot(dat_means, aes(x = hours, y = mean_wellbeing, linetype = day)) +
geom_line() +
geom_point() +
theme_bw()
QUESTION 4c: What research question does this plot describe? Is it appropriate for the levels within the data?
Still not quite there.
TASK: Fill in the blanks (for x, y and linetype) as before. Now have a good look at the code below. What has changed? Copy the code to your script and run it. Then, for each line write a sentence as a comment to describe its effect on the plot.
ggplot(dat_means, aes(x = , y = , linetype = )) +
geom_line() +
geom_point() +
facet_wrap(~variable, nrow = 2) +
theme_bw()
ggplot(dat_means, aes(hours, mean_wellbeing, linetype = day)) + # plot 'hours' on the x-axis, 'mean_wellbeing' on the y-axis and use a different type of line for the levels of 'day'
geom_line() + # add a line
geom_point() + # add point data
facet_wrap(~variable, nrow = 2) + # plot separate plots for each level of 'variable'
theme_bw() # use the 'black and white' theme
We add the facet_wrap()
function here. You can check the ?facet_wrap()
help page for more information.
QUESTION 4d: c. What does the facet_wrap()
function do? Is this plot appropriate for the levels in the data?
Step 5: Calculating mean hours per day for smartphone use, for each participant
As mentioned at the beginning, in today’s lab we’ll focus on smartphone use. So looking at the bottom left of the figure we could suggest that smartphone use of more than 1 hour per day is associated with increasingly negative well-being the longer screen time people have. This looks to be a similar effect for Weekdays and Weekends, though perhaps overall well-being in Weekdays is marginally lower than in Weekends (the line for Weekday is lower on the y-axis than Weekends). This makes some sense as people tend to be happier on Weekends!
TASK: Below is a set of comments that describe what the chunk of code that you need to write next does:
- #use ‘screen2’
- #and then filter out the observations for ‘Using Smartphone’,
- #and then group together each participant,
- #and then summarise the mean hours calling it ‘hours_per_day’,
- #save it in an object called ‘smarttot’
Below is a template for the code you need. Try to fill in the relevant bits:
NEW_OBJECT <- DATA %>%
filter(variable == "?") %>%
group_by(VARIABLE) %>%
summarise(new_measure = mean(VARIABLE))
The code to do this is below:
smarttot <- screen2 %>% # use 'screen2'
filter(variable == "Using Smartphone") %>% # filter out the observations for 'Using Smartphone'
group_by(Serial) %>% # group together each participant
summarise(hours_per_day = mean(hours)) #summarise the mean hours calling it 'hours_per_day'
TASK: Now let’s do it the other way around. Run the code below. Have a look at the structure of ‘smart_wb’.
You can use the str()
function.
<- smarttot %>%
smart_wb filter(hours_per_day > 1) %>%
inner_join(wb_tot, "Serial") %>%
inner_join(pinfo, "Serial") %>%
mutate(sex = as.factor(sex))
QUESTION 5a: What does the code do? Write a short paragraph, using the phrase “and then” to represent the pipes.
Step 6: More visualisation
We are now using only one small part of the data - smartphone use and its relationship with well-being over different durations of time. Before formally testing our research question, we can visualise the data and enquire about sex differences on the same plot - run each chunk of code below:
TASK To further group the data, copy the code below to your script and run it. Look at the ‘smart_wb_gen’ dataframe. What has the code above done? Write a couple of sentences of description.
<- smart_wb %>%
smart_wb_gen group_by(hours_per_day, sex) %>%
summarise(mean_wellbeing = mean(tot_wellbeing))
TASK: Let’s visualise these data.
ggplot(smart_wb_gen, aes(hours_per_day, mean_wellbeing, color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_discrete(name = "Sex", labels = c("Girls", "Boys"))+
scale_x_continuous(name = "Total hours smartphone use") +
scale_y_continuous(name = "Mean well-being score") +
theme_bw()
QUESTION 6a: Write an interpretation of the above plot in plain English.
Step 7: The regression model
In the steps 2 to 6 we’ve covered some pretty heavy-lifting data-wrangling. As it is so often the case that something like this is needed when working with real data, it is really important to practise this. However, to ensure you also spend time on fitting the regression model and interpreting the output, you can choose to use the data-file smart_wb.csv to get started with that. It contains the data in a format that is the result of all the data-wrangling we did in steps 2 to 6. So, download the smart_sb.csv
data-file, put it in the folder that is your working directory and you’re all set for running the regression model.
If you experienced difficulties with uploading a folder or a file to the server, you can use the code below to directly download the file you need in this lab activity to the server (instead of first downloading it to you computer and then uploading it to the server). Remember that you can copy the code to your clipboard by clicking on the ‘clipboard’ in the top right corner.
download.file("https://github.com/mg78/2324_PSYC402/blob/main/data/week12/smart_wb.csv?raw=true", destfile = "smart_wb.csv")
TASK: Let’s run the regression. Write code in your script in which you call your output ‘mod’, and use the data ‘smart_wb’ using the following formula,
lm(y ~ x1 + x2 + x1:x2, data)
to construct your regression model. Go back to the research question for your outcome and two predictor variables.
Use the following template:
mod <- lm(y ~ x1 + x2 + x1:x2, data)
mod <- lm(tot_wellbeing ~ hours_per_day + sex + hours_per_day:sex, smart_wb)
TASK: Call and save the summary of your model as ‘mod_summary’; then have a look at it.
mod_summary <- summary(mod)
mod_summary
Let’s first look at the model as a whole:
QUESTION 7a: What is the p-value for the overall model? Is it significant? What does this mean?
QUESTION 7b: To two decimal places, what percentage of the variance in well-being scores does the overall model explain?
Now, lets look at the coefficients of our predictors:
QUESTION 7c: Are the main effects of smartphone use and sex significant?
QUESTION 7d: Which variable indicates the interaction between smartphone use and sex?
QUESTION 7e: And is the interaction significant?
QUESTION 7f: What is the most reasonable interpretation of these results?
The above model uses treatment coding (sometimes called dummy coding), for the sex variable. In a categorical variable with only two levels this means that one level is coded as 0 and the other level is coded as 1. In categorical variables with more than two levels, it works slightly differently.
TASK: We can check that the sex variable is treatment with the following code:
contrasts(smart_wb$sex)
Because we have not explicitly told R about the labels for the sex variable, it has used level 0 as the reference level, hidden within the intercept term and level sex1 describes the difference (or slope) between level 0 and 1 or in this dataset from female to male.
QUESTION 7g: Is being Male better for a person’s well-being in the context of smartphone use than being Female?
Now let’s look at deviation coding. There are other ways to code your categorical variables. One of them is deviation coding (also sometimes called sum coding). This effectively divides the difference of the values between your categorical levels by the number of levels so that each level can be compared to one intercept that is central to them all rather than comparing levels to one reference level. It is like centering for a categorical level.
TASK Use the code chunk below to: 1) Add a variable to the smart_wb data that is a deviation coding of sex; 2) Set the deviation coding (we’ll label it ‘Sum’ here for easy variable naming); and 3) Look at the output for the sum-coded sex variable.
<- mutate(smart_wb, sexSum = sex) # add a variable to the smart_wb data that is a deviation coding of sex
smart_wb contrasts(smart_wb$sexSum) <- contr.sum(2) # wet the deviation coding
contrasts(smart_wb$sexSum) # look at the output for the sum coded sex variable
Next, we’ll run the regression again, using the sum-coded sex variable and we’ll compare the outputs.
# Run the regression model again, using the sumcoded sex model and compare outputs
<- lm(tot_wellbeing ~ hours_per_day + sexSum + hours_per_day:sexSum, smart_wb)
mod_S <- summary(mod_S)
mod_S_summary
# Compare the two model summary outputs
mod_summary mod_S_summary
‘sexSum1’ is now the coefficient for sex and represents the change from the intercept value which now lies between the values for being Female and Male. Note how this coefficient is negative.
The earlier model had a positive coefficient because the intercept described the reference group of the Girls, who on average begin at a lower well-being level than Boys (refer back to the scatterplot to verify this). Because the sum-coding has moved the intercept to a point that is the center of the difference between Boys and Girls, sexSum1 now describes the distance between the centre and a level of Sex.
Values for well-being in Girls are thus: \[ Intercept + sexSum*+1 = 49.74 + (-1.61)*(+1) \]
Values for well-being in Boys are thus: \[Intercept + sexSum*-1 = 49.74 + (-1.61)*(-1)\]
with the Boys being higher in well-being…(remember a negative number multiplied by a negative number produces a positive number and a negative number multiplied by a positive number produces a negative number).
The interpretation of both model effects is the same, and if you look at the summary statistics, they are identical. Deviation coding effectively centers your categorical variables and helps with interpretation of interaction terms.
Step 8: Checking assumptions
Now that we’ve fit a model, let’s check whether it meets the assumptions of linearity, normality and homoscedasticity. With regression models, you do this after you’ve actually fit the model.
Linearity Unlike when we did simple regression we can’t use crPlots()
to test for linearity when there is an interaction, but we know from looking at the grouped scatterplot that this assumption has been met.
Normality Normally we would test for normality with a QQ-plot and a Shapiro-Wilk test. However, because this dataset is so large, the Shapiro-Wilk is not appropriate (if you try to run the test it will produce a warning telling you that the sample size must be between 3 and 5000). This is because with extremely large sample sizes the Shapiro-Wilk test will find that any deviation from normality is significant. Therefore we should judge normality based upon the QQ-plot.
TASK: Create a QQ-plot to check the residuals are normally distributed.
You can use the qqPlot()
function. The residuals are stored in the ‘mod’ object you created earlier.
qqPlot(mod$residuals)
QUESTION 8a: What do you conclude from the QQ-plot?
Homoscedasticity Here we have the same problem as with testing for normality: with such a large sample the ncvTest()
will produce a significant result for any deviation from homoscedasticity. So we need to rely on plots again. To check for homoscedasticity we can use plot()
from Base R that will produce a bunch of helpful plots (more information here:.
TASK: Copy the code chunk below to your script and run it.
par(mfrow=c(2,2)) # 4 charts in 1 panel
plot(mod) # this may take a few seconds to run
The residuals vs leverage plot shows a flat red line so, whilst it isn’t perfect, we can assume that with such a large sample size regression is still an appropriate analysis.
Multi-collinearity Finally, lets check for multicollinearity using the vif()
function. Essentially, this function estimates how much the variance of a coefficient is “inflated” because of linear dependence with other predictors, i.e., that a predictor isn’t actually adding any unique variance to the model, it’s just really strongly related to other predictors. Thankfully, vif is not affected by large samples like the other tests. There are various rules of thumb, but most converge on a VIF of above 2 to 2.5 for any one predictor being problematic.
TASK: Copy the code chunk below to your script and run it.
vif(mod) # Check for multi-collinearity
QUESTION 8b: Do any of the predictors show evidence of multicollinearity?
Step 9: Write up
QUESTION 9a: How would you write up the results following APA guidance? You can choose whether you do so for the model using treatment coding or for the model using deviation coding.
Answers
When you have completed all of the lab content, you may want to check your answers with our completed version of the script for this week. Remember, looking at this script (studying/revising it) does not replace the process of working through the lab activities, trying them out for yourself, getting stuck, asking questions, finding solutions, adding your own comments, etc. Actively engaging with the material is the way to learn these analysis skills, not by looking at someone else’s completed code…
The answers to the questions and the script containing the code will be available after the lab session has taken place.