Week 4. Testing nominal data

Written by Padraic Monaghan

4.1 Overview

This week, there are three mini lectures, and the practical workbook working with R-studio.

Before the practical on Tuesday, please try to work through the practical workbook in your group.

Bring your questions (and/or answers) to the practical.

4.2 Learning Goals

Understand the value of conducting statistical tests and interpreting p-values
Understand null effects and null hypotheses
Understand the difference between parametric and non-parametric data
Understand when to apply the Chi-squared test
Understand the relation between Cramer’s V test and the Chi-squared test
Be able to apply the Chi-squared test to data and interpret the result
Be able to apply Cramer’s V test to data

4.3 Lectures and slides

4.3.1 Lectures

Watch Lecture week 4 part 1:

Watch Lecture week 4 part 2:

Watch Lecture week 4 part 3:

Take the quiz (not assessed) on the lecture materials.

4.3.2 Slides

Download the lecture slides for:

4.4 Practical Materials

4.4.1 Workbook

The materials in this workbook share some material with Glasgow University Psychology Department Teaching in R website

In your group, work through this workbook, note any problems and questions you have, and come prepared to the online practical class to go through the tasks and ask your questions.

4.4.1.1 Part 1: Revision for last week

Task 1: Your data from the paper in Psychological Science

Your take-home task was to produce some graphs of the data set downloaded from a paper in Psychological Science. Show your graphs and R script to the rest of your group.

4.4.1.2 Part 2: Load in the Vocabulary Scores Data and Produce Graphs

Task 2: Load in the Data

Remember to clear out R first and load the tidyverse library:

rm(list=ls())
library(tidyverse)

The data set on the Shipley and Gent vocabulary scores is now updated with the data from your group, so it now contains eight years of PSYC411 students’ data. I’ve omitted Age as this might impact anonymity of the data. Download the data from here: PSYC411-shipley-scores-anonymous-17_24.csv and read the data into an object in R studio called vdat (for vocabulary data).

Answer

library(tidyverse)

Warning: package 'tidyr' was built under R version 4.1.1

Warning: package 'purrr' was built under R version 4.1.1

Warning: package 'stringr' was built under R version 4.1.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

vdat <- read_csv("data/wk4/PSYC411-shipley-scores-anonymous-17_24.csv")

Rows: 270 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): english_status, Gender
dbl (6): subject_ID, Age, Shipley_Voc_Score, Gent_1_score, Gent_2_score, aca...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

As a reminder, when we want to look at a particular variable (a column) in an object in R studio, we refer to it using the $ notation. So, for the object vdat and the variable academic_year you would refer to it as vdat$academic_year. For this data set, we need to change academic year to be a nominal (factor) variable. Why are we setting academic year to be nominal and not interval/ratio?:

vdat$academic_year <- as.factor(vdat$academic_year)

view(vdat) #view the data

Answer

It has to be nominal unless we have a hypothesis about the order of years of groups being important to the data. But that is very unlikely to have an effect (though you could look up the “Flynn Effect” in case you wanted to generate one…). But for now we want each year to be treated independently, rather than in order, in our analysis.

Make sure the tidyverse library is loaded. Select all the variables apart from Age and save as a new object called “summaryvdat”. We will omit these variables because they are not complete for the dataset.

Hint

# there are two ways to do this, either list the variables you are omitting with a `-c(variables,variables,...)` command:
summaryvdat <- select(vdat, -c(Age))

# or list the variables we do want to include:
summaryvdat <- select(vdat, subject_ID, english_status, Gender, Shipley_Voc_Score, Gent_1_score, Gent_2_score, academic_year)

Arrange the data according to Gent_2_score, from highest to lowest. Save this as a new object called “summaryvdat_sort”

Answer

summaryvdat_sort <- arrange(summaryvdat, desc(Gent_2_score))

Task 3: Draw Graphs of the Vocabulary Data

Draw graphs of the following relations:

English status and academic year
Gender and academic year
Vocabulary score and academic year

Answer

ggplot(summaryvdat, aes(x = academic_year, fill = english_status)) + 
  geom_bar(position = "dodge")

ggplot(summaryvdat, aes(x = academic_year, fill = Gender)) + 
  geom_bar(position = "dodge")

ggplot(summaryvdat, aes(x = academic_year, y = Gent_1_score)) + 
  geom_boxplot()

Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Save your script file.

4.4.1.3 Part 3: Grouping data in R studio

Task 4: Loading and joining data in R studio

Now, let’s clear out R-studio before we get started again using rm(list=ls()).
Go to the data files from week 2 and load them into Rstudio again (“ahicesd.csv”, and “participantinfo.csv”). If you need to download these again, you can get them here: ahicesd.csv, participantinfo.csv.

Remember these data come from this study: Woodworth, R.J., O’Brien-Malone, A., Diamond, M.R. and Schuz, B. (2018). Data from, “Web-based Positive Psychology Interventions: A Reexamination of Effectiveness”. Journal of Open Psychology Data, 6(1).
Remind yourself of the aim of the study and the variables that are in the data set (see end of this script file for repeat description on the study).

Next, load and join the ahicesd.csv and participantinfo.csv data in R studio. Call the joined data set “all_dat” (see week 2 workbook for reminders about this)

Answer

dat <- read_csv("data/wk2/ahicesd.csv")

Rows: 992 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (50): id, occasion, elapsed.days, intervention, ahi01, ahi02, ahi03, ahi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pinfo <- read_csv("data/wk2/participantinfo.csv")

Rows: 295 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): id, intervention, sex, age, educ, income

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

all_dat <- inner_join(x = dat, y = pinfo, by = c("id", "intervention"))

Task 5: Selecting and manipulating data

We’re not interested in the individual questionnaire items. So, let’s select all the variables we want to keep (omitting the individual questionnaire items), and save this to an object called summary_all_dat (again see week 2 workbook for reminder)
Next, we will add another variable to the data. We use the function mutate() for this. Let’s scale the ahiTotal and cesdTotal values and add them to the summary_all_dat set.

What are the minimum and maximum values of the new variable ahiTotalscale?
What do these scale values mean? (reminder: they are Z scores).

Hint

hint: use the arrange() function, or the min() and max() functions.

Answer

summary_all_dat <- select(all_dat, c("id", "occasion", "intervention", "ahiTotal","cesdTotal","sex", "age", "educ", "income"))
summary_all_dat_scale <- mutate(summary_all_dat, ahiTotalscale = scale(ahiTotal),
                                cesdTotalscale = scale(cesdTotal))
summarise(summary_all_dat_scale, min(ahiTotalscale), max(ahiTotalscale))

# A tibble: 1 × 2
  `min(ahiTotalscale)` `max(ahiTotalscale)`
                 <dbl>                <dbl>
1                -2.87                 2.90

The values are z-scores so the numbers indicate how many standard deviations the values are from the mean

The next way we will work with the data is to organise the observations into different groups. We will use the function summarise(). So, instead of mean(summary_all_dat_scale$ahiTotal) you can use this, which turns out to be a much more powerful way of looking at the data:

summarise(summary_all_dat_scale, mean(ahiTotal))

They should give the same results - check that they do. This function summarise() is more powerful because you can look at several values at the same time, e.g.:

summarise(summary_all_dat_scale, mean(ahiTotal), sd(ahiTotal), mean(cesdTotal), sd(cesdTotal))

What is the result of this command?

Answer

summarise(summary_all_dat_scale, mean(ahiTotal), sd(ahiTotal), 
          mean(cesdTotal), sd(cesdTotal))

# A tibble: 1 × 4
  `mean(ahiTotal)` `sd(ahiTotal)` `mean(cesdTotal)` `sd(cesdTotal)`
             <dbl>          <dbl>             <dbl>           <dbl>
1             72.8           14.2              13.1            11.7

But now let’s think about what kind of patterns we’d like to investigate in the data. There are four interventions conducted in this study. Let’s look at each of these interventions and their effect of ahiTotal and cesdTotal.

We can look at subgroups of data either by using the filter() function, or by using the function group_by(). The advantage of group_by() is that we can look at several groups at the same time, rather than dividing up the data file into pieces. Let’s organise by the different interventions.

summary_all_dat_scale_intervention <- group_by(summary_all_dat_scale, intervention)

This command takes the data summary_all_dat_scale, and then groups it according to the four interventions in the data. We can’t yet see any difference in summary_all_dat_scale_intervention but it’s in there, lurking, just waiting.

Now, we can look at the means for each intervention using the summarise function again. Run the summarise function on summarydata_scale_intervention. What happens?

Answer

summarise(summary_all_dat_scale, mean(ahiTotal), sd(ahiTotal), 
          mean(cesdTotal), sd(cesdTotal))

# A tibble: 1 × 4
  `mean(ahiTotal)` `sd(ahiTotal)` `mean(cesdTotal)` `sd(cesdTotal)`
             <dbl>          <dbl>             <dbl>           <dbl>
1             72.8           14.2              13.1            11.7

It separates out the means and SDs by the group_by condition - i.e, by intervention.

You can also group by several factors at the same time. We can group by intervention and get means and standard deviations, but that is not going to give us a huge amount of insight into how the interventions affect the happiness measure because we are combining the mean of ahiTotal across all occasions of testing, including testing before the intervention has been applied.

So, let’s group by intervention and occasion of testing:

summary_all_dat_intocc <- group_by(summary_all_dat_scale, intervention, occasion)

Now produce the means and standard deviations of the happiness score (ahiTotal) for each intervention at each testing occasion.

Answer

summary_all_dat_intocc <- group_by(summary_all_dat_scale, intervention, occasion)
summarise(summary_all_dat_intocc, mean(ahiTotal), sd(ahiTotal))

`summarise()` has grouped output by 'intervention'. You can override using the
`.groups` argument.

# A tibble: 24 × 4
# Groups:   intervention [4]
   intervention occasion `mean(ahiTotal)` `sd(ahiTotal)`
          <dbl>    <dbl>            <dbl>          <dbl>
 1            1        0             68.4           14.0
 2            1        1             69.5           13.4
 3            1        2             70.3           15.7
 4            1        3             75.0           12.8
 5            1        4             76.5           16.2
 6            1        5             75.5           14.5
 7            2        0             68.8           13.0
 8            2        1             71.6           12.5
 9            2        2             73.0           14.0
10            2        3             72.5           14.2
# ℹ 14 more rows

This doesn’t print all the lines out, so you can make a new object (e.g., called sum_output) and view that, or you can filter out some of the lines so we only look at the first and second occasion of testing.

Answer

sum_output <- summarise(summary_all_dat_intocc, mean(ahiTotal), sd(ahiTotal))
View(sum_output)

4.4.1.4 Part 4: Graphing groups

Task 6: Graph some groups

Draw a scatter plot of ahiTotal and cesdTotal values for the whole data set.

Hint

Use the ggplot() function with geom_point()

And make it a bit more beautiful using the labs() addition.

Answer

ggplot(summary_all_dat, aes(x = ahiTotal, y = cesdTotal)) + 
  geom_point() + 
  labs(title = "Happiness and Depression Scores", 
       x = "Happiness Score", 
       y = "Depression Score")

Now redraw the plot, but colour the points according to whether they are first, second, third, etc occasion of testing. Add in col = "occasion" into the aes() part of the geom_point function, so that this part looks like this: aes(x = ahiTotal, y = cesdTotal, col = occasion)

Answer

ggplot(summary_all_dat, aes(x = ahiTotal, y = cesdTotal, col = occasion)) + 
  geom_point() + 
  labs(title = "Happiness and Depression Scores", 
       x = "Happiness Score", 
       y = "Depression Score")

4.4.1.5 Part 5: Working out whether nominal data is random or structured: Repeating the analyses from Lecture week4 part3

Task 7: Chi-squared and Cramer’s V

Let’s now have a look at running Chi-squared and Cramer’s V tests in R. Download the titanic data.

Read the titanic.csv into an object called “titanic”.

View the data. It should correspond to the data in the overhead slides.

Make a bar graph to count the numbers of survived and died by class.

Answer

titanic <- read_csv("data/wk4/titanic.csv")

New names:
Rows: 1309 Columns: 3
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): class, survival dbl (1): ...1
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

ggplot(titanic, aes(x = class, fill = survival)) + 
  geom_bar( position = "dodge")

Now let’s see if there is a significant relation between class and survival using Chi-squared:

chisq.test(x = titanic$class, y = titanic$survival)

The results give the chi-squared value, the number of degrees of freedom, and the p-value. P = 2.2e-16 means p = .0000000000000022. That’s highly significant. That means the observations are divided across the categories in a way that is very unlikely to be due to chance (for this number (P = 2.2e-16), it means there’s a 2 in a quadrillion chance that titanic survival was not related to class).

In a report, you would write:

Chi-squared(2, N= 1309) = 127.86, p < .001.

To understand where the significant effect comes from, we need to look at where in our table of counts there is a big discrepancy between the expected frequency and the actual frequency. We can do this by analysing the “standardised residuals” of the chi-squared test.

Repeat the chi-squared test on the titanic data set, and save the result of the test into a new object called “titanic_chisq_result”.

Then, look at the standardised residuals that are saved in the test results - the standardised residuals are saved in a variable called stdres.

Hint

Here is how you do it

titanic_chisq_result <- chisq.test(x = titanic$class, y = titanic$survival)
titanic_chisq_result$stdres

             titanic$survival
titanic$class       died   survived
       first  -10.110480  10.110480
       second  -1.837589   1.837589
       third   10.254442 -10.254442

Negative values indicate that actual counts are lower than expected, positive values indicate that actual counts are higher than expected. The standardised residuals indicate that there are fewer first class and more third class than expected that died, second class died at a level close to that expected from the overall numbers of deaths.

Now, let’s compute Cramer’s V. First, we need to make sure we have the library lsr loaded in.

library(lsr)

Then run the test:

cramersV(x = titanic$class, y = titanic$survival)

Your next task is to run some Chi-squared and Cramer’s V tests on some of the other nominal data. Open the data “PSYC411-shipley-scores-anonymous-17_24.csv” again. Investigate the association between gender and year (are there different distributions of males and females in each of our masters’ year cohorts) using Chi-squared and Cramer’s V. Is it significant?

Answer

library(lsr)

Warning: package 'lsr' was built under R version 4.1.1

vdat <- read_csv("data/wk4/PSYC411-shipley-scores-anonymous-17_24.csv");

Rows: 270 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): english_status, Gender
dbl (6): subject_ID, Age, Shipley_Voc_Score, Gent_1_score, Gent_2_score, aca...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

vdat$academic_year <- as.factor(vdat$academic_year)

chisq.test( x = vdat$Gender, vdat$academic_year)

Warning in chisq.test(x = vdat$Gender, vdat$academic_year): Chi-squared
approximation may be incorrect


    Pearson's Chi-squared test

data:  vdat$Gender and vdat$academic_year
X-squared = 11.79, df = 14, p-value = 0.6232

cramersV( x = vdat$Gender, vdat$academic_year)

Warning in stats::chisq.test(...): Chi-squared approximation may be incorrect

[1] 0.1488686

it isn’t significant, X-squared = 11.79, df = 14, p-value = .623, but there is < 5 in the “Other” gender category which gives us the error message “Chi-squared approximation may be incorrect”. if we focus just on Male/Female data, if, for instance, we are interested for the purposes of our investigation only in these larger categories:

vdatmf <- filter(vdat, Gender == "Male" | Gender == "Female")
chisq.test( x = vdatmf$Gender, vdatmf$academic_year)


    Pearson's Chi-squared test

data:  vdatmf$Gender and vdatmf$academic_year
X-squared = 5.9864, df = 7, p-value = 0.5413

cramersV( x = vdatmf$Gender, vdatmf$academic_year)

[1] 0.1505847

It still isn’t significant, X-squared = 5.99, df = 7, p-value = .541

What about the association between english_status and Gender?

Answer

chisq.test( x = vdat$Gender, vdat$english_status)

Warning in chisq.test(x = vdat$Gender, vdat$english_status): Chi-squared
approximation may be incorrect


    Pearson's Chi-squared test

data:  vdat$Gender and vdat$english_status
X-squared = 2.086, df = 2, p-value = 0.3524

cramersV( x = vdat$Gender, vdat$english_status)

Warning in stats::chisq.test(...): Chi-squared approximation may be incorrect

[1] 0.08940014

not significant, X-squared = 2.09, df = 2, p-value = .352.

And just for Male/Female:

chisq.test( x = vdatmf$Gender, vdatmf$english_status)


    Pearson's Chi-squared test with Yates' continuity correction

data:  vdatmf$Gender and vdatmf$english_status
X-squared = 0.62108, df = 1, p-value = 0.4306

cramersV( x = vdatmf$Gender, vdatmf$english_status)

[1] 0.04896934

Still not significant, X-squared = 0.62, df = 1, p-value = .431

What about the association between english_status and academic year?

Answer

chisq.test( x = vdat$english_status, vdat$academic_year)


    Pearson's Chi-squared test

data:  vdat$english_status and vdat$academic_year
X-squared = 4.3238, df = 7, p-value = 0.7418

cramersV( x = vdat$english_status, vdat$academic_year)

[1] 0.1284637

Not significant, X-squared = 4.32, df = 7, p-value = .742

4.4.1.6 Part 6: More practice using Chi-squared and Cramer’s V test

Task 8: More Chi-squared and Cramer’s V tests

Look at the “ahicesd.csv” and “participantinfo.csv” data sets from week 2 again. Which nominal measures could you look at an association between? Report the Chi-squared test and Cramer’s V results for these associations. Are these associations significant? How do you interpret the significant associations?

Answer

For example, you could have a go at occasion and intervention. This would tell you whether the participants are unevenly distributed across the occasions and interventions:

a <- group_by(all_dat, occasion, intervention)
chisq.test( x = all_dat$occasion, all_dat$intervention)


    Pearson's Chi-squared test

data:  all_dat$occasion and all_dat$intervention
X-squared = 9.7806, df = 15, p-value = 0.8333

cramersV( x = all_dat$occasion, all_dat$intervention)

[1] 0.05732779

chi2(15, N = 992) = 9.78, p = .833

Cramer’s V = .06

Not significant, X-squared = 2.73, df = 6, p-value = .842

4.4.1.7 Part 7: Extra practise downloading and analysing data

Here is another dataset for you to investigate:

Papoutsi, C., Zimianiti, E., Bosker, H. R., & Frost, R. L. (2024). Statistical learning at a virtual cocktail party. Psychonomic Bulletin & Review, 31(2), 849-861. https://doi.org/10.3758/s13423-023-02384-1

The data are available on OSF, but also a cleaned version of the dataset is available here. If this doesn’t work when you try to upload into psy-rstudio.lancaster.ac.uk, then you can always get the data from the osf site using this command:

dat <- read_csv('https://osf.io/download/ky4u6/')

There is also something in the osf site called a R-markdown file - Data_analysis_script.rmd

This is a special kind of R-script, a “R-markdown” file, which also stores the results alongside the commands.

You should be able to scroll through it and see some of the R studio commands that might be familiar.

For more information on R-markdown, you can see here: https://r4ds.had.co.nz/r-markdown.html

Our challenge to you is to make a Figure that looks a bit like their Figure 2. e.g., construct a boxplot of some of these data (though the Figure they use is called a pirate plot). If you’re keen to learn, there is more information on pirate plots here: pirateplots

Answer

This is the dataset you need

papitsou <- read_csv("data/wk4/papatsou_cleaneddata.csv")

Rows: 10236 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): participant, test, test_speaker, word1, word2, test_type, testpair...
dbl (11): trialNo, tablerow, soundrow, correct_answer, rt, status, trialaccu...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Have a further browse of Psychological Science for data sets that you can download and begin to explore. Practise applying the data manipulation and graphing functions to these data sets. Or here is another one you might find interesting:

Woodworth, R. J., O’Brien‐Malone, A., Diamond, M. R., & Schüz, B. (2017). Web‐based positive psychology interventions: A reexamination of effectiveness. Journal of Clinical Psychology, 73(3), 218-232. Data available here

Description of their study:

In our study we attempted a partial replication of the study of Seligman, Steen, Park, and Peterson (2005) which had suggested that the web-based delivery of positive psychology exercises could, as the consequence of containing specific, powerful therapeutic ingredients, effect greater increases in happiness and greater reductions in depression than could a placebo control. Participants (n=295) were randomly allocated to one of four intervention groups referred to, in accordance with the terminology in Seligman et al. (2005) as 1: Using Signature Strengths; 2: Three Good Things; 3: Gratitude Visit; 4: Early Memories (placebo control). At the commencement of the study, participants provided basic demographic information (age, sex, education, income) in addition to completing a pretest on the Authentic Happiness Inventory (AHI) and the Center for Epidemiologic Studies-Depression (CES-D) scale. Participants were asked to complete intervention-related activities during the week following the pretest. Futher measurements were then made on the AHI and CESD immediately after the intervention period (‘posttest’) and then 1 month after the posttest (day 38), 3 months after the posttest (day 98), and 6 months after the posttest (day 189). Participants were not able to to complete a follow-up questionnaire prior to the time that it was due but might have completed at either at the time that it was due, or later. We recorded the date and time at which follow-up questionnaires were completed.

4.4.2 Data

Data referred to in this workbook:

4.4.3 Answers

Answers now appear in this page, above.

4.5 Extras

See the guides to reporting numbers and statistical tests in American Psychological Association format (the format that we use in Psychology for all reports).

Watch the clip of the Titanic film if you like (not assessed!). Hosted on youtube.

Watch the clip of the Titanic film with a cat if you like (also not assessed!). Hosted on youtube.

Read what is a p-value, on wikipedia. On this topic, wikipedia is good at explaining this, and explaining misinterpretations of p-values as well.