Week 4. Testing nominal data
Written by Padraic Monaghan
4.1 Overview
4.2 Learning Goals
Understand the value of conducting statistical tests and interpreting p-values
Understand null effects and null hypotheses
Understand the difference between parametric and non-parametric data
Understand when to apply the Chi-squared test
Understand the relation between Cramer’s V test and the Chi-squared test
Be able to apply the Chi-squared test to data and interpret the result
Be able to apply Cramer’s V test to data
4.3 Lectures and slides
4.3.1 Lectures
4.3.2 Slides
4.4 Practical Materials
4.4.1 Workbook
The materials in this workbook share some material with Glasgow University Psychology Department Teaching in R website
In your group, work through this workbook, note any problems and questions you have, and come prepared to the online practical class to go through the tasks and ask your questions. Part 1: Revision for last week
Task 1: Your data from the paper in Psychological Science
- Your take-home task was to produce some graphs of the data set downloaded from a paper in Psychological Science. Show your graphs and R script to the rest of your group. Part 2: Load in the Vocabulary Scores Data and Produce Graphs
Task 2: Load in the Data
- Remember to clear out R first and load the tidyverse library:
The data set on the Shipley and Gent vocabulary scores is now updated with the data from your group, so it now contains eight years of PSYC411 students’ data. I’ve omitted Age as this might impact anonymity of the data. Download the data from here: PSYC411-shipley-scores-anonymous-17_24.csv and read the data into an object in R studio called vdat
(for vocabulary data).
- As a reminder, when we want to look at a particular variable (a column) in an object in R studio, we refer to it using the
notation. So, for the object vdat and the variable academic_year you would refer to it asvdat$academic_year
. For this data set, we need to change academic year to be a nominal (factor) variable. Why are we setting academic year to be nominal and not interval/ratio?:
$academic_year <- as.factor(vdat$academic_year)
view(vdat) #view the data
- Make sure the tidyverse library is loaded. Select all the variables apart from Age and save as a new object called “summaryvdat”. We will omit these variables because they are not complete for the dataset.
- Arrange the data according to Gent_2_score, from highest to lowest. Save this as a new object called “summaryvdat_sort”
Task 3: Draw Graphs of the Vocabulary Data
- Draw graphs of the following relations:
- English status and academic year
- Gender and academic year
- Vocabulary score and academic year
- Save your script file. Part 3: Grouping data in R studio
Task 4: Loading and joining data in R studio
Now, let’s clear out R-studio before we get started again using
.Go to the data files from week 2 and load them into Rstudio again (“ahicesd.csv”, and “participantinfo.csv”). If you need to download these again, you can get them here: ahicesd.csv, participantinfo.csv.
- Remember these data come from this study: Woodworth, R.J., O’Brien-Malone, A., Diamond, M.R. and Schuz, B. (2018). Data from, “Web-based Positive Psychology Interventions: A Reexamination of Effectiveness”. Journal of Open Psychology Data, 6(1).
- Remind yourself of the aim of the study and the variables that are in the data set (see end of this script file for repeat description on the study).
- Next, load and join the ahicesd.csv and participantinfo.csv data in R studio. Call the joined data set “all_dat” (see week 2 workbook for reminders about this)
Task 5: Selecting and manipulating data
We’re not interested in the individual questionnaire items. So, let’s select all the variables we want to keep (omitting the individual questionnaire items), and save this to an object called summary_all_dat (again see week 2 workbook for reminder)
Next, we will add another variable to the data. We use the function
for this. Let’s scale the ahiTotal and cesdTotal values and add them to the summary_all_dat set.
What are the minimum and maximum values of the new variable ahiTotalscale?
What do these scale values mean? (reminder: they are Z scores).
- The next way we will work with the data is to organise the observations into different groups. We will use the function summarise(). So, instead of
you can use this, which turns out to be a much more powerful way of looking at the data:
summarise(summary_all_dat_scale, mean(ahiTotal))
- They should give the same results - check that they do. This function
is more powerful because you can look at several values at the same time, e.g.:
summarise(summary_all_dat_scale, mean(ahiTotal), sd(ahiTotal), mean(cesdTotal), sd(cesdTotal))
- What is the result of this command?
- But now let’s think about what kind of patterns we’d like to investigate in the data. There are four interventions conducted in this study. Let’s look at each of these interventions and their effect of ahiTotal and cesdTotal.
We can look at subgroups of data either by using the filter()
function, or by using the function group_by()
. The advantage of group_by()
is that we can look at several groups at the same time, rather than dividing up the data file into pieces. Let’s organise by the different interventions.
<- group_by(summary_all_dat_scale, intervention) summary_all_dat_scale_intervention
- This command takes the data summary_all_dat_scale, and then groups it according to the four interventions in the data. We can’t yet see any difference in summary_all_dat_scale_intervention but it’s in there, lurking, just waiting.
Now, we can look at the means for each intervention using the summarise function again. Run the summarise
function on summarydata_scale_intervention. What happens?
- You can also group by several factors at the same time. We can group by intervention and get means and standard deviations, but that is not going to give us a huge amount of insight into how the interventions affect the happiness measure because we are combining the mean of ahiTotal across all occasions of testing, including testing before the intervention has been applied.
So, let’s group by intervention and occasion of testing:
<- group_by(summary_all_dat_scale, intervention, occasion) summary_all_dat_intocc
- Now produce the means and standard deviations of the happiness score (ahiTotal) for each intervention at each testing occasion.
- This doesn’t print all the lines out, so you can make a new object (e.g., called sum_output) and view that, or you can filter out some of the lines so we only look at the first and second occasion of testing. Part 4: Graphing groups
Task 6: Graph some groups
- Draw a scatter plot of ahiTotal and cesdTotal values for the whole data set.
- Now redraw the plot, but colour the points according to whether they are first, second, third, etc occasion of testing. Add in
col = "occasion"
into theaes()
part of the geom_point function, so that this part looks like this:aes(x = ahiTotal, y = cesdTotal, col = occasion) Part 5: Working out whether nominal data is random or structured: Repeating the analyses from Lecture week4 part3
Task 7: Chi-squared and Cramer’s V
- Let’s now have a look at running Chi-squared and Cramer’s V tests in R. Download the titanic data.
Read the titanic.csv into an object called “titanic”.
View the data. It should correspond to the data in the overhead slides.
- Make a bar graph to count the numbers of survived and died by class.
- Now let’s see if there is a significant relation between class and survival using Chi-squared:
chisq.test(x = titanic$class, y = titanic$survival)
- The results give the chi-squared value, the number of degrees of freedom, and the p-value. P = 2.2e-16 means p = .0000000000000022. That’s highly significant. That means the observations are divided across the categories in a way that is very unlikely to be due to chance (for this number (P = 2.2e-16), it means there’s a 2 in a quadrillion chance that titanic survival was not related to class).
In a report, you would write:
Chi-squared(2, N= 1309) = 127.86, p < .001.
- To understand where the significant effect comes from, we need to look at where in our table of counts there is a big discrepancy between the expected frequency and the actual frequency. We can do this by analysing the “standardised residuals” of the chi-squared test.
Repeat the chi-squared test on the titanic data set, and save the result of the test into a new object called “titanic_chisq_result”.
Then, look at the standardised residuals that are saved in the test results - the standardised residuals are saved in a variable called stdres
Negative values indicate that actual counts are lower than expected, positive values indicate that actual counts are higher than expected. The standardised residuals indicate that there are fewer first class and more third class than expected that died, second class died at a level close to that expected from the overall numbers of deaths.
- Now, let’s compute Cramer’s V. First, we need to make sure we have the library lsr loaded in.
- Then run the test:
cramersV(x = titanic$class, y = titanic$survival)
- Your next task is to run some Chi-squared and Cramer’s V tests on some of the other nominal data. Open the data “PSYC411-shipley-scores-anonymous-17_24.csv” again. Investigate the association between gender and year (are there different distributions of males and females in each of our masters’ year cohorts) using Chi-squared and Cramer’s V. Is it significant?
- What about the association between english_status and Gender?
- What about the association between english_status and academic year? Part 6: More practice using Chi-squared and Cramer’s V test
Task 8: More Chi-squared and Cramer’s V tests
- Look at the “ahicesd.csv” and “participantinfo.csv” data sets from week 2 again. Which nominal measures could you look at an association between? Report the Chi-squared test and Cramer’s V results for these associations. Are these associations significant? How do you interpret the significant associations? Part 7: Extra practise downloading and analysing data
- Here is another dataset for you to investigate:
Papoutsi, C., Zimianiti, E., Bosker, H. R., & Frost, R. L. (2024). Statistical learning at a virtual cocktail party. Psychonomic Bulletin & Review, 31(2), 849-861. https://doi.org/10.3758/s13423-023-02384-1
The data are available on OSF, but also a cleaned version of the dataset is available here. If this doesn’t work when you try to upload into psy-rstudio.lancaster.ac.uk, then you can always get the data from the osf site using this command:
<- read_csv('https://osf.io/download/ky4u6/') dat
There is also something in the osf site called a R-markdown file - Data_analysis_script.rmd
This is a special kind of R-script, a “R-markdown” file, which also stores the results alongside the commands.
You should be able to scroll through it and see some of the R studio commands that might be familiar.
For more information on R-markdown, you can see here: https://r4ds.had.co.nz/r-markdown.html
- Our challenge to you is to make a Figure that looks a bit like their Figure 2. e.g., construct a boxplot of some of these data (though the Figure they use is called a pirate plot). If you’re keen to learn, there is more information on pirate plots here: pirateplots
- Have a further browse of Psychological Science for data sets that you can download and begin to explore. Practise applying the data manipulation and graphing functions to these data sets. Or here is another one you might find interesting:
Woodworth, R. J., O’Brien‐Malone, A., Diamond, M. R., & Schüz, B. (2017). Web‐based positive psychology interventions: A reexamination of effectiveness. Journal of Clinical Psychology, 73(3), 218-232. Data available here
Description of their study:
In our study we attempted a partial replication of the study of Seligman, Steen, Park, and Peterson (2005) which had suggested that the web-based delivery of positive psychology exercises could, as the consequence of containing specific, powerful therapeutic ingredients, effect greater increases in happiness and greater reductions in depression than could a placebo control. Participants (n=295) were randomly allocated to one of four intervention groups referred to, in accordance with the terminology in Seligman et al. (2005) as 1: Using Signature Strengths; 2: Three Good Things; 3: Gratitude Visit; 4: Early Memories (placebo control). At the commencement of the study, participants provided basic demographic information (age, sex, education, income) in addition to completing a pretest on the Authentic Happiness Inventory (AHI) and the Center for Epidemiologic Studies-Depression (CES-D) scale. Participants were asked to complete intervention-related activities during the week following the pretest. Futher measurements were then made on the AHI and CESD immediately after the intervention period (‘posttest’) and then 1 month after the posttest (day 38), 3 months after the posttest (day 98), and 6 months after the posttest (day 189). Participants were not able to to complete a follow-up questionnaire prior to the time that it was due but might have completed at either at the time that it was due, or later. We recorded the date and time at which follow-up questionnaires were completed.
4.4.2 Data
Data referred to in this workbook:
4.4.3 Answers
Answers now appear in this page, above.
4.5 Extras
See the guides to reporting numbers and statistical tests in American Psychological Association format (the format that we use in Psychology for all reports).
- Read what is a p-value, on wikipedia. On this topic, wikipedia is good at explaining this, and explaining misinterpretations of p-values as well.