Week 3. Drawing graphs from data

Written by Padraic Monaghan

3.1 Overview

This week, there are three mini lectures, the practical workbook working with R-studio, and the first assignment (due by Friday of week 3).

Before the practical on Tuesday, please try to work through the practical workbook in your group.

Bring your questions (and/or answers) to the practical.

The assignment is on moodle here.

3.2 Learning Goals

  • See how data gathering for vocabulary knowledge can be conducted

  • Understand different types of reproducibility

  • Understand the relations between observed score, true score, and error score

  • Understand multiple types of graphs and how to construct them in Rstudio

  • Manipulate data, including filtering, grouping and summarizing data in Rstudio

3.3 Lectures and slides

3.3.1 Lectures

  • Watch Lecture 3 part 1

Watch Lecture week3 part2

Watch Lecture week3 part3

Do the quick quiz (not assessed).

3.3.2 Slides

Download the lecture slides for:

I haven’t put slides for part 1 here because they just said to do the different vocabulary tests.

3.4 Practical Materials

3.4.1 Workbook

Part 1 is revision from last week.

Part 2 is to reproduce the Lecture week 3 part 3 analyses.

Part 3 is to practise manipulating data.

Part 4 is starting making visualisations of data.

Part 5 is some more advanced manipulation of data.

Parts 6 and 7 are extra: extending your knowledge.

If you’ve followed a course in R studio before then Parts 1, 6 and 7 will extend your knowledge.

3.4.1.1 Part 1: Revision from last week

If you had a look at the data from Stamkou et al.’s study from last week, then that’s great, in which case follow Task 1, otherwise just go straight to Task 2.

Task 1: Describe Stamkou et al.’s study from last week
  1. Talk to your group about what you did.

  2. From the Stamkou et al. data, can you look at whether there was a change in happiness ratings from the baseline (PracticeHappy) to after watching the video (VideoHappy)?

Remember first to load in the libraries Hmisc and tidyverse, and then use get.spss to load in the data: There are two ways to load in data that comes from spss - when the datafile ends with .sav The way we looked at doing this in week 1 workbook used the library Hmisc and then the function get.spss The other way is to use the library haven, and the function read_sav

library(Hmisc)
dat <- spss.get("stamkou_Study 1_edited data.sav")

or:

library(haven)
dat <- read_sav("stamkou_Study 1_edited data.sav")
  1. If you used get.spss, though, the data, as you may have seen, is rather messy, and needs some tidying up. If you used read_sav then these issues don’t apply. But for get.spss, in particular, the ratings of happiness are a bit all over the place. (this is because get.spss keeps information about responses but also what those responses mean and sometimes mixes them up together). We want these to just be numbers, 1,2,3,4, but instead they have the coding for what the numbers mean as well. If you’ve loaded in your data set into an object called “dat”: then the following lines will work to tidy things up:
dat$PracticeHappy <- gsub("\n1=not happy at all\n", "1", dat$PracticeHappy)
dat$PracticeHappy <- gsub("\n2=a bit happy\n", "2", dat$PracticeHappy)
dat$PracticeHappy <- gsub("\n3=quite happy\n", "3", dat$PracticeHappy)
dat$PracticeHappy <- gsub("\n4=very happy\n", "4", dat$PracticeHappy)
dat$PracticeHappy <- as.numeric(dat$PracticeHappy)
dat$VideoHappy <- gsub("\n1=not happy at all\n", "1", dat$VideoHappy)
dat$VideoHappy <- gsub("\n2=a bit happy\n", "2", dat$VideoHappy)
dat$VideoHappy <- gsub("\n3=quite happy\n", "3", dat$VideoHappy)
dat$VideoHappy <- gsub("\n4=very happy\n", "4", dat$VideoHappy)
dat$VideoHappy <- as.numeric(dat$VideoHappy)
  1. Now, you can look at plotting the happiness ratings at baseline against happiness ratings after the children watch the video. You can also look at the means and standard deviations of the ratings to see if there is a numerical difference. What are the means and SDs?

Remember: You will need to load in the tidyverse library at this stage.

library(tidyverse)
plot(dat$PracticeHappy, dat$VideoHappy)
summarise(dat, mean(PracticeHappy), sd(PracticeHappy), mean(VideoHappy, na.rm=T), sd(VideoHappy, na.rm=T))

mean(PracticeHappy) sd(PracticeHappy) mean(VideoHappy, na.rm = T) sd(VideoHappy, na.rm = T)

1 2.638889 0.7822321 1.983051 0.828982

or you could use:

mean(dat$PracticeHappy); sd(dat$PracticeHappy); mean(dat$VideoHappy); sd(dat$VideoHappy)
  1. Later in the workbook, we use the functions group_by and summarise, so you could use these to look at the means separately for the emotion conditions (EmotionCondition) - how would you do that? What are the means and SDs? What effect does there seem to be of the emotion conditions on happiness mood change? If you don’t yet know how to do this, then come back to it after you’ve completed the rest of the workbook.
dat2 <- group_by(dat, EmotionCondition)
summarise(dat2, mean(PracticeHappy), sd(PracticeHappy), mean(VideoHappy, na.rm=T), sd(VideoHappy, na.rm=T))

EmotionCondition mean(PracticeHappy) sd(PracticeHappy) mean(VideoHappy, na.rm = T) sd(VideoHappy, na.rm = T)

1 Awe 2.68 0.800 2.08 0.747

2 Joy 2.5 0.701 2.22 0.956

3 Control 2.74 0.835 1.62 0.648

happiness decreases in all conditions, but most for Control, mid for Awe, least for Joy conditions

Part 2: Reproduce the Lecture week3 part3 analyses

Task 2: Load in the data, draw a histogram, find means and standard deviations
  1. Create a new r script, called psyc411_week3.r, and clear out R studio ready for a new script using rm(list=ls()).

  2. Download the data files from last week once again on the vocabulary tests here: PSYC411-shipley-scores-anonymous-17_18.csv.

  3. Load the data into an object called “dat” using read_csv(), what command line do you use? (remember to set the working directory)

dat <- read_csv("PSYC411-shipley-scores-anonymous-17_18.csv")
  1. View the data. What command do you use?
View(dat)
  1. We can make a histogram of the first time people took the Gent vocabulary test:
hist(dat$Gent_1_score)
  1. And a histogram of the second time people took the Gent test, what command line do you use?
hist(dat$Gent_2_score)
  1. We will use a function called summarise. This is more powerful than just using mean(dat$Gent_1_score) and sd(dat$Gent_1_score), etc. because it can give us summary statistics on subgroups of data, and can give us much more than just means and SDs. Have a look at the documentation for the function:
?summarise

We use it like this: first put the object that the variables are in, then tell summarise what to do with each variable - eg compute the mean or sd or number of times it’s measured (that’s the n() part), or whatever:

summarise(dat, mean(Gent_1_score), sd(Gent_1_score), mean(Gent_2_score), sd(Gent_2_score), n())

However, this won’t have worked - it will just say NA for means and SDs. This is because summarise doesn’t work if there are missing values (called NA). So, we have to tell summarise to ignore the NAs.

summarise(dat, mean(Gent_1_score, na.rm=T), sd(Gent_1_score, na.rm=T), mean(Gent_2_score, na.rm=T), sd(Gent_2_score, na.rm=T), n())

What is the mean and standard deviation of Gent_1_score and Gent_2_score?

summarise(dat, mean(Gent_1_score, na.rm=T), sd(Gent_1_score, na.rm=T), mean(Gent_2_score, na.rm=T), sd(Gent_2_score, na.rm=T), n())

mean(Gent_1_score, na.rm = T) sd(Gent_1_score, na.rm = T) mean(Gent_2_score, na.rm = T) sd(Gent_2_score, na-rm= T) n()

57.4 16.0 57.1 18.7 81

Task 3: Use ggplot to draw some histograms:
  1. Now we are going to use another way of making graphs. This is more flexible than the hist function. Here is how to make a histogram of the Gent vocabulary scores:
ggplot(dat, aes(x = Gent_1_score) ) + geom_histogram(fill="blue") + labs(title="Gent Vocabulary Test 1", x = "Vocabulary Score", y = "Frequency")
  1. And for the second Gent test:
ggplot(dat, aes(x = Gent_2_score) ) + geom_histogram(fill="red") + labs(title="Gent Vocabulary Test 2", x = "Vocabulary Score", y = "Frequency")
Note

Breaking it down:

ggplot(dat, aes(x = Gent_1_score)): this calls the plotting function ggplot

dat: we specify the data set we will use

and we set the data for the plot, in this case we say that the x value (so that’s what will be along the x-axis in the graph) is the Gent_1_score. We put this inside aes(), which stands for “aesthetic”.

+ geom_histogram(fill="blue"): this adds a graph of type histogram and colours it blue

+ labs(title="Gent Vocabulary Test", x = "Vocabulary Score", y = "Frequency"): this adds labels to the graph: title, the x-axis label and the y-axis label.

4.1.3.3 Part 3: Practise manipulating data

Task 4: Practise manipulating data
  1. Let’s keep only some of the variables from the dataset dat - let’s remove Gender_code, and Dyslexia_diagnosis. Keep the other variables using select() and load this into summarydata
summarydata <- select(dat, subject_ID, Age, english_status, Gender, Shipley_Voc_Score, Gent_1_score, Gent_2_score, academic_year)
  1. Next we will have a bit more of a wander around the data to get a feel for it. We will first use the function arrange(), which changes the order of observations (rows):
arrange(summarydata, Shipley_Voc_Score)

What is the lowest score of a participant on the Shipley Vocabulary questionnaire? (You may like to make a new object, which is the result of the arrange function, then look at it in View).

10

  1. If you want to order from highest to lowest, you have to use the desc() function:
arrange(summarydata, desc(Shipley_Voc_Score))

What is the highest value on the Shipley Vocabulary Test? How many participants have this highest score?

38, 2 participants

  1. Next we will use the filter() function. This includes or excludes certain observations (rows). Let’s just include the participants with English as a first language and put this into a new object, called summarydata_enl.
summarydata_enl <- filter(summarydata, english_status == 'native')

What are the mean and SD values of the Shipley Vocabulary test for the native speakers? (use the summarise function)

summarise(summarydata_enl, mean(Shipley_Voc_Score), sd(Shipley_Voc_Score))

mean = 32.1, sd = 3.25

  1. Make another variable with the z-scores of the Shipley Vocabulary test (see week 1 workbook, 1.4.1.1 Part 1, Task 4). What are the maximum and minimum z-scores? (you can use min() and max() in summarise to compute these)?
summarydata$zshipley <- scale(summarydata$Shipley_Voc_Score)
summarise(summarydata, min(zshipley), max(zshipley))

Min: -2.46, max = 1.43

  1. Making subgroups of data using group_by function. Another handy function in the tidyverse library is group_by. This makes subgroups of data which can then be looked at separately in other functions, such as summarise. Above, we filtered just the “native” English speakers, but we can specify subgroups of speakers instead. We make a new object of the grouped data (it looks just the same, but the groups are behind the scenes):

summarydata2 <- group_by(summarydata, english_status)

  1. Then we summarise dat2. What is the mean, SD and number of participants for the Shipley_Voc_Score for the ESL and native English speakers groups?
summarise(summarydata2, mean(Shipley_Voc_Score), sd(Shipley_Voc_Score), n())

english_status mean(Shipley_Voc_Score) sd(Shipley_Voc_Score) n()

1 ESL 21.6 6.76 33

2 native 32.1 3.25 42

3 NA 30.3 4.93 6

  1. Remember to save your script file as you go along.

3.1.4.4 Part 4: Graphing data

Task 5: graphing data using histograms
  1. Previously we used plot to draw a scatter plot, and hist to draw a histogram. Now, we’re going to use ggplot which can draw all kinds of graphs, with a great deal more flexibility. We are going to represent the data to reflect the following relations:
  • English status and gender
  • Age and vocabulary score
  • Gender and vocabulary score
  • Academic year and vocabulary score
  • Academic year and age
  • English status and vocabulary score
  • English status and age

But first, let’s repeat reproducing the histogram from the overhead slides to look at the distribution of variables:

ggplot(summarydata, aes(x = Gent_1_score)) +
  geom_histogram(fill="blue") + 
  labs(title="Gent Vocabulary Test 1", x = "Vocabulary Score", y = "Frequency")
  1. Now draw a histogram with Shipley_Voc_Score as the variable and colour it orange. Remember to change the title to something appropriate.
ggplot(summarydata, aes(x = summarydata$Shipley_Voc_Score)) + geom_histogram(fill="orange") + labs(title="Shipley Vocabulary Test", x = "Vocabulary Score", y = "Frequency")

or how about this for a version that separates out the gender in the same figure:

ggplot(summarydata, aes(x = Shipley_Voc_Score, fill=Gender )) + geom_histogram(bins = 10) + labs(title="Shipley Vocabulary Test", x = "Vocabulary Score", y = "Frequency")
Task 6: graphing data using bar graphs
  1. Next let’s look at English status and gender. What types of variable are these? Nominal? Ordinal? Interval/ratio?

both nominal. So we can only count them.

  1. We will draw a bar graph of the counts. We use geom_bar() for this:
  • First try this:
ggplot(summarydata, aes(x = Gender)) + 
  geom_bar()
  • This just draws counts of Gender
  • Now let’s draw Gender and English Status together:
ggplot(summarydata, aes(x = Gender, fill = english_status)) + 
  geom_bar(position = "dodge")

Note 1: We use position dodge so that it puts the bars next to each other (what happens if you leave out position = dodge?)

Note 2: We use fill = english_status so that it fills the different bars with different colours according to different english statuses.

What is the general pattern of counts? Are there proportional differences by English status according to gender?

proportionally more males who are native English. But we don’t (yet) know if that’s significant…

Task 7: graphing data using scatterplot
  1. Next we’ll look at Age and Shipley Vocabulary Score. What types of data are these?

interval/ratio, so we can draw a point plot (scatterplot).

  1. We will draw a point plot of these values:
ggplot(summarydata, aes(x= Age, y = Shipley_Voc_Score)) + geom_point()

We can add + labs(title = "Age by Shipley Vocabulary Score", x = "Age", y = "Shipley Vocabulary Score") to tidy up presentation a bit.

ggplot(summarydata, aes(x= Age, y = Shipley_Voc_Score)) + 
  geom_point() + 
  labs(title = "Age by Shipley Vocabulary Score", x = "Age", y = "Shipley Vocabulary Score")
  • What is the relation between age and Shipley Vocabulary score?

it looks positive.

Task 8: Draw and interpret a box plot
  1. Next on the list of relations to check is gender and vocabulary score. Let’s look at Gent_1_score against Gender. What type of variables are these?

one ratio, one nominal, so need a box plot.

  1. We will draw a box plot (you could draw a bar graph, but box plots tend to be preferred for these combinations of variables - use a bar graph for counts):
ggplot(summarydata, aes(x= Gender, y = Gent_1_score)) + 
  geom_boxplot() 
  • Again we can tidy this up by adding labels:
ggplot(summarydata, aes(x= Gender, y = Gent_1_score)) + geom_boxplot() + labs(title = "Vocabulary Score by Gender", x = "Gender", y = "Gent Vocabulary Score Test 1")
  1. Interpreting box plots: The horizontal line indicates the median. The box indicates where 50% of the data lie. The lines indicate an estimate of the range of the data (minimum and maximum values). The dots indicate outliers. A large box indicates larger standard deviation. If the boxes don’t overlap much then this indicates there may be a difference between the groups.

Are there differences in Vocabulary according to gender?

maybe females slightly lower than males.

  1. Now for the other relations:
  • Academic year and vocabulary score
  • Academic year and age
  • English status and vocabulary score
  • English status and age

At the moment, R is interpreting Academic year as a number. We need to turn it into a nominal variable (called a “factor” in R studio):

summarydata$academic_year <- as.factor(summarydata$academic_year)

Draw a graph for each of these relations.

These are categorical against ratio variables, so all box plots:

ggplot(summarydata, aes(x = academic_year, y = Gent_1_score) ) + geom_boxplot()
ggplot(summarydata, aes(x = academic_year, y = Age) ) + geom_boxplot()
ggplot(summarydata, aes(x = english_status, y = Gent_1_score) ) + geom_boxplot()
ggplot(summarydata, aes(x = english_status, y = Age) ) + geom_boxplot()
ggplot(summarydata, aes(x = Age, fill = english_status) ) + geom_histogram()
  1. Save your R script.

3.4.1.5 Part 5: More practise manipulating data

  1. What are the minimum and maximum values for the ESL and English native speakers?
summarise(summarydata2, min(Shipley_Voc_Score), max(Shipley_Voc_Score))

english_status min(Shipley_Voc_Score) max(Shipley_Voc_Score) 1 ESL 10 38 2 native 24 38 3 NA 24 35

3.4.1.6 Part 6: Extending your knowledge

  1. An important part of using R-studio is that people share their code, their tutorials, and help one another out in troubleshooting. Searching for commands and how to use them on search engines is absolutely fine, and supplementing your learning with others’ tutorials is also what the community does. Glasgow University psychology department has made some of their course materials available and so for more practise in using the different data manipulation functions, you could have a look at https://psyteachr.github.io/msc-conv/data-wrangling-1.html

Ignore the part about making a new R markdown document, use instead an R script.

Just go straight to clear out the environment rm(list=ls()) then load the libraries tidyverse() and babynames() where the babynames() library contains the database on babies’ names that you can explore in the example (you’ll need to install.packages(“babynames”) first).

And then begin at Section “4.4: Activity 2: Look at the data”.

  1. For the data that you downloaded from the papers for your extra exercises in weeks 1 and 2, construct some graphs to show relations between variables.
Note

Use geom_bar() for bar graphs of counts Use geom_box() for box plots showing means Use geom_point() to show relations between two or more variables Use geom_histogram() to show distributions of one or more variables

3.4.1.7 Part 7: Extending everyone’s knowledge (Optional extra)

If you want a real challenge, where we can get to the boundary of doing some original research, then here is another optional extra:

This is a recently published paper that draws on lots of different sources of secondary data. It is a really impressive example of what you can do with already existing data.

This is from the study de Zubicaray, G. I., Kearney, E., Guenther, F., McMahon, K. L., & Arciuli, J. (2024). Statistical relationships between surface form and sensory meanings of English words influence lexical processing. Journal of Experimental Psychology: Human Perception and Performance, in press.

My challenge to you is to get all the datasets together that this paper refers to, and see if you can repeat the first analysis in their paper. They actually have the r scripts (in this case they are r-markdown (.rmd) files) as well as the data available.

If you can do that, then I will be amazed and impressed, and then I’ve got another task for you, which you’ll have to ask me for in the practical.

3.4.2 Data

Here is a link to the data referred to in this practical:

3.4.3 Answers

The answers to the workbook now appear below each question in the workbook, above, after the practical has finished, so you can check your work.

It’s really important for your learning that you have a go first of all at the workbook before looking at the answers.

<!—- ## 3.5 Extras

Back to top