library(Hmisc)
<- spss.get("stamkou_Study 1_edited data.sav") dat
Week 3. Drawing graphs from data
Written by Padraic Monaghan
3.1 Overview
This week, there are three mini lectures, the practical workbook working with R-studio, and the first assignment (due by Friday of week 3).
Before the practical on Tuesday, please try to work through the practical workbook in your group.
Bring your questions (and/or answers) to the practical.
The assignment is on moodle here.
3.2 Learning Goals
See how data gathering for vocabulary knowledge can be conducted
Understand different types of reproducibility
Understand the relations between observed score, true score, and error score
Understand multiple types of graphs and how to construct them in Rstudio
Manipulate data, including filtering, grouping and summarizing data in Rstudio
3.3 Lectures and slides
3.3.1 Lectures
- Watch Lecture 3 part 1
Have a go at the Gent Vocabulary test, and record your score.
Have another go at the Gent Vocabulary test, and record your score again.
Have a go at the Shipley Vocabulary test, and record your score.
Fill in your vocabulary scores into our course database: What is your vocabulary?
Watch Lecture week3 part2
Watch Lecture week3 part3
Do the quick quiz (not assessed).
3.3.2 Slides
Download the lecture slides for:
I haven’t put slides for part 1 here because they just said to do the different vocabulary tests.
3.4 Practical Materials
3.4.1 Workbook
Part 1 is revision from last week.
Part 2 is to reproduce the Lecture week 3 part 3 analyses.
Part 3 is to practise manipulating data.
Part 4 is starting making visualisations of data.
Part 5 is some more advanced manipulation of data.
Parts 6 and 7 are extra: extending your knowledge.
If you’ve followed a course in R studio before then Parts 1, 6 and 7 will extend your knowledge.
3.4.1.1 Part 1: Revision from last week
If you had a look at the data from Stamkou et al.’s study from last week, then that’s great, in which case follow Task 1, otherwise just go straight to Task 2.
Task 1: Describe Stamkou et al.’s study from last week
Talk to your group about what you did.
From the Stamkou et al. data, can you look at whether there was a change in happiness ratings from the baseline (PracticeHappy) to after watching the video (VideoHappy)?
- If you used get.spss, though, the data, as you may have seen, is rather messy, and needs some tidying up. If you used read_sav then these issues don’t apply. But for get.spss, in particular, the ratings of happiness are a bit all over the place. (this is because get.spss keeps information about responses but also what those responses mean and sometimes mixes them up together). We want these to just be numbers, 1,2,3,4, but instead they have the coding for what the numbers mean as well. If you’ve loaded in your data set into an object called “dat”: then the following lines will work to tidy things up:
$PracticeHappy <- gsub("\n1=not happy at all\n", "1", dat$PracticeHappy)
dat$PracticeHappy <- gsub("\n2=a bit happy\n", "2", dat$PracticeHappy)
dat$PracticeHappy <- gsub("\n3=quite happy\n", "3", dat$PracticeHappy)
dat$PracticeHappy <- gsub("\n4=very happy\n", "4", dat$PracticeHappy)
dat$PracticeHappy <- as.numeric(dat$PracticeHappy)
dat$VideoHappy <- gsub("\n1=not happy at all\n", "1", dat$VideoHappy)
dat$VideoHappy <- gsub("\n2=a bit happy\n", "2", dat$VideoHappy)
dat$VideoHappy <- gsub("\n3=quite happy\n", "3", dat$VideoHappy)
dat$VideoHappy <- gsub("\n4=very happy\n", "4", dat$VideoHappy)
dat$VideoHappy <- as.numeric(dat$VideoHappy) dat
- Now, you can look at plotting the happiness ratings at baseline against happiness ratings after the children watch the video. You can also look at the means and standard deviations of the ratings to see if there is a numerical difference. What are the means and SDs?
- Later in the workbook, we use the functions group_by and summarise, so you could use these to look at the means separately for the emotion conditions (EmotionCondition) - how would you do that? What are the means and SDs? What effect does there seem to be of the emotion conditions on happiness mood change? If you don’t yet know how to do this, then come back to it after you’ve completed the rest of the workbook.
Part 2: Reproduce the Lecture week3 part3 analyses
Task 2: Load in the data, draw a histogram, find means and standard deviations
Create a new r script, called psyc411_week3.r, and clear out R studio ready for a new script using
rm(list=ls())
.Download the data files from last week once again on the vocabulary tests here: PSYC411-shipley-scores-anonymous-17_18.csv.
Load the data into an object called “dat” using
read_csv()
, what command line do you use? (remember to set the working directory)
- View the data. What command do you use?
- We can make a histogram of the first time people took the Gent vocabulary test:
hist(dat$Gent_1_score)
- And a histogram of the second time people took the Gent test, what command line do you use?
- We will use a function called summarise. This is more powerful than just using mean(dat$Gent_1_score) and sd(dat$Gent_1_score), etc. because it can give us summary statistics on subgroups of data, and can give us much more than just means and SDs. Have a look at the documentation for the function:
?summarise
We use it like this: first put the object that the variables are in, then tell summarise what to do with each variable - eg compute the mean or sd or number of times it’s measured (that’s the n() part), or whatever:
summarise(dat, mean(Gent_1_score), sd(Gent_1_score), mean(Gent_2_score), sd(Gent_2_score), n())
However, this won’t have worked - it will just say NA for means and SDs. This is because summarise doesn’t work if there are missing values (called NA). So, we have to tell summarise to ignore the NAs.
summarise(dat, mean(Gent_1_score, na.rm=T), sd(Gent_1_score, na.rm=T), mean(Gent_2_score, na.rm=T), sd(Gent_2_score, na.rm=T), n())
What is the mean and standard deviation of Gent_1_score and Gent_2_score?
Task 3: Use ggplot to draw some histograms:
- Now we are going to use another way of making graphs. This is more flexible than the
hist
function. Here is how to make a histogram of the Gent vocabulary scores:
ggplot(dat, aes(x = Gent_1_score) ) + geom_histogram(fill="blue") + labs(title="Gent Vocabulary Test 1", x = "Vocabulary Score", y = "Frequency")
- And for the second Gent test:
ggplot(dat, aes(x = Gent_2_score) ) + geom_histogram(fill="red") + labs(title="Gent Vocabulary Test 2", x = "Vocabulary Score", y = "Frequency")
4.1.3.3 Part 3: Practise manipulating data
Task 4: Practise manipulating data
- Let’s keep only some of the variables from the dataset dat - let’s remove Gender_code, and Dyslexia_diagnosis. Keep the other variables using
select()
and load this into summarydata
<- select(dat, subject_ID, Age, english_status, Gender, Shipley_Voc_Score, Gent_1_score, Gent_2_score, academic_year) summarydata
- Next we will have a bit more of a wander around the data to get a feel for it. We will first use the function
arrange()
, which changes the order of observations (rows):
arrange(summarydata, Shipley_Voc_Score)
What is the lowest score of a participant on the Shipley Vocabulary questionnaire? (You may like to make a new object, which is the result of the arrange
function, then look at it in View).
- If you want to order from highest to lowest, you have to use the
desc()
function:
arrange(summarydata, desc(Shipley_Voc_Score))
What is the highest value on the Shipley Vocabulary Test? How many participants have this highest score?
- Next we will use the
filter()
function. This includes or excludes certain observations (rows). Let’s just include the participants with English as a first language and put this into a new object, called summarydata_enl.
<- filter(summarydata, english_status == 'native') summarydata_enl
What are the mean and SD values of the Shipley Vocabulary test for the native speakers? (use the summarise function)
- Make another variable with the z-scores of the Shipley Vocabulary test (see week 1 workbook, 1.4.1.1 Part 1, Task 4). What are the maximum and minimum z-scores? (you can use min() and max() in summarise to compute these)?
- Making subgroups of data using group_by function. Another handy function in the tidyverse library is group_by. This makes subgroups of data which can then be looked at separately in other functions, such as summarise. Above, we filtered just the “native” English speakers, but we can specify subgroups of speakers instead. We make a new object of the grouped data (it looks just the same, but the groups are behind the scenes):
summarydata2 <- group_by(summarydata, english_status)
- Then we summarise dat2. What is the mean, SD and number of participants for the Shipley_Voc_Score for the ESL and native English speakers groups?
- Remember to save your script file as you go along.
3.1.4.4 Part 4: Graphing data
Task 5: graphing data using histograms
- Previously we used plot to draw a scatter plot, and hist to draw a histogram. Now, we’re going to use ggplot which can draw all kinds of graphs, with a great deal more flexibility. We are going to represent the data to reflect the following relations:
- English status and gender
- Age and vocabulary score
- Gender and vocabulary score
- Academic year and vocabulary score
- Academic year and age
- English status and vocabulary score
- English status and age
But first, let’s repeat reproducing the histogram from the overhead slides to look at the distribution of variables:
ggplot(summarydata, aes(x = Gent_1_score)) +
geom_histogram(fill="blue") +
labs(title="Gent Vocabulary Test 1", x = "Vocabulary Score", y = "Frequency")
- Now draw a histogram with Shipley_Voc_Score as the variable and colour it orange. Remember to change the title to something appropriate.
Task 6: graphing data using bar graphs
- Next let’s look at English status and gender. What types of variable are these? Nominal? Ordinal? Interval/ratio?
- We will draw a bar graph of the counts. We use
geom_bar()
for this:
- First try this:
ggplot(summarydata, aes(x = Gender)) +
geom_bar()
- This just draws counts of Gender
- Now let’s draw Gender and English Status together:
ggplot(summarydata, aes(x = Gender, fill = english_status)) +
geom_bar(position = "dodge")
Note 1: We use position dodge so that it puts the bars next to each other (what happens if you leave out position = dodge
?)
Note 2: We use fill = english_status
so that it fills the different bars with different colours according to different english statuses.
What is the general pattern of counts? Are there proportional differences by English status according to gender?
Task 7: graphing data using scatterplot
- Next we’ll look at Age and Shipley Vocabulary Score. What types of data are these?
- We will draw a point plot of these values:
ggplot(summarydata, aes(x= Age, y = Shipley_Voc_Score)) + geom_point()
We can add + labs(title = "Age by Shipley Vocabulary Score", x = "Age", y = "Shipley Vocabulary Score")
to tidy up presentation a bit.
ggplot(summarydata, aes(x= Age, y = Shipley_Voc_Score)) +
geom_point() +
labs(title = "Age by Shipley Vocabulary Score", x = "Age", y = "Shipley Vocabulary Score")
- What is the relation between age and Shipley Vocabulary score?
Task 8: Draw and interpret a box plot
- Next on the list of relations to check is gender and vocabulary score. Let’s look at Gent_1_score against Gender. What type of variables are these?
- We will draw a box plot (you could draw a bar graph, but box plots tend to be preferred for these combinations of variables - use a bar graph for counts):
ggplot(summarydata, aes(x= Gender, y = Gent_1_score)) +
geom_boxplot()
- Again we can tidy this up by adding labels:
ggplot(summarydata, aes(x= Gender, y = Gent_1_score)) + geom_boxplot() + labs(title = "Vocabulary Score by Gender", x = "Gender", y = "Gent Vocabulary Score Test 1")
- Interpreting box plots: The horizontal line indicates the median. The box indicates where 50% of the data lie. The lines indicate an estimate of the range of the data (minimum and maximum values). The dots indicate outliers. A large box indicates larger standard deviation. If the boxes don’t overlap much then this indicates there may be a difference between the groups.
Are there differences in Vocabulary according to gender?
- Now for the other relations:
- Academic year and vocabulary score
- Academic year and age
- English status and vocabulary score
- English status and age
At the moment, R is interpreting Academic year as a number. We need to turn it into a nominal variable (called a “factor” in R studio):
$academic_year <- as.factor(summarydata$academic_year) summarydata
Draw a graph for each of these relations.
- Save your R script.
3.4.1.5 Part 5: More practise manipulating data
- What are the minimum and maximum values for the ESL and English native speakers?
3.4.1.6 Part 6: Extending your knowledge
- An important part of using R-studio is that people share their code, their tutorials, and help one another out in troubleshooting. Searching for commands and how to use them on search engines is absolutely fine, and supplementing your learning with others’ tutorials is also what the community does. Glasgow University psychology department has made some of their course materials available and so for more practise in using the different data manipulation functions, you could have a look at https://psyteachr.github.io/msc-conv/data-wrangling-1.html
Ignore the part about making a new R markdown document, use instead an R script.
Just go straight to clear out the environment rm(list=ls()) then load the libraries tidyverse() and babynames() where the babynames() library contains the database on babies’ names that you can explore in the example (you’ll need to install.packages(“babynames”) first).
And then begin at Section “4.4: Activity 2: Look at the data”.
- For the data that you downloaded from the papers for your extra exercises in weeks 1 and 2, construct some graphs to show relations between variables.
Use geom_bar()
for bar graphs of counts Use geom_box()
for box plots showing means Use geom_point()
to show relations between two or more variables Use geom_histogram()
to show distributions of one or more variables
3.4.1.7 Part 7: Extending everyone’s knowledge (Optional extra)
If you want a real challenge, where we can get to the boundary of doing some original research, then here is another optional extra:
This is a recently published paper that draws on lots of different sources of secondary data. It is a really impressive example of what you can do with already existing data.
My challenge to you is to get all the datasets together that this paper refers to, and see if you can repeat the first analysis in their paper. They actually have the r scripts (in this case they are r-markdown (.rmd) files) as well as the data available.
If you can do that, then I will be amazed and impressed, and then I’ve got another task for you, which you’ll have to ask me for in the practical.
3.4.2 Data
Here is a link to the data referred to in this practical:
3.4.3 Answers
The answers to the workbook now appear below each question in the workbook, above, after the practical has finished, so you can check your work.
It’s really important for your learning that you have a go first of all at the workbook before looking at the answers.
<!—- ## 3.5 Extras