3. DVs and IVs in RStudio

Written by John Towse & Tom Beesley

Lectures

Note: due to to Uni-based technical issues these are last year’s videos and cannot be edited. They will be updated once system fixes are in place)

Watch Part 1

Watch Part 2

Watch Lecture Part 3

Watch Lecture Part 4

Download the lecture slides here

Pre-lab work

Last week we asked you to

  • Use a script to run instructions in RStudio
  • Put data into RStudio form a data file and explore how to run descriptive stats

This week - again, there’s a learnr tutorial to follow and help prep for this week’s activities. You can find it here

R Studio tasks

For the weight of a cow data last week, we provided you with estimates data, and you were able to generate descriptive statistics for the estimates (if not, please go back and work through that part of the week 2 activity again). You also found the weight estimates for the female and non-female guesses, right?

However, in order to find the estimates separately for gender identity, we needed to have a column for each gender category. Whilst that worked, it could get cumbersome over time always to work with data created like that.

There is a better way…

Task 1 - More with the penelope22 data

Step 1. Create a project and a folder, and set the working directory. This is covered in the learnr tutorial so head over there if you need reminding.

Step 2. Bring the week3.zip file into R Studio server. Like last week, upload the zip file, and launch the R script. You can get the file here

Step 3. This week, we again want to explore commands from the tidyverse library (toolkit) which can help us do more powerful things more elegantly. So let’s get R to work again with the tidyverse library by running the code line

library(tidyverse)

Step 4. Explore help() commands. R can give you more information about how it works.

Step 5. Read or load the penelope data into R. That is what line 16 of the code is designed to do

data_object_name <- read_csv("fill in") # use your own dataobject_name and specify the file you want to work with

but note that you will need to edit this line -and ensure you are in the correct working directory- for this to be successful. Then have a look at it using the View() command in the script

This time, let’s ask for the estimate data arranged by identity:

aggregate(x = *MISSING*$estimate, by = list(*MISSING*$identity), FUN = mean)

First, let’s try this (you will need to use your dataframe/variable name in place of MISSING). What do you get? Does this match what we did last week when we calculated the mean for the female and for the other (i.e., non-female) group?

Second, let’s look at what is happening here:

aggregate This is a command to call for descriptive statistics

x= This defines what column we are analyzing

by=list Now we tell R how to group the estimate data, and which column does that

FUN=mean Specifies which descriptive function is being asked for Can you explore whether you can call on alternate measures?)

group_by()

There’s another way that also allows us to group scores by a (nominal) variable. This is explored in the learnr tutorial, which should help you create the command the get weight estimates broken down by gender identity. You need to define the data frame for the estimates data, and the gender IV and the estimates DV

*MISSING* %>% group_by(*MISSING*) %>% summarise(mean_estimate = mean(*MISSING*))

First, try this command and see what you get. If you run this command as entered, it won’t work. So now use your experience at skills from the above and the learnr tutorial to work out what is required.

Note

%>% This is called a “pipe operator”, basically take the output from the left and feed it into the requests on the right. Summarise Provide summary statistics information for the specified variable as specified (whether mean, median etc)

The assignment operator

As well as learning about the pipe operator, we want to remind you /draw attention explicitly to another important element of the R command line syntax: the assignment operator. Uing a command such as

cows <- read_csv("penelope22.csv")

looks for the csv datafile called ‘penelope22’, and assign it to an object or variable called ‘cows’

We could create any object name we wanted (within limits of names already known to RStudio). It isn’t just for reading in data files, we can perform a whole range of functions and assign them to an object.

Task 2 - New salary data

Using aggregate and summarise may not seem like much progress, because they are just replicating what we had already done with mean() is week 2. However (a) this emphasizes that there are often several ways to get at the same thing in R (b) now we know about grouping, about working with 2-dimensional data, we can start to do more efficient and informative things.

Now, let’s turn to the guesses made about median salary in the UK. We can get the data from the file wages2023.csv (you will need to adapt the code we used above for the weight estimation file so that it will load in the wages data, and in what follows the assumption is your new variable name is called wages)

Let’s take a peek at the dataset with

glimpse(wages)

Glimpse pretty much does what you might think from the meaning of the word – it just gives us a data sample (handy because this is a much bigger dataset) and shows that we have 3 columns; uk_region (where someone lives, note ‘other’ probably equals Northern Ireland, Europe, China, etc), family_position (age relationship with siblings), and salary (estimate).

By the way, the govt statistics say the actual median income in 2022 was approx. £32,300 see this link

  1. Writing into your script, use the working “aggregate” commands from task 1 with the penelope weight data to find out the salary guesses as a function of where someone lives? That is, can you adapt that code for this problem? First, make sure you read in the data file into an R object.

  2. Can you use the aggregate command to find out salary guess as a function of family relationships? (if you are the youngest child maybe you have older siblings earning money that changes your evaluation?)

  3. Can you get a breakdown of guess as a function of BOTH UK region AND family relationship together?

  4. Can you use the group by() command to display salary guesses as a function of where someone lives? Check this gives you the same answer.

Task 3 - New phone use data

  1. Dataset 3: Use the dataset of phone screen time usage, screentime2023.csv to further explore and consolidate the group_by() command (ie we’ll drop the aggregate command for this task to focus on group_by()). Use copy and paste to adapt the existing script lines from the above two tasks so that this time you read in and calculate screen time usage as a function of the type of phone. In other words, add line (and comments) to the scripts for this new task.

  2. Use RStudio to figure out the overall mean screen time usage estimate and the standard deviation. Calculate by hand what usage estimate would have a z score of z=-1.5?

Task 4 - Final challenge: visualisation

Can you find a way to visualise the screentime usage data that you have been working with above? The script provides two ways to consider doing this - boxplots (which we have looked at in script commands already) and ggplot, which we have spent less time with but is an extremely powerful engine for creating plots. We’ve provided the start of the code in each case, leaving you to work out the specifics. Remember to annotate your creations!

Now you’re finished …

Remember to complete your notes in the script, save the script file, and end the server session.

Post - lab recap: The slides

Want to see the introduction slides used in the Levy lab? They are available here

Back to top