Hi, in this video I would like to talk you through how to check your assumptions for correlation analysis in R. How to evaluate whether there are any issues with outliers or range restrictions. Then we'll talk about how to conduct Spearman's rho correlation analysis in R and finally, how to conduct multiple correlations or inter correlation in R. Now first the the assumptions. On the slide here, you see the questions that you need to answer to make sure that you meet the assumptions and pick the right type of analysis for your data. So the first question that you have to address is is the data at interval, ratio or ordinal level so that refers to the type or variable that you've got. So there's nothing to do with R for that. You need to think about at what measurement level your variable was measured. If they are measured at the interval or ratio variable, then you would use Pearson's R. If they are measured at the ordinal level, then you would use Spearman's rho. OK, so the second point is that you need to check whether there is a data point for each participant for both variables. Now that you can look at in R. So I will navigate here to my R studio window. And here you see my RStudio window, I hope. I've already set my working directory and I've loaded in the packages that we need and I've also read in here the data file, the Miller Haden data that we looked at last time. So you can see that's that the table is here. OK, so that's where we are now. We want to check whether there are any missing data in that file. Now, because that file isn't very big, it would be easy. You could do it by hand, but most data files are much much bigger and you cannot do this by hand. That would be awful and very prone to errors, so we can use some R functions to do that. And the thing to know is that if there is a missing data point R kind of write that down as: capital N capital A, 'not available\ so there is no data point in that cell. So if a participant doesn't have a score for reading ability, then there would be a big NA instead of a number. Now the way to check whether you have any missing values is to use the is.na function. S, that is basically asking so you tell it here. OK, look at this data frame for this variable. So look in the MH dataframe at this variable and tell me whether any of the cells have an NA. So if we do that. It tells you False false false false false false for each of the 25 participants and false is good in this case because it tells you: I can't find any NAs for that particular variable. If there are no NAs, that means that there are numbers in that cell. That is excellent. We have a data point for each of the participants on this variable. Now we can use this function. I'm using this function here in the in the slightly kind of. Different way in that I'm saying OK, I'm going to combine it with filter because I want to use filter to only keep the people who don't have any NAs right? So I'm saying here by using this exclamation mark. Keep people who don't have any. So for anybody for whom an NA would be true, they wouldn't be kept in the next object and in this function. I'll say that so I've put that in a pipe, so what we're essentially doing is: take the table, mh and then. send that to the filter function, and within that filter function only keep the people who have no NAs for reading ability. And then, because correlation analysis is with two variables, you also want them to have a data point on that other variable. So then we use the same trick. To check whether they also have a value for IQ. So that's what happens in this line of code. So send it to the filter function again and we say OK only keep the people who don't have any NAs. So don't is the exclamation mark. Is there any not available in for that particular area? So that is a neat way of checking whether both variables have data points for your participants. So if you run that, we've got a new dataframe where that is the case where we've only kept those people now because there were no Na's in this particular toy kind of demo datasets, it has the same number of observations. Here. If in a real data set you would kind of see here, OK, there's fewer observations here. Then it Had in the original data frame, so it has thrown out a few people for not having data points on both variables. OK, so much about the missing data part. So the third assumption that you need to check is whether data on both variables are normally distributed. Again, we can do that. We can look at that in R by making a histogram. So a histogram plots how many observations each value of the scale for a particular value variable has in the sample. So we use GG plot for that here. We tell it which dataframe to used, ones we've checked for missing values and we say OK, we want you to plot reading ability. Histograms don't have a y value. Then be saying use the geom_histogram to do that and there we go, let's run that. So here we have. So there's one observation with 45 for instance, and there is 4 observations where participants have a score of 60, and here you are looking for something that resembles a normal distribution. So more observations in the middle and fewer on both sides and hopefully roughly symmetrical. Now this probably looks about OK, but it's not always that easy to see and that is where the QQ plot comes in really handy. So for QQ plot we use this function. It's QQ Capital plot. Again we just say OK, please use the MH data set and that variable to make the QQ plot. So let's do that. OK, so here we have our QQ plot and remember all the observations need to be kind of around this line. The solid line. So that is good and all that, and as long as they kind of fall in between the dashed lines the blue dashed lines, here (I just noticed that you can't actually see my pointer here) It is in the bottom right corner. You can see the plot, so observations need to fall between these two blue dashed lines and ideally as close as possible to the solid blue line in the middle of that. If that's the case, then you can conclude from the QQ plot. That the variable is normally distributed. Now you would you would do the same thing for for IQ. Um? So let's just do that. We should use that one, it doesn't really matter in our case because we didn't have any missing values, but it is a bit neater. Here on the right we have the histogram. For IQ you can see a roughly. Yeah, it resembles resembles a normal distribution to an extent. I will make a QQ plot. And then we go and you can see actually in the top right corner of the QQ plots and in the bottom left corner of the QQ plot you can see some observations that are quite kind of just on the outside of the dashed line, blue line and which suggests that I would probably say OK, these data are still normally distributed, but well, these observations are something to watch and if they are further and would if there would be even further away from that dashed line then. It would have to reconsider. So if the data would clearly not be normally distributed, you would use Spearman's rho. OK, then we go on to the 4th an assumption. That's about linearity. So does the relationship between the variables appear linear. Now for that we can use a scatter plot. You already know scatter plots from last from the last session, so I'm not going to talk much about it. This is basically just using that. Again. This will make a scatter plot. Here we are. Here we have a scatter plot of reading ability and IQ. I'm not, this looks like it is a kind of linear relationship. We don't see kind of a pattern that has curves in it. If. You know we would have wanted to. If it would have decided it is. At any point. it is nonlinear. Then you would have to do different things. Spearman's Rho doesn't necessarily help there, so more about that will be covered next year. The other thing to check in the scatterplot is the spread around the line of best fit. An is the spread roughly kind of equal across the line, so is it. Is there as much spread over here as there is over here and that looks like that is the case. OK, so that is it about the assumptions. Let me just go back here and see what comes next. Yes, of course. The issues with outliers, an range restrictions. Again, you would look at your scatterplot to see whether there are any issues with that. So you would look at it at your scatterplot to see whether this correlation would be driven by, you know, one data point in case of outliers, or whether there would be very restricted variation. So maybe if we. I only had reading scores kind of 45 to 65, then the correlation would have been much smaller and so that would have been restricted range. OK, let's go back to the presentation. Intercorrelation Hold on before we go to intercorrelation. I just want to show you how to run Spearman's rho correlation analysis in the in R. So we go back to the old window and here we've got a code example to do that. Remember you would do this if you've got ordinal data or you have the data breach another assumption, so it's not. They're not normally distributed, for instance. That a correlation analysis using Spearman's Rho is very similar to the one using Pearson's r. It is basically displayed here that you change. It is what's called the method. So previously it's at Pearson's r here and now it says Spearman and that is the only difference. Other than that we can use this cor.test function as we did for Pearson's R. I'm telling it which data frame to use and which variable to use, and then we just change the method, so that's cool. It is complaining about this but never don't don't worry bout that this. So if we look at the results. It gives you quite a similar correlation so you can read this in the same way. This is where you have your Spearman's rho. This is where you have your P value and here it reminds you that it is Spearman's rho correlation rather than Pearson's R. OK, now to intercorrelation. So what we want to do is to construct a matrix of scatterplots because we want to look at multiple variables within a data set and the relationships between them and a matrix of scatterplots is then really good idea to get an overview. Um, and then we want to conduct multiple multiple correlations at once and correct for multiple comparisons using a Bonferroni correction, for any adjustment to make sure we don't have a problem with type one errors. So how do we do that? Let's go back to R. So again, we will use the MH data and I will use this function called pairs and what that does it basically creates a matrix, right? So. Technically, we should really be using that, alright? Because here we checked for missing values, so we have four very full variables and participant number. Now we don't want the correlation between the participant number and reading ability, because that would be meaningless. So before we use the pairs function to create a matrix. We say OK, we use select to say, get rid of that. T So I will say matrix. So what I'm doing is taking the table we checked for missing values and I'm piping that to the select function and I'm using here this minus sign to say OK keep everything except the participants variable and I'm sticking that in a new object called mh_matrix. So there we go. So here you can see I will click. But so this one you can see it has five variables because participants is still in there and this one now has four variables here, because a participant is not there anymore. OK, so then we can run the pairs function too. I create this matrix now. This is really tiny because my R window has to be quite small to fit into the window that I'm using to make the video recordings. So I would say try this code in your own screen and then you can actually see that it is quite helpful, but I'll talk you through it Oh, hold on. I think participant is still there. That is wrong, but that's because I'm using the wrong dataframe. There we go. So here we are. We've got ability, IQ, home and TV. This basically shows you the relationships between all the variables. This is the scatter plot for ability on one axis and IQ on the other axis and this one. I'm sorry I keep forgetting that you can't see my pointer so let me try and talk you through that without using my pointer. So you can see the different variables across the diagonal in this graph: ability, IQ, home and TV and then if you kind of go down from ability and then you are to the left of IQ that is the scatter plot that plots those two variables. So if you go down one more you have reading ability on one variable and home on the other variable. So to the left of home and kind of below IQ sits the scatter plot that plots IQ on one and home on the other axis, etc etc. So it is a bit small on this screen, but that's because of the recording, so try it on your own screen. Then you can see that you get a lot of scatterplots all in one go. That's the main thing I'm trying to say. So how do we then? So that's the scatter plots, but we want the correlation coefficient as well, of course. So how do you run that? Then you first have to tell it. That's well over writing it in this case. That you're treating this as a dataframe. Don't worry too much about it. Well, there's do that. And then. It's just, well, I can say something about it. So that you know what you're doing. Previously these functions store this as what is called a tibble. It's just a sort of type of specific type of table in R. But the function that we will be using in a minute needs it as a data frame, which is also a table, but it has slightly different properties. So that's what you're doing here. You just saying OK. In anticipation of this function we need the input as a dataframe. We say OK. Take my table and make it into a take my table here and make it into a data frame and we are just overwriting it. So we've done that. I think I'll just do it again. And then we use the function correlate to create this correlation matrix, so we have to tell it what data frame to use. I've been changing that during the video. So here is our data here we say yes, please compute P values. Here is the method. So quite similar to the cor.test function that we used previously. We're using Pearson's here. You could change that to Spearman if you need to. And here we have to adjustment method that we're using. So here we're using Bonferroni. There are ones that you can learn about at some point in time. So we're using that to create a correlation matrix and storing that in an object called interior results. So let's do that. So here we've got it. We can look at that that way, but we can also just do it here. So now in the console you've got this, you can have a look at the correlation matrix. So here we have this matrix setup again. So across the diagonal correlations with itself. So if I say here I mean in the console window at the bottom. So across the diagonal you have to correlation with itself, so they not interesting, that's why these cells are empty, and then we have the correlations between reading ability and IQ, reading ability and home, reading ability and TV and then the next column you have IQ and Home and IQ and TV. ETC etc here in the top half kind of above the diagonal you have the same values as we discussed previously. You only need to report 1/2 for that reason. So that's all the correlation coefficients And then here we have the P values. So if you're going to scroll down, you'll see the P values for all those combinations and these are adjusted for multiple comparisons using the Bonferroni correction. So here you can see which ones are significant and they have already been corrected. And finally, there is a table with the sample sizes where yeah, where you can just double check how many participants are there for each correlation. OK. And that is it for now. Thank you very much.