OK hi again

So in this video I will talk you
through the example script of

how to do Chi Square in R.

So I've set my working directory.
I have checked that the file I

need fighting_TV.csv
is in that directory so I'm

all good to go.

Um?

So we are working with the same
example that we looked at in the

slides, in the slides in the
lecture and so want to know

where there is a relationship
between children's record for

fighting in school.

and a preference for a TV program.

So first of all, just always
good idea to start by cleaning

the environment. So we
do that with this bit of code

Then we need to load two
particular libraries, tidyverse

which you're familiar with
and we also need the library

called LSR. So let's load those.

Um?

And then we need
to read in data.

now we've done that before its not
very new, we're using again the read_csv

function to do this.

We are assigning.

whatever is in that file to an
object in studio called fighting_data

So there we go. Now we have an
object here in our

environment in top right
panel. So we know that it's there.

Now by running the next bit
of code, you can just see the first few

lines of that table so let's do that.

And we can see that there
it has three variables so ID

which is participant

Number TV and fight.

Now in these TV and fight
variables, we only have the

numbers 1 and zero because if the
child has been involved in fighting in

school we have recorded
a 1 for that participant and

if not, we have recorded
a 0 for that participant

Similarly for TV program, if
they chose, if they have a preference for

Violent TV program they get a 1

In that uhm

in in that variable or
if they have a preference

for 0 they get a 0 for that

variable. So in the bottom left
pane of the console you can see

for instance, for participant 5 we have
one and one. So suggesting that

participant 5 choose, has a preference
for Violent TV program

and is also involved in fights in school.

OK, so far so good.

The first thing that we want
to do though is to re code

these numbers to more
informative text labels. So

instead of having oops, instead of having
one, we are going to say please

record the text label violent for zero

or for non violent.
Similarly, we want yes and no

for the ones and the
zeros in the fight variable

To do that, we will be using
mutate that you have come

across before and we say OK,
use within mutate we want to use the

recode function and we want you to take.

the variable TV and whenever you
encounter a 1, we want you to

replace that with word violent.

If whenever you encounter a zero
we want you to replace that with

the word label non-violent, etc. etc.

Now Please note that these are
all in double quotation marks.

OK, so we've.

Got our data table there and we're
saying OK, take the data table.

And then, that's our pipe, use the mutate
function to change the coding of these

values for TV variable
and for the fight variable.

We run that.

We can again have a look
and we see that instead of

of 1s and 0s we have words
violent and Yes makes the

output later and also
the graphs easier to read.

Now the next thing is to do
calculate some descriptive

statistics and.

Some we actually need to
create a contingency table,

so let's do that first,
the contingency table.

And that is our table that
we sort of get an inlection with

one variable across the top
and the other one across the side

So we are taking our
table with recoded.

Values and then we say group by
our two variables.

And then we ask, it, please count
how many observations we have in

each of the cells.

Let's run that..

and lets have a look so we

erm just to remind you we took
this table with the recoded

values. We did all that and we
assigned it, that's this arrow, to

a new object in R studio called

fighting counts. You can see
that object in the top right pane in

the environment pane.

Now, if you just run it, it
appears in the environment

pane in the top right, But you
don't see, you can't see it in

the console. You have to, what's called

you have to call it, Well, I did
and that's why in the bottom left

pane we now have our contingency
table. So TV program on the

preference on this side
and here we have fight

So we have 70 children
who have.

erm

who have not been in a fight in school and
prefer a non-violent TV program, etc. etc.

You will recognize this from the er

lecture slides

Now, as I mentioned
in the lecture, it is useful to

Report the percentages of these

to give the reader an idea about how much
is actually what this actually means. Even

though we need these numbers
to calculate the chi square

the percentages are informative.

You can use this bit of code to add
them to this table. So if I run this

And then call this new table.
Then here you see we've got. We've

got raw numbers. But now we've also
got our percentages.

You might wonder why is the
same, but that's because there were

were exactly 100 children who had a

preference for a nonviolent tv program,
so here we've got the percentages

in the console

in the bottom left pane
is the right-most column

OK then.

Often a good visualisation is really

helpful in understanding your data,
so that is a no different for chi square

you can use a bar plot to graph the data.

and this is how you do it.

We're using the ggplot function
that you've used before to create

scatterplots for instance,
And that needs a few pieces of

information, right? So usually
we tell it what the data is and say

OK, please use the
data recode table

and on the x axis, please
put fight, the variable fight

And in this case we don't have a Y
axis, but we said OK. Please use

the color to er indicate what

that tv program preference was.

So that is all the information that
ggplots needs about the data and then

here we tell it to use the geom_bar
function to create the bar plot

Let's run that.

And here we are. I will make that bigger.

So here you can see the
count data graph in the bar plot

so fight, not been involved in a

fight. Yes, has been involved in a
fight and the colors indicate whether

they have a preference
for violent or non violent

TV program

So as you can see, the plot makes
it much easier to visualize the data

Children with preference for violent
tv programs were more likely to get into

fights in school.

OK, so then we are ready to
run the chi square test and for that

function called chisq.test

And that function needs to, we need
to tell it a few different things.

We need to tell it what our
first grouping variable is, TV.

We need to tell it what our
second grouping variable is fight

and notice that again we're telling it
to use the data table fighting_recode.

and then here we tell,
we say want to use the

continuity correction.
Don't worry about that for now.

So we asked it to do all that
and save the results in a

new object that we've called results.

So let's do that.

So now I have this
result object.

In my top right pane, the environment
pane. If I want to actually see them on the

screen, I need to print it. So I need
to call that object, so thats what I do

Here

so we're running just
this bit that's called results.

You get some output in the console

so in the bottom left pane you
can now see what the results are so

Pearsons chi squared test, this is the
data it's using, and the chi squared is

26.157. You might remember that
we calculate by hand and it was

26.16, so it's pretty accurate.

we have one degree of
freedom and this is our exact

P value. It is scientific notation so you

have to move this decimal
point seven places to the left.

So this is well below 0.01

So, it is significant.

Now we need to check a
few assumptions right so

data in the cells
should be frequencies of

counts, not something
else, we know that that

is the case. The levels
and categories need to

be mutually exclusive.

So a participant can only contribute
to the count in one particular cell

We know that that's also the case

No child was both involved in
the fight and not involved in the fight

it wouldn't be logically possible.

There was also only, they
could only choose either

a violent TV program or
a non-violent TV program

so it's mutually exclusive

Groups are independent all
fine and there are two variables.

both categorical. That's all those
assumptions you can check from

the design of the study

The final one, number 6 here is
something that you need to check

in R so the expected cell frequencies should
be greater than 5 or at least 20% of them

So

We had in our results object

that we created by running
the chi squared test. We actually

have a lot more information
than it spits out in the first instance,

so if it has actually nine
different things in it and we we

need to look at one of those things and
that is the expected cell count. So if you

say. OK, take this table and then
look at that location with the dollar sign.

It'll give you the
expected cell count so let's

run that. Here you can
see what the expected

cell count for this
particular sample were.

and you can see in the bottom
pane, so in the console that

they are way bigger than five.
Actually in all cases.

So, the assumption in
this case has been met

just come back to this
continuity correction that is

basically, so. We
specified that here as false.

That is, becomes relevant when
your counts are smaller than 10.

You can use whats called
Yates's correction and say true.

OK, but we don't need to
worry about that right now

That was the expected frequencies, we've
checked that. All good, assumption met

Now over to the effect size so Cramer's V

there is a function for Cramer's V in R

and that what we use.
The code very closely

mirrors what you put
into the original chi square

Again, you have our data table
finding_recode and the first

variable is TV and the second variable

is Fight. And so that's exactly what you
put into Cramer's V and we're storing that

into new object called eff_size

OK, so let's run that code
again and call that object to

actually see what the numbers is at the
bottom pane in the console you can now see

that Cramer's V is .41.

Now we can.

Square that effect size to get the

percentage variance accounted for

or multiply it by itself, thats what
I'm doing here, say run that.

I get the percentage of
variance accounted for in.

Number of fights, number of children that
were involved in a fight accounted for,

by number of children that
preferred the violent TV program.

You can see that again in the console
that is .17, multiply that by 100.

and then you have your
percentage of variance accounted for

Now it is useful to look
at the standard residuals to

help interpret the relationship
basically asking OK which of

these if we go to the cells
here. Which of these numbers

contributes most or contributes
significantly to the overall

significance association

You know if we look at the

graph in the bottom right pane

Where which difference is
actually significant within that

within the contingency
table? And to figure that out

we can look at the standardised
residuals and they are again

stored within that results
object. So we say OK, look

into the results and give me
whatever is in.

in the residuals location so

if I run that code.

I get this so if you
now look in the console

So you can see that these.

these are the standardised
residuals, these numbers here

so standardised residuals for the
case where a child has been in a fight

and preferred a violent TV program is

3.04. And so that indicates gives
you an indication of the number

of standard deviation that is above
or below the expected frequencies.

so 3 you might remember
is very much significant

It's bigger than point than 2.48 sorry 2.58.

but it is smaller than 3.29 so we
would say that that standardised residual

is significant at

p is smaller than .01

And then we've got all of the
information we need for the

write up. So here you have an
example of the write up its the same

as was on the lecture slide.

OK, that's it for that.
Thank you very much for

your attention.