Data visualization: practices

PSYC411: Data visualization – practices

My name is Dr Rob Davies, I am an expert in communication, individual differences, and methods

Tip

Ask me anything:

questions during class in person or anonymously through slido;
all other questions on discussion forum

Introduction: Data visualization – perspectives

In this PSYC411 class, we look at practices or how we get visualization done
In the linked PSYC413 class, we focus on perspectives or how to think about visualization

Our lesson plan

Identify your goals
Think about your audience
Develop reflectively
Implement good practice

Our learning objectives: — what are we learning about?

We are working together to help you:

Goals — Formulate questions you can ask yourself to help you to work effectively
Audience — Understand the psychological factors that affect your impact
Development — Work reflectively through a development process
Implement — Produce visualizations in line with best practice

Our assessment targets: — how do you know if you have learned?

We are working together so you can:

Goals — Identify a set of targets for a development process in your professional teams
Audience — Explain what you need to do to make a visualization effective
Development — Locate yourself within the stages of the development process
Implement — Produce visualizations that look good and are useful

What are our goals – Questions to help you to work effectively

Tip

We begin by thinking about the questions you will ask yourself when you need to decide what you will do
We build, here, on the insights developed by A. Gelman & Unwin (2013).

What are our goals?

Why don’t we just use the good enough easy to produce plots in Excel? \(\rightarrow\) Why bother?
Why don’t we just produce a summary table? \(\rightarrow\) Why bother?
Are we engaged in making beautiful graphics or informative displays or both? \(\rightarrow\) What are we doing?
In PSYC413, we look at \(\rightarrow\) Perspectives: the context and history of thinking about visualization

Visualize to enable comparison

Scatterplot showing the relation between reaction time and days in the `sleepstudy` data. Points are ordered on x-axis from 0 to 9 days, on y-axis from 200 to 500 ms reaction time. The plot indicates that reaction time increases with increasing number of days. — Figure 1: Scatterplot of the relation between reaction time and days in the `sleepstudy` data

The figure presents a grid of scatterplots showing the relation between reaction time and days in the `sleepstudy` data separately for each participant. Points are ordered on x-axis from 0 to 9 days, on y-axis from 200 to 500 ms reaction time. Most plots indicate that reaction time increases with increasing number of days. However, different participants show this trend to differing extents. — Figure 2: The relation between reaction time and days: here, we plot the data for each participant separately

What are our goals? — What are our jobs?

Data visualization workers: we may aim to get and keep the attention of our audience, to tell a story, to persuade our viewers
Data analysis workers: we may aim to enable our audience to understand our data, our findings, and to discover more for themselves

What are our goals? — Where or when are we in our process?

Sometimes in a workflow, we are quickly sketching draft visualizations: exploring, for ourselves, or with others, what we can see in our data
Sometimes, we are ready to present our visualization to a wider audience: we aim to share a polished visual object

What are our goals? — Discovery

Discovery goals

Do we need an overview? – To get a sense of what is in the data, and to check our assumptions
Are we looking for the unexpected? – Comparing groups to check for variability, exploring data open to surprises

What are our goals? — Communication

Communication goals

What do we need our audience to understand?
What story are we telling?
Do we need to attract attention or stimulate interest?

Think about your audience – An evidence based account of what works

Tip

We will produce more effective visualizations if we think about how our audience sees, and what they expect (Franconeri et al., 2021)
Check out the PSYC403 Perspectives lecture for more in-depth explanation; here, I present a selective summary

Your audience can look at your visualization
And quickly and easily extract statistical information from what you show
You look at a scatterplot and see the minimum, maximum and mean heights of the points

We show a schematic grid of plots, from top to bottom: dot plot; stacked bar plot; area or bubblen plot; line plot; area plot, rectangles varying in intensity. Each plot schematic is marked to show what statistics can be extracted — Franconeri et al. (2021) Fig. 2

Communicating uncertainty is critical

As scientists, we think about uncertainty all the time
We quantify and typically show uncertainty over estimates e.g. average differences
We should also show and think about outcome variability

We show a grid of four plots. In a column of two plots on the left, there are error bars indicating average outcomes given smaller (top) and larger (bottom) samples. The error bars are more narrow for the larger sample. On the right, the same estimates are shown but with raw individual level outcomes. The variability in outcomes is very wide. — Zhang et al. (2023): The difference between uncertainty over estimates and uncertainty over the predictability of outcomes

Consider accessibility from the start

The first row shows a scatterplot encoded with two colors, green and orange
People with typical vision can see that the green dots have a steep positive correlation and the orange dots make a flat line
We use colour blindness friendly colour palettes

We show a grid of six plots. The plots indicate how for some colour blindness the difference between points will not be apparent. — Franconeri et al. (2021) Fig.5

Development – Work reflectively through a development process

Tip

Your first question is always going to be: (why) do we need to make a plot?
Your answer will evolve through a development process that will gradually reveal the characteristics of your data

The benefits of investing in the development process

Identifying your goals enables you to understand what you are doing and why
Through the development process, you may create different versions — iterations — of a plot
This iterative work benefits both you and your audience (A. Gelman et al., 2002; Kastellec & Leoni, 2007)

The benefits of investing in the development process

Tip

As you iterate, reflect on what your goals are, what your audience needs and expects, and how each plot version moves you closer to effective discovery or communication
This reflection uncovers what is interesting, useful and beautiful about your data

Scientific thinking and data visualisation

We can use text and tables to communicate specific values but visualizations help us to:

stimulate thinking
discover what is unexpected
communicate scale and complexity
make comparisons to show how results vary
display uncertainty about estimates

Anscombe (1973): visualizations show data features quickly and vividly

data columns
x-variables
y-variables

x1	x2	x3	x4	y1	y2	y3	y4
10	10	10	8	8.04	9.14	7.46	6.58
8	8	8	8	6.95	8.14	6.77	5.76
13	13	13	8	7.58	8.74	12.74	7.71
9	9	9	8	8.81	8.77	7.11	8.84
11	11	11	8	8.33	9.26	7.81	8.47
14	14	14	8	9.96	8.10	8.84	7.04
6	6	6	8	7.24	6.13	6.08	5.25
4	4	4	19	4.26	3.10	5.39	12.50
12	12	12	8	10.84	9.13	8.15	5.56
7	7	7	8	4.82	7.26	6.42	7.91
5	5	5	8	5.68	4.74	5.73	6.89

Figure 3: Data table view of Anscombe’s Quartet dataset

x1	x2	x3	x4
Min. : 4.0	Min. : 4.0	Min. : 4.0	Min. : 8
1st Qu.: 6.5	1st Qu.: 6.5	1st Qu.: 6.5	1st Qu.: 8
Median : 9.0	Median : 9.0	Median : 9.0	Median : 8
Mean : 9.0	Mean : 9.0	Mean : 9.0	Mean : 9
3rd Qu.:11.5	3rd Qu.:11.5	3rd Qu.:11.5	3rd Qu.: 8
Max. :14.0	Max. :14.0	Max. :14.0	Max. :19

Figure 4: Summary table view of descriptive statistics for x variables

y1	y2	y3	y4
Min. : 4.260	Min. :3.100	Min. : 5.39	Min. : 5.250
1st Qu.: 6.315	1st Qu.:6.695	1st Qu.: 6.25	1st Qu.: 6.170
Median : 7.580	Median :8.140	Median : 7.11	Median : 7.040
Mean : 7.501	Mean :7.501	Mean : 7.50	Mean : 7.501
3rd Qu.: 8.570	3rd Qu.:8.950	3rd Qu.: 7.98	3rd Qu.: 8.190
Max. :10.840	Max. :9.260	Max. :12.74	Max. :12.500

Figure 5: Summary table view of descriptive statistics for y variables

Anscombe (1973): visualizations show data features quickly and vividly

Grid of scatterplots showing the relation between x,y variables in the Anscombe 1973 dataset. The plots show: (top left) a typical scatterplot indicating a positive correlation; (top right) a curvilinear association; (bottom left) a more or less coherent positive trend with a marked outlier; and (bottom right) a scatter showing all data grouped at one value of x, with only one point at a second value of x

Figure 6: All 4 of the Anscombe (1973) x,y datasets are identical when examined using summary statistics but we see how they vary when we use scatterplots to visualize them

Matejka & Fitzmaurice (2017) give us the `Datasaurus dozen`

Grid of scatterplots showing the relation between x,y variables in the Datasaurus dozen datasets. The plots show different shapes, made by the points, even though the summary statistics for the underlying data are the same

Figure 7: All 12 Matejka & Fitzmaurice (2017) x,y datasets (via jumpingrivers (n.d.)) have the same mean and standard deviation summary statistics but we only understand how the data are structured when we plot them and can look at the structure

Develop visualizations to discover and communicate variability in outcomes

Figure 8: In this plot we show data on the impact of sleep deprivation on reaction time, from Belenky et al. (2003; via Bates et al., 2015). We can see how reaction time slows with increasing deprivation on average (grey line) but that the rate of slowing varies between individuals

Reflect on kinds of uncertainty

Scientists are often faced with the challenge of conveying uncertainty to their audiences (Hofman et al., 2020):

Inferential uncertainty — the degree to which a particular summary statistic (e.g., a population mean) is known to the scientist
Outcome uncertainty — how much individual outcomes vary (e.g., around the mean, regardless of how well it has been estimated)

Inferential uncertainty can be reduced by collecting and analyzing more data, whereas outcome uncertainty cannot

As we work, reflect on the challenges of visualizing uncertainty

The process through which we understand the world is characterized by assumptions, limitations, extrapolations, and generalizations, and this brings uncertainty (Van Der Bles et al., 2019)
We often face the challenge of communicating this

The challenges of uncertainty

Non-expert people will tend to overstate the impact of interventions and understate the variability of outcomes
when they see visualizations like error bars that show
mean and standard error values, that focus on inferential uncertainty (Hofman et al., 2020)

The challenges of uncertainty

Expert scientists also overestimate the impact of interventions when they see standard visualizations that focus on inferential uncertainty: the illusion of predictability
We can stimulate more accurate understanding if we show outcome variability (Zhang et al., 2023)

Variation and uncertainty — the importance, the challenges

Vasishth & Gelman (2021):

The most difficult idea to digest in data analysis is that conclusions based on data are almost always uncertain, regardless of whether the outcome of the statistical test is statistically significant or not

Variation and uncertainty — the importance, the challenges

a. Gelman (2015):

We must move beyond the idea that effects are ‘there’ or not and the idea that the goal of a study is to reject a null hypothesis. As many observers have noted, these attitudes lead to trouble because they deny the variation inherent in real social phenomena, and they deny the uncertainty inherent in statistical inference

We use visualizations to help us to see and understand the variation and the uncertainty in our data

Results will vary: we should expect changes over time, or differences between individuals or between groups
Knowledge is uncertain: outcomes will vary even when the average effect is precisely estimated
We have the responsibility to accept and to express this uncertainty

Implement – Produce visualizations in line with best practice

Tip

We combine our creative thinking with the flexibility of the Grammar of Graphics to produce effective plots

`{ggplot2}` means: the Grammar of Graphics Plot 2

When we use the {ggplot2} to draw plots, we are using tools developed with a philosophy of visualization in mind (Wickham, 2010; Wilkinson, 2013): The Grammar of Graphics
A grammar is a system of rules that allows people to collaborate and individuals to create
We do not need to think about the grammar when we produce visualizations
But it will help you to know that when we puzzle over how we do things, there are always reasons why we do things

A simple plot has many elements

data and aesthetic mappings
statistical transformations
geometric objects
scales

A scatterplot: points are shown in grey, a smoother line is shown in red. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores — Figure 9: A scatterplot showing the potential association between health literacy and vocabulary

We begin with data: here, from the Health Comprehension project

participant_ID	mean.acc	mean.self	study	AGE	SHIPLEY	HLVA	FACTOR3	QRITOTAL	GENDER	EDUCATION	ETHNICITY
studyone.1	0.49	7.96	studyone	34	33	7	53	11	Non-binary	Higher	White
studyone.10	0.85	7.28	studyone	25	33	7	60	11	Female	Higher	White
studyone.100	0.82	7.36	studyone	43	40	8	46	12	Male	Further	White
studyone.101	0.94	7.88	studyone	46	33	11	51	15	Male	Higher	White
studyone.102	0.58	6.96	studyone	18	32	3	51	12	Male	Secondary	Mixed
studyone.103	0.84	7.88	studyone	19	37	13	45	19	Female	Further	Asian
studyone.104	0.64	8.96	studyone	21	23	9	63	10	Female	Further	White
studyone.105	0.86	7.12	studyone	34	36	6	56	14	Male	Higher	Asian
studyone.106	0.82	7.52	studyone	60	37	7	53	14	Male	Higher	White
studyone.107	0.92	8.76	studyone	29	34	8	34	18	Female	Further	White
studyone.108	0.82	8.68	studyone	21	31	10	55	13	Male	Higher	Black
studyone.109	0.93	8.04	studyone	62	38	13	60	16	Female	Higher	White
studyone.11	0.89	7.76	studyone	69	40	10	50	15	Female	Higher	White
studyone.110	0.88	7.48	studyone	19	40	11	59	17	Female	Further	Asian
studyone.111	0.78	7.92	studyone	21	38	8	61	14	Female	Further	Asian
studyone.112	0.76	6.20	studyone	43	34	6	43	13	Female	Secondary	White
studyone.113	0.57	3.48	studyone	21	37	10	40	13	Male	Higher	White
studyone.114	0.66	4.20	studyone	24	25	10	43	10	Female	Higher	Mixed
studyone.115	0.94	7.52	studyone	66	37	10	55	17	Female	Higher	White
studyone.116	0.89	8.24	studyone	33	40	12	58	15	Male	Further	White
studyone.117	0.82	5.28	studyone	57	39	10	42	8	Male	Higher	White
studyone.118	0.82	7.96	studyone	24	34	7	46	15	Male	Higher	Black
studyone.119	0.80	6.64	studyone	30	27	6	47	12	Female	Higher	White
studyone.12	0.95	7.76	studyone	23	37	9	51	16	Female	Higher	White
studyone.120	0.51	3.68	studyone	25	38	12	58	7	Female	Higher	White
studyone.121	0.43	3.44	studyone	30	39	6	36	13	Female	Higher	White
studyone.122	0.59	5.04	studyone	35	26	6	37	10	Female	Secondary	Asian
studyone.123	0.79	5.48	studyone	37	31	5	41	13	Male	Further	White
studyone.124	0.95	7.04	studyone	24	40	10	40	14	Male	Higher	White
studyone.125	0.72	5.92	studyone	27	36	7	46	14	Female	Higher	White
studyone.126	0.89	9.00	studyone	47	39	11	63	12	Female	Higher	White
studyone.127	0.58	5.00	studyone	37	35	8	36	8	Male	Secondary	White
studyone.128	0.86	6.36	studyone	28	38	9	47	17	Female	Higher	White
studyone.129	0.84	8.00	studyone	37	36	9	44	11	Female	Higher	White
studyone.13	0.83	7.00	studyone	26	33	6	51	10	Male	Secondary	Mixed
studyone.130	0.80	7.36	studyone	34	39	12	55	12	Female	Higher	White
studyone.131	0.85	6.48	studyone	27	32	8	55	14	Female	Further	White
studyone.132	0.76	8.48	studyone	52	34	9	52	8	Female	Higher	White
studyone.133	0.75	5.04	studyone	30	38	10	38	10	Male	Higher	White
studyone.134	0.90	7.64	studyone	20	34	8	56	15	Non-binary	Further	White
studyone.135	0.96	8.84	studyone	23	40	12	53	15	Male	Higher	White
studyone.136	0.85	7.56	studyone	21	31	10	54	10	Female	Higher	White
studyone.137	0.89	5.96	studyone	45	39	10	47	13	Female	Higher	Asian
studyone.138	0.75	5.60	studyone	31	37	8	41	10	Male	Higher	White
studyone.139	0.80	6.32	studyone	60	36	10	55	12	Female	Higher	White
studyone.14	0.94	8.56	studyone	30	34	11	55	17	Female	Higher	White
studyone.140	0.76	4.52	studyone	19	40	9	47	16	Male	Secondary	White
studyone.141	0.92	7.80	studyone	52	38	10	50	15	Female	Higher	White
studyone.142	0.94	8.52	studyone	55	37	11	47	13	Male	Further	White
studyone.143	0.89	6.32	studyone	74	36	8	41	16	Male	Higher	White
studyone.144	0.92	6.96	studyone	40	34	10	41	16	Female	Higher	White
studyone.145	0.83	5.56	studyone	32	30	7	45	12	Male	Higher	Asian
studyone.146	0.80	7.20	studyone	42	33	8	54	10	Male	Further	White
studyone.147	0.89	6.92	studyone	26	34	8	46	13	Female	Further	White
studyone.148	0.84	7.12	studyone	22	37	8	49	16	Male	Higher	White
studyone.149	0.86	5.80	studyone	34	32	10	47	15	Female	Higher	White
studyone.15	0.89	7.48	studyone	45	40	12	50	15	Female	Secondary	White
studyone.150	0.83	8.20	studyone	43	32	8	58	17	Female	Further	White
studyone.151	0.74	5.24	studyone	20	32	9	37	13	Female	Further	White
studyone.152	0.64	7.96	studyone	50	38	5	59	10	Female	Further	White
studyone.153	0.86	7.96	studyone	56	35	10	57	14	Male	Further	Mixed
studyone.154	0.80	5.40	studyone	39	36	9	51	16	Male	Higher	White
studyone.155	0.76	6.16	studyone	28	28	3	51	13	Female	Secondary	White
studyone.156	0.88	7.60	studyone	34	38	10	47	14	Female	Higher	Asian
studyone.157	0.78	8.92	studyone	38	39	8	63	17	Female	Further	White
studyone.158	0.93	7.28	studyone	31	36	9	57	11	Female	Higher	White
studyone.159	0.77	7.12	studyone	59	31	9	51	10	Female	Secondary	White
studyone.16	0.90	6.72	studyone	28	35	6	45	15	Male	Higher	White
studyone.160	0.99	8.12	studyone	44	38	11	56	14	Female	Higher	White
studyone.161	0.92	6.28	studyone	46	40	10	48	17	Male	Higher	White
studyone.162	0.86	5.24	studyone	41	40	10	41	17	Male	Higher	White
studyone.163	0.75	7.96	studyone	36	37	6	54	12	Female	Higher	White
studyone.164	0.87	8.44	studyone	62	40	10	51	16	Female	Higher	Black
studyone.165	0.93	8.52	studyone	59	39	9	58	16	Male	Further	White
studyone.166	0.80	6.52	studyone	76	35	8	57	11	Male	Secondary	White
studyone.167	0.77	5.92	studyone	50	38	12	46	18	Male	Secondary	White
studyone.168	0.87	7.52	studyone	25	24	6	48	12	Female	Higher	White
studyone.169	0.70	6.68	studyone	47	26	7	56	9	Male	Secondary	White
studyone.17	0.94	8.52	studyone	34	40	11	55	14	Male	Higher	White
studyone.18	0.76	5.88	studyone	30	32	8	42	15	Female	Higher	White
studyone.19	0.84	8.80	studyone	41	37	9	52	12	Female	Higher	Asian
studyone.2	0.92	8.76	studyone	20	36	11	44	11	Female	Higher	Other
studyone.20	0.95	6.32	studyone	29	40	6	52	15	Female	Higher	White
studyone.21	0.91	6.76	studyone	26	31	6	45	16	Female	Higher	White
studyone.22	0.86	8.80	studyone	20	36	7	53	16	Female	Further	White
studyone.23	0.76	6.12	studyone	24	35	6	48	11	Female	Secondary	White
studyone.24	0.86	8.32	studyone	32	35	11	61	14	Female	Higher	White
studyone.25	0.98	8.04	studyone	32	33	10	58	15	Female	Further	White
studyone.26	0.86	6.40	studyone	40	33	9	45	15	Male	Secondary	White
studyone.27	0.80	9.00	studyone	34	29	6	46	10	Female	Further	White
studyone.28	0.94	7.04	studyone	23	34	10	56	18	Female	Further	Mixed
studyone.29	0.88	7.00	studyone	22	38	9	51	18	Male	Further	White
studyone.3	0.76	6.32	studyone	40	33	12	60	13	Female	Higher	White
studyone.30	0.92	6.88	studyone	42	36	14	46	15	Female	Higher	White
studyone.31	0.84	8.36	studyone	46	40	8	59	16	Female	Further	White
studyone.32	0.89	7.68	studyone	34	35	7	50	15	Female	Higher	White
studyone.33	0.85	7.84	studyone	51	36	9	45	13	Female	Higher	White
studyone.34	0.88	5.48	studyone	32	38	12	55	11	Female	Higher	White
studyone.35	0.92	7.72	studyone	24	36	14	56	15	Female	Higher	White
studyone.36	0.74	5.64	studyone	18	37	9	54	15	Male	Secondary	White
studyone.37	0.96	7.88	studyone	24	35	7	49	15	Female	Higher	Asian
studyone.38	0.86	4.40	studyone	32	38	10	50	16	Female	Higher	White
studyone.39	0.90	6.88	studyone	22	33	10	44	12	Female	Further	White
studyone.4	0.87	7.08	studyone	37	39	10	49	13	Female	Higher	White
studyone.40	0.85	8.16	studyone	31	38	7	56	18	Male	Secondary	White
studyone.41	0.69	4.52	studyone	27	35	7	45	16	Female	Higher	White
studyone.42	0.92	8.60	studyone	40	35	13	53	17	Female	Higher	White
studyone.43	0.92	6.60	studyone	40	37	9	57	12	Female	Higher	White
studyone.44	0.74	6.56	studyone	20	31	9	41	14	Male	Further	White
studyone.45	0.97	7.96	studyone	29	37	11	54	14	Female	Secondary	White
studyone.46	0.90	6.28	studyone	23	32	8	54	15	Male	Further	White
studyone.47	0.64	7.64	studyone	19	28	5	63	14	Female	Further	Mixed
studyone.48	0.65	4.52	studyone	29	28	5	49	13	Female	Further	White
studyone.49	0.81	5.52	studyone	31	36	10	54	14	Female	Secondary	White
studyone.5	0.88	6.40	studyone	26	34	10	43	11	Female	Further	White
studyone.50	0.67	5.80	studyone	22	32	6	49	13	Male	Further	White
studyone.51	0.90	7.48	studyone	40	37	13	46	10	Male	Higher	White
studyone.52	0.71	8.64	studyone	23	38	7	60	17	Female	Higher	Asian
studyone.53	0.95	8.48	studyone	26	33	11	56	18	Female	Higher	White
studyone.54	0.92	7.60	studyone	30	35	10	58	14	Female	Higher	White
studyone.55	0.90	7.88	studyone	24	36	12	62	15	Female	Higher	Asian
studyone.56	0.67	5.72	studyone	36	38	7	53	14	Female	Higher	White
studyone.57	0.88	3.48	studyone	18	29	7	44	14	Female	Further	Asian
studyone.58	0.86	7.76	studyone	44	32	7	55	13	Female	Higher	White
studyone.59	0.84	7.20	studyone	18	34	6	49	13	Male	Secondary	White
studyone.6	0.86	7.52	studyone	41	37	11	51	11	Female	Higher	White
studyone.60	0.81	6.56	studyone	30	32	5	55	13	Female	Higher	White
studyone.61	0.65	7.72	studyone	31	29	11	40	13	Male	Further	White
studyone.62	0.82	5.44	studyone	46	35	9	50	11	Female	Further	White
studyone.63	0.91	6.08	studyone	40	33	14	52	13	Female	Higher	White
studyone.64	0.85	4.60	studyone	39	38	13	48	6	Female	Higher	White
studyone.65	0.90	9.00	studyone	28	40	10	58	18	Female	Higher	White
studyone.66	0.60	5.28	studyone	27	40	7	45	16	Male	Higher	White
studyone.67	0.92	8.88	studyone	29	33	10	54	11	Female	Further	White
studyone.68	0.81	6.16	studyone	22	25	6	46	12	Female	Further	White
studyone.69	0.63	7.44	studyone	39	33	10	46	13	Female	Further	Asian
studyone.7	0.58	4.76	studyone	22	29	8	51	12	Female	Higher	White
studyone.70	0.41	5.28	studyone	22	31	9	36	8	Male	Further	White
studyone.71	0.85	5.60	studyone	26	37	10	52	14	Male	Further	White
studyone.72	0.98	8.28	studyone	46	39	12	58	15	Female	Higher	White
studyone.73	0.93	8.32	studyone	47	39	11	56	15	Female	Higher	White
studyone.74	0.91	7.88	studyone	18	38	10	49	14	Male	Secondary	Mixed
studyone.75	0.89	7.16	studyone	28	36	11	51	14	Male	Higher	White
studyone.76	0.96	7.20	studyone	36	40	11	51	16	Male	Higher	White
studyone.77	0.66	7.68	studyone	18	27	7	52	11	Male	Secondary	White
studyone.78	0.96	8.36	studyone	32	37	8	55	15	Female	Higher	White
studyone.79	0.66	4.52	studyone	28	33	6	44	11	Female	Higher	White
studyone.8	0.75	6.16	studyone	44	35	7	44	12	Female	Higher	White
studyone.80	0.87	5.88	studyone	30	31	9	51	15	Male	Higher	White
studyone.81	0.88	8.16	studyone	34	34	9	58	11	Male	Higher	White
studyone.82	0.80	7.88	studyone	51	33	5	48	12	Female	Secondary	White
studyone.83	0.81	6.12	studyone	43	37	12	47	15	Female	Higher	White
studyone.84	0.36	4.40	studyone	22	32	4	46	10	Female	Further	White
studyone.85	0.77	7.24	studyone	49	39	9	52	12	Male	Higher	White
studyone.86	0.82	7.20	studyone	39	35	7	49	15	Male	Further	White
studyone.87	0.82	6.52	studyone	55	36	10	45	13	Male	Further	White
studyone.88	0.80	6.84	studyone	67	40	9	56	10	Female	Higher	White
studyone.89	0.62	5.40	studyone	65	34	10	43	10	Male	Secondary	White
studyone.9	0.76	6.32	studyone	30	32	8	41	10	Female	Higher	White
studyone.90	0.82	4.60	studyone	52	40	9	58	13	Male	Secondary	White
studyone.91	0.86	5.80	studyone	24	33	8	40	13	Female	Higher	White
studyone.92	0.83	8.80	studyone	41	32	10	57	14	Male	Further	White
studyone.93	0.67	5.48	studyone	25	32	6	57	6	Male	Further	White
studyone.94	0.77	6.96	studyone	18	32	10	44	12	Male	Further	Asian
studyone.95	0.96	6.88	studyone	20	37	14	51	16	Female	Further	White
studyone.96	0.67	7.92	studyone	73	33	9	54	11	Male	Secondary	White
studyone.97	0.76	6.52	studyone	32	29	7	51	12	Female	Higher	White
studyone.98	0.98	6.60	studyone	39	39	10	50	12	Female	Higher	White
studyone.99	0.71	6.44	studyone	19	35	9	48	9	Male	Further	White

Figure 10: Data table view of Health Comprehension project Study One dataset

When we code a plot, we tell R we want:
to use ggplot() to create a plot
using the data-set clearly.one.subjects
and the variables SHIPLEY, HLVA

An empty scatterplot: axes show that vocabulary scores are on the x-axis and health literacy scores are on the y-axis — Figure 11: Scatterplot showing association between health literacy and vocabulary

  ggplot(data = clearly.one.subjects, aes(x = SHIPLEY, y = HLVA))

We bring the data-set and the variables
We declare the aesthetic mappings:

SHIPLEY score \(\rightarrow\) x-axis (horizontal: left-to-right position)
HLVA score \(\rightarrow\) y-axis (vertical: bottom-to-top position)

A simple plot has many elements

Plot with only objects
Code for the plot

When we code a plot, we tell R we want:
to use a geometric object, like geom_point
to display the data aesthetic mappings

A scatterplot showing only points: vocabulary scores are on the x-axis and health literacy scores are on the y-axis; but only the points are shown — Figure 12: Scatterplot showing association between health literacy and vocabulary

  ggplot(data = clearly.one.subjects, aes(x = SHIPLEY, y = HLVA)) +
  geom_point()

We add the geom_point() to tell R to draw the information about the SHIPLEY and HLVA scores as points
Each point represents information about one participant in the clearly.one.subjects data-set

SHIPLEY score \(\rightarrow\) x-axis (horizontal: left-to-right position)
HLVA score \(\rightarrow\) y-axis (vertical: bottom-to-top position)

When we use `{ggplot2}` we work in layers

The grammar of graphics define the components of a plot: the data, the mappings, and the geometric object
Together, the data, mappings, and geometric object form a layer
A plot may have multiple layers

When we use `{ggplot2}` we are in control and we can be creative

Tip

Having a system of graphics: with components, layers and rules
Releases us to be creative: changing a single feature at a time

Plot with layers: add a smoother

Plot with smoother
Code for the plot

Build a plot layer by layer
We can begin by using points to display the vocabulary and health literacy scores for each person
We add a layer using a smoother to show the average association between vocabulary and literacy

A scatterplot showing points and a smoother. Vocabulary scores are on the x-axis and health literacy scores are on the y-axis. Each point represents the pairing of vocabulary and health literacy scores for one person. A smoother is added indicating the average association between vocabulary and health literacy scores across all people in our sample. The points and the smoother suggest a trend so that higher vocabulary scores are associated with higher health literacy scores — Figure 13: Scatterplot showing association between health literacy and vocabulary

  ggplot(data = clearly.one.subjects, aes(x = SHIPLEY, y = HLVA)) +
  geom_point() +
  geom_smooth()

We add the geom_smooth() to tell R to represent the average trend for the association between SHIPLEY and HLVA scores
The line is drawn by {ggplot2} which calculates a statistical transformation
Here, the transformation summarizes the association for different ranges of SHIPLEY vocabulary scores

Defaults and arguments

clearly.one.subjects %>%
  ggplot(aes(SHIPLEY, HLVA)) +
  geom_smooth() +
  geom_point()

The {ggplot2} library supplies default values
So we do not need to tell R how to do every thing
We do not need to tell R that the points in a scatterplot:
should represent the data aesthetic mappings in Cartesian (x-horizontal, y-vertical) 2-dimensional space
and should be black in colour

Defaults and arguments

clearly.one.subjects %>%
  ggplot(aes(SHIPLEY, HLVA)) +
  geom_smooth() +
  geom_point(colour = "darkgrey", size = 3)

We can over-ride the defaults by supplying arguments, entering values inside the brackets in the function calls
geom_point(colour = "darkgrey", size = 3) tells R we want:

dark grey points when the default is black
points that are 3x larger than the default size

When we use `{ggplot2}` we are in control and we can be creative

Tip

We can add layers, control the appearance of each component
To construct more effective plots
The plots can be more effective because we develop them in an iterative process
in which we reflect on our goals and the needs of our audience

We can use colour

Using colour
Code for the plot

When we code a plot, we tell R we want:
to display data about people with different education levels
distinguishing education level by colour

A scatterplot: points and smoothers are shown in red, green or blue. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores. The trend appears to be steeper for people with secondary education — Figure 14: Using colour

clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA, 
             group = EDUCATION, colour = EDUCATION)) +
  geom_smooth(method = "lm", se = FALSE, 
              linewidth = 2, alpha = .75) +
  geom_point(size = 3)

group = EDUCATION, colour = EDUCATION tells R to:

group the data by EDUCATION level
colour the points for people with different levels of education in different colours

Method, size, transparency

clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA, 
             group = EDUCATION, colour = EDUCATION)) +
  geom_smooth(method = "lm", se = FALSE, 
              linewidth = 2, alpha = .75) +
  geom_point(size = 3)

method = "lm", se = FALSE tells R what method to use to draw the smoother line
linewidth = 2 makes the width of the smoother line 2 x larger than the default
alpha = .75 makes the line .75 x the opacity of the default (i.e. a. bit more transparent)
Learn to edit: shape, size, transparency and colour

We facet plots to enable comparisons

Using facets
Code for the plot

It is often easier to compare trends
By presenting a separate plot for each condition or group
Showing the separate plots in a grid side-by-side

A grid of scatterplots: points and smoothers are shown in red, green or blue. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores. The trend appears to be steeper for people with secondary education. Plots are split into separate facets by education — Figure 15: The association between health literacy and vocabulary *varies by education level*

clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA, 
             group = EDUCATION, colour = EDUCATION)) +
  geom_smooth(method = "lm", se = FALSE, 
              linewidth = 2, alpha = .75) +
  geom_point(size = 3) +
  facet_wrap(~ EDUCATION)

facet_wrap(~ EDUCATION) tells R to split the data by EDUCATION level
And show a separate plot for each EDUCATION level group side-by-side for easy comparison

We can guide our audience

Labelled plot
Code for the plot

We do not present visualizations in isolation
We present plots embedded in the context of labels and titles
We use the text to guide the viewer

A scatterplot: points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores. Plot labels for title, and each axis have been edited to be more informative — Figure 16: A labelled plot

clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA)) +
  geom_smooth(method = "lm", se = FALSE,
              colour = "darkgreen", linewidth = 2, alpha = .75) +
  geom_point(size = 3, colour = "lightgreen") +
  labs(x = "Vocabulary (Shipley)", y = "Health literacy (HLVA)",
       title = "Scatterplot showing how higher vocabulary\npredicts higher health literacy on average")

We use the labs() function to add: the plot title and the labels for the x-axis and y-axis
We edit the title so that the viewer can see what we want them to see
We use \n to make the title fit on two lines

We annotate plots to direct attention

Annotated plot
Code for the plot

We can direct the attention of our audience to key features of our data
By adding annotations like text and lines

clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA)) +
  geom_smooth(method = "lm", se = FALSE,
              colour = "darkgreen", linewidth = 2, alpha = .75) +
  geom_point(size = 3, colour = "lightgreen") +
  labs(x = "Vocabulary (Shipley)", y = "Health literacy (HLVA)") +
  geom_hline(yintercept = mean(clearly.one.subjects$HLVA),
             linetype = "dashed",
             linewidth = 2,
             colour = "grey",
             alpha = .85) +
  annotate("text", x = 27, y = 9.3, label = "Mean HLVA", colour = "grey") +
  theme_bw()

geom_hline() adds a line to show mean health literacy
annotate("text" ...) adds a text label

Extensions free our creativity

Complex plot
Code for the plot

The power of the Grammar of Graphics lies in the rules
Developers can use the rules to expand our capacity to visualize data
We add marginal histograms to our scatterplot to visualize associations and distributions

plot <- clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA)) +
  geom_smooth(method = "lm", se = FALSE,
              colour = "darkgreen", linewidth = 2, alpha = .75) +
  geom_point(size = 3, colour = "lightgreen") +
  labs(x = "Vocabulary (Shipley)", y = "Health literacy (HLVA)")

ggMarginal(plot, type = "histogram", fill = "lightgreen", 
           xparams = list(binwidth=2), yparams = list(binwidth=1))

ggMarginal(plot, type = "histogram") enables us to show the distribution of scores on each variable
This helps our viewer to process the association and information about each variable (Franconeri et al., 2021)

Choose your plot theme

We can choose a theme to adapt the look of the whole plot to suit our needs or the needs of our audience

Summary

You start your work with these questions:

What are our goals?
What does our audience need or expect?

Summary

You develop your visualization in a reflective process:

Begin with a quick draft to show the distributions or make the comparisons you think about first
Then reflect, and edit: does this enable me to discover sources of variability in my data?
Then reflect, and edit: does this enable me to effectively communicate what I want to communicate?
Then reflect, and edit: does this look good? – do my viewers tell me this works well?

Summary

Tip

I can only show you the potential for creative and effective visualization

experiment and find what looks good and is useful to you
seek out information – good places to start are:

https://ggplot2.tidyverse.org/index.html

https://r-graph-gallery.com

References

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. https://doi.org/10.2307/2682899

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Belenky, G., Wesensten, N. J., Thorne, D. R., Thomas, M. L., Sing, H. C., Redmond, D. P., Russo, M. B., & Balkin, T. J. (2003). Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study. Journal of Sleep Research, 12(1), 1–12. https://doi.org/10.1046/j.1365-2869.2003.00337.x

Franconeri, S. L., Padilla, L. M., Shah, P., Zacks, J. M., & Hullman, J. (2021). The Science of Visual Data Communication: What Works. Psychological Science in the Public Interest, 22(3), 110–161. https://doi.org/10.1177/15291006211051956

Gelman, a. (2015). The connection between varying treatment effects and the crisis of unreplicable research: A bayesian perspective. Journal of Management, 41(2), 632–643. https://doi.org/10.1177/0149206314525208

Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s practice what we preach. The American Statistician, 56(2), 121–130. https://doi.org/10.1198/000313002317572790

Gelman, A., & Unwin, A. (2013). Infovis and statistical graphics: Different goals, different looks. Journal of Computational and Graphical Statistics, 22(1), 2–28. https://doi.org/10.1080/10618600.2012.761137

Hofman, J. M., Goldstein, D. G., & Hullman, J. (2020). How visualizing inferential uncertainty can mislead readers about treatment effects in scientific results. 112. https://doi.org/10.1145/3313831.3376454

jumpingrivers. (n.d.). Datasets from the Datasaurus Dozen. https://jumpingrivers.github.io/datasauRus/

Kastellec, J. P., & Leoni, E. L. (2007). Using Graphs Instead of Tables in Political Science. Perspectives on Politics, 5(4), 755–771. https://doi.org/10.1017/S1537592707072209

Matejka, J., & Fitzmaurice, G. (2017). CHI ’17: CHI Conference on Human Factors in Computing Systems. 1290–1294. https://doi.org/10.1145/3025453.3025912

Van Der Bles, A. M., Van Der Linden, S., Freeman, A. L. J., Mitchell, J., Galvao, A. B., Zaval, L., & Spiegelhalter, D. J. (2019). Communicating uncertainty about facts, numbers and science (Vol. 6).

Vasishth, S., & Gelman, A. (2021). How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics, 59(5), 1311–1342. https://doi.org/10.1515/ling-2019-0051

Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28. https://doi.org/10.1198/jcgs.2009.07098

Wilkinson, L. (2013). The Grammar of Graphics. Springer Science & Business Media.

Zhang, S., Heck, P. R., Meyer, M. N., Chabris, C. F., Goldstein, D. G., & Hofman, J. M. (2023). An illusion of predictability in scientific results: Even experts confuse inferential uncertainty and outcome variability. Proceedings of the National Academy of Sciences, 120(33), e2302491120. https://doi.org/10.1073/pnas.2302491120

Data visualization: practices

PSYC411: Data visualization – practices

Introduction: Data visualization – perspectives

Our lesson plan

Our learning objectives: — what are we learning about?

Our assessment targets: — how do you know if you have learned?

What are our goals – Questions to help you to work effectively

What are our goals?

Visualize to enable comparison

What are our goals? — What are our jobs?

What are our goals? — Where or when are we in our process?

What are our goals? — Discovery

What are our goals? — Communication

Think about your audience – An evidence based account of what works

Communicating uncertainty is critical

Consider accessibility from the start

Development – Work reflectively through a development process

The benefits of investing in the development process

The benefits of investing in the development process

Scientific thinking and data visualisation

Anscombe (1973): visualizations show data features quickly and vividly

Anscombe (1973): visualizations show data features quickly and vividly

Matejka & Fitzmaurice (2017) give us the Datasaurus dozen

Develop visualizations to discover and communicate variability in outcomes

Reflect on kinds of uncertainty

As we work, reflect on the challenges of visualizing uncertainty

The challenges of uncertainty

The challenges of uncertainty

Variation and uncertainty — the importance, the challenges

Variation and uncertainty — the importance, the challenges

We use visualizations to help us to see and understand the variation and the uncertainty in our data

Implement – Produce visualizations in line with best practice

{ggplot2} means: the Grammar of Graphics Plot 2

A simple plot has many elements

We begin with data: here, from the Health Comprehension project

A simple plot has many elements

A simple plot has many elements

When we use {ggplot2} we work in layers

When we use {ggplot2} we are in control and we can be creative

Plot with layers: add a smoother

Defaults and arguments

Defaults and arguments

When we use {ggplot2} we are in control and we can be creative

We can use colour

Method, size, transparency

We facet plots to enable comparisons

We can guide our audience

We annotate plots to direct attention

Extensions free our creativity

Choose your plot theme

Summary

Summary

Summary

References

Matejka & Fitzmaurice (2017) give us the `Datasaurus dozen`

`{ggplot2}` means: the Grammar of Graphics Plot 2

When we use `{ggplot2}` we work in layers

When we use `{ggplot2}` we are in control and we can be creative

When we use `{ggplot2}` we are in control and we can be creative