Data visualization: practices

Rob Davies

Department of Psychology, Lancaster University

PSYC411: Data visualization – practices

  • My name is Dr Rob Davies, I am an expert in communication, individual differences, and methods

Tip

Ask me anything:

  • questions during class in person or anonymously through slido;
  • all other questions on discussion forum

Introduction: Data visualization – perspectives

  • In this PSYC411 class, we look at practices or how we get visualization done
  • In the linked PSYC413 class, we focus on perspectives or how to think about visualization

Our lesson plan

  1. Identify your goals
  2. Think about your audience
  3. Develop reflectively
  4. Implement good practice

Our learning objectives: — what are we learning about?

We are working together to help you:

  1. Goals — Formulate questions you can ask yourself to help you to work effectively
  2. Audience — Understand the psychological factors that affect your impact
  3. Development — Work reflectively through a development process
  4. Implement — Produce visualizations in line with best practice

Our assessment targets: — how do you know if you have learned?

We are working together so you can:

  1. Goals — Identify a set of targets for a development process in your professional teams
  2. Audience — Explain what you need to do to make a visualization effective
  3. Development — Locate yourself within the stages of the development process
  4. Implement — Produce visualizations that look good and are useful

What are our goals – Questions to help you to work effectively

Tip

  • We begin by thinking about the questions you will ask yourself when you need to decide what you will do
  • We build, here, on the insights developed by A. Gelman & Unwin (2013).

What are our goals?

  • Why don’t we just use the good enough easy to produce plots in Excel? \(\rightarrow\) Why bother?
  • Why don’t we just produce a summary table? \(\rightarrow\) Why bother?
  • Are we engaged in making beautiful graphics or informative displays or both? \(\rightarrow\) What are we doing?
  • In PSYC413, we look at \(\rightarrow\) Perspectives: the context and history of thinking about visualization

Visualize to enable comparison

Scatterplot showing the relation between reaction time and days in the `sleepstudy` data. Points are ordered on x-axis from 0 to 9 days, on y-axis from 200 to 500 ms reaction time. The plot indicates that reaction time increases with increasing number of days.
Figure 1: Scatterplot of the relation between reaction time and days in the sleepstudy data
The figure presents a grid of scatterplots showing the relation between reaction time and days in the `sleepstudy` data separately for each participant. Points are ordered on x-axis from 0 to 9 days, on y-axis from 200 to 500 ms reaction time. Most plots indicate that reaction time increases with increasing number of days. However, different participants show this trend to differing extents.
Figure 2: The relation between reaction time and days: here, we plot the data for each participant separately

What are our goals? — What are our jobs?

  • Data visualization workers: we may aim to get and keep the attention of our audience, to tell a story, to persuade our viewers
  • Data analysis workers: we may aim to enable our audience to understand our data, our findings, and to discover more for themselves

What are our goals? — Where or when are we in our process?

  • Sometimes in a workflow, we are quickly sketching draft visualizations: exploring, for ourselves, or with others, what we can see in our data
  • Sometimes, we are ready to present our visualization to a wider audience: we aim to share a polished visual object

What are our goals? — Discovery

Discovery goals

  1. Do we need an overview? – To get a sense of what is in the data, and to check our assumptions
  2. Are we looking for the unexpected? – Comparing groups to check for variability, exploring data open to surprises

What are our goals? — Communication

Communication goals

  1. What do we need our audience to understand?
  2. What story are we telling?
  3. Do we need to attract attention or stimulate interest?

Think about your audience – An evidence based account of what works

Tip

  • We will produce more effective visualizations if we think about how our audience sees, and what they expect (Franconeri et al., 2021)
  • Check out the PSYC403 Perspectives lecture for more in-depth explanation; here, I present a selective summary

  • Your audience can look at your visualization
  • And quickly and easily extract statistical information from what you show
  • You look at a scatterplot and see the minimum, maximum and mean heights of the points

We show a schematic grid of plots, from top to bottom: dot plot; stacked bar plot; area or bubblen plot; line plot; area plot, rectangles varying in intensity. Each plot schematic is marked to show what statistics can be extracted

Franconeri et al. (2021) Fig. 2

Communicating uncertainty is critical

  • As scientists, we think about uncertainty all the time
  • We quantify and typically show uncertainty over estimates e.g. average differences
  • We should also show and think about outcome variability

We show a grid of four plots. In a column of two plots on the left, there are error bars indicating average outcomes given smaller (top) and larger (bottom) samples. The error bars are more narrow for the larger sample. On the right, the same estimates are shown but with raw individual level outcomes. The variability in outcomes is very wide.

Zhang et al. (2023): The difference between uncertainty over estimates and uncertainty over the predictability of outcomes

Consider accessibility from the start

  • The first row shows a scatterplot encoded with two colors, green and orange
  • People with typical vision can see that the green dots have a steep positive correlation and the orange dots make a flat line
  • We use colour blindness friendly colour palettes

We show a grid of six plots. The plots indicate how for some colour blindness the difference between points will not be apparent.

Franconeri et al. (2021) Fig.5

Development – Work reflectively through a development process

Tip

  • Your first question is always going to be: (why) do we need to make a plot?
  • Your answer will evolve through a development process that will gradually reveal the characteristics of your data

The benefits of investing in the development process

  • Identifying your goals enables you to understand what you are doing and why
  • Through the development process, you may create different versions — iterations — of a plot
  • This iterative work benefits both you and your audience (A. Gelman et al., 2002; Kastellec & Leoni, 2007)

The benefits of investing in the development process

Tip

  • As you iterate, reflect on what your goals are, what your audience needs and expects, and how each plot version moves you closer to effective discovery or communication
  • This reflection uncovers what is interesting, useful and beautiful about your data

Scientific thinking and data visualisation

We can use text and tables to communicate specific values but visualizations help us to:

  • stimulate thinking
  • discover what is unexpected
  • communicate scale and complexity
  • make comparisons to show how results vary
  • display uncertainty about estimates

Anscombe (1973): visualizations show data features quickly and vividly

x1 x2 x3 x4 y1 y2 y3 y4
10 10 10 8 8.04 9.14 7.46 6.58
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47
14 14 14 8 9.96 8.10 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25
4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89
Figure 3: Data table view of Anscombe’s Quartet dataset
x1 x2 x3 x4
Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8
1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8
Median : 9.0 Median : 9.0 Median : 9.0 Median : 8
Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9
3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8
Max. :14.0 Max. :14.0 Max. :14.0 Max. :19
Figure 4: Summary table view of descriptive statistics for x variables
y1 y2 y3 y4
Min. : 4.260 Min. :3.100 Min. : 5.39 Min. : 5.250
1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
Median : 7.580 Median :8.140 Median : 7.11 Median : 7.040
Mean : 7.501 Mean :7.501 Mean : 7.50 Mean : 7.501
3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
Max. :10.840 Max. :9.260 Max. :12.74 Max. :12.500
Figure 5: Summary table view of descriptive statistics for y variables

Anscombe (1973): visualizations show data features quickly and vividly

Grid of scatterplots showing the relation between x,y variables in the Anscombe 1973 dataset. The plots show: (top left) a typical scatterplot indicating a positive correlation; (top right) a curvilinear association; (bottom left) a more or less coherent positive trend with a marked outlier; and (bottom right) a scatter showing all data grouped at one value of x, with only one point at a second value of x

Figure 6: All 4 of the Anscombe (1973) x,y datasets are identical when examined using summary statistics but we see how they vary when we use scatterplots to visualize them

Matejka & Fitzmaurice (2017) give us the Datasaurus dozen

Grid of scatterplots showing the relation between x,y variables in the Datasaurus dozen datasets. The plots show different shapes, made by the points, even though the summary statistics for the underlying data are the same

Figure 7: All 12 Matejka & Fitzmaurice (2017) x,y datasets (via jumpingrivers (n.d.)) have the same mean and standard deviation summary statistics but we only understand how the data are structured when we plot them and can look at the structure

Develop visualizations to discover and communicate variability in outcomes

The figure presents a grid of scatterplots showing the relation between reaction time and days in the `sleepstudy` data separately for each participant. Points are ordered on x-axis from 0 to 9 days, on y-axis from 200 to 500 ms reaction time. Most plots indicate that reaction time increases with increasing number of days. However, different participants show this trend to differing extents.

Figure 8: In this plot we show data on the impact of sleep deprivation on reaction time, from Belenky et al. (2003; via Bates et al., 2015). We can see how reaction time slows with increasing deprivation on average (grey line) but that the rate of slowing varies between individuals

Reflect on kinds of uncertainty

  • Scientists are often faced with the challenge of conveying uncertainty to their audiences (Hofman et al., 2020):
  1. Inferential uncertainty — the degree to which a particular summary statistic (e.g., a population mean) is known to the scientist
  2. Outcome uncertainty — how much individual outcomes vary (e.g., around the mean, regardless of how well it has been estimated)
  • Inferential uncertainty can be reduced by collecting and analyzing more data, whereas outcome uncertainty cannot

As we work, reflect on the challenges of visualizing uncertainty

  • The process through which we understand the world is characterized by assumptions, limitations, extrapolations, and generalizations, and this brings uncertainty (Van Der Bles et al., 2019)
  • We often face the challenge of communicating this

We show a grid of four plots. In a column of two plots on the left, there are error bars indicating average outcomes given smaller (top) and larger (bottom) samples. The error bars are more narrow for the larger sample. On the right, the same estimates are shown but with raw individual level outcomes. The variability in outcomes is very wide.

Zhang et al. (2023): The difference between uncertainty over estimates and uncertainty over the predictability of outcomes

The challenges of uncertainty

  • Non-expert people will tend to overstate the impact of interventions and understate the variability of outcomes
  • when they see visualizations like error bars that show
  • mean and standard error values, that focus on inferential uncertainty (Hofman et al., 2020)

We show a grid of four plots. In a column of two plots on the left, there are error bars indicating average outcomes given smaller (top) and larger (bottom) samples. The error bars are more narrow for the larger sample. On the right, the same estimates are shown but with raw individual level outcomes. The variability in outcomes is very wide.

Zhang et al. (2023): The difference between uncertainty over estimates and uncertainty over the predictability of outcomes

The challenges of uncertainty

  • Expert scientists also overestimate the impact of interventions when they see standard visualizations that focus on inferential uncertainty: the illusion of predictability
  • We can stimulate more accurate understanding if we show outcome variability (Zhang et al., 2023)

We show a grid of four plots. In a column of two plots on the left, there are error bars indicating average outcomes given smaller (top) and larger (bottom) samples. The error bars are more narrow for the larger sample. On the right, the same estimates are shown but with raw individual level outcomes. The variability in outcomes is very wide.

Zhang et al. (2023): The difference between uncertainty over estimates and uncertainty over the predictability of outcomes

Variation and uncertainty — the importance, the challenges

Vasishth & Gelman (2021):

The most difficult idea to digest in data analysis is that conclusions based on data are almost always uncertain, regardless of whether the outcome of the statistical test is statistically significant or not

Variation and uncertainty — the importance, the challenges

a. Gelman (2015):

We must move beyond the idea that effects are ‘there’ or not and the idea that the goal of a study is to reject a null hypothesis. As many observers have noted, these attitudes lead to trouble because they deny the variation inherent in real social phenomena, and they deny the uncertainty inherent in statistical inference

We use visualizations to help us to see and understand the variation and the uncertainty in our data

  • Results will vary: we should expect changes over time, or differences between individuals or between groups
  • Knowledge is uncertain: outcomes will vary even when the average effect is precisely estimated
  • We have the responsibility to accept and to express this uncertainty

Implement – Produce visualizations in line with best practice

Tip

  • We combine our creative thinking with the flexibility of the Grammar of Graphics to produce effective plots

{ggplot2} means: the Grammar of Graphics Plot 2

  • When we use the {ggplot2} to draw plots, we are using tools developed with a philosophy of visualization in mind (Wickham, 2010; Wilkinson, 2013): The Grammar of Graphics
  • A grammar is a system of rules that allows people to collaborate and individuals to create
  • We do not need to think about the grammar when we produce visualizations
  • But it will help you to know that when we puzzle over how we do things, there are always reasons why we do things

A simple plot has many elements

  • data and aesthetic mappings
  • statistical transformations
  • geometric objects
  • scales
A scatterplot: points are shown in grey, a smoother line is shown in red. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores
Figure 9: A scatterplot showing the potential association between health literacy and vocabulary

We begin with data: here, from the Health Comprehension project

participant_ID mean.acc mean.self study AGE SHIPLEY HLVA FACTOR3 QRITOTAL GENDER EDUCATION ETHNICITY
studyone.1 0.49 7.96 studyone 34 33 7 53 11 Non-binary Higher White
studyone.10 0.85 7.28 studyone 25 33 7 60 11 Female Higher White
studyone.100 0.82 7.36 studyone 43 40 8 46 12 Male Further White
studyone.101 0.94 7.88 studyone 46 33 11 51 15 Male Higher White
studyone.102 0.58 6.96 studyone 18 32 3 51 12 Male Secondary Mixed
studyone.103 0.84 7.88 studyone 19 37 13 45 19 Female Further Asian
studyone.104 0.64 8.96 studyone 21 23 9 63 10 Female Further White
studyone.105 0.86 7.12 studyone 34 36 6 56 14 Male Higher Asian
studyone.106 0.82 7.52 studyone 60 37 7 53 14 Male Higher White
studyone.107 0.92 8.76 studyone 29 34 8 34 18 Female Further White
studyone.108 0.82 8.68 studyone 21 31 10 55 13 Male Higher Black
studyone.109 0.93 8.04 studyone 62 38 13 60 16 Female Higher White
studyone.11 0.89 7.76 studyone 69 40 10 50 15 Female Higher White
studyone.110 0.88 7.48 studyone 19 40 11 59 17 Female Further Asian
studyone.111 0.78 7.92 studyone 21 38 8 61 14 Female Further Asian
studyone.112 0.76 6.20 studyone 43 34 6 43 13 Female Secondary White
studyone.113 0.57 3.48 studyone 21 37 10 40 13 Male Higher White
studyone.114 0.66 4.20 studyone 24 25 10 43 10 Female Higher Mixed
studyone.115 0.94 7.52 studyone 66 37 10 55 17 Female Higher White
studyone.116 0.89 8.24 studyone 33 40 12 58 15 Male Further White
studyone.117 0.82 5.28 studyone 57 39 10 42 8 Male Higher White
studyone.118 0.82 7.96 studyone 24 34 7 46 15 Male Higher Black
studyone.119 0.80 6.64 studyone 30 27 6 47 12 Female Higher White
studyone.12 0.95 7.76 studyone 23 37 9 51 16 Female Higher White
studyone.120 0.51 3.68 studyone 25 38 12 58 7 Female Higher White
studyone.121 0.43 3.44 studyone 30 39 6 36 13 Female Higher White
studyone.122 0.59 5.04 studyone 35 26 6 37 10 Female Secondary Asian
studyone.123 0.79 5.48 studyone 37 31 5 41 13 Male Further White
studyone.124 0.95 7.04 studyone 24 40 10 40 14 Male Higher White
studyone.125 0.72 5.92 studyone 27 36 7 46 14 Female Higher White
studyone.126 0.89 9.00 studyone 47 39 11 63 12 Female Higher White
studyone.127 0.58 5.00 studyone 37 35 8 36 8 Male Secondary White
studyone.128 0.86 6.36 studyone 28 38 9 47 17 Female Higher White
studyone.129 0.84 8.00 studyone 37 36 9 44 11 Female Higher White
studyone.13 0.83 7.00 studyone 26 33 6 51 10 Male Secondary Mixed
studyone.130 0.80 7.36 studyone 34 39 12 55 12 Female Higher White
studyone.131 0.85 6.48 studyone 27 32 8 55 14 Female Further White
studyone.132 0.76 8.48 studyone 52 34 9 52 8 Female Higher White
studyone.133 0.75 5.04 studyone 30 38 10 38 10 Male Higher White
studyone.134 0.90 7.64 studyone 20 34 8 56 15 Non-binary Further White
studyone.135 0.96 8.84 studyone 23 40 12 53 15 Male Higher White
studyone.136 0.85 7.56 studyone 21 31 10 54 10 Female Higher White
studyone.137 0.89 5.96 studyone 45 39 10 47 13 Female Higher Asian
studyone.138 0.75 5.60 studyone 31 37 8 41 10 Male Higher White
studyone.139 0.80 6.32 studyone 60 36 10 55 12 Female Higher White
studyone.14 0.94 8.56 studyone 30 34 11 55 17 Female Higher White
studyone.140 0.76 4.52 studyone 19 40 9 47 16 Male Secondary White
studyone.141 0.92 7.80 studyone 52 38 10 50 15 Female Higher White
studyone.142 0.94 8.52 studyone 55 37 11 47 13 Male Further White
studyone.143 0.89 6.32 studyone 74 36 8 41 16 Male Higher White
studyone.144 0.92 6.96 studyone 40 34 10 41 16 Female Higher White
studyone.145 0.83 5.56 studyone 32 30 7 45 12 Male Higher Asian
studyone.146 0.80 7.20 studyone 42 33 8 54 10 Male Further White
studyone.147 0.89 6.92 studyone 26 34 8 46 13 Female Further White
studyone.148 0.84 7.12 studyone 22 37 8 49 16 Male Higher White
studyone.149 0.86 5.80 studyone 34 32 10 47 15 Female Higher White
studyone.15 0.89 7.48 studyone 45 40 12 50 15 Female Secondary White
studyone.150 0.83 8.20 studyone 43 32 8 58 17 Female Further White
studyone.151 0.74 5.24 studyone 20 32 9 37 13 Female Further White
studyone.152 0.64 7.96 studyone 50 38 5 59 10 Female Further White
studyone.153 0.86 7.96 studyone 56 35 10 57 14 Male Further Mixed
studyone.154 0.80 5.40 studyone 39 36 9 51 16 Male Higher White
studyone.155 0.76 6.16 studyone 28 28 3 51 13 Female Secondary White
studyone.156 0.88 7.60 studyone 34 38 10 47 14 Female Higher Asian
studyone.157 0.78 8.92 studyone 38 39 8 63 17 Female Further White
studyone.158 0.93 7.28 studyone 31 36 9 57 11 Female Higher White
studyone.159 0.77 7.12 studyone 59 31 9 51 10 Female Secondary White
studyone.16 0.90 6.72 studyone 28 35 6 45 15 Male Higher White
studyone.160 0.99 8.12 studyone 44 38 11 56 14 Female Higher White
studyone.161 0.92 6.28 studyone 46 40 10 48 17 Male Higher White
studyone.162 0.86 5.24 studyone 41 40 10 41 17 Male Higher White
studyone.163 0.75 7.96 studyone 36 37 6 54 12 Female Higher White
studyone.164 0.87 8.44 studyone 62 40 10 51 16 Female Higher Black
studyone.165 0.93 8.52 studyone 59 39 9 58 16 Male Further White
studyone.166 0.80 6.52 studyone 76 35 8 57 11 Male Secondary White
studyone.167 0.77 5.92 studyone 50 38 12 46 18 Male Secondary White
studyone.168 0.87 7.52 studyone 25 24 6 48 12 Female Higher White
studyone.169 0.70 6.68 studyone 47 26 7 56 9 Male Secondary White
studyone.17 0.94 8.52 studyone 34 40 11 55 14 Male Higher White
studyone.18 0.76 5.88 studyone 30 32 8 42 15 Female Higher White
studyone.19 0.84 8.80 studyone 41 37 9 52 12 Female Higher Asian
studyone.2 0.92 8.76 studyone 20 36 11 44 11 Female Higher Other
studyone.20 0.95 6.32 studyone 29 40 6 52 15 Female Higher White
studyone.21 0.91 6.76 studyone 26 31 6 45 16 Female Higher White
studyone.22 0.86 8.80 studyone 20 36 7 53 16 Female Further White
studyone.23 0.76 6.12 studyone 24 35 6 48 11 Female Secondary White
studyone.24 0.86 8.32 studyone 32 35 11 61 14 Female Higher White
studyone.25 0.98 8.04 studyone 32 33 10 58 15 Female Further White
studyone.26 0.86 6.40 studyone 40 33 9 45 15 Male Secondary White
studyone.27 0.80 9.00 studyone 34 29 6 46 10 Female Further White
studyone.28 0.94 7.04 studyone 23 34 10 56 18 Female Further Mixed
studyone.29 0.88 7.00 studyone 22 38 9 51 18 Male Further White
studyone.3 0.76 6.32 studyone 40 33 12 60 13 Female Higher White
studyone.30 0.92 6.88 studyone 42 36 14 46 15 Female Higher White
studyone.31 0.84 8.36 studyone 46 40 8 59 16 Female Further White
studyone.32 0.89 7.68 studyone 34 35 7 50 15 Female Higher White
studyone.33 0.85 7.84 studyone 51 36 9 45 13 Female Higher White
studyone.34 0.88 5.48 studyone 32 38 12 55 11 Female Higher White
studyone.35 0.92 7.72 studyone 24 36 14 56 15 Female Higher White
studyone.36 0.74 5.64 studyone 18 37 9 54 15 Male Secondary White
studyone.37 0.96 7.88 studyone 24 35 7 49 15 Female Higher Asian
studyone.38 0.86 4.40 studyone 32 38 10 50 16 Female Higher White
studyone.39 0.90 6.88 studyone 22 33 10 44 12 Female Further White
studyone.4 0.87 7.08 studyone 37 39 10 49 13 Female Higher White
studyone.40 0.85 8.16 studyone 31 38 7 56 18 Male Secondary White
studyone.41 0.69 4.52 studyone 27 35 7 45 16 Female Higher White
studyone.42 0.92 8.60 studyone 40 35 13 53 17 Female Higher White
studyone.43 0.92 6.60 studyone 40 37 9 57 12 Female Higher White
studyone.44 0.74 6.56 studyone 20 31 9 41 14 Male Further White
studyone.45 0.97 7.96 studyone 29 37 11 54 14 Female Secondary White
studyone.46 0.90 6.28 studyone 23 32 8 54 15 Male Further White
studyone.47 0.64 7.64 studyone 19 28 5 63 14 Female Further Mixed
studyone.48 0.65 4.52 studyone 29 28 5 49 13 Female Further White
studyone.49 0.81 5.52 studyone 31 36 10 54 14 Female Secondary White
studyone.5 0.88 6.40 studyone 26 34 10 43 11 Female Further White
studyone.50 0.67 5.80 studyone 22 32 6 49 13 Male Further White
studyone.51 0.90 7.48 studyone 40 37 13 46 10 Male Higher White
studyone.52 0.71 8.64 studyone 23 38 7 60 17 Female Higher Asian
studyone.53 0.95 8.48 studyone 26 33 11 56 18 Female Higher White
studyone.54 0.92 7.60 studyone 30 35 10 58 14 Female Higher White
studyone.55 0.90 7.88 studyone 24 36 12 62 15 Female Higher Asian
studyone.56 0.67 5.72 studyone 36 38 7 53 14 Female Higher White
studyone.57 0.88 3.48 studyone 18 29 7 44 14 Female Further Asian
studyone.58 0.86 7.76 studyone 44 32 7 55 13 Female Higher White
studyone.59 0.84 7.20 studyone 18 34 6 49 13 Male Secondary White
studyone.6 0.86 7.52 studyone 41 37 11 51 11 Female Higher White
studyone.60 0.81 6.56 studyone 30 32 5 55 13 Female Higher White
studyone.61 0.65 7.72 studyone 31 29 11 40 13 Male Further White
studyone.62 0.82 5.44 studyone 46 35 9 50 11 Female Further White
studyone.63 0.91 6.08 studyone 40 33 14 52 13 Female Higher White
studyone.64 0.85 4.60 studyone 39 38 13 48 6 Female Higher White
studyone.65 0.90 9.00 studyone 28 40 10 58 18 Female Higher White
studyone.66 0.60 5.28 studyone 27 40 7 45 16 Male Higher White
studyone.67 0.92 8.88 studyone 29 33 10 54 11 Female Further White
studyone.68 0.81 6.16 studyone 22 25 6 46 12 Female Further White
studyone.69 0.63 7.44 studyone 39 33 10 46 13 Female Further Asian
studyone.7 0.58 4.76 studyone 22 29 8 51 12 Female Higher White
studyone.70 0.41 5.28 studyone 22 31 9 36 8 Male Further White
studyone.71 0.85 5.60 studyone 26 37 10 52 14 Male Further White
studyone.72 0.98 8.28 studyone 46 39 12 58 15 Female Higher White
studyone.73 0.93 8.32 studyone 47 39 11 56 15 Female Higher White
studyone.74 0.91 7.88 studyone 18 38 10 49 14 Male Secondary Mixed
studyone.75 0.89 7.16 studyone 28 36 11 51 14 Male Higher White
studyone.76 0.96 7.20 studyone 36 40 11 51 16 Male Higher White
studyone.77 0.66 7.68 studyone 18 27 7 52 11 Male Secondary White
studyone.78 0.96 8.36 studyone 32 37 8 55 15 Female Higher White
studyone.79 0.66 4.52 studyone 28 33 6 44 11 Female Higher White
studyone.8 0.75 6.16 studyone 44 35 7 44 12 Female Higher White
studyone.80 0.87 5.88 studyone 30 31 9 51 15 Male Higher White
studyone.81 0.88 8.16 studyone 34 34 9 58 11 Male Higher White
studyone.82 0.80 7.88 studyone 51 33 5 48 12 Female Secondary White
studyone.83 0.81 6.12 studyone 43 37 12 47 15 Female Higher White
studyone.84 0.36 4.40 studyone 22 32 4 46 10 Female Further White
studyone.85 0.77 7.24 studyone 49 39 9 52 12 Male Higher White
studyone.86 0.82 7.20 studyone 39 35 7 49 15 Male Further White
studyone.87 0.82 6.52 studyone 55 36 10 45 13 Male Further White
studyone.88 0.80 6.84 studyone 67 40 9 56 10 Female Higher White
studyone.89 0.62 5.40 studyone 65 34 10 43 10 Male Secondary White
studyone.9 0.76 6.32 studyone 30 32 8 41 10 Female Higher White
studyone.90 0.82 4.60 studyone 52 40 9 58 13 Male Secondary White
studyone.91 0.86 5.80 studyone 24 33 8 40 13 Female Higher White
studyone.92 0.83 8.80 studyone 41 32 10 57 14 Male Further White
studyone.93 0.67 5.48 studyone 25 32 6 57 6 Male Further White
studyone.94 0.77 6.96 studyone 18 32 10 44 12 Male Further Asian
studyone.95 0.96 6.88 studyone 20 37 14 51 16 Female Further White
studyone.96 0.67 7.92 studyone 73 33 9 54 11 Male Secondary White
studyone.97 0.76 6.52 studyone 32 29 7 51 12 Female Higher White
studyone.98 0.98 6.60 studyone 39 39 10 50 12 Female Higher White
studyone.99 0.71 6.44 studyone 19 35 9 48 9 Male Further White
Figure 10: Data table view of Health Comprehension project Study One dataset

A simple plot has many elements

  • When we code a plot, we tell R we want:
  • to use ggplot() to create a plot
  • using the data-set clearly.one.subjects
  • and the variables SHIPLEY, HLVA
An empty scatterplot: axes show that vocabulary scores are on the x-axis and health literacy scores are on the y-axis
Figure 11: Scatterplot showing association between health literacy and vocabulary
  ggplot(data = clearly.one.subjects, aes(x = SHIPLEY, y = HLVA))
  • We bring the data-set and the variables
  • We declare the aesthetic mappings:
  1. SHIPLEY score \(\rightarrow\) x-axis (horizontal: left-to-right position)
  2. HLVA score \(\rightarrow\) y-axis (vertical: bottom-to-top position)

A simple plot has many elements

  • When we code a plot, we tell R we want:
  • to use a geometric object, like geom_point
  • to display the data aesthetic mappings
A scatterplot showing only points: vocabulary scores are on the x-axis and health literacy scores are on the y-axis; but only the points are shown
Figure 12: Scatterplot showing association between health literacy and vocabulary
  ggplot(data = clearly.one.subjects, aes(x = SHIPLEY, y = HLVA)) +
  geom_point()
  • We add the geom_point() to tell R to draw the information about the SHIPLEY and HLVA scores as points
  • Each point represents information about one participant in the clearly.one.subjects data-set
  1. SHIPLEY score \(\rightarrow\) x-axis (horizontal: left-to-right position)
  2. HLVA score \(\rightarrow\) y-axis (vertical: bottom-to-top position)

When we use {ggplot2} we work in layers

  • The grammar of graphics define the components of a plot: the data, the mappings, and the geometric object
  • Together, the data, mappings, and geometric object form a layer
  • A plot may have multiple layers

When we use {ggplot2} we are in control and we can be creative

Tip

  • Having a system of graphics: with components, layers and rules
  • Releases us to be creative: changing a single feature at a time

Plot with layers: add a smoother

  • Build a plot layer by layer
  • We can begin by using points to display the vocabulary and health literacy scores for each person
  • We add a layer using a smoother to show the average association between vocabulary and literacy
A scatterplot showing points and a smoother. Vocabulary scores are on the x-axis and health literacy scores are on the y-axis. Each point represents the pairing of vocabulary and health literacy scores for one person. A smoother is added indicating the average association between vocabulary and health literacy scores across all people in our sample. The points and the smoother suggest a trend so that higher vocabulary scores are associated with higher health literacy scores
Figure 13: Scatterplot showing association between health literacy and vocabulary
  ggplot(data = clearly.one.subjects, aes(x = SHIPLEY, y = HLVA)) +
  geom_point() +
  geom_smooth()
  • We add the geom_smooth() to tell R to represent the average trend for the association between SHIPLEY and HLVA scores
  • The line is drawn by {ggplot2} which calculates a statistical transformation
  • Here, the transformation summarizes the association for different ranges of SHIPLEY vocabulary scores

Defaults and arguments

clearly.one.subjects %>%
  ggplot(aes(SHIPLEY, HLVA)) +
  geom_smooth() +
  geom_point()
  • The {ggplot2} library supplies default values
  • So we do not need to tell R how to do every thing
  • We do not need to tell R that the points in a scatterplot:
  • should represent the data aesthetic mappings in Cartesian (x-horizontal, y-vertical) 2-dimensional space
  • and should be black in colour

Defaults and arguments

clearly.one.subjects %>%
  ggplot(aes(SHIPLEY, HLVA)) +
  geom_smooth() +
  geom_point(colour = "darkgrey", size = 3)
  • We can over-ride the defaults by supplying arguments, entering values inside the brackets in the function calls
  • geom_point(colour = "darkgrey", size = 3) tells R we want:
  1. dark grey points when the default is black
  2. points that are 3x larger than the default size

When we use {ggplot2} we are in control and we can be creative

Tip

  • We can add layers, control the appearance of each component
  • To construct more effective plots
  • The plots can be more effective because we develop them in an iterative process
  • in which we reflect on our goals and the needs of our audience

We can use colour

  • When we code a plot, we tell R we want:
  • to display data about people with different education levels
  • distinguishing education level by colour
A scatterplot: points and smoothers are shown in red, green or blue. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores. The trend appears to be steeper for people with secondary education
Figure 14: Using colour
clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA, 
             group = EDUCATION, colour = EDUCATION)) +
  geom_smooth(method = "lm", se = FALSE, 
              linewidth = 2, alpha = .75) +
  geom_point(size = 3)
  • group = EDUCATION, colour = EDUCATION tells R to:
  1. group the data by EDUCATION level
  2. colour the points for people with different levels of education in different colours

Method, size, transparency

clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA, 
             group = EDUCATION, colour = EDUCATION)) +
  geom_smooth(method = "lm", se = FALSE, 
              linewidth = 2, alpha = .75) +
  geom_point(size = 3)
  • method = "lm", se = FALSE tells R what method to use to draw the smoother line
  • linewidth = 2 makes the width of the smoother line 2 x larger than the default
  • alpha = .75 makes the line .75 x the opacity of the default (i.e. a. bit more transparent)
  • Learn to edit: shape, size, transparency and colour

We facet plots to enable comparisons

  • It is often easier to compare trends
  • By presenting a separate plot for each condition or group
  • Showing the separate plots in a grid side-by-side
A grid of scatterplots: points and smoothers are shown in red, green or blue. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores. The trend appears to be steeper for people with secondary education. Plots are split into separate facets by education
Figure 15: The association between health literacy and vocabulary varies by education level
clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA, 
             group = EDUCATION, colour = EDUCATION)) +
  geom_smooth(method = "lm", se = FALSE, 
              linewidth = 2, alpha = .75) +
  geom_point(size = 3) +
  facet_wrap(~ EDUCATION)
  • facet_wrap(~ EDUCATION) tells R to split the data by EDUCATION level
  • And show a separate plot for each EDUCATION level group side-by-side for easy comparison

We can guide our audience

  • We do not present visualizations in isolation
  • We present plots embedded in the context of labels and titles
  • We use the text to guide the viewer
A scatterplot: points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores. Plot labels for title, and each axis have been edited to be more informative
Figure 16: A labelled plot
clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA)) +
  geom_smooth(method = "lm", se = FALSE,
              colour = "darkgreen", linewidth = 2, alpha = .75) +
  geom_point(size = 3, colour = "lightgreen") +
  labs(x = "Vocabulary (Shipley)", y = "Health literacy (HLVA)",
       title = "Scatterplot showing how higher vocabulary\npredicts higher health literacy on average")
  • We use the labs() function to add: the plot title and the labels for the x-axis and y-axis
  • We edit the title so that the viewer can see what we want them to see
  • We use \n to make the title fit on two lines

We annotate plots to direct attention

  • We can direct the attention of our audience to key features of our data
  • By adding annotations like text and lines
A scatterplot: points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores. We have added a horizontal grey line to indicate the average health literacy (HLVA) score for this sample
Figure 17: An annotated plot
clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA)) +
  geom_smooth(method = "lm", se = FALSE,
              colour = "darkgreen", linewidth = 2, alpha = .75) +
  geom_point(size = 3, colour = "lightgreen") +
  labs(x = "Vocabulary (Shipley)", y = "Health literacy (HLVA)") +
  geom_hline(yintercept = mean(clearly.one.subjects$HLVA),
             linetype = "dashed",
             linewidth = 2,
             colour = "grey",
             alpha = .85) +
  annotate("text", x = 27, y = 9.3, label = "Mean HLVA", colour = "grey") +
  theme_bw()
  • geom_hline() adds a line to show mean health literacy
  • annotate("text" ...) adds a text label

Extensions free our creativity

  • The power of the Grammar of Graphics lies in the rules
  • Developers can use the rules to expand our capacity to visualize data
  • We add marginal histograms to our scatterplot to visualize associations and distributions
A grid of scatterplots: points and smoothers are shown in red, green or blue. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores. The trend appears to be steeper for people with secondary education. Plots are split into separate facets by education
Figure 18: A scatterplot showing the potential association between health literacy and vocabulary
plot <- clearly.one.subjects %>%
  ggplot(aes(x = SHIPLEY, y = HLVA)) +
  geom_smooth(method = "lm", se = FALSE,
              colour = "darkgreen", linewidth = 2, alpha = .75) +
  geom_point(size = 3, colour = "lightgreen") +
  labs(x = "Vocabulary (Shipley)", y = "Health literacy (HLVA)")

ggMarginal(plot, type = "histogram", fill = "lightgreen", 
           xparams = list(binwidth=2), yparams = list(binwidth=1))
  • ggMarginal(plot, type = "histogram") enables us to show the distribution of scores on each variable
  • This helps our viewer to process the association and information about each variable (Franconeri et al., 2021)

Choose your plot theme

  • We can choose a theme to adapt the look of the whole plot to suit our needs or the needs of our audience
A scatterplot: points are shown in grey, a smoother line is shown in red. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores
Figure 19: theme_dark()
A scatterplot: points are shown in grey, a smoother line is shown in red. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores
Figure 20: theme_bw()
A scatterplot: points are shown in grey, a smoother line is shown in red. Points represent the pairing of health literacy and vocabulary scores for each participant. An upward trend is apparent, such that higher vocabulary scores (on the x-axis) are associated with higher health literacy scores
Figure 21: theme_classic()

Summary

You start your work with these questions:

  1. What are our goals?
  2. What does our audience need or expect?

Summary

You develop your visualization in a reflective process:

  1. Begin with a quick draft to show the distributions or make the comparisons you think about first
  2. Then reflect, and edit: does this enable me to discover sources of variability in my data?
  3. Then reflect, and edit: does this enable me to effectively communicate what I want to communicate?
  4. Then reflect, and edit: does this look good? – do my viewers tell me this works well?

Summary

Tip

I can only show you the potential for creative and effective visualization

  • experiment and find what looks good and is useful to you
  • seek out information – good places to start are:

https://ggplot2.tidyverse.org/index.html

https://r-graph-gallery.com

References

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. https://doi.org/10.2307/2682899
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
Belenky, G., Wesensten, N. J., Thorne, D. R., Thomas, M. L., Sing, H. C., Redmond, D. P., Russo, M. B., & Balkin, T. J. (2003). Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study. Journal of Sleep Research, 12(1), 1–12. https://doi.org/10.1046/j.1365-2869.2003.00337.x
Franconeri, S. L., Padilla, L. M., Shah, P., Zacks, J. M., & Hullman, J. (2021). The Science of Visual Data Communication: What Works. Psychological Science in the Public Interest, 22(3), 110–161. https://doi.org/10.1177/15291006211051956
Gelman, a. (2015). The connection between varying treatment effects and the crisis of unreplicable research: A bayesian perspective. Journal of Management, 41(2), 632–643. https://doi.org/10.1177/0149206314525208
Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s practice what we preach. The American Statistician, 56(2), 121–130. https://doi.org/10.1198/000313002317572790
Gelman, A., & Unwin, A. (2013). Infovis and statistical graphics: Different goals, different looks. Journal of Computational and Graphical Statistics, 22(1), 2–28. https://doi.org/10.1080/10618600.2012.761137
Hofman, J. M., Goldstein, D. G., & Hullman, J. (2020). How visualizing inferential uncertainty can mislead readers about treatment effects in scientific results. 112. https://doi.org/10.1145/3313831.3376454
jumpingrivers. (n.d.). Datasets from the Datasaurus Dozen. https://jumpingrivers.github.io/datasauRus/
Kastellec, J. P., & Leoni, E. L. (2007). Using Graphs Instead of Tables in Political Science. Perspectives on Politics, 5(4), 755–771. https://doi.org/10.1017/S1537592707072209
Matejka, J., & Fitzmaurice, G. (2017). CHI ’17: CHI Conference on Human Factors in Computing Systems. 1290–1294. https://doi.org/10.1145/3025453.3025912
Van Der Bles, A. M., Van Der Linden, S., Freeman, A. L. J., Mitchell, J., Galvao, A. B., Zaval, L., & Spiegelhalter, D. J. (2019). Communicating uncertainty about facts, numbers and science (Vol. 6).
Vasishth, S., & Gelman, A. (2021). How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics, 59(5), 1311–1342. https://doi.org/10.1515/ling-2019-0051
Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28. https://doi.org/10.1198/jcgs.2009.07098
Wilkinson, L. (2013). The Grammar of Graphics. Springer Science & Business Media.
Zhang, S., Heck, P. R., Meyer, M. N., Chabris, C. F., Goldstein, D. G., & Hofman, J. M. (2023). An illusion of predictability in scientific results: Even experts confuse inferential uncertainty and outcome variability. Proceedings of the National Academy of Sciences, 120(33), e2302491120. https://doi.org/10.1073/pnas.2302491120