Chapter 2:
Data Visualisation

For the R For Data Science 2nd Edition Book Club Cohort 9

Jeremy Selva

Introduction

This chapter shows how to visualise data using ggplot2. Ensure these libraries are available.

# Needed in the book
library(tidyverse)
library(palmerpenguins)
library(ggthemes)

# For the extra slides
library(scales)
library(glue)
library(ggfittext)
library(ggtext)
library(downloadthis)

penguins data

The penguins data has 344 penguins observations with 8 columns.

print(penguins, width = Inf)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
   sex     year
   <fct>  <int>
 1 male    2007
 2 female  2007
 3 female  2007
 4 <NA>    2007
 5 female  2007
 6 male    2007
 7 female  2007
 8 male    2007
 9 <NA>    2007
10 <NA>    2007
# ℹ 334 more rows

penguins data

Among the variables in penguins are:

  1. species: a penguin’s species (Adelie, Chinstrap, or Gentoo).

  2. flipper_length_mm: length of a penguin’s flipper, in millimeters.

  3. body_mass_g: body mass of a penguin, in grams.

penguins data

We can take a closer look using dplyr::glimpse()

dplyr::glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

We try to find a relationship between flipper lengths and body masses for each penguins species.

Creating a ggplot

The first argument data is to tell ggplot which the dataset to use in the graph. It gives an empty graph at first.

ggplot2::ggplot(
  data = penguins
)

An empty graph.

Creating a ggplot

The next argument mapping tells ggplot what variables in penguins holds the x and y axis. The mapping argument must be an output of the ggplot2::aes function.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g
      )
)

A graph with only x and y axis labels.

Creating a ggplot

We can see that flipper lengths and body mass are displayed in the x and y axis. However, we cannot see the penguin data in the plot.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g
      )
)

A graph with only x and y axis labels.

Creating a ggplot

We need to specify using geom_ what the data should be represented as a

  • bar using geom_bar()
  • line using geom_line()
  • boxplot using geom_boxplot()
  • scatterplots using geom_point()

Creating a ggplot

Adding geom_point() gives us the points in the plot. We can see that penguins with longer flippers are generally larger in their body mass.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g
      )
) + 
ggplot2::geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm.

Creating a ggplot

The warning indicates that two of the penguins have missing body mass and/or flipper length values.

penguins |> 
  dplyr::filter(
    is.na(flipper_length_mm) | is.na(body_mass_g)
  ) |>
  print(width = Inf)
# A tibble: 2 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen             NA            NA                NA          NA
2 Gentoo  Biscoe                NA            NA                NA          NA
  sex    year
  <fct> <int>
1 <NA>   2007
2 <NA>   2009

Creating a ggplot

In general, we can see that penguins with longer flippers are generally larger in their body mass.

Does this applies for all species (Adelie, Chinstrap and Gentoo) ?

We can do this by representing species with different coloured points.

Creating a ggplot

To do this, we modify the aes function by adding color = species.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g,
      color = species
      )
) + 
ggplot2::geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm with colours.

Creating a ggplot

Adding color = species allows ggplot to assign a unique colour for each species and provides a legend.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g,
      color = species
      )
) + 
ggplot2::geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm with colours.

Creating a ggplot

We now wish to add a smooth curve for each species. this is done by adding geom_smooth(method = "lm")

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g,
      color = species
      )
) + 
ggplot2::geom_point() +
ggplot2::geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm with smooth curves for each species.

Creating a ggplot

Instead of a line for all species, how do we add only one smooth curve to describe the relationship for all species ?

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g,
      color = species
      )
) + 
ggplot2::geom_point() +
ggplot2::geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm with smooth curves for each species.

Creating a ggplot

We need to agree that the following are equivalent.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g,
      color = species)
) + 
ggplot2::geom_point() +
ggplot2::geom_smooth(method = "lm")
ggplot2::ggplot() + 
ggplot2::geom_point(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g,
      color = species)
) +
ggplot2::geom_smooth(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g,
      color = species),
  method = "lm")

Creating a ggplot

Because we want to draw a smooth curve along all species, we need to remove the color = species in the mapping argument of geom_smooth.

ggplot2::ggplot() + 
ggplot2::geom_point(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g,
      color = species)
) +
ggplot2::geom_smooth(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g),
  method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm with smooth curves for each species.

Creating a ggplot

We put what mapping arguments that are common in geom_point and geom_smooth back to ggplot.

ggplot2::ggplot() + 
ggplot2::geom_point(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g,
      color = species)
) +
ggplot2::geom_smooth(
  data = penguins,
  mapping = ggplot2::aes(
      x = flipper_length_mm, 
      y = body_mass_g),
  method = "lm")
ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = flipper_length_mm, 
     y = body_mass_g
    )
) + 
ggplot2::geom_point(
  mapping = ggplot2::aes(
      color = species)
) +
ggplot2::geom_smooth(
  method = "lm")

Creating a ggplot

Here is the simplified code.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = flipper_length_mm, 
     y = body_mass_g
    )
) + 
ggplot2::geom_point(
  mapping = ggplot2::aes(
      color = species)
) +
ggplot2::geom_smooth(
  method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm with a smooth curve.

Creating a ggplot

Besides colour, we can create a different shape for each species.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = flipper_length_mm, 
     y = body_mass_g
    )
) + 
ggplot2::geom_point(
  mapping = ggplot2::aes(
      color = species,
      shape = species)
) +
ggplot2::geom_smooth(
  method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm with a smooth curve. Different species are indicated with different colour and shape.

Creating a ggplot

Next, we use labs to label the plot.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = flipper_length_mm, 
     y = body_mass_g
    )
) + 
ggplot2::geom_point(
  mapping = ggplot2::aes(
      color = species,
      shape = species)
) +
ggplot2::geom_smooth(
  method = "lm") +
ggplot2::labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", 
    y = "Body mass (g)",
    color = "Species", 
    shape = "Species"
  ) 
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm with a smooth curve. A title and subtitle has been added.

Creating a ggplot

Lastly, we add scale_color_colorblind to improve the colour palette to be colour-blind safe.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = flipper_length_mm, 
     y = body_mass_g
    )
) + 
ggplot2::geom_point(
  mapping = ggplot2::aes(
      color = species,
      shape = species)
) +
ggplot2::geom_smooth(
  method = "lm") +
ggplot2::labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", 
    y = "Body mass (g)",
    color = "Species", 
    shape = "Species"
  ) +
ggthemes::scale_color_colorblind()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm with a smooth curve. A colour blind friendly palette is used.

Creating a ggplot (Extra)

In Google Chrome, we can go to Developer’s mode (Press F12), press Three Dots, press More tools, press Rendering, scroll down to Emulate vision deficiencies and make changes.

More info in Anna Monus’s Blog

A scatterplot of penguin's body mass in grams vs flipper length in mm with a smooth curve. A colour blind friendly palette is used.

ggplot2 calls

There is a concise way of using ggplot2 code to that saves typing. In general, we want to reduce the amount of extra text to make it easier to compare similar codes.

More information will be provided in Chapter 4 when we’ll learn about the pipe, |> and Chapter 26 where we’ll look into coding style.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = flipper_length_mm, 
    y = body_mass_g)) +
  ggplot2::geom_point()
penguins |> 
  ggplot2::ggplot(
    ggplot2::aes(
      x = flipper_length_mm,
      y = body_mass_g)) + 
  ggplot2::geom_point()

ggplot2 calls (Extra)

When is flipper_length_mm and body_mass_g a variable or column name ? We can use .data to differentiate them.

flipper_length_mm = "Flipper Length in mm"
body_mass_g = "Body Mass in g"

penguins |> 
  ggplot2::ggplot(
    ggplot2::aes(
      x = flipper_length_mm,
      y = body_mass_g)) + 
  ggplot2::geom_point() +
  ggplot2::labs(
    x = flipper_length_mm,
    y = body_mass_g
  )

A scatterplot of body mass in grams vs flipper length in mm.

flipper_length_mm = "Flipper Length in mm"
body_mass_g = "Body Mass in g"

penguins |> 
  ggplot2::ggplot(
    ggplot2::aes(
      x = .data[["flipper_length_mm"]],
      y = .data[["body_mass_g"]])) + 
  ggplot2::geom_point() +
  ggplot2::labs(
    x = flipper_length_mm,
    y = body_mass_g
  )

A scatterplot of body mass in grams vs flipper length in mm.

ggplot2 calls (Extra)

Set axis.title.y = ggplot2::element_text(angle = 0) in ggplot2::theme to rotate the y axis title.

penguins |> 
  ggplot2::ggplot(
    ggplot2::aes(
      x = .data[["flipper_length_mm"]],
      y = .data[["body_mass_g"]])) + 
  ggplot2::geom_point() +
  ggplot2::labs(
    x = "Flipper Length in mm",
    y = "Body Mass in g"
  )

A scatterplot of body mass in grams vs flipper length in mm.

penguins |> 
  ggplot2::ggplot(
    ggplot2::aes(
      x = .data[["flipper_length_mm"]],
      y = .data[["body_mass_g"]])) + 
  ggplot2::geom_point() +
  ggplot2::theme(
    axis.title.y = ggplot2::element_text(angle = 0)
  ) +
  ggplot2::labs(
    x = "Flipper Length in mm",
    y = "Body Mass in g"
  )

A scatterplot of body mass in grams vs flipper length in mm with the y axis title rotated.

Exercise 2.2.5

How many rows are in penguins? How many columns?

glue::glue("
  There are {nrow(penguins)} rows and {ncol(penguins)} columns in the penguin data set
  "
)
There are 344 rows and 8 columns in the penguin data set

A little preview of glue from Chapter 15.

Exercise 2.2.5

What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.

?penguins


Exercise 2.2.5

Make a scatterplot of bill_depth_mm vs. bill_length_mm. Describe the relationship between these two variables.

ggplot2::ggplot(
  data = penguins, 
  mapping = ggplot2::aes(
    x = .data[["bill_depth_mm"]], 
    y = .data[["bill_length_mm"]])
) + 
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Bill depth vs Bill Length",
    subtitle = "Small positive correaltion between bill depth and length.",
    x = "Bill Length (mm)", 
    y = "Bill Depth\n(mm)"
  ) 
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body depth in mm vs bill length in mm.

Exercise 2.2.5

What happens if you make a scatterplot of species vs. bill_depth_mm ?

ggplot2::ggplot(
  data = penguins, 
  mapping = ggplot2::aes(
    x = .data[["bill_length_mm"]], 
    y = .data[["species"]])
) + 
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Species vs Bill Length",
    subtitle = "We get a dot plot but it is hard to see the distribution.",
    x = "Bill Depth (mm)", 
    y = "Species"
  ) 
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of species vs bill length in mm.

Exercise 2.2.5

What might be a better choice of geom ?

ggplot2::ggplot(
  data = penguins, 
  mapping = ggplot2::aes(
    x = .data[["bill_length_mm"]], 
    y = .data[["species"]])
) + 
  ggplot2::geom_boxplot() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Species vs Bill Length",
    subtitle = "A box plot may be better.",
    x = "Bill Depth (mm)", 
    y = "Species"
  ) 
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

A boxplot of bill length in mm for each species.

Exercise 2.2.5

Why does the following give an error and how would you fix it?

ggplot2::ggplot(data = penguins) + 
  ggplot2::geom_point()
Error in `ggplot2::geom_point()`:
! Problem while setting up geom.
ℹ Error occurred in the 1st layer.
Caused by error in `compute_geom_1()`:
! `geom_point()` requires the following missing aesthetics: x and y

Exercise 2.2.5

We need to set mapping parameters using the aes function to tell ggplot which columns in penguins are used for the x or y axis respectively.

ggplot(data = penguins) + 
  geom_point(
    mapping = ggplot2::aes(
    x = .data[["bill_depth_mm"]], 
    y = .data[["bill_length_mm"]]  
  )
)
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of bill depth in mm vs bill length in mm.

Exercise 2.2.5

What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

?ggplot2::geom_point()


Exercise 2.2.5

Warning message is removed.

ggplot2::ggplot(
  data = penguins, 
  mapping = ggplot2::aes(
    x = .data[["bill_depth_mm"]], 
    y = .data[["bill_length_mm"]])
) + 
  ggplot2::geom_point(
    na.rm = TRUE,
) +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Bill depth vs Bill Length",
    subtitle = "Small positive correaltion between bill depth and length.",
    x = "Bill Length (mm)", 
    y = "Bill Depth\n(mm)"
  ) 

A scatterplot of bill depth in mm vs bill length in mm.

Exercise 2.2.5

Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

ggplot2::ggplot(
  data = penguins, 
  mapping = ggplot2::aes(
    x = .data[["bill_depth_mm"]], 
    y = .data[["bill_length_mm"]])
) + 
  ggplot2::geom_point(
    na.rm = TRUE,
) +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Bill depth vs Bill Length",
    subtitle = "Small positive correaltion between bill depth and length.",
    caption = "Data come from the palmerpenguins package.",
    x = "Bill Length (mm)", 
    y = "Bill Depth\n(mm)"
  ) 

A scatterplot of bill depth in mm vs bill length in mm with caption.

Exercise 2.2.5

Image replication

ggplot2::ggplot(
  data = penguins, 
  mapping = ggplot2::aes(
    x = .data[["flipper_length_mm"]], 
    y = .data[["body_mass_g"]])
  ) + 
  ggplot2::geom_point(
    mapping = ggplot2::aes(
      color = .data[["bill_depth_mm"]]
    )
  ) +
  ggplot2::geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of body mass in grams vs flipper length in mm, coloured by bill depth with a smooth curve.

Exercise 2.2.5

What does se = FALSE in geom_smooth do ? It plots the smooth line without confidence interval.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["flipper_length_mm"]], 
    y = .data[["body_mass_g"]], 
    color = .data[["island"]])
) +
  ggplot2::geom_point() +
  ggplot2::geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of body mass in grams vs flipper length in mm, coloured by each island, with a smooth curve without confidence interval plotted for each island.

Exercise 2.2.5

What is the output difference between these two codes ? There is no difference.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["flipper_length_mm"]], 
    y = .data[["body_mass_g"]])
) +
  ggplot2::geom_point() +
  ggplot2::geom_smooth()

A scatterplot of body mass in grams vs flipper length in mm with a smooth curve.

ggplot2::ggplot() + 
ggplot2::geom_point(
  data = penguins,
  mapping = ggplot2::aes(
      x = .data[["flipper_length_mm"]], 
      y = .data[["body_mass_g"]])
) +
ggplot2::geom_smooth(
  data = penguins,
  mapping = ggplot2::aes(
      x = .data[["flipper_length_mm"]], 
      y = .data[["body_mass_g"]])
  )

A scatterplot of body mass in grams vs flipper length in mm with a smooth curve.

Visualising Distributions

For categorical variables, we can view the distribution using bar charts.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = species
    )
) + 
ggplot2::geom_bar()

A barchart showing the count of each penguin species.

Visualising Distributions

To arrange them based on frequencies, we can use fct_infreq to convert the categorical variable to a factor and reorder the level of the factor based on the frequencies. More on Chapter 17.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = forcats::fct_infreq(species)
    )
) + 
ggplot2::geom_bar()

A barchart showing the count of each penguin species. This time, the species are sorted by their frequencies.

Visualising Distributions

For numerical variables, we can view the distribution using histograms.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = body_mass_g
    )
) + 
ggplot2::geom_histogram(
  binwidth = 200)
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

A histogram showing the distribution of the penguin's body mass in grams.

Visualising Distributions

We may need to adjust the binwidth argument if necessary.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = body_mass_g
    )
) + 
ggplot2::geom_histogram(
  binwidth = 20)
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

A histogram showing the distribution of the penguin's body mass in grams with binwidth set to 20.

Visualising Distributions

You can specify a function for calculating binwidth. More info about functions in Chapter 26.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = body_mass_g
    )
) + 
ggplot2::geom_histogram(
  binwidth = function(x) 2 * stats::IQR(x) / (length(x)^(1/3)))
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

A histogram showing the distribution of the penguin's body mass in grams with binwidth set by a function.

Visualising Distributions

Alternatively, we can view the distribution using density plot.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = body_mass_g
    )
) + 
ggplot2::geom_density()
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

A density plot showing the distribution of the penguin's body mass in grams.

Exercise 2.4.3

Make a bar plot of species of penguins, where you assign species to the y aesthetic (instead of x). How is this plot different?

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["species"]])
) +
  ggplot2::geom_bar()

A vertical barchart showing the count of each penguin species.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    y = .data[["species"]])
) +
  ggplot2::geom_bar()

A horizontal barchart showing the count of each penguin species.

Exercise 2.4.3

Which aesthetic, color or fill, is more useful for changing the color of bars?

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["species"]])
) +
  ggplot2::geom_bar(
    color = "green"
  )

A vertical barchart showing the count of each penguin species. Border of the bars are green.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["species"]])
) +
  ggplot2::geom_bar(
    fill = "green"
  )

A vertical barchart showing the count of each penguin species. Interior colour of the bars are green.

Exercise 2.4.3

What does the bins argument in geom_histogram do?

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["body_mass_g"]])
) + 
ggplot2::geom_histogram(
  color = "green",
  bins = 5)

A histogram showing the penguin's body mass in grams using 5 bins.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["body_mass_g"]])
) + 
ggplot2::geom_histogram(
  color = "green",
  bins = 10)

A histogram showing the penguin's body mass in grams using 10 bins.

Exercise 2.4.3

Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?

ggplot2::ggplot(
  data = diamonds,
  mapping = ggplot2::aes(
    x = .data[["carat"]])
) + 
ggplot2::geom_histogram(
  color = "green",
  binwidth = 1)

A histogram showing the diamond's caret using binwidth of 1.

ggplot2::ggplot(
  data = diamonds,
  mapping = ggplot2::aes(
    x = .data[["carat"]])
) + 
ggplot2::geom_histogram(
  color = "green",
  binwidth = 0.01)

A histogram showing the diamond's caret using binwidth of 0.01.

Visualising Relationships

We can use a side-by-side boxplot to view the relationship between a numerical and a categorical variable.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = species, 
     y = body_mass_g
    )
) + 
ggplot2::geom_boxplot()
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Boxplots of the penguin's body mass in grams for each species.

Visualising Relationships

Alternatively, we can view the distribution using density plot. We use linewidth = 0.75 to make the lines stand out a bit more against the background.

ggplot2::ggplot(
    data = penguins,
  mapping = ggplot2::aes(
     x = body_mass_g, 
     color = species
    )
) + 
ggplot2::geom_density(
  linewidth = 0.75
)
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Density plots of the penguin's body mass in grams for each species.

Visualising Relationships

Additionally, we can set fill = species and use alpha to add transparency to the filled density curve.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = body_mass_g, 
     color = species,
     fill = species
    )
) + 
ggplot2::geom_density(
  linewidth = 0.75,
  alpha = 0.5
)
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Density plots of the penguin's body mass in grams for each species with filled colours.

Visualising Relationships

Stacked bar plots are useful to visualize the relationship between two categorical variables.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = island, 
     fill = species
    )
) + 
ggplot2::geom_bar()

A stacked bar plot of showing the count of each penguin's species in each island.

Visualising Relationships

By setting position = "fill", we create a relative frequency plot.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = island, 
     fill = species
    )
) + 
ggplot2::geom_bar(
  position = "fill"
)

A relative frequency plot showing the proportion of each penguin's species living in each island.

Visualising Relationships

Scatterplots are useful to visualize the relationship between two numerical variables.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = flipper_length_mm, 
     y = body_mass_g
    )
) + 
ggplot2::geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of the penguin's body mass in grams vs flipper length in mm.

Visualising Relationships

To visualize the relationship between three or more variables. Things get a bit tricky. For example, we can let colors of points represent species and the shapes of points represent islands but it is hard to see anything meaningful.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = flipper_length_mm, 
     y = body_mass_g
    )
) + 
ggplot2::geom_point(
  mapping = ggplot2::aes(
     color = species, 
     shape = island
    )
)
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of the penguin's body mass in grams vs flipper length in mm. The colours represent different species and the shapes represent different islands.

Visualising Relationships

To counter this issue we can split plots into different facets, subplots that each display one subset of the data, using facet_wrap

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = flipper_length_mm, 
     y = body_mass_g
    )
) + 
ggplot2::geom_point(
  mapping = ggplot2::aes(
     color = species, 
     shape = species
    )
) +
ggplot2::facet_wrap(
  ggplot2::vars(island)
)
Warning: Removed 2 rows containing missing values (`geom_point()`).

A facet of scatterplots showing the penguin's body mass in grams vs flipper length in mm. The colours and shapes represent different species. The different facets are splitted based on the differnt islands.

Exercise 2.5.5

Which variables in mpg are categorical? Which variables are numerical? How can you see this information when you run mpg? Unfortunately ?mpg is not so clear.

?mpg


Exercise 2.5.5

We can use dplyr::glimpse to identify which variables are categorical and numeric.

dplyr::glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Exercise 2.5.5

Make a scatterplot of hwy vs. displ using the mpg data frame

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]])
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon"
  ) 

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres.

Exercise 2.5.5

Next, map a third, numerical/categorical variable to color, then size, then both color and size, then shape.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    color = .data[["cty"]])
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    colour = "City miles\nper gallon"
  ) 

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres coloured by city miles per gallon.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    color = .data[["class"]])
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    color = "Type of Car"
  ) 

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres coloured by the type of cars.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    size = .data[["cty"]])
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    colour = "City miles\nper gallon"
  ) 

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres. The size is based on the city miles per gallon.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    size = .data[["class"]])
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    color = "Type of Car"
  ) 
Warning: Using size for a discrete variable is not advised.

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres. The size is based on the type of car.

Exercise 2.5.5

Next, map a third, numerical/categorical variable to color, then size, then both color and size, then shape.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    color = .data[["cty"]],
    size = .data[["cty"]])
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    colour = "City miles\nper gallon"
  ) 

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres coloured by the city miles per gallon. The size is also based on the city miles per gallon.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    color = .data[["class"]],
    size = .data[["class"]])
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    color = "Type of Car"
  ) 
Warning: Using size for a discrete variable is not advised.

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres coloured by the type of car. The size is also based on the type of car.

Exercise 2.5.5

Next, map a third, numerical/categorical variable to color, then size, then both color and size, then shape

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    shape = .data[["cty"]])
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    colour = "City miles\nper gallon"
  ) 
Error in `ggplot2::geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `scale_f()`:
! A continuous variable cannot be mapped to the shape aesthetic
ℹ choose a different aesthetic or use `scale_shape_binned()`
ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    shape = .data[["cty"]])
) +
  ggplot2::geom_point() +
  ggplot2::scale_shape_binned() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    colour = "City miles\nper gallon"
  ) 

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres. The shape is based on the city miles per gallon.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    shape = .data[["class"]])
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    color = "Type of Car"
  ) 
Warning: The shape palette can deal with a maximum of 6 discrete values because
more than 6 becomes difficult to discriminate; you have 7. Consider
specifying shapes manually if you must have them.
Warning: Removed 62 rows containing missing values (`geom_point()`).

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres. The shape is based on the type of car. However the shape for the suv is missing.

See stackoverflow link for more info.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    shape = .data[["class"]])
) +
  ggplot2::geom_point() +
ggplot2::scale_shape_manual(values= c(0:7)) +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Engine displacement in litres", 
    y = "Highway miles\nper gallon",
    color = "Type of Car"
  ) 

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres. The shape is based on the type of car. This time, the shape for the suv is not missing.

Exercise 2.5.5

In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    linewidth = .data[["year"]])
) +
  ggplot2::geom_point()

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    linewidth = .data[["year"]])
) +
  ggplot2::geom_point() +
  ggplot2::geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
linewidth
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres. The smooth curve line width is based on the year of manufacture.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["displ"]], 
    y = .data[["hwy"]],
    linewidth = .data[["class"]])
) +
  ggplot2::geom_point() +
  ggplot2::geom_smooth()
Warning: Using linewidth for a discrete variable is not advised.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: span too small.  fewer data values than degrees of freedom.
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 5.6935
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 0.5065
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 0
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 0.65044
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : span too small.  fewer
data values than degrees of freedom.
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
5.6935
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
0.5065
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 0
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : There are other near
singularities as well. 0.65044
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 4.008
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 0.708
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 0
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 0.25
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
4.008
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
0.708
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 0
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : There are other near
singularities as well. 0.25

A scatterplot showing the car's highway miles per gallon vs engine displacement in litres. The smooth curve line width is based on the type of car.

Exercise 2.5.5

What happens if you map the same variable to multiple aesthetics?

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["class"]], 
    colour = .data[["class"]],
    fill = .data[["class"]])
  
) +
  ggplot2::geom_bar() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "Type of Car", 
    y = "Count",
    color = "Type of Car",
    fill = "Type of Car"
  ) 

A barchart showing the frequency of each type of car in the mpg dataset.

ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
    x = .data[["cty"]], 
    y = .data[["cty"]],
    colour = .data[["cty"]])
  
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    x = "City miles per gallon", 
    y = "City miles\nper gallon",
    color = "City miles\nper gallon",
  ) 

A scatterplot showing city miles per gallon against itself with colour scales by itself.

Exercise 2.5.5

Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["bill_length_mm"]], 
    y = .data[["bill_depth_mm"]],
    colour = .data[["species"]])
  
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Bill Depth(mm) vs Bill Length (mm)",
    subtitle = glue::glue("
        \U2022 Adelie has longer bill depth but shorter bill length than Gentoo.
        \U2022 Chinstrap has longer bill depth and length than Adelie and Gentoo.
    "),
    x = "Bill Length (mm)", 
    y = "Bill Depth\n(mm)",
    color = "Species",
  ) 
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of the penguin's bill depth in mm vs bill depth in mm. The colours represent different species.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["bill_length_mm"]], 
    y = .data[["bill_depth_mm"]],
    colour = .data[["species"]])
  
) +
  ggplot2::geom_point() +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[["species"]])
) +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Bill Depth(mm) vs Bill Length (mm)",
    subtitle = glue::glue("
        \U2022 Adelie has longer bill depth but shorter bill length than Gentoo.
        \U2022 Chinstrap has longer bill depth and length than Adelie and Gentoo.
    "),
    x = "Bill Length (mm)", 
    y = "Bill Depth\n(mm)",
    color = "Species",
  ) 
Warning: Removed 2 rows containing missing values (`geom_point()`).

A facet of scatterplots of the penguin's bill depth in mm vs bill depth in mm. The colours represent different species. The facet of scatterplots are splitted based on different species as well.

Exercise 2.5.5

Why does the following yield two separate legends? How would you fix it to combine the two legends?

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["bill_length_mm"]], 
    y = .data[["bill_depth_mm"]],
    colour = .data[["species"]],
    shape = .data[["species"]])
  
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Bill Depth(mm) vs Bill Length (mm)",
    subtitle = glue::glue("
        \U2022 Adelie has longer bill depth but shorter bill length than Gentoo.
        \U2022 Chinstrap has longer bill depth and length than Adelie and Gentoo.
    "),
    x = "Bill Length (mm)", 
    y = "Bill Depth\n(mm)",
    color = "Species",
  ) 
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplots of the penguin's bill depth in mm vs bill depth in mm. The colours and shapes represent different species. However, there were two legends in the plot. One for colour and the other for shape.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["bill_length_mm"]], 
    y = .data[["bill_depth_mm"]],
    colour = .data[["species"]],
    shape = .data[["species"]])
  
) +
  ggplot2::geom_point() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Bill Depth(mm) vs Bill Length (mm)",
    subtitle = glue::glue("
        \U2022 Adelie has longer bill depth but shorter bill length than Gentoo.
        \U2022 Chinstrap has longer bill depth and length than Adelie and Gentoo.
    "),
    x = "Bill Length (mm)", 
    y = "Bill Depth\n(mm)",
    color = "Species",
    shape = "Species"
  ) 
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplots of the penguin's bill depth in mm vs bill depth in mm. The colours and shapes represent different species. This time, there were only one legend in the plot.

Exercise 2.5.5

Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["island"]],
    fill = .data[["species"]])
) +
  ggplot2::geom_bar(
    position = "fill"
  ) +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Proportion Of Species In A Given Island",
    subtitle = glue::glue("
        \U2022 Torgersen only has Adelie.
        \U2022 Dream has similar proportions of Adelie and Chinstrap living there.
        \U2022 Biscoe has 1/4 Adelie and 3/4 Gentoo living there.
    "),
    x = "Island", 
    y = "Proportion",
    fill = "Species"
  ) 

A stacked bar plot showing the proportion of penguin species in a given island.

ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
    x = .data[["species"]],
    fill = .data[["island"]])
) +
  ggplot2::geom_bar(
    position = "fill"
  ) +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
    title = "Island habitat distribution for each species",
    subtitle = glue::glue("
        \U2022 Adelie lives in all three islands at similar proportions.
        \U2022 Chinstrap only lives in Dream.
        \U2022 Gentoo only lives in Biscoe.
    "),
    x = "Species", 
    y = "Proportion",
    fill = "Island"
  ) 

A stacked bar plot showing the island habitat distribution for each species.

Save Plots

Use ggsave to save a ggplot plot.

penguin_plot <- ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = flipper_length_mm, 
     y = body_mass_g
    )
) + 
ggplot2::geom_point()

ggplot2::ggsave(
  filename = "penguin-plot.png",
  plot = penguin_plot,
  width = 6, 
  height = 7)

A scatterplot of penguin's body mass in grams vs flipper length in mm

Exercise 2.6.1

Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

car_bar_plot <- ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
     x = .data[["class"]]
    )
) + 
ggplot2::geom_bar()

car_scatter_plot <- ggplot2::ggplot(
  data = mpg,
  mapping = ggplot2::aes(
     x = .data[["cty"]],
     y = .data[["hwy"]]
    )
) + 
ggplot2::geom_point()

ggplot2::ggsave(
  filename = "mpg-plot.png"
)

A bar chart of the type of cars

A scatterplot of the car's city miles per gallon vs highway miles per gallon.

Exercise 2.6.1

It is the second plot because by default value of the plot argument, ggsave saves the last plot displayed.

?ggplot2::ggsave


Exercise 2.6.1

What do you need to change in the code above to save the plot as a PDF instead of a PNG?

ggplot2::ggsave(
  filename = "mpg-plot.pdf"
)

ggplot2::ggsave(
  filename = "no_file_extension",
  device = "pdf"
)


Exercise 2.6.1

How could you find out what types of image files would work in ggsave()? Use the device argument.

ggplot2::ggsave(
  filename = "no_file_extension",
  device = "pdf"
)


Common Problems

Spelling mistakes (Pics from Allison Horst)

A comic showing a Crocodile programmer getting angry that because he does not know the error was caused by a spelling mistake until a Flamingo programmer steps in to help.

Common Problems

Plus in wrong place

ggplot2::ggplot(data = mpg) 

An empty plot.

+ ggplot2::geom_point(
  mapping = ggplot2::aes(x = displ, y = hwy))
Error:
! Cannot use `+` with a single argument
ℹ Did you accidentally put `+` on a new line?
ggplot2::ggplot(data = mpg) + 
  ggplot2::geom_point(
    mapping = ggplot2::aes(x = displ, y = hwy))

A scatterplot of the car's engine displacemen in litres vs highway miles per gallon.

Summary

Chapter 2 helps you learn the basics of ggplot2. More ggplot2 related techniques will be covered on Chapter 10 to Chapter 12.

The rest of my slides are some thorny “issues” that I have faced when using ggplot2 and how I handle them (after hours of searching).

Extra

Not enough discrete colours. Check out ggthemes::tableau_color_pal

scales::show_col(
  colours = ggthemes::tableau_color_pal("Classic 20")(20), 
  ncol = 6)

Extra

Colour-blind friendly discrete colours. Check out ggthemes::colorblind_pal for Okabe Ito colour palette. But 8 may not always be sufficient …

scales::show_col(
  colours = ggthemes::colorblind_pal()(8), 
  ncol = 4)

Extra

Both colour-blind friendly discrete and continunus colours. Check out microshades

Extra

Both colour-blind friendly discrete and continunus colours. Check out microshades

Extra

Not enough shapes. Check out ggstars

Extra

Facing long axis labels. See Andrew Heiss’s blog.

Extra

Need to label your bar charts. Consider ggfittext

Extra

Need to label your bar charts quickly. Consider ggfittext

island_count <- penguins |>
  dplyr::reframe(count = dplyr::n(), .by = c("species"))

ggplot2::ggplot(
  data = island_count,
  mapping = ggplot2::aes(
    x = .data[["species"]],
    y = .data[["count"]],
    fill = .data[["species"]]),
) +
  ggplot2::geom_col() +
  ggfittext::geom_bar_text() +
ggplot2::theme(
  axis.title.y = ggplot2::element_text(angle = 0)
)

A barchart plot showing the frequency of each penguin's species with labels.

Extra

Need to add text annotation. Consider ggtext.

Extra

Here is my example.

island_count <- penguins |>
  dplyr::reframe(count = dplyr::n(), .by = c("species"))

species_colours <- list("Adelie" = "#D55E00",
                        "Chinstrap" = "#009E73",
                        "Gentoo" = "#0072B2")

ggplot2::ggplot(
  data = island_count,
  mapping = ggplot2::aes(
    x = .data[["species"]],
    y = .data[["count"]],
    fill = .data[["species"]]),
) +
  ggplot2::geom_col() +
  ggfittext::geom_bar_text() +
  ggplot2::scale_fill_manual(values = species_colours) +
ggplot2::theme(
  plot.subtitle = ggtext::element_markdown(),
  axis.title.y = ggtext::element_markdown(angle = 0)
) +
ggplot2::labs(
    x = "Islands",
    y = "Count",
    title = "Penguin Species",
    subtitle = glue::glue("
    The penguin data has a total of {island_count$count[island_count$species == \"Adelie\"]} <span style=\"color:{species_colours$Adelie}\">**Adelie**</span>, {island_count$count[island_count$species == \"Chinstrap\"]} <span style=\"color:{species_colours$Chinstrap}\">**Chinstrap**</span> and {island_count$count[island_count$species == \"Gentoo\"]} <span style=\"color:{species_colours$Gentoo}\">**Gentoo**</span>
    ")
  )

A barchart plot showing the frequency of each penguin's species with labels.

Extra

See Royal Statistical Society Best Practices for Data Visualisation and Cara Thompsom’s NHRS 2022 Talk for more ggtext examples.

Extra

Don’t know which plot your client wants. Create buttons using download_this in your html document.

penguin_plot <- ggplot2::ggplot(
  data = penguins,
  mapping = ggplot2::aes(
     x = .data[["flipper_length_mm"]], 
     y = .data[["body_mass_g"]]
    )
) + 
ggplot2::geom_point()
downloadthis::download_this(
  penguin_plot,
  output_name = "penguin-plot",
  ggsave_args = list(width = 6, height = 7))

Extra

Learn from ggplot2 mishaps. See Kara Woo’s RStudio 2021 Conference talk.