Chapter 2:
Data Visualisation
This chapter shows how to visualise data using ggplot2
. Ensure these libraries are available.
penguins
dataThe penguins
data has 344 penguins observations with 8 columns.
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
sex year
<fct> <int>
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
7 female 2007
8 male 2007
9 <NA> 2007
10 <NA> 2007
# ℹ 334 more rows
penguins
dataAmong the variables in penguins are:
species
: a penguin’s species (Adelie, Chinstrap, or Gentoo).
flipper_length_mm
: length of a penguin’s flipper, in millimeters.
body_mass_g
: body mass of a penguin, in grams.
penguins
dataWe can take a closer look using dplyr::glimpse()
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
We try to find a relationship between flipper lengths and body masses for each penguins species.
ggplot
The first argument data
is to tell ggplot
which the dataset to use in the graph. It gives an empty graph at first.
ggplot
The next argument mapping
tells ggplot
what variables in penguins
holds the x and y axis. The mapping
argument must be an output of the ggplot2::aes
function.
ggplot
We can see that flipper lengths and body mass are displayed in the x and y axis. However, we cannot see the penguin data in the plot.
ggplot
We need to specify using geom_
what the data should be represented as a
geom_bar()
geom_line()
geom_boxplot()
geom_point()
ggplot
Adding geom_point()
gives us the points in the plot. We can see that penguins with longer flippers are generally larger in their body mass.
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
The warning indicates that two of the penguins have missing body mass and/or flipper length values.
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen NA NA NA NA
2 Gentoo Biscoe NA NA NA NA
sex year
<fct> <int>
1 <NA> 2007
2 <NA> 2009
ggplot
In general, we can see that penguins with longer flippers are generally larger in their body mass.
Does this applies for all species (Adelie, Chinstrap and Gentoo) ?
We can do this by representing species with different coloured points.
ggplot
To do this, we modify the aes
function by adding color = species
.
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
Adding color = species
allows ggplot
to assign a unique colour for each species and provides a legend.
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
We now wish to add a smooth curve for each species. this is done by adding geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
Instead of a line for all species, how do we add only one smooth curve to describe the relationship for all species ?
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
We need to agree that the following are equivalent.
ggplot
Because we want to draw a smooth curve along all species, we need to remove the color = species
in the mapping
argument of geom_smooth
.
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
We put what mapping arguments that are common in geom_point
and geom_smooth
back to ggplot
.
ggplot
Here is the simplified code.
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
Besides colour, we can create a different shape for each species.
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
Next, we use labs
to label the plot.
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = flipper_length_mm,
y = body_mass_g
)
) +
ggplot2::geom_point(
mapping = ggplot2::aes(
color = species,
shape = species)
) +
ggplot2::geom_smooth(
method = "lm") +
ggplot2::labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Species",
shape = "Species"
)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
Lastly, we add scale_color_colorblind
to improve the colour palette to be colour-blind safe.
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = flipper_length_mm,
y = body_mass_g
)
) +
ggplot2::geom_point(
mapping = ggplot2::aes(
color = species,
shape = species)
) +
ggplot2::geom_smooth(
method = "lm") +
ggplot2::labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Species",
shape = "Species"
) +
ggthemes::scale_color_colorblind()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot
(Extra)In Google Chrome, we can go to Developer’s mode (Press F12), press Three Dots, press More tools, press Rendering, scroll down to Emulate vision deficiencies and make changes.
More info in Anna Monus’s Blog
ggplot2
callsThere is a concise way of using ggplot2 code to that saves typing. In general, we want to reduce the amount of extra text to make it easier to compare similar codes.
More information will be provided in Chapter 4 when we’ll learn about the pipe, |>
and Chapter 26 where we’ll look into coding style.
ggplot2
calls (Extra)When is flipper_length_mm
and body_mass_g
a variable or column name ? We can use .data
to differentiate them.
ggplot2
calls (Extra)Set axis.title.y = ggplot2::element_text(angle = 0)
in ggplot2::theme
to rotate the y axis title.
How many rows are in penguins
? How many columns?
glue::glue("
There are {nrow(penguins)} rows and {ncol(penguins)} columns in the penguin data set
"
)
There are 344 rows and 8 columns in the penguin data set
A little preview of glue
from Chapter 15.
What does the bill_depth_mm
variable in the penguins
data frame describe? Read the help for ?penguins to find out.
Make a scatterplot of bill_depth_mm
vs. bill_length_mm
. Describe the relationship between these two variables.
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["bill_depth_mm"]],
y = .data[["bill_length_mm"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Bill depth vs Bill Length",
subtitle = "Small positive correaltion between bill depth and length.",
x = "Bill Length (mm)",
y = "Bill Depth\n(mm)"
)
Warning: Removed 2 rows containing missing values (`geom_point()`).
What happens if you make a scatterplot of species
vs. bill_depth_mm
?
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["bill_length_mm"]],
y = .data[["species"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Species vs Bill Length",
subtitle = "We get a dot plot but it is hard to see the distribution.",
x = "Bill Depth (mm)",
y = "Species"
)
Warning: Removed 2 rows containing missing values (`geom_point()`).
What might be a better choice of geom ?
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["bill_length_mm"]],
y = .data[["species"]])
) +
ggplot2::geom_boxplot() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Species vs Bill Length",
subtitle = "A box plot may be better.",
x = "Bill Depth (mm)",
y = "Species"
)
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Why does the following give an error and how would you fix it?
Error in `ggplot2::geom_point()`:
! Problem while setting up geom.
ℹ Error occurred in the 1st layer.
Caused by error in `compute_geom_1()`:
! `geom_point()` requires the following missing aesthetics: x and y
We need to set mapping parameters using the aes
function to tell ggplot
which columns in penguins
are used for the x or y axis respectively.
Warning: Removed 2 rows containing missing values (`geom_point()`).
What does the na.rm
argument do in geom_point()
? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE
.
Warning message is removed.
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["bill_depth_mm"]],
y = .data[["bill_length_mm"]])
) +
ggplot2::geom_point(
na.rm = TRUE,
) +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Bill depth vs Bill Length",
subtitle = "Small positive correaltion between bill depth and length.",
x = "Bill Length (mm)",
y = "Bill Depth\n(mm)"
)
Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins
package.” Hint: Take a look at the documentation for labs()
.
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["bill_depth_mm"]],
y = .data[["bill_length_mm"]])
) +
ggplot2::geom_point(
na.rm = TRUE,
) +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Bill depth vs Bill Length",
subtitle = "Small positive correaltion between bill depth and length.",
caption = "Data come from the palmerpenguins package.",
x = "Bill Length (mm)",
y = "Bill Depth\n(mm)"
)
Image replication
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
What does se = FALSE
in geom_smooth
do ? It plots the smooth line without confidence interval.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
What is the output difference between these two codes ? There is no difference.
For categorical variables, we can view the distribution using bar charts.
To arrange them based on frequencies, we can use fct_infreq
to convert the categorical variable to a factor and reorder the level of the factor based on the frequencies. More on Chapter 17.
For numerical variables, we can view the distribution using histograms.
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
We may need to adjust the binwidth
argument if necessary.
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
You can specify a function for calculating binwidth
. More info about functions in Chapter 26.
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
Alternatively, we can view the distribution using density plot.
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Make a bar plot of species of penguins, where you assign species to the y
aesthetic (instead of x
). How is this plot different?
Which aesthetic, color
or fill
, is more useful for changing the color of bars?
What does the bins argument in geom_histogram
do?
Make a histogram of the carat
variable in the diamonds
dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?
We can use a side-by-side boxplot to view the relationship between a numerical and a categorical variable.
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Alternatively, we can view the distribution using density plot. We use linewidth = 0.75
to make the lines stand out a bit more against the background.
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Additionally, we can set fill = species
and use alpha
to add transparency to the filled density curve.
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Stacked bar plots are useful to visualize the relationship between two categorical variables.
By setting position = "fill"
, we create a relative frequency plot.
Scatterplots are useful to visualize the relationship between two numerical variables.
Warning: Removed 2 rows containing missing values (`geom_point()`).
To visualize the relationship between three or more variables. Things get a bit tricky. For example, we can let colors of points represent species and the shapes of points represent islands but it is hard to see anything meaningful.
Warning: Removed 2 rows containing missing values (`geom_point()`).
To counter this issue we can split plots into different facets, subplots that each display one subset of the data, using facet_wrap
Warning: Removed 2 rows containing missing values (`geom_point()`).
Which variables in mpg are categorical? Which variables are numerical? How can you see this information when you run mpg? Unfortunately ?mpg
is not so clear.
We can use dplyr::glimpse
to identify which variables are categorical and numeric.
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Make a scatterplot of hwy
vs. displ
using the mpg data frame
Next, map a third, numerical/categorical variable to color
, then size
, then both color
and size
, then shape
.
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
color = .data[["cty"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
colour = "City miles\nper gallon"
)
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
color = .data[["class"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
color = "Type of Car"
)
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
size = .data[["cty"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
colour = "City miles\nper gallon"
)
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
size = .data[["class"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
color = "Type of Car"
)
Warning: Using size for a discrete variable is not advised.
Next, map a third, numerical/categorical variable to color
, then size
, then both color
and size
, then shape
.
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
color = .data[["cty"]],
size = .data[["cty"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
colour = "City miles\nper gallon"
)
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
color = .data[["class"]],
size = .data[["class"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
color = "Type of Car"
)
Warning: Using size for a discrete variable is not advised.
Next, map a third, numerical/categorical variable to color, then size, then both color and size, then shape
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
shape = .data[["cty"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
colour = "City miles\nper gallon"
)
Error in `ggplot2::geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `scale_f()`:
! A continuous variable cannot be mapped to the shape aesthetic
ℹ choose a different aesthetic or use `scale_shape_binned()`
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
shape = .data[["cty"]])
) +
ggplot2::geom_point() +
ggplot2::scale_shape_binned() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
colour = "City miles\nper gallon"
)
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
shape = .data[["class"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
color = "Type of Car"
)
Warning: The shape palette can deal with a maximum of 6 discrete values because
more than 6 becomes difficult to discriminate; you have 7. Consider
specifying shapes manually if you must have them.
Warning: Removed 62 rows containing missing values (`geom_point()`).
See stackoverflow link for more info.
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["displ"]],
y = .data[["hwy"]],
shape = .data[["class"]])
) +
ggplot2::geom_point() +
ggplot2::scale_shape_manual(values= c(0:7)) +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Engine displacement in litres",
y = "Highway miles\nper gallon",
color = "Type of Car"
)
In the scatterplot of hwy
vs. displ
, what happens if you map a third variable to linewidth
?
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
linewidth
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Warning: Using linewidth for a discrete variable is not advised.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: span too small. fewer data values than degrees of freedom.
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 5.6935
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 0.5065
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 0
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 0.65044
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : span too small. fewer
data values than degrees of freedom.
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
5.6935
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
0.5065
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 0
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : There are other near
singularities as well. 0.65044
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 4.008
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 0.708
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 0
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 0.25
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
4.008
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
0.708
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
number 0
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
else if (is.data.frame(newdata))
as.matrix(model.frame(delete.response(terms(object)), : There are other near
singularities as well. 0.25
What happens if you map the same variable to multiple aesthetics?
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["class"]],
colour = .data[["class"]],
fill = .data[["class"]])
) +
ggplot2::geom_bar() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "Type of Car",
y = "Count",
color = "Type of Car",
fill = "Type of Car"
)
ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["cty"]],
y = .data[["cty"]],
colour = .data[["cty"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
x = "City miles per gallon",
y = "City miles\nper gallon",
color = "City miles\nper gallon",
)
Make a scatterplot of bill_depth_mm
vs. bill_length_mm
and color the points by species
. What does adding coloring by species
reveal about the relationship between these two variables? What about faceting by species?
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["bill_length_mm"]],
y = .data[["bill_depth_mm"]],
colour = .data[["species"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Bill Depth(mm) vs Bill Length (mm)",
subtitle = glue::glue("
\U2022 Adelie has longer bill depth but shorter bill length than Gentoo.
\U2022 Chinstrap has longer bill depth and length than Adelie and Gentoo.
"),
x = "Bill Length (mm)",
y = "Bill Depth\n(mm)",
color = "Species",
)
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["bill_length_mm"]],
y = .data[["bill_depth_mm"]],
colour = .data[["species"]])
) +
ggplot2::geom_point() +
ggplot2::facet_wrap(
facets = ggplot2::vars(.data[["species"]])
) +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Bill Depth(mm) vs Bill Length (mm)",
subtitle = glue::glue("
\U2022 Adelie has longer bill depth but shorter bill length than Gentoo.
\U2022 Chinstrap has longer bill depth and length than Adelie and Gentoo.
"),
x = "Bill Length (mm)",
y = "Bill Depth\n(mm)",
color = "Species",
)
Warning: Removed 2 rows containing missing values (`geom_point()`).
Why does the following yield two separate legends? How would you fix it to combine the two legends?
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["bill_length_mm"]],
y = .data[["bill_depth_mm"]],
colour = .data[["species"]],
shape = .data[["species"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Bill Depth(mm) vs Bill Length (mm)",
subtitle = glue::glue("
\U2022 Adelie has longer bill depth but shorter bill length than Gentoo.
\U2022 Chinstrap has longer bill depth and length than Adelie and Gentoo.
"),
x = "Bill Length (mm)",
y = "Bill Depth\n(mm)",
color = "Species",
)
Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["bill_length_mm"]],
y = .data[["bill_depth_mm"]],
colour = .data[["species"]],
shape = .data[["species"]])
) +
ggplot2::geom_point() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Bill Depth(mm) vs Bill Length (mm)",
subtitle = glue::glue("
\U2022 Adelie has longer bill depth but shorter bill length than Gentoo.
\U2022 Chinstrap has longer bill depth and length than Adelie and Gentoo.
"),
x = "Bill Length (mm)",
y = "Bill Depth\n(mm)",
color = "Species",
shape = "Species"
)
Warning: Removed 2 rows containing missing values (`geom_point()`).
Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["island"]],
fill = .data[["species"]])
) +
ggplot2::geom_bar(
position = "fill"
) +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Proportion Of Species In A Given Island",
subtitle = glue::glue("
\U2022 Torgersen only has Adelie.
\U2022 Dream has similar proportions of Adelie and Chinstrap living there.
\U2022 Biscoe has 1/4 Adelie and 3/4 Gentoo living there.
"),
x = "Island",
y = "Proportion",
fill = "Species"
)
ggplot2::ggplot(
data = penguins,
mapping = ggplot2::aes(
x = .data[["species"]],
fill = .data[["island"]])
) +
ggplot2::geom_bar(
position = "fill"
) +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
) +
ggplot2::labs(
title = "Island habitat distribution for each species",
subtitle = glue::glue("
\U2022 Adelie lives in all three islands at similar proportions.
\U2022 Chinstrap only lives in Dream.
\U2022 Gentoo only lives in Biscoe.
"),
x = "Species",
y = "Proportion",
fill = "Island"
)
Use ggsave
to save a ggplot
plot.
Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?
car_bar_plot <- ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["class"]]
)
) +
ggplot2::geom_bar()
car_scatter_plot <- ggplot2::ggplot(
data = mpg,
mapping = ggplot2::aes(
x = .data[["cty"]],
y = .data[["hwy"]]
)
) +
ggplot2::geom_point()
ggplot2::ggsave(
filename = "mpg-plot.png"
)
It is the second plot because by default value of the plot
argument, ggsave
saves the last plot displayed.
What do you need to change in the code above to save the plot as a PDF instead of a PNG?
How could you find out what types of image files would work in ggsave()
? Use the device
argument.
Spelling mistakes (Pics from Allison Horst)
Plus in wrong place
Chapter 2 helps you learn the basics of ggplot2
. More ggplot2
related techniques will be covered on Chapter 10 to Chapter 12.
The rest of my slides are some thorny “issues” that I have faced when using ggplot2
and how I handle them (after hours of searching).
Not enough discrete colours. Check out ggthemes::tableau_color_pal
Colour-blind friendly discrete colours. Check out ggthemes::colorblind_pal
for Okabe Ito colour palette. But 8 may not always be sufficient …
Both colour-blind friendly discrete and continunus colours. Check out microshades
Both colour-blind friendly discrete and continunus colours. Check out microshades
Not enough shapes. Check out ggstars
Facing long axis labels. See Andrew Heiss’s blog.
Need to label your bar charts. Consider ggfittext
Need to label your bar charts quickly. Consider ggfittext
island_count <- penguins |>
dplyr::reframe(count = dplyr::n(), .by = c("species"))
ggplot2::ggplot(
data = island_count,
mapping = ggplot2::aes(
x = .data[["species"]],
y = .data[["count"]],
fill = .data[["species"]]),
) +
ggplot2::geom_col() +
ggfittext::geom_bar_text() +
ggplot2::theme(
axis.title.y = ggplot2::element_text(angle = 0)
)
Need to add text annotation. Consider ggtext
.
Here is my example.
island_count <- penguins |>
dplyr::reframe(count = dplyr::n(), .by = c("species"))
species_colours <- list("Adelie" = "#D55E00",
"Chinstrap" = "#009E73",
"Gentoo" = "#0072B2")
ggplot2::ggplot(
data = island_count,
mapping = ggplot2::aes(
x = .data[["species"]],
y = .data[["count"]],
fill = .data[["species"]]),
) +
ggplot2::geom_col() +
ggfittext::geom_bar_text() +
ggplot2::scale_fill_manual(values = species_colours) +
ggplot2::theme(
plot.subtitle = ggtext::element_markdown(),
axis.title.y = ggtext::element_markdown(angle = 0)
) +
ggplot2::labs(
x = "Islands",
y = "Count",
title = "Penguin Species",
subtitle = glue::glue("
The penguin data has a total of {island_count$count[island_count$species == \"Adelie\"]} <span style=\"color:{species_colours$Adelie}\">**Adelie**</span>, {island_count$count[island_count$species == \"Chinstrap\"]} <span style=\"color:{species_colours$Chinstrap}\">**Chinstrap**</span> and {island_count$count[island_count$species == \"Gentoo\"]} <span style=\"color:{species_colours$Gentoo}\">**Gentoo**</span>
")
)
See Royal Statistical Society Best Practices for Data Visualisation and Cara Thompsom’s NHRS 2022 Talk for more ggtext
examples.
Don’t know which plot your client wants. Create buttons using download_this
in your html document.
Learn from ggplot2
mishaps. See Kara Woo’s RStudio 2021 Conference talk.