Chapter 18: Missing Values

For the R For Data Science 2nd Edition Book Club Cohort 9

Jeremy Selva
@JauntyJJS
https://jeremy-selva.netlify.app

Introduction

We encountered missing values in previous chapters.

ggplot2::ggplot(
  data = palmerpenguins::penguins,
  mapping = ggplot2::aes(
      x = .data[["flipper_length_mm"]], 
      y = .data[["body_mass_g"]]
      )
) + 
ggplot2::geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

A scatterplot of penguin's body mass in grams vs flipper length in mm.

palmerpenguins::penguins |> 
  dplyr::filter(
    is.na(flipper_length_mm) | is.na(body_mass_g)
  ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark()
  )

Introduction

We encountered missing values in previous chapters.

nycflights13::flights |> 
  dplyr::group_by(.data[["month"]]) |> 
  dplyr::summarize(
    avg_delay = mean(.data[["dep_delay"]])
  ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )
nycflights13::flights |> 
  dplyr::group_by(.data[["month"]]) |> 
  dplyr::summarize(
    avg_delay = mean(.data[["dep_delay"]], 
                     na.rm = FALSE),
    avg_delay_corrected = mean(.data[["dep_delay"]], 
                     na.rm = TRUE)
  ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )

Introduction

We encountered missing values in previous chapters.

NA > 5
[1] NA
10 == NA
[1] NA
NA == NA
[1] NA
is.na(NA)
[1] TRUE

We learn more of the details in this chapter, covering additional tools (besides is.na and na.rm argument) for working with missing values

  • Explicit missing values
  • Implicit missing values
  • Empty groups

Explicit missing values

When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward). We can fill down in these missing values with tidyr::fill()

treatment <- tibble::tribble(
  ~person,           ~treatment, ~response,
  "Derrick Whitmore", 1,         7,
  NA,                 2,         10,
  NA,                 3,         NA,
  "Katherine Burke",  1,         4
)

print(treatment)
# A tibble: 4 × 3
  person           treatment response
  <chr>                <dbl>    <dbl>
1 Derrick Whitmore         1        7
2 <NA>                     2       10
3 <NA>                     3       NA
4 Katherine Burke          1        4
treatment |>
  tidyr::fill(
    dplyr::everything(),
    .direction = "down"
)
# A tibble: 4 × 3
  person           treatment response
  <chr>                <dbl>    <dbl>
1 Derrick Whitmore         1        7
2 Derrick Whitmore         2       10
3 Derrick Whitmore         3       10
4 Katherine Burke          1        4

Explicit missing values

Missing values may need to be represented with some fixed and known value, most commonly 0. You can use dplyr::coalesce() to replace them

x <- c(1, 4, 5, 7, NA)
dplyr::coalesce(x, 0)
[1] 1 4 5 7 0
y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
dplyr::coalesce(y, z)
[1] 1 2 3 4 5

Explicit missing values

If we need to replace na for multiple columns, tidyr::replace_na is more useful.

df <- tibble::tibble(x = c(1, 2, NA), y = c("a", NA, "b"))

df
# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 <NA> 
3    NA b    
df |> tidyr::replace_na(list(x = 0, y = "unknown"))
# A tibble: 3 × 2
      x y      
  <dbl> <chr>  
1     1 a      
2     2 unknown
3     0 b      

Explicit missing values

On the other hand, some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.

If possible, handle this when reading in the data, for example, by using the na argument to readr::read_csv(), e.g., read_csv(path, na = "99")

If you discover the problem later, or your data source doesn’t provide a way to handle it on read, you can use dplyr::na_if():

x <- c(1, 4, 5, 7, -99)
dplyr::na_if(x, -99)
[1]  1  4  5  7 NA

Explicit missing values

R has one special type of missing value called NaN (pronounced “nan”), or not a number. NaN occurs when a mathematical operation that has an indeterminate result:

0 / 0
[1] NaN
0 * Inf
[1] NaN
Inf - Inf
[1] NaN
sqrt(-1)
[1] NaN

Explicit missing values

NaN generally behaves just like NA.

x <- c(NA, NaN)

x * 10
[1]  NA NaN
x == 1
[1] NA NA
is.na(x)
[1] TRUE TRUE

In the rare case you need to distinguish an NA from a NaN, you can use is.nan(x).

is.nan(x)
[1] FALSE  TRUE

Implicit missing values

Consider a simple dataset that records the price of some stock each quarter:

stocks <- tibble::tibble(
  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

This dataset has two missing observations:

  • The price in the fourth quarter of 2020 is explicitly missing, because its value is NA.

  • The price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.

Implicit missing values

If there is a need to make implicit missing values explicit, we can pivot the data using tidyr::pivot_wider.

wide_stocks <-  stocks |>
  tidyr::pivot_wider(
    names_from = "qtr", 
    values_from = "price"
  )

wide_stocks
# A tibble: 2 × 5
   year   `1`   `2`   `3`   `4`
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  2020  1.88  0.59  0.35 NA   
2  2021 NA     0.92  0.17  2.66

Implicit missing values

By default, making data longer using tidyr::pivot_longer preserves explicit missing values. We can drop them (make them implicit) by setting values_drop_na = TRUE.

wide_stocks |>
  tidyr::pivot_longer(
    cols = -c("year"),
    names_to = "qtr", 
    values_to = "price"
  )
# A tibble: 8 × 3
   year qtr   price
  <dbl> <chr> <dbl>
1  2020 1      1.88
2  2020 2      0.59
3  2020 3      0.35
4  2020 4     NA   
5  2021 1     NA   
6  2021 2      0.92
7  2021 3      0.17
8  2021 4      2.66
wide_stocks |>
  tidyr::pivot_longer(
    cols = -c("year"),
    names_to = "qtr", 
    values_to = "price",
    values_drop_na = TRUE
  )
# A tibble: 6 × 3
   year qtr   price
  <dbl> <chr> <dbl>
1  2020 1      1.88
2  2020 2      0.59
3  2020 3      0.35
4  2021 2      0.92
5  2021 3      0.17
6  2021 4      2.66

Implicit missing values

tidyr::complete() turns implicit missing values into explicit missing values based on combination values from its input columns.

stocks |>
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 7
  )
stocks |>
  tidyr::complete(
    .data[["year"]], 
    .data[["qtr"]]) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 4
  )

Implicit missing values

Sometimes the individual variables are themselves incomplete and they is a need to provide your own data. For example, if we know that the stocks dataset is supposed to run from 2019 to 2021, we could explicitly supply those values for year.

stocks |>
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 7
  )
stocks |>
  tidyr::complete(
    `year` = 2019:2021, 
    .data[["qtr"]]) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 4
  )

Implicit missing values

Another way to reveal implicitly missing observations is by using dplyr::anti_join. Here, four of the destinations do not have any airport metadata information.

# Get unique destination and rename to faa

dest_flights <- nycflights13::flights |> 
  dplyr::distinct(faa = .data[["dest"]])

dest_flights |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )
dest_flights |> 
  dplyr::anti_join(
    y = nycflights13::airports,
    by = dplyr::join_by("faa")
 ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )

Implicit missing values

Here, 722 planes do not have any planes metadata information.

# Get unique tail numbers

tailnum_flights <- nycflights13::flights |> 
  dplyr::distinct(.data[["tailnum"]])

tailnum_flights |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )
tailnum_flights |> 
  dplyr::anti_join(
    y = nycflights13::planes,
    by = dplyr::join_by("tailnum")
 ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )

Extra

Use dplyr::anti_join to isolate rows causing dplyr::inner_join error.

three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  0,         1500,
  1,         3220,
  2,         4730,
  3,         4000,
  4,         1000,
  5,         1100
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 
Error in `dplyr::inner_join()`:
! Each row of `y` must be matched by `x`.
ℹ Row 1 of `y` was not matched.
weight_extra |> 
  dplyr::anti_join(
    y = three_penguins,
    by = dplyr::join_by("samp_id")
 ) 
# A tibble: 3 × 2
  samp_id body_mass_g
    <dbl>       <dbl>
1       0        1500
2       4        1000
3       5        1100

Extra

Use dplyr::anti_join to isolate dplyr::inner_join error.

three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_no_3 <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_no_3,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 
Error in `dplyr::inner_join()`:
! Each row of `x` must have a match in `y`.
ℹ Row 3 of `x` does not have a match.
three_penguins |> 
  dplyr::anti_join(
    y = weight_no_3,
    by = dplyr::join_by("samp_id")
 ) 
# A tibble: 1 × 3
  samp_id species   island
    <dbl> <chr>     <chr> 
1       3 Chinstrap Dream 

Extra

Unfortunately cannot resolve multiple matches. Use argument both relationship = "one-to-one" and unmatched = "error" to ensure one row from x matches with exactly one row of y.

three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra_2 <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730,
  2,         4725,
  3,         4000
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra_2,
    by = dplyr::join_by("samp_id"),
    relationship = "one-to-one",
    unmatched = "error"
 ) 
Error in `dplyr::inner_join()`:
! Each row in `x` must match at most 1 row in `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.

Exercises

Can you find any relationship between the carrier and the rows that appear to be missing from planes ?

tailnum_carrier_flights <- nycflights13::flights |> 
  dplyr::distinct(.data[["tailnum"]], .data[["carrier"]]) |> 
  dplyr::arrange(.data[["carrier"]]) |> 
  dplyr::left_join(
    nycflights13::airlines,
    by = dplyr::join_by("carrier")
  )

tailnum_carrier_flights |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    filterable = TRUE,
    defaultPageSize = 5
  )

Exercises

Can you find any relationship between the carrier and the rows that appear to be missing from planes ?

missing_tailnum_carrier_flights <- tailnum_carrier_flights |> 
  dplyr::anti_join(
    y = nycflights13::planes,
    by = dplyr::join_by("tailnum")
 ) 

missing_tailnum_carrier_flights[["carrier"]] |> 
  table()

 9E  AA  B6  DL  F9  FL  MQ  UA  US  WN 
  1 430   3  10   3  12 234  23   9   3 
missing_tailnum_carrier_flights |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    filterable = TRUE,
    defaultPageSize = 5
  )

Factors and empty groups

A final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors.

Here is a dataset that contains some health information about people.

health <- tibble::tibble(
  name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
  age    = c(34, 88, 75, 47, 56),
)

Factors and empty groups

We want to count the number of smokers and non-smokers with dplyr::count() but it only gives us the amount of smokers because the group of smokers is empty

health |> dplyr::count(smoker)
# A tibble: 1 × 2
  smoker     n
  <fct>  <int>
1 no         5

We can request count() to keep all the groups, even those not seen in the data by using .drop = FALSE:

health |> dplyr::count(smoker,
                       .drop = FALSE)
# A tibble: 2 × 2
  smoker     n
  <fct>  <int>
1 yes        0
2 no         5

Factors and empty groups

The same principle applies to ggplot2’s discrete axes, which will also drop levels that don’t have any values. You can force them to display by supplying drop = FALSE to the appropriate discrete axis

ggplot2::ggplot(
  data = health, 
  mapping = ggplot2::aes(
    x = .data[["smoker"]])
  ) +
  ggplot2::geom_bar() +
  ggplot2::scale_x_discrete()

A barchart of the number of smoker and non-smokers. Thanks to the argument drop=FALSE in the scale_x_discrete function, the number of smoker is presented even though there are no smoker in the dataset.

ggplot2::ggplot(
  data = health, 
  mapping = ggplot2::aes(
    x = .data[["smoker"]])
  ) +
  ggplot2::geom_bar() +
  ggplot2::scale_x_discrete(drop = FALSE)

A barchart of the number of smoker and non-smokers. The number of smoker is presented even though there are no smoker in the dataset.

Factors and empty groups

The same problem comes up more generally with dplyr::group_by(). And again you can use .drop = FALSE to preserve all factor levels:

health |> 
  dplyr::group_by(
    .data[["smoker"]]
  ) |> 
  dplyr::summarize(
    n = dplyr::n(),
    mean_age = mean(.data[["age"]]),
    min_age = min(.data[["age"]]),
    max_age = max(.data[["age"]]),
    sd_age = sd(.data[["age"]])
  )
# A tibble: 1 × 6
  smoker     n mean_age min_age max_age sd_age
  <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
1 no         5       60      34      88   21.6
health |> 
  dplyr::group_by(
    .data[["smoker"]], 
    .drop = FALSE) |> 
  dplyr::summarize(
    n = dplyr::n(),
    mean_age = mean(.data[["age"]]),
    min_age = min(.data[["age"]]),
    max_age = max(.data[["age"]]),
    sd_age = sd(.data[["age"]])
  )
# A tibble: 2 × 6
  smoker     n mean_age min_age max_age sd_age
  <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
1 yes        0      NaN     Inf    -Inf   NA  
2 no         5       60      34      88   21.6

We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors

Factors and empty groups

Here we see mean({zero_vec}) returning NaN because

mean({zero_vec}) = sum({zero_vec})/length({zero_vec})

which is 0/0.

max() and min() return -Inf and Inf for empty vectors.

health |> 
  dplyr::group_by(
    .data[["smoker"]], 
    .drop = FALSE) |> 
  dplyr::summarize(
    n = dplyr::n(),
    mean_age = mean(.data[["age"]]),
    min_age = min(.data[["age"]]),
    max_age = max(.data[["age"]]),
    sd_age = sd(.data[["age"]])
  )
# A tibble: 2 × 6
  smoker     n mean_age min_age max_age sd_age
  <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
1 yes        0      NaN     Inf    -Inf   NA  
2 no         5       60      34      88   21.6

Factors and empty groups

Instead of .drop = FALSE, we can use tidyr::complete() to the implicit missing values explicit. The main drawback of this approach is that you get an NA for the count, even though you know that it should be zero.

health |> 
  dplyr::group_by(
    .data[["smoker"]]
  ) |> 
  dplyr::summarize(
    n = dplyr::n(),
    mean_age = mean(.data[["age"]]),
    min_age = min(.data[["age"]]),
    max_age = max(.data[["age"]]),
    sd_age = sd(.data[["age"]])
  ) |> 
  tidyr::complete(.data[["smoker"]])
# A tibble: 2 × 6
  smoker     n mean_age min_age max_age sd_age
  <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
1 yes       NA       NA      NA      NA   NA  
2 no         5       60      34      88   21.6

Extra

Adapted from forcats 1.0.0 blog

There are two ways to represent a missing value in a factor:

NA as values

f1 <- factor(c("x", "y", NA, NA, "x"), 
             exclude = NA)

levels(f1)
[1] "x" "y"

NA as factors

f2 <- factor(c("x", "y", NA, NA, "x"), 
             exclude = NULL)

levels(f2)
[1] "x" "y" NA 

Extra

They provide different behaviour when is.na and as.integer are applied

NA as values

f1 <- factor(c("x", "y", NA, NA, "x"), 
             exclude = NA)

is.na(f1)
[1] FALSE FALSE  TRUE  TRUE FALSE
as.integer(f1)
[1]  1  2 NA NA  1

NAs in the values tend to be best for data analysis.

NA as factors

f2 <- factor(c("x", "y", NA, NA, "x"), 
             exclude = NULL)

is.na(f2)
[1] FALSE FALSE FALSE FALSE FALSE
as.integer(f2)
[1] 1 2 3 3 1

NAs in the levels are useful if you need to control where missing values are shown in a table or a plot

To make it easier to switch between these forms, forcats now comes fct_na_value_to_level() and fct_na_level_to_value().

Extra

In the plot below, we use fct_infreq() to reorder the levels of the factor so that the highest frequency levels are at the top of the bar chart. However, because the NAs are stored in the values, fct_infreq() has no ability to affect them, so they appear in their “default” position.

example <- data.frame(
  hair_color = c(dplyr::starwars$hair_color, 
                 rep("missing", 10), 
                 rep("don't know", 5))
 ) |> 
  dplyr::mutate(
    hair_color = .data[["hair_color"]] |> 
      # Reorder factor by frequency
      forcats::fct_infreq() |> 
      # Group hair colours with less than 2 observations as Other
      forcats::fct_lump_min(2, other_level = "(Other)") |>
      forcats::fct_rev()
  ) 

example |> 
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      y = .data[["hair_color"]]
    )
  ) + 
  ggplot2::geom_bar() + 
  ggplot2::labs(y = "Hair color")

A barchart showing the number of hair colour type in the modified starwars hair colour dataset. We can see that the missing group types are not consolidated together.

Extra

To consolidate all missing values,

example <- data.frame(
  hair_color = c(dplyr::starwars$hair_color, 
                 rep("missing", 10), 
                 rep("don't know", 5))
 ) |> 
  dplyr::mutate(
    hair_color = .data[["hair_color"]] |> 
      # Reorder factor by frequency
      forcats::fct_infreq() |> 
      forcats::fct_recode(
        missing = "don't know") |> 
      forcats::fct_na_level_to_value(
        extra_levels = "missing") |>
      forcats::fct_na_value_to_level(
        level = "(Missing)") |>
      # Group hair colours with less than 2 observations as Other
      forcats::fct_lump_min(2, other_level = "(Other)") |>
      forcats::fct_rev()
  )

example |> 
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      y = .data[["hair_color"]]
    )
  ) + 
  ggplot2::geom_bar() + 
  ggplot2::labs(y = "Hair color")

A barchart showing the number of hair colour type in the modified starwars hair colour dataset. We can see that the missing group types are consolidated together.