We encountered missing values in previous chapters.
NA>5
[1] NA
10==NA
[1] NA
NA==NA
[1] NA
is.na(NA)
[1] TRUE
We learn more of the details in this chapter, covering additional tools (besides is.na and na.rm argument) for working with missing values
Explicit missing values
Implicit missing values
Empty groups
Explicit missing values
When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward). We can fill down in these missing values with tidyr::fill()
If we need to replace na for multiple columns, tidyr::replace_na is more useful.
df <- tibble::tibble(x =c(1, 2, NA), y =c("a", NA, "b"))df
# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 a
2 2 <NA>
3 NA b
df |> tidyr::replace_na(list(x =0, y ="unknown"))
# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 a
2 2 unknown
3 0 b
Explicit missing values
On the other hand, some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.
If possible, handle this when reading in the data, for example, by using the na argument to readr::read_csv(), e.g., read_csv(path, na = "99")
If you discover the problem later, or your data source doesn’t provide a way to handle it on read, you can use dplyr::na_if():
x <-c(1, 4, 5, 7, -99)dplyr::na_if(x, -99)
[1] 1 4 5 7 NA
Explicit missing values
R has one special type of missing value called NaN (pronounced “nan”), or not anumber. NaN occurs when a mathematical operation that has an indeterminate result:
0/0
[1] NaN
0*Inf
[1] NaN
Inf-Inf
[1] NaN
sqrt(-1)
[1] NaN
Explicit missing values
NaN generally behaves just like NA.
x <-c(NA, NaN)x *10
[1] NA NaN
x ==1
[1] NA NA
is.na(x)
[1] TRUE TRUE
In the rare case you need to distinguish an NA from a NaN, you can use is.nan(x).
is.nan(x)
[1] FALSE TRUE
Implicit missing values
Consider a simple dataset that records the price of some stock each quarter:
# A tibble: 2 × 5
year `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2020 1.88 0.59 0.35 NA
2 2021 NA 0.92 0.17 2.66
Implicit missing values
By default, making data longer using tidyr::pivot_longer preserves explicit missing values. We can drop them (make them implicit) by setting values_drop_na = TRUE.
Sometimes the individual variables are themselves incomplete and they is a need to provide your own data. For example, if we know that the stocks dataset is supposed to run from 2019 to 2021, we could explicitly supply those values for year.
Another way to reveal implicitly missing observations is by using dplyr::anti_join. Here, four of the destinations do not have any airport metadata information.
# Get unique destination and rename to faadest_flights <- nycflights13::flights |> dplyr::distinct(faa = .data[["dest"]])dest_flights |> reactable::reactable(theme = reactablefmtr::dark(),defaultPageSize =5 )
# A tibble: 1 × 3
samp_id species island
<dbl> <chr> <chr>
1 3 Chinstrap Dream
Extra
Unfortunately cannot resolve multiple matches. Use argument both relationship = "one-to-one" and unmatched = "error" to ensure one row from x matches with exactly one row of y.
We want to count the number of smokers and non-smokers with dplyr::count() but it only gives us the amount of smokers because the group of smokers is empty
health |> dplyr::count(smoker)
# A tibble: 1 × 2
smoker n
<fct> <int>
1 no 5
We can request count() to keep all the groups, even those not seen in the data by using .drop = FALSE:
health |> dplyr::count(smoker,.drop =FALSE)
# A tibble: 2 × 2
smoker n
<fct> <int>
1 yes 0
2 no 5
Factors and empty groups
The same principle applies to ggplot2’s discrete axes, which will also drop levels that don’t have any values. You can force them to display by supplying drop = FALSE to the appropriate discrete axis
# A tibble: 2 × 6
smoker n mean_age min_age max_age sd_age
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 yes 0 NaN Inf -Inf NA
2 no 5 60 34 88 21.6
Factors and empty groups
Instead of .drop = FALSE, we can use tidyr::complete() to the implicit missing values explicit. The main drawback of this approach is that you get an NA for the count, even though you know that it should be zero.
In the plot below, we use fct_infreq() to reorder the levels of the factor so that the highest frequency levels are at the top of the bar chart. However, because the NAs are stored in the values, fct_infreq() has no ability to affect them, so they appear in their “default” position.
example <-data.frame(hair_color =c(dplyr::starwars$hair_color, rep("missing", 10), rep("don't know", 5)) ) |> dplyr::mutate(hair_color = .data[["hair_color"]] |># Reorder factor by frequency forcats::fct_infreq() |># Group hair colours with less than 2 observations as Other forcats::fct_lump_min(2, other_level ="(Other)") |> forcats::fct_rev() ) example |> ggplot2::ggplot(mapping = ggplot2::aes(y = .data[["hair_color"]] ) ) + ggplot2::geom_bar() + ggplot2::labs(y ="Hair color")
Extra
To consolidate all missing values,
Use fct_recode to convert “don’t know” to the value “missing”.