Chapter 13: Numbers
This chapter aims to show what R can do to numeric vectors.
dplyr::count()
Here are R packages needed.
During the data import process, numeric columns are usually expressed in integer or double. In some cases, they are in strings. The readr
package has useful functions for parsing strings back to numeric.
Use readr::parse_double()
when you have numbers that have been written as strings:
Use readr::parse_number()
to ignore non-numeric text in the numbers:
The dplyr
package has a function called dplyr::count()
that helps to count the number of observations in each group.
If you want to see the most common values, add sort = TRUE
:
The same computation can be done using dplyr::group_by()
, dplyr::summarize()
and dplyr::n()
dplyr::n()
is a special function that can only be used with some dplyr functions.
Error in `dplyr::n()`:
! Must only be used inside data-masking verbs like `mutate()`,
`filter()`, and `group_by()`.
dplyr::n_distinct
helps to count the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:
A weighted count is a sum. For example you could “count” the number of miles each plane flew:
We can count missing values by combining sum()
and is.na()
. Below represents the number of flights that are cancelled for each destination (destination with no departure time from origin).
How can you use count() to count the number rows with a missing value for a given variable?
Alternatively
flights |>
dplyr::summarise(
dplyr::across(
.cols = dplyr::everything(),
.fns = ~ sum(is.na(.))
)
) |>
tidyr::pivot_longer(
cols = dplyr::everything(),
names_to = "column names",
values_to = "na count"
) |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)
How can you use count() to count the number rows with a missing value for a given variable?
Expand the following calls to count()
to instead use group_by()
, summarize()
, and arrange()
:
Expand the following calls to count()
to instead use group_by()
, summarize()
, and arrange()
:
R handles mismatched lengths by recycling. Occasionally warnings are given.
[1] 0.2 0.4 2.0 4.0
[1] 0.2 0.4 2.0 4.0
[1] 1 4 10 40
Warning in x * c(1, 2, 3): longer object length is not a multiple of shorter
object length
[1] 1 4 30 20
Recycling rules also applied to logical comparison. Below is an example when ==
is used instead of %in%
. Because of the recycling rules, the dplyr::filter
function will find flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February
Functions are pmin()
and pmax()
will return the smallest or largest value in each row. This is different to the summary functions min()
and max()
which take multiple observations and return a single value.
In R, %/%
does integer division and %%
computes the remainder:
We can use modular arithmetic to unpack the sched_dep_time
variable into hour and minute:
Proportion of cancelled flights varies over the course of the day. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.
cancelled_flights <- flights |>
dplyr::group_by(hour = .data[["sched_dep_time"]] %/% 100) |>
dplyr::summarize(
prop_cancelled = mean(is.na(.data[["dep_time"]])),
total_cancelled = sum(is.na(.data[["dep_time"]])),
n = dplyr::n())
cancelled_flights |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 2,
theme = reactablefmtr::dark()
)
In R, you have a choice of three logarithms: log()
(the natural log, base e), log2()
(base 2), and log10()
(base 10).
The inverse of log()
is exp()
; to compute the inverse of log2() or log10() you’ll need to use 2^ or 10^
Use round(x)
to round a number to the nearest integer. The second argument, digits
rounds to the nearest 10^-{digits}
.
[1] 123
[1] 123.46
[1] 123.5
[1] 120
[1] 100
However, do note that round
does a Banker’s rounding. If a number is half way between two integers, it will be rounded to the even integer.
It does this to keep keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.
See Markus Savli’s Linkedin Post on difference between SAS and R rounding methods.
floor()
always rounds down and ceiling() always rounds up:
However, they do not have a digits
argument like round
. There is a need to scale down, round, and then scale back up.
[1] 123.45
[1] 200
Use cut()
to break up (aka bin) a numeric vector into discrete buckets or categories:
[1] (0,5] (0,5] (0,5] (5,10] (10,15] (15,20]
Levels: (0,5] (5,10] (10,15] (15,20]
The breaks don’t need to be evenly spaced:
[1] (0,5] (0,5] (0,5] (5,10] (10,100] (10,100]
Levels: (0,5] (5,10] (10,100]
You can optionally supply your own labels. Note that there should be one less labels than breaks.
[1] small small small medium large XL
Levels: small medium large XL
Any values outside (30) of the range of the breaks (0-20) will become NA:
[1] <NA> <NA> (0,5] (5,10] <NA>
Levels: (0,5] (5,10] (10,15] (15,20]
Base R provides cumsum()
, cumprod()
, cummin()
, cummax()
for running, or cumulative, sums, products, mins and maxes. dplyr provides cummean()
for cumulative means. Cumulative sums tend to come up the most in practice:
If you need more complex rolling or sliding aggregates, try the slider
package.
What trigonometric functions does R provide? Guess some names and look up the documentation. Do they use degrees or radians?
Currently dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. You can see the basic problem by running the code below: there’s a gap between each hour.
Convert them to a more truthful representation of time (either fractional hours or minutes since midnight).
calculated_flights <- flights |>
dplyr::mutate(
sched_dep_time_hour = .data[["sched_dep_time"]] %/% 100,
sched_dep_time_minute = .data[["sched_dep_time"]] %% 100,
sched_dep_time_fractions = (sched_dep_time_hour + sched_dep_time_minute/60) * 100,
dep_delay_time_hour = .data[["dep_delay"]] %/% 100,
dep_delay_time_minute = .data[["dep_delay"]] %% 100,
dep_delay_time_fractions = (dep_delay_time_hour + dep_delay_time_minute/60) * 100
)
calculated_flights |>
dplyr::filter(.data[["month"]] == 1,
.data[["day"]] == 1) |>
ggplot2::ggplot(
mapping = ggplot2::aes(
x = .data[["sched_dep_time_fractions"]],
y = .data[["dep_delay_time_fractions"]])
) +
ggplot2::geom_point()
Round dep_time
and arr_time
to the nearest five minutes.
dplyr provides a number of ranking functions inspired by SQL. They differ primarily in how they handle ties.
dplyr::min_rank()
gives every tie the same (smallest) value. It’s the way that ranks are usually computed in sports. This is similar to rank(ties.method = "min")
[1] 1 2 2 4 5 NA
[1] 1 2 2 4 5 NA
Use dplyr::desc(x)
to give the largest values the smallest ranks:
[1] 5 3 3 2 1 NA
[1] 5 3 3 2 1 NA
dplyr::row_number()
gives every input a unique rank.dplyr::dense_rank()
is like dplyr::min_rank()
but does not leave gaps.dplyr::percent_rank()
counts the total number of values less than the observed value xi, and divides it by the number of observations minus 1. Missing values are ignored when counting the number of observationsdplyr::cume_dist()
counts the total number of values less than the observed value xi, and divides it by the number of observations minus 1.row_number()
can also be used without any arguments when inside a dplyr verb. In this case, it’ll give the number of the “current” row. When combined with %%
(computes remainder) or %/%
(integer division) this can be a useful tool for dividing data into similarly sized groups:
dplyr::lead()
and dplyr::lag()
allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end:
x - lag(x)
gives you the difference between the current and previous value.x != lag(x)
tells you when the current value changes.Sometimes you want to start a new group every time some event occurs.
# Imagine you have the times (in minutes) when someone visited a website.
# We want to identify new session (gap of more than 5 minutes since the last activity)
events <- tibble(
time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27, 28, 30)
) |>
dplyr::mutate(
diff = .data[["time"]] -
dplyr::lag(.data[["time"]], default = dplyr::first(.data[["time"]])),
has_gap = .data[["diff"]] >= 5
)
events |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 6,
theme = reactablefmtr::dark()
)
To classify rows before and/or after has_gap == TRUE
into one group. We start with group = 0
and use cumsum
to increment each row when has_gap
is set to TRUE.
consecutive_id()
generates a unique identifier that increments every time a variable (or combination of variables) changes.
Find the 10 most delayed (delay arrival) flights using a ranking function.
flights |>
dplyr::mutate(
arr_delay_rank =
dplyr::min_rank(dplyr::desc(.data[["arr_delay"]]))
) |>
dplyr::arrange(.data[["arr_delay_rank"]]) |>
dplyr::select(c("arr_delay_rank", "arr_delay",
"day", "flight", "tailnum")) |>
dplyr::slice_head(n = 10) |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)
Which plane (tailnum) has the worst on-time record ?
flights |>
dplyr::group_by(.data[["tailnum"]]) |>
dplyr::summarise(
number_of_delay =
sum(.data[["arr_delay"]] > 0, na.rm = TRUE),
) |>
dplyr::mutate(
number_of_delay_rank =
dplyr::min_rank(dplyr::desc(.data[["number_of_delay"]]))
) |>
dplyr::arrange(.data[["number_of_delay_rank"]]) |>
dplyr::slice_head(n = 10) |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)
What time of day should you fly if you want to avoid delays as much as possible?
flights |>
group_by(dep_hour = sched_dep_time %/% 100) |>
summarize(prop_of_arrival_delay = mean(.data[["arr_delay"]] > 0, na.rm = TRUE),
n = sum(.data[["arr_delay"]] > 0, na.rm = TRUE)
) |>
filter(dep_hour > 1) |>
ggplot(aes(x = dep_hour, y = prop_of_arrival_delay)) +
geom_line(color = "grey50") +
geom_point(aes(size = n), color = "black")
What does the below code do ?
flights |>
dplyr::group_by(.data[["dest"]]) |>
dplyr::filter(dplyr::row_number() < 4) |>
dplyr::relocate(c("dest")) |>
dplyr::arrange(.data[["dest"]]) |>
dplyr::select(c("dest", "year", "month",
"day", "arr_time", "tailnum")) |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)
Assume data is sorted by time, we are filtering to see the first three flights for each destination.
flights |>
dplyr::group_by(.data[["dest"]]) |>
dplyr::mutate(
group_id = dplyr::row_number()
) |>
dplyr::relocate(c("dest","group_id")) |>
dplyr::filter(.data[["group_id"]] < 4) |>
dplyr::arrange(.data[["dest"]]) |>
dplyr::select(c("dest", "group_id" ,"year",
"month", "day", "arr_time", "tailnum")) |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)
What does the below code do ?
flights |>
dplyr::group_by(.data[["dest"]]) |>
dplyr::filter(dplyr::row_number(.data[["dep_delay"]]) < 4) |>
dplyr::relocate(c("dest")) |>
dplyr::arrange(.data[["dest"]], .data[["dep_delay"]]) |>
dplyr::select(c("dest", "dep_delay", "year",
"month", "day", "tailnum")) |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)
We are filtering to see the three lowest dep_delay
flights for each destination.
flights |>
dplyr::group_by(.data[["dest"]]) |>
dplyr::mutate(
group_id = dplyr::row_number(.data[["dep_delay"]])
) |>
dplyr::relocate(c("dest","group_id")) |>
dplyr::filter(.data[["group_id"]] < 4) |>
dplyr::arrange(.data[["dest"]], .data[["dep_delay"]]) |>
dplyr::select(c("dest", "dep_delay", "group_id" ,
"year", "month", "day", "tailnum")) |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)
For each destination, compute the total minutes of delay.
For each flight, compute the proportion of the total delay for its destination.
flights |>
dplyr::group_by(.data[["tailnum"]], .data[["dest"]]) |>
dplyr::summarise(
prop_delay = mean(.data[["arr_delay"]], na.rm = TRUE),
total_delay = sum(.data[["arr_delay"]], na.rm = TRUE),
n = dplyr::n(),
.groups = "drop"
) |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)
Delays are typically temporally correlated: Using lag(), explore how the average flight delay for an hour is related to the average delay for the previous hour.
filtered_flights <- flights |>
mutate(hour = .data[["dep_time"]] %/% 100) |>
group_by(.data[["year"]], .data[["month"]],
.data[["day"]], .data[["hour"]]) |>
summarize(
dep_delay = mean(.data[["dep_delay"]], na.rm = TRUE),
n = dplyr::n(),
.groups = "drop"
) |>
filter(.data[["n"]] > 5)
filtered_flights |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)
Delays are typically temporally correlated: Using lag(), explore how the average flight delay for an hour is related to the average delay for the previous hour.
Delays are typically temporally correlated: Later flights are delayed to allow earlier flights to leave
filtered_flights |>
filter(.data[["dep_delay"]] > 0,
!is.na(.data[["dep_delay"]]),
!is.na(.data[["prev_dep_delay"]])) |>
select(.data[["dep_delay"]],
.data[["prev_dep_delay"]]) |>
ggplot2::ggplot(
mapping = ggplot2::aes(
x = .data[["prev_dep_delay"]],
y = .data[["dep_delay"]])
) +
ggplot2::geom_point(alpha = 1/10) +
ggplot2::geom_smooth(method = "lm") +
ggplot2::geom_abline(slope = 1, intercept = 0)
Look at each destination. Can you find flights that are suspiciously fast (above 575 mph) (i.e. flights that represent a potential data entry error) ?
flights |>
dplyr::mutate(
air_time_hours = .data[["air_time"]] / 60,
air_speed_mph = .data[["distance"]] / .data[["air_time_hours"]]
) |>
dplyr::arrange(dplyr::desc(.data[["air_speed_mph"]])) |>
dplyr::select(c("tailnum", "dest", "air_speed_mph",
"distance", "air_time_hours")) |>
reactable::reactable(
filterable = TRUE,
defaultPageSize = 5,
theme = reactablefmtr::dark()
)