Validating BMI
Dealing with missing data and exceptions
Consider this data set bmi_dataset consisting of the patient’s bmi printed using reactable.
Check if positive again
Again, we can start by validating if the column BMI is positive by using use the function col_vals_gt().
However, this time, we have an error.
When we isolate the rows that fail the validation, we can see that the cause of this error was due to missing values.
Sometimes, we want to keep the data despite having missing values. As such, we can set the parameter na_pass = TRUE to allow validation workflow to bypass rows with missing values.
Here is how it is done.
Check if between two specific values
In reality, we will want to verify if the patient’s BMI is between two values (say 10 to 50 for example). We will isolate patients that do not meet this criteria for verification if the values are valid or not.
In this case, we can use the function col_vals_between instead. Instead of the parameter value, we have three additional variables (left, right and inclusive) that we can use to determine the range. The input inclusive is a two-element logical (TRUE or FALSE) vector that indicates whether the left and right bounds should be inclusive. In our example below, we set inclusive = (TRUE, TRUE) as we want to include the lower and upper bound.
Remember to set the parameter na_pass = TRUE to ignore the missing values.
Here we can see that Patient 14 has BMI of over 50 and Patient 16 has BMI of less than 15.
Creating exceptions using preconditions
Suppose that we have verified Patient 14 and Patient 16 and it turns out that these values are true. This means that in our validation steps, we have to omit these two patients as well.
Such exceptions can be done using the preconditions parameter. This special parameter accepts expression for changing the input table before proceeding with a particular validation step. Changes can include filtering of rows, adding new columns via joining with another data set or creating a new calculating column from other columns.
For this example, we set the precondition parameters to keep patients who are neither in the bmi_50_and_above_id nor bmi_15_and_below_id groups.
A more thorough validation
A more thorough (and robust against changes in updated version of data sets) approach is to also verify if these patients in the bmi_50_and_above_id or bmi_15_and_below_id groups are equal to or above 50 and equal to or below 15 respectively.
We can do this using the col_vals_gte and col_vals_lte.
Try out this exercise below.
Consider using the following preconditions parameters
preconditions = \(x) {x |> dplyr::filter(.data[["Patient ID"]] %in% c(bmi_50_and_above_id))}forcol_vals_gtepreconditions = \(x) {x |> dplyr::filter(.data[["Patient ID"]] %in% c(bmi_10_and_above_id))}forcol_vals_lte
bmi_50_and_above_id <- c("Patient 14")
bmi_15_and_below_id <- c("Patient 16")
bmi_dataset |>
pointblank::col_vals_gte(
columns = c("BMI"),
preconditions = \(x) {
x |>
dplyr::filter(
.data[["Patient ID"]] %in% c(bmi_50_and_above_id)
)
},
value = 50,
na_pass = TRUE
) |>
pointblank::col_vals_lte(
columns = c("BMI"),
preconditions = \(x) {
x |>
dplyr::filter(
.data[["Patient ID"]] %in% c(bmi_15_and_below_id)
)
},
value = 15,
na_pass = TRUE
) |>
pointblank::col_vals_between(
columns = c("BMI"),
preconditions = \(x) {
x |>
dplyr::filter(
!.data[["Patient ID"]] %in%
c(bmi_15_and_below_id, bmi_50_and_above_id)
)
},
left = 15,
right = 50,
inclusive = c(TRUE, TRUE),
na_pass = TRUE
)