Chapter 12 Lab
This document was prepared on 2022-08-23.
Code
# Rmarkdown stuff
library(rmarkdown, quietly = TRUE)
library(knitr, quietly = TRUE)
library(htmltools, quietly = TRUE)
library(styler, quietly = TRUE)
# Dataset
library(ISLR2, quietly = TRUE)
# Session info and package reporting
library(report, quietly = TRUE)
# Data wrangling
library(dplyr, quietly = TRUE)
library(tidyr, quietly = TRUE)
library(magrittr, quietly = TRUE)
library(purrr, quietly = TRUE)
library(tibble, quietly = TRUE)
library(stringr, quietly = TRUE)
# View PCA
library(langevitour, quietly = TRUE)
# Matrix Completion
library(softImpute, quietly = TRUE)
# Create interactive tables
library(reactable, quietly = TRUE)
library(reactablefmtr, quietly = TRUE)
# For plotting
library(tidytext, quietly = TRUE)
library(scales, quietly = TRUE)
library(forcats, quietly = TRUE)
library(grid, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(gridExtra, quietly = TRUE)
library(grDevices, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(ggforce, quietly = TRUE)
# For applying tidymodels
library(recipes, quietly = TRUE)
# From GitHub remotes::install_github("tidymodels/parsnip")
library(parsnip, quietly = TRUE)
library(broom, quietly = TRUE)
library(yardstick, quietly = TRUE)
# For clustering
library(proxy, quietly = TRUE)
library(factoextra, quietly = TRUE)
library(cluster, quietly = TRUE)
# For tidy clustering
# From GitHub remotes::install_github("tidymodels/workflows@celery")
library(workflows, quietly = TRUE)
library(rsample, quietly = TRUE)
library(tune, quietly = TRUE)
library(dials, quietly = TRUE)
library(tidyclust, quietly = TRUE)
library(Rfast, quietly = TRUE)
library(flexclust, quietly = TRUE)
summary(report::report(sessionInfo()))The analysis was done using the R Statistical language (v4.2.1; R Core Team, 2022) on Windows 10 x64, using the packages Rcpp (v1.0.9), ggplot2 (v3.3.6), rmarkdown (v2.14), cluster (v2.1.3), report (v0.5.1), lattice (v0.20.45), tidytext (v0.3.3), flexclust (v1.4.1), knitr (v1.39), broom (v1.0.0), cowplot (v1.1.1), dials (v1.0.0), dplyr (v1.0.9), factoextra (v1.0.7), forcats (v0.5.1), ggforce (v0.3.3), gridExtra (v2.3), htmltools (v0.5.2), ISLR2 (v1.3.1), langevitour (v0.5), magrittr (v2.0.3), Matrix (v1.4.1), modeltools (v0.2.23), parsnip (v1.0.0.9000), proxy (v0.4.27), purrr (v0.3.4), RcppZiggurat (v0.1.6), reactable (v0.3.0), reactablefmtr (v2.0.0), recipes (v1.0.1), Rfast (v2.0.6), rsample (v1.1.0), scales (v1.2.0), softImpute (v1.4.1), stringr (v1.4.0), styler (v1.7.0), tibble (v3.1.8), tidyclust (v0.0.0.9000), tidyr (v1.2.0), tune (v1.0.0), workflows (v1.0.0.9000) and yardstick (v1.0.0).
Principal Components Analysis
The USArrests data set from the datasets package contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
| Column Name | Column Description | 
|---|---|
| Murder | Number of arrests for murder (per 100,000) | 
| Assault | Number of arrests for assault (per 100,000) | 
| UrbanPop | Percent of the population living in urban areas | 
| Rape | Number of arrests for rape (per 100,000) | 
Turn USArrests into a tibble and move the rownames into a column.
We create a recipe with recipes::recipe, recipes::all_numeric_predictors(), recipes::step_normalize and recipes::step_pca
By default, recipes::step_pca uses stats::prcomp with the argument defaults set to retx = FALSE, center = FALSE, scale. = FALSE, and tol = NULL.
Code
pca_recipe <- recipes::recipe(
  formula = ~.,
  data = USArrests
) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::step_pca(recipes::all_numeric_predictors(),
    id = "pca",
    num_comp = 4
  )
pca_recipe> Recipe
> 
> Inputs:
> 
>       role #variables
>  predictor          5
> 
> Operations:
> 
> Centering and scaling for recipes::all_numeric_predictors()
> PCA extraction with recipes::all_numeric_predictors()Apply recipes::prep and recipes::bake to compute the scores of the principal component analysis.
Code
Code
We can explore the results of the PCA using the recipes::prep and parsnip::tidy. We can see that pca is done at step number 2
As such, we extract the pca loadings of the pca step.
Code
Here are the pca loadings in wide table form.
Code
We make a visualization to see how the loadings of the four components look like
Code
tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["value"]],
    y = .data[["terms"]],
    fill = .data[["terms"]]
  )) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]])) +
  ggplot2::labs(y = NULL)
Let us take a closer look at the four variables that contribute to the four components
Code
tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::group_by(.data[["component"]]) %>%
  dplyr::top_n(4, abs(.data[["value"]])) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    terms = tidytext::reorder_within(
      x = .data[["terms"]],
      by = abs(.data[["value"]]),
      within = .data[["component"]]
    )
  ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = abs(.data[["value"]]),
    y = .data[["terms"]],
    fill = .data[["value"]] > 0
  )) +
  ggplot2::geom_col() +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[["component"]]),
    scales = "free_y"
  ) +
  tidytext::scale_y_reordered() +
  ggplot2::labs(
    x = "Absolute value of contribution to PCA component",
    y = NULL,
    fill = "Positive?"
  )
How much variation are we capturing for each component?
Here is how we can see the variance statistics but it is in a long form
Code
Here is how we can see the variance statistics in its wide form
Code
Here is a simple plot to show the variance captured for each component
Code
# Get the variance
percent_variation <- tidied_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "percent variance") %>%
  dplyr::pull(.data[["value"]])
percent_variation <- percent_variation / 100
# I use [1:4] to select the first four components
dplyr::tibble(
  component = unique(tidied_pca_loadings$component)[1:4],
  percent_var = percent_variation[1:4]
) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["percent_var"]]
  )) +
  ggplot2::geom_col() +
  ggplot2::geom_text(
    mapping = ggplot2::aes(
      label = paste(round(.data[["percent_var"]] * 100, 0), "%")
    ),
    vjust = -0.2
  ) +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
  ggplot2::labs(
    x = NULL,
    y = "Percent variance \nexplained by each \nPCA component"
  )
Here is a PCA Biplot of PC1 and PC2
Code
filtered_loadings <- tidied_pca_loadings %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]]
  )
ggplot2::ggplot() +
  ggplot2::geom_point(
    data = pca_score,
    mapping = ggplot2::aes(
      x = .data[["PC1"]],
      y = .data[["PC2"]]
    ),
    size = 2,
    alpha = 0.6
  ) +
  ggplot2::geom_text(
    data = pca_score,
    mapping = ggplot2::aes(
      x = .data[["PC1"]],
      y = .data[["PC2"]],
      label = .data[["state"]]
    ),
    check_overlap = TRUE,
    size = 2.5,
    hjust = "inward"
  ) +
  ggplot2::geom_segment(
    data = filtered_loadings,
    mapping = ggplot2::aes(
      x = 0,
      y = 0,
      xend = .data[["PC1"]] * 2.5,
      yend = .data[["PC2"]] * 2.5
    ),
    arrow = ggplot2::arrow(length = grid::unit(
      x = 0.5,
      units = "picas"
    )),
    color = "blue"
  ) +
  ggplot2::annotate(
    geom = "text",
    x = filtered_loadings[["PC1"]] * 2.7,
    y = filtered_loadings[["PC2"]] * 2.7,
    label = filtered_loadings[["terms"]]
  ) +
  ggplot2::coord_fixed(clip = "off") +
  ggplot2::theme_minimal()
Matrix Completion
Initialisation
Code
missing_data <- data.frame(
  SBP = c(-3, NA, -1, 1, 1, NA),
  DBP = c(NA, -2, 0, NA, 2, 4)
)
colour_missing_cells <- function(value, index, name) {
  missing_indices <- which(is.na(missing_data[[name]]))
  if (index %in% missing_indices) {
    list(background = "#aaf0f8")
  }
}
xbar <- colMeans(missing_data, na.rm = TRUE)
initialisation_data <- data.frame(
  SBP = c(-3, -0.5, -1, 1, 1, -0.5),
  DBP = c(1, -2, 0, 1, 2, 4)
)
thresh <- 1e-2
ismiss <- is.na(missing_data)Functions
These are the implementations for calculating step 2a and 2b for Algorithm 12.1
Code
fit_pca <- function(input_data, num_comp = 1) {
  data_pca <- prcomp(input_data, center = FALSE)
  pca_loadings <- data_pca$rotation
  pca_loadings_inverse <- solve(pca_loadings)[1:num_comp, , drop = FALSE]
  pca_scores <- data_pca$x[, 1:num_comp, drop = FALSE]
  estimated_data <- pca_scores %*% pca_loadings_inverse
  return(estimated_data)
}Code
fit_svd <- function(input_data, num_comp = 1) {
  svd_object <- svd(input_data)
  pca_loadings <- svd_object$v[, 1:num_comp, drop = FALSE]
  pca_loadings_inverse <- t(pca_loadings)
  pca_scores <- (svd_object$u %*% diag(svd_object$d))[, 1:num_comp, drop = FALSE]
  estimated_data <- pca_scores %*% pca_loadings_inverse
  return(estimated_data)
}Code
fit_pca_tidy <- function(input_data, num_comp = 1) {
  pca_recipe <- recipes::recipe(
    formula = ~.,
    data = input_data
  ) %>%
    recipes::step_pca(recipes::all_numeric_predictors(),
      id = "pca"
    )
  pca_estimates <- recipes::prep(x = pca_recipe, training = input_data)
  pca_scores <- pca_estimates %>%
    recipes::bake(new_data = input_data) %>%
    as.matrix()
  pca_scores <- pca_scores[, 1:num_comp, drop = FALSE]
  pca_loadings <- pca_estimates %>%
    parsnip::tidy(id = "pca", type = "coef") %>%
    tidyr::pivot_wider(
      names_from = .data[["component"]],
      values_from = .data[["value"]]
    ) %>%
    dplyr::select(-c("terms", "id")) %>%
    as.matrix()
  pca_loadings_inverse <- solve(pca_loadings)[1:num_comp, , drop = FALSE]
  estimated_data <- pca_scores %*% pca_loadings_inverse
  return(estimated_data)
}Iteration 1
Code
initialisation_data_pca <- initialisation_data
# Step 2(a)
Xapp <- fit_pca(initialisation_data_pca, num_comp = 1)
# Step 2(b)
initialisation_data_pca[ismiss] <- Xapp[ismiss]
# Step 2(c)
mss_old_pca <- mean(((missing_data - Xapp)[!ismiss])^2)
initialisation_data_pca %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  )Code
mss_old_pca> [1] 1.48974Code
initialisation_data_svd <- initialisation_data
# Step 2(a)
Xapp <- fit_svd(initialisation_data_svd, num_comp = 1)
# Step 2(b)
initialisation_data_svd[ismiss] <- Xapp[ismiss]
# Step 2(c)
mss_old_svd <- mean(((missing_data - Xapp)[!ismiss])^2)
initialisation_data_svd %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  )Code
mss_old_svd> [1] 1.48974Code
initialisation_data_tidymodels <- initialisation_data
# Step 2(a)
Xapp <- fit_pca_tidy(initialisation_data_tidymodels, num_comp = 1)
# Step 2(b)
initialisation_data_tidymodels[ismiss] <- Xapp[ismiss]
# Step 2(c)
mss_old_tidymodels <- mean(((missing_data - Xapp)[!ismiss])^2)
initialisation_data_tidymodels %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  )Code
mss_old_tidymodels> [1] 1.48974The rest of the iteration
Code
iter <- 1
rel_err <- mss_old_pca
thresh <- 1e-2
while (rel_err > thresh) {
  iter <- iter + 1
  # Step 2(a)
  Xapp <- fit_pca(initialisation_data_pca, num_comp = 1)
  # Step 2(b)
  initialisation_data_pca[ismiss] <- Xapp[ismiss]
  # Step 2(c)
  mss_pca <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_pca - mss_pca
  mss_old_pca <- mss_pca
  cat(
    "Iter:", iter, "\n",
    "MSS:", mss_pca, "\n",
    "Rel. Err:", rel_err, "\n"
  )
  print("Xapp")
  print(Xapp)
  print("Data")
  print(initialisation_data_pca)
}> Iter: 2 
>  MSS: 1.43288 
>  Rel. Err: 0.05686009 
> [1] "Xapp"
>              SBP        DBP
> [1,] -0.24567017  1.6296426
> [2,]  0.29799874 -1.9767619
> [3,] -0.02222086  0.1474011
> [4,] -0.11358385  0.7534536
> [5,] -0.27258141  1.8081571
> [6,] -0.59617741  3.9547173
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  1.6296426
> 2  0.2979987 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.7534536
> 5  1.0000000  2.0000000
> 6 -0.5961774  4.0000000
> Iter: 3 
>  MSS: 1.222089 
>  Rel. Err: 0.2107907 
> [1] "Xapp"
>              SBP        DBP
> [1,] -0.73055871  2.3385975
> [2,]  0.59572858 -1.9069916
> [3,] -0.08891186  0.2846165
> [4,] -0.12553344  0.4018461
> [5,] -0.48032110  1.5375598
> [6,] -1.19147316  3.8140345
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  2.3385975
> 2  0.5957286 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.4018461
> 5  1.0000000  2.0000000
> 6 -1.1914732  4.0000000
> Iter: 4 
>  MSS: 0.8998268 
>  Rel. Err: 0.3222623 
> [1] "Xapp"
>              SBP         DBP
> [1,] -1.53549232  3.07087697
> [2,]  0.91916923 -1.83827401
> [3,] -0.20001122  0.40000841
> [4,]  0.03926938 -0.07853601
> [5,] -0.60000561  1.19996915
> [6,] -1.83834165  3.67655441
> [1] "Data"
>          SBP         DBP
> 1 -3.0000000  3.07087697
> 2  0.9191692 -2.00000000
> 3 -1.0000000  0.00000000
> 4  1.0000000 -0.07853601
> 5  1.0000000  2.00000000
> 6 -1.8383417  4.00000000
> Iter: 5 
>  MSS: 0.6967107 
>  Rel. Err: 0.2031161 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.1668365  3.5757572
> [2,]  1.1333206 -1.8702286
> [3,] -0.2685842  0.4432231
> [4,]  0.3033932 -0.5006656
> [5,] -0.6178620  1.0196084
> [6,] -2.2666421  3.7404586
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.5757572
> 2  1.133321 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.5006656
> 5  1.000000  2.0000000
> 6 -2.266642  4.0000000
> Iter: 6 
>  MSS: 0.612954 
>  Rel. Err: 0.08375675 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.5102026  3.8916848
> [2,]  1.2439901 -1.9286163
> [3,] -0.2938087  0.4555054
> [4,]  0.5218645 -0.8090711
> [5,] -0.6172021  0.9568773
> [6,] -2.4879805  3.8572330
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.8916848
> 2  1.243990 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.8090711
> 5  1.000000  2.0000000
> 6 -2.487981  4.0000000
> Iter: 7 
>  MSS: 0.5807517 
>  Rel. Err: 0.03220227 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.6813227  4.1000892
> [2,]  1.2887789 -1.9707096
> [3,] -0.2995594  0.4580650
> [4,]  0.6701665 -1.0247713
> [5,] -0.6165707  0.9428163
> [6,] -2.5775579  3.9414194
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.100089
> 2  1.288779 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.024771
> 5  1.000000  2.000000
> 6 -2.577558  4.000000
> Iter: 8 
>  MSS: 0.5665195 
>  Rel. Err: 0.01423225 
> [1] "Xapp"
>             SBP       DBP
> [1,] -2.7628696  4.254096
> [2,]  1.2959129 -1.995367
> [3,] -0.2966660  0.456788
> [4,]  0.7647692 -1.177544
> [5,] -0.6169100  0.949880
> [6,] -2.5918258  3.990734
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.254096
> 2  1.295913 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.177544
> 5  1.000000  2.000000
> 6 -2.591826  4.000000
> Iter: 9 
>  MSS: 0.558216 
>  Rel. Err: 0.00830349 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.7988324  4.3825676
> [2,]  1.2826615 -2.0084627
> [3,] -0.2896952  0.4536209
> [4,]  0.8238540 -1.2900364
> [5,] -0.6175465  0.9669887
> [6,] -2.5653230  4.0169255
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.382568
> 2  1.282661 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.290036
> 5  1.000000  2.000000
> 6 -2.565323  4.000000Code
iter <- 1
rel_err <- mss_old_svd
thresh <- 1e-2
while (rel_err > thresh) {
  iter <- iter + 1
  # Step 2(a)
  Xapp <- fit_svd(initialisation_data_svd, num_comp = 1)
  # Step 2(b)
  initialisation_data_svd[ismiss] <- Xapp[ismiss]
  # Step 2(c)
  mss_svd <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_svd - mss_svd
  mss_old_svd <- mss_svd
  cat(
    "Iter:", iter, "\n",
    "MSS:", mss_svd, "\n",
    "Rel. Err:", rel_err, "\n"
  )
  print("Xapp")
  print(Xapp)
  print("Data")
  print(initialisation_data_svd)
}> Iter: 2 
>  MSS: 1.43288 
>  Rel. Err: 0.05686009 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.24567017  1.6296426
> [2,]  0.29799874 -1.9767619
> [3,] -0.02222086  0.1474011
> [4,] -0.11358385  0.7534536
> [5,] -0.27258141  1.8081571
> [6,] -0.59617741  3.9547173
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  1.6296426
> 2  0.2979987 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.7534536
> 5  1.0000000  2.0000000
> 6 -0.5961774  4.0000000
> Iter: 3 
>  MSS: 1.222089 
>  Rel. Err: 0.2107907 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.73055871  2.3385975
> [2,]  0.59572858 -1.9069916
> [3,] -0.08891186  0.2846165
> [4,] -0.12553344  0.4018461
> [5,] -0.48032110  1.5375598
> [6,] -1.19147316  3.8140345
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  2.3385975
> 2  0.5957286 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.4018461
> 5  1.0000000  2.0000000
> 6 -1.1914732  4.0000000
> Iter: 4 
>  MSS: 0.8998268 
>  Rel. Err: 0.3222623 
> [1] "Xapp"
>             [,1]        [,2]
> [1,] -1.53549232  3.07087697
> [2,]  0.91916923 -1.83827401
> [3,] -0.20001122  0.40000841
> [4,]  0.03926938 -0.07853601
> [5,] -0.60000561  1.19996915
> [6,] -1.83834165  3.67655441
> [1] "Data"
>          SBP         DBP
> 1 -3.0000000  3.07087697
> 2  0.9191692 -2.00000000
> 3 -1.0000000  0.00000000
> 4  1.0000000 -0.07853601
> 5  1.0000000  2.00000000
> 6 -1.8383417  4.00000000
> Iter: 5 
>  MSS: 0.6967107 
>  Rel. Err: 0.2031161 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.1668365  3.5757572
> [2,]  1.1333206 -1.8702286
> [3,] -0.2685842  0.4432231
> [4,]  0.3033932 -0.5006656
> [5,] -0.6178620  1.0196084
> [6,] -2.2666421  3.7404586
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.5757572
> 2  1.133321 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.5006656
> 5  1.000000  2.0000000
> 6 -2.266642  4.0000000
> Iter: 6 
>  MSS: 0.612954 
>  Rel. Err: 0.08375675 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.5102026  3.8916848
> [2,]  1.2439901 -1.9286163
> [3,] -0.2938087  0.4555054
> [4,]  0.5218645 -0.8090711
> [5,] -0.6172021  0.9568773
> [6,] -2.4879805  3.8572330
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.8916848
> 2  1.243990 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.8090711
> 5  1.000000  2.0000000
> 6 -2.487981  4.0000000
> Iter: 7 
>  MSS: 0.5807517 
>  Rel. Err: 0.03220227 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.6813227  4.1000892
> [2,]  1.2887789 -1.9707096
> [3,] -0.2995594  0.4580650
> [4,]  0.6701665 -1.0247713
> [5,] -0.6165707  0.9428163
> [6,] -2.5775579  3.9414194
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.100089
> 2  1.288779 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.024771
> 5  1.000000  2.000000
> 6 -2.577558  4.000000
> Iter: 8 
>  MSS: 0.5665195 
>  Rel. Err: 0.01423225 
> [1] "Xapp"
>            [,1]      [,2]
> [1,] -2.7628696  4.254096
> [2,]  1.2959129 -1.995367
> [3,] -0.2966660  0.456788
> [4,]  0.7647692 -1.177544
> [5,] -0.6169100  0.949880
> [6,] -2.5918258  3.990734
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.254096
> 2  1.295913 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.177544
> 5  1.000000  2.000000
> 6 -2.591826  4.000000
> Iter: 9 
>  MSS: 0.558216 
>  Rel. Err: 0.00830349 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.7988324  4.3825676
> [2,]  1.2826615 -2.0084627
> [3,] -0.2896952  0.4536209
> [4,]  0.8238540 -1.2900364
> [5,] -0.6175465  0.9669887
> [6,] -2.5653230  4.0169255
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.382568
> 2  1.282661 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.290036
> 5  1.000000  2.000000
> 6 -2.565323  4.000000Code
iter <- 1
rel_err <- mss_old_tidymodels
thresh <- 1e-2
while (rel_err > thresh) {
  iter <- iter + 1
  # Step 2(a)
  Xapp <- fit_svd(initialisation_data_tidymodels, num_comp = 1)
  # Step 2(b)
  initialisation_data_tidymodels[ismiss] <- Xapp[ismiss]
  # Step 2(c)
  mss_tidymodels <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_tidymodels - mss_tidymodels
  mss_old_tidymodels <- mss_tidymodels
  cat(
    "Iter:", iter, "\n",
    "MSS:", mss_tidymodels, "\n",
    "Rel. Err:", rel_err, "\n"
  )
  print("Xapp")
  print(Xapp)
  print("Data")
  print(initialisation_data_tidymodels)
}> Iter: 2 
>  MSS: 1.43288 
>  Rel. Err: 0.05686009 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.24567017  1.6296426
> [2,]  0.29799874 -1.9767619
> [3,] -0.02222086  0.1474011
> [4,] -0.11358385  0.7534536
> [5,] -0.27258141  1.8081571
> [6,] -0.59617741  3.9547173
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  1.6296426
> 2  0.2979987 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.7534536
> 5  1.0000000  2.0000000
> 6 -0.5961774  4.0000000
> Iter: 3 
>  MSS: 1.222089 
>  Rel. Err: 0.2107907 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.73055871  2.3385975
> [2,]  0.59572858 -1.9069916
> [3,] -0.08891186  0.2846165
> [4,] -0.12553344  0.4018461
> [5,] -0.48032110  1.5375598
> [6,] -1.19147316  3.8140345
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  2.3385975
> 2  0.5957286 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.4018461
> 5  1.0000000  2.0000000
> 6 -1.1914732  4.0000000
> Iter: 4 
>  MSS: 0.8998268 
>  Rel. Err: 0.3222623 
> [1] "Xapp"
>             [,1]        [,2]
> [1,] -1.53549232  3.07087697
> [2,]  0.91916923 -1.83827401
> [3,] -0.20001122  0.40000841
> [4,]  0.03926938 -0.07853601
> [5,] -0.60000561  1.19996915
> [6,] -1.83834165  3.67655441
> [1] "Data"
>          SBP         DBP
> 1 -3.0000000  3.07087697
> 2  0.9191692 -2.00000000
> 3 -1.0000000  0.00000000
> 4  1.0000000 -0.07853601
> 5  1.0000000  2.00000000
> 6 -1.8383417  4.00000000
> Iter: 5 
>  MSS: 0.6967107 
>  Rel. Err: 0.2031161 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.1668365  3.5757572
> [2,]  1.1333206 -1.8702286
> [3,] -0.2685842  0.4432231
> [4,]  0.3033932 -0.5006656
> [5,] -0.6178620  1.0196084
> [6,] -2.2666421  3.7404586
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.5757572
> 2  1.133321 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.5006656
> 5  1.000000  2.0000000
> 6 -2.266642  4.0000000
> Iter: 6 
>  MSS: 0.612954 
>  Rel. Err: 0.08375675 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.5102026  3.8916848
> [2,]  1.2439901 -1.9286163
> [3,] -0.2938087  0.4555054
> [4,]  0.5218645 -0.8090711
> [5,] -0.6172021  0.9568773
> [6,] -2.4879805  3.8572330
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.8916848
> 2  1.243990 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.8090711
> 5  1.000000  2.0000000
> 6 -2.487981  4.0000000
> Iter: 7 
>  MSS: 0.5807517 
>  Rel. Err: 0.03220227 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.6813227  4.1000892
> [2,]  1.2887789 -1.9707096
> [3,] -0.2995594  0.4580650
> [4,]  0.6701665 -1.0247713
> [5,] -0.6165707  0.9428163
> [6,] -2.5775579  3.9414194
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.100089
> 2  1.288779 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.024771
> 5  1.000000  2.000000
> 6 -2.577558  4.000000
> Iter: 8 
>  MSS: 0.5665195 
>  Rel. Err: 0.01423225 
> [1] "Xapp"
>            [,1]      [,2]
> [1,] -2.7628696  4.254096
> [2,]  1.2959129 -1.995367
> [3,] -0.2966660  0.456788
> [4,]  0.7647692 -1.177544
> [5,] -0.6169100  0.949880
> [6,] -2.5918258  3.990734
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.254096
> 2  1.295913 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.177544
> 5  1.000000  2.000000
> 6 -2.591826  4.000000
> Iter: 9 
>  MSS: 0.558216 
>  Rel. Err: 0.00830349 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.7988324  4.3825676
> [2,]  1.2826615 -2.0084627
> [3,] -0.2896952  0.4536209
> [4,]  0.8238540 -1.2900364
> [5,] -0.6175465  0.9669887
> [6,] -2.5653230  4.0169255
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.382568
> 2  1.282661 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.290036
> 5  1.000000  2.000000
> 6 -2.565323  4.000000SoftImpute
Code
initialisation_data_tidymodels %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  ) %>%
  reactablefmtr::add_title(
    title = "Algorithm 12.1"
  )
fits <- softImpute::softImpute(missing_data, type = "svd")
softImpute::complete(missing_data, fits) %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  ) %>%
  reactablefmtr::add_title(
    title = "Soft Impute"
  )Algorithm 12.1
Soft Impute
K-Means Clustering
Introduction
We begin with a simple simulated example in which there truly are two clusters in the
Code
set.seed(2)
# CLuster centre at (0,0) and (3,-4)
centers <- tibble::tibble(
  cluster = factor(1:2),
  num_points = c(25, 25), # number points in each cluster
  V1 = c(0, 3), # V1 coordinate of cluster center
  V2 = c(0, -4) # V2 coordinate of cluster center
)
labelled_points <- centers %>%
  dplyr::mutate(
    V1 = purrr::map2(.data[["num_points"]], .data[["V1"]], rnorm),
    V2 = purrr::map2(.data[["num_points"]], .data[["V2"]], rnorm)
  ) %>%
  dplyr::select(-c("num_points")) %>%
  tidyr::unnest(cols = c("V1", "V2"))
labelled_points %>%
  reactable::reactable(
    defaultPageSize = 10,
    columns = list(
      V1 = reactable::colDef(format = reactable::colFormat(digits = 2)),
      V2 = reactable::colDef(format = reactable::colFormat(digits = 2))
    )
  )Code
ggplot2::ggplot(
  data = labelled_points,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]],
    color = .data[["cluster"]]
  )
) +
  ggplot2::geom_point(alpha = 1)
We now perform K-means clustering with \(K = 2\).
We’ll use the built-in stats::kmeans function, which accepts a data frame with all numeric columns as it’s primary argument
Code
> K-means clustering with 2 clusters of sizes 26, 24
> 
> Cluster means:
>         V1        V2
> 1 0.361163 -0.162447
> 2 2.877777 -4.262629
> 
> Clustering vector:
>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
> [39] 2 2 2 1 2 2 2 2 2 2 2 2
> 
> Within cluster sum of squares by cluster:
> [1] 68.52253 58.53435
>  (between_SS / total_SS =  69.5 %)
> 
> Available components:
> 
> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
> [6] "betweenss"    "size"         "iter"         "ifault"The broom::tidy function returns information for each cluster, including their mean position (centroid), size and within-cluster sum-of-squares.
The broom::glance function returns model wise metrics. One of these is tot.withinss which is the total within-cluster sum-of-squares that we seek to minimize when we perform K-means clustering.
We can see what cluster each observation belongs to by using broom::augment which “predict” which cluster a given observation belongs to.
Code
Code
ggplot2::ggplot(
  data = predicted_kmeans,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]],
    color = .data[[".cluster"]]
  )
) +
  ggplot2::geom_point(alpha = 1)
Exploration
In reality, we will not know how many clusters is required for a given dataset. We need to try out a number of different values of \(K\) and then find the best one.
Let us try \(K\) from 1 to 9.
Code
kclusts <- tibble::tibble(k = 1:9) %>%
  dplyr::mutate(
    kclust = purrr::map(
      .x = .data[["k"]],
      .f = ~ kmeans(x = points, centers = .x)
    )
  ) %>%
  dplyr::mutate(
    tidied = purrr::map(
      .x = .data[["kclust"]],
      .f = broom::tidy
    ),
    glanced = purrr::map(
      .x = .data[["kclust"]],
      .f = broom::glance
    ),
    augmented = purrr::map(
      .x = .data[["kclust"]],
      .f = broom::augment, data = points
    )
  )
kclusts> # A tibble: 9 × 5
>       k kclust   tidied           glanced          augmented        
>   <int> <list>   <list>           <list>           <list>           
> 1     1 <kmeans> <tibble [1 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 2     2 <kmeans> <tibble [2 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 3     3 <kmeans> <tibble [3 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 4     4 <kmeans> <tibble [4 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 5     5 <kmeans> <tibble [5 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 6     6 <kmeans> <tibble [6 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 7     7 <kmeans> <tibble [7 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 8     8 <kmeans> <tibble [8 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 9     9 <kmeans> <tibble [9 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>Let assignments be a dataset of which cluster each points goes to for \(K\) from 1 to 9
Code
Now we can plot the original points using the data from assignments, with each point coloured according to the predicted cluster.
Code
ggplot2::ggplot(
  data = assignments,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]]
  )
) +
  geom_point(
    mapping = ggplot2::aes(color = .data[[".cluster"]]),
    alpha = 0.8
  ) +
  facet_wrap(ggplot2::vars(.data[["k"]]))
Let clusters be a dataset of cluster mean position (centroids) for \(K\) from 1 to 9
Code
We can then add the centers of the cluster using the data from clusters
Code
ggplot2::ggplot(
  data = assignments,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]]
  )
) +
  geom_point(
    mapping = ggplot2::aes(color = .data[[".cluster"]]),
    alpha = 0.8
  ) +
  facet_wrap(ggplot2::vars(.data[["k"]])) +
  geom_point(
    data = clusters,
    size = 10,
    shape = "x"
  )
Let clusterings be a dataset of cluster metrics for \(K\) from 1 to 9
Code
Now that we have the total within-cluster sum-of-squares for \(K\) from 1 to 9, we can plot them against k so we can use the elbow method to find the optimal number of clusters.
Code
ggplot2::ggplot(
  data = clusterings,
  mapping = ggplot2::aes(
    x = .data[["k"]],
    y = .data[["tot.withinss"]]
  )
) +
  ggplot2::geom_line(alpha = 0.5, size = 1.2, color = "midnightblue") +
  ggplot2::geom_point(size = 2, color = "midnightblue")
We see an elbow at k = 2 which makes us happy since the data set is specifically created to have 2 clusters. We can now extract the model where k = 2 from
Hierarchical Clustering
The stats::dist function is used to calculate the Euclidean distance dissimilarity measure.
Code
The stats::hclust function implements hierarchical clustering in R. We use the data used for K-means clustering to plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure.
We use factoextra::fviz_dend to visualize the clustering created using hclust.
To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the stats::cutree function
We analyze the nature of the clusters mathematically by evaluating the c, which measures the clustering structure of the dataset - with values closer to 1 suggest a more balanced clustering structure and values closer to 0 suggest less well-formed clusters. However, the agglomerative coefficient tends to become larger as the number of samples increases, so it should not be used to compare across data sets of very different sizes.
cluster::agnes allows us to compute this metric on the results from hierarchical clustering.
It looks like the complete linkage did the best.
Code
# Compute agglomerative coefficient values
ac_metric <- list(
  complete_ac = cluster::agnes(
    x = dissimilarity_measure,
    metric = "euclidean",
    method = "complete"
  )$ac,
  average_ac = cluster::agnes(
    x = dissimilarity_measure,
    metric = "euclidean",
    method = "average"
  )$ac,
  single_ac = cluster::agnes(
    x = dissimilarity_measure,
    metric = "euclidean",
    method = "single"
  )$ac
)
ac_metric> $complete_ac
> [1] 0.9359608
> 
> $average_ac
> [1] 0.8887208
> 
> $single_ac
> [1] 0.6700538Although hierarchical clustering provides a fully connected dendrogram representing the cluster relationships, there may still need to choose the preferred number of clusters to extract. Fortunately we can execute approaches similar to those discussed for k-means clustering with factoextra::fviz_nbclust
Code
# Plot cluster results
p1 <- factoextra::fviz_nbclust(
  x = points,
  FUN = factoextra::hcut,
  method = "wss",
  k.max = 10
) +
  ggplot2::ggtitle(label = "(A) Elbow method")
p2 <- factoextra::fviz_nbclust(
  x = points,
  FUN = factoextra::hcut,
  method = "silhouette",
  k.max = 10
) +
  ggplot2::ggtitle("(B) Silhouette method")
p3 <- factoextra::fviz_nbclust(
  x = points,
  FUN = factoextra::hcut,
  method = "gap_stat",
  k.max = 10
) +
  ggplot2::ggtitle("(C) Gap statistic")
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
Another way of calculating distances is based on pearson correlation using proxy::dist. However, this only makes sense if we have data with 3 or more variables.
Code

Code
# Plot cluster results
p1 <- factoextra::fviz_nbclust(
  x = three_var_data,
  FUN = factoextra::hcut,
  method = "wss",
  hc_metric = "pearson",
  k.max = 10
) +
  ggplot2::ggtitle(label = "(A) Elbow method")
p2 <- factoextra::fviz_nbclust(
  x = three_var_data,
  FUN = factoextra::hcut,
  method = "silhouette",
  hc_metric = "pearson",
  k.max = 10
) +
  ggplot2::ggtitle("(B) Silhouette method")
p3 <- factoextra::fviz_nbclust(
  x = three_var_data,
  FUN = factoextra::hcut,
  method = "gap_stat",
  hc_metric = "pearson",
  k.max = 10
) +
  ggplot2::ggtitle("(C) Gap statistic")
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
NCI60 Data Example
Unsupervised techniques are often used in the analysis of genomic data.
We illustrate these techniques on the NCI60 cancer cell line microarray data, which consists of 6,830 gene expression measurements on 64 cancer cell lines.
labels represents the type of cancer. V_1 to V_6830 represent a gene.
PCA on the NCI60 Data
We first perform PCA on the data after scaling the variables (genes) to have standard deviation one and plot the first few principal component score vectors, in order to visualize the data.
Code
nci60_preproc_rec <- recipes::recipe(formula = label ~ ., data = nci60) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::step_pca(recipes::all_numeric_predictors(),
    id = "pca",
    threshold = 0.99
  )Code
Code
We visualise the projections of the NCI60 cancer cell lines onto the first three principal components
Code
colors <- unname(palette.colors(n = 14, palette = "Polychrome 36"))
nci60_pca_score %>%
  dplyr::select(c("label", "PC01", "PC02")) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["PC01"]],
    y = .data[["PC02"]],
    color = .data[["label"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::scale_color_manual(values = colors) +
  ggplot2::theme_bw()
Code
nci60_pca_score %>%
  dplyr::select(c("label", "PC01", "PC03")) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["PC01"]],
    y = .data[["PC03"]],
    color = .data[["label"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::scale_color_manual(values = colors) +
  ggplot2::theme_bw()
Here is a scatter plot matrix
Code
nci60_pca_score %>%
  dplyr::select(c("label", "PC01", "PC02", "PC03")) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .panel_x,
    y = .panel_y,
    color = .data[["label"]],
    fill = .data[["label"]]
  )) +
  ggplot2::geom_point(alpha = 0.4, size = 0.5) +
  ggforce::geom_autodensity(alpha = .3) +
  ggforce::facet_matrix(rows = ggplot2::vars(-c("label")), layer.diag = 2) +
  ggplot2::scale_color_manual(values = colors) +
  ggplot2::scale_fill_manual(values = colors) +
  cowplot::theme_minimal_grid() +
  cowplot::panel_border(color = "black") +
  ggplot2::theme(legend.position = "bottom")
To view them interactively, consider using langevitour::langevitour from the R package langevitour
Code
nci60_pca_score_filtered <- nci60_pca_score %>%
  dplyr::select(c(
    "label",
    "PC01", "PC02", "PC03",
    "PC04", "PC05", "PC06"
  ))
langevitour::langevitour(
  nci60_pca_score_filtered[, -1], nci60_pca_score_filtered$label,
  pointSize = 2
)How much variation are we capturing for each component?
We extract the pca loadings of the pca step.
Code
Here is how we can see the variance statistics but it is in a long form
Here is how we can see the variance statistics in its wide form
Code
Here is a simple plot to show the variance captured for each component.
Code
# Get the variance
percent_variation <- nci60_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "percent variance") %>%
  dplyr::pull(.data[["value"]])
percent_variation <- percent_variation / 100
component <- unique(nci60_pca_loadings$component) %>%
  stringr::str_remove(pattern = "PC") %>%
  as.integer()
# I use [1:4] to select the first four components
dplyr::tibble(
  component = component,
  percent_var = percent_variation
) %>%
  # dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["percent_var"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
  ggplot2::labs(
    x = NULL,
    y = "Percent variance \nexplained by each \nPCA component"
  )
We see that together, the first seven principal components explain around 40% of the variance in the data.
Code
# Get the variance
cumulative_percent_variation <- nci60_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "cumulative percent variance") %>%
  dplyr::pull(.data[["value"]])
cumulative_percent_variation <- cumulative_percent_variation / 100
component <- unique(nci60_pca_loadings$component) %>%
  stringr::str_remove(pattern = "PC") %>%
  as.integer()
# I use [1:4] to select the first four components
dplyr::tibble(
  component = component,
  cumulative_percent_var = cumulative_percent_variation
) %>%
  # dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["cumulative_percent_var"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
  ggplot2::labs(
    x = NULL,
    y = "Cumulative percent \nvariance explained \nby each PCA \ncomponent"
  )
Clustering on NCI60 dataset (Base R)
We now proceed to hierarchically cluster the cell lines in the NCI60 data, with the goal of finding out whether or not the observations cluster into distinct types of cancer.
We first preprocess the data suited for hierarchical clustering
Code
nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::prep() %>%
  recipes::bake(new_data = NULL)We choose the preferred number of clusters to extract.
Code
nci60_num_cluster <- nci60_scaled %>%
  dplyr::select(-c("label"))
# Plot cluster results
p1 <- factoextra::fviz_nbclust(
  x = nci60_num_cluster,
  FUN = factoextra::hcut,
  method = "wss",
  hc_metric = "euclidean",
  k.max = 10
) +
  ggplot2::ggtitle(label = "(A) Elbow method")
p2 <- factoextra::fviz_nbclust(
  x = nci60_num_cluster,
  FUN = factoextra::hcut,
  method = "silhouette",
  hc_metric = "euclidean",
  k.max = 10
) +
  ggplot2::ggtitle("(B) Silhouette method")
p3 <- factoextra::fviz_nbclust(
  x = nci60_num_cluster,
  FUN = factoextra::hcut,
  method = "gap_stat",
  hc_metric = "euclidean",
  k.max = 10
) +
  ggplot2::ggtitle("(C) Gap statistic")
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
We now perform hierarchical clustering of the observations using complete, single, and average linkage. Euclidean distance is used as the dissimilarity measure.
Code
nci60_complete <- nci60_num_cluster %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "complete")
nci60_average <- nci60_num_cluster %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "average")
nci60_single <- nci60_num_cluster %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "single")We then visualize them to see if any of them have some good natural separations.
Typically, single linkage will tend to yield trailing clusters: very large clusters onto which individual observations.
Complete and average linkage tend to yield more balanced, attractive clusters. For this reason, complete and average linkage are generally preferred to single linkage.
Code
nci60_complete$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_complete, main = "complete", k = 4)
Code
stats::cutree(tree = nci60_complete, k = 4)>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           2           2 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           2           2           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           2 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           2           2           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           3           3           3 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           3           3           3           3           3           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           4           1           4           4           4           4 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           4           4           4           4           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1Code
nci60_average$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_average, main = "average", k = 4)
Code
stats::cutree(tree = nci60_average, k = 4)>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           1           1 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           1           1           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           1 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           2           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           3           3           3 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           3           3           3           3           3           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           1           1           1           1           1           1 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           1           1           1           1           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           4           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1Code
nci60_single$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_single, main = "single", k = 4)
Code
stats::cutree(tree = nci60_single, k = 4)>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           1           1 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           1           1           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           2 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           3           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           1           1           1 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           1           1           1           1           4           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           1           1           1           1           1           1 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           1           1           1           1           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1We will use complete linkage hierarchical clustering for the analysis that follows. We calculate which Label is the most common within each cluster.
Code
tibble::tibble(
  label = nci60$label,
  cluster_id = stats::cutree(nci60_complete, k = 4)
) %>%
  dplyr::count(.data[["label"]], .data[["cluster_id"]]) %>%
  dplyr::group_by(.data[["cluster_id"]]) %>%
  dplyr::mutate(
    total = sum(.data[["n"]]),
    prop = .data[["n"]] / sum(.data[["n"]])
  ) %>%
  dplyr::slice_max(n = 1, order_by = .data[["prop"]]) %>%
  dplyr::ungroup() %>%
  reactable::reactable()How do these NCI60 hierarchical clustering results compare to what we get if we perform K-means clustering with K = 4?
Code
nci60_kmeans_results <- kmeans(nci60_num_cluster, centers = 4, nstart = 50)Code
We see that the four clusters obtained using hierarchical clustering and Kmeans clustering are somewhat different.
Code
>               cluster_kmeans
> cluster_hclust  1  2  3  4
>              1  0  9 20 11
>              2  0  0  7  0
>              3  8  0  0  0
>              4  0  0  0  9Code

Cluster 2 in K-means clustering is identical to cluster 4 in hierarchical clustering. However, the other clusters defer. For the twenty seven samples in K-means Cluster 3, twenty goes to hierarchical clustering Cluster 1 and seven goes to hierarchical clustering Cluster 2.
Rather than performing hierarchical clustering on the entire data matrix, we can also perform hierarchical clustering on the first few principal component score vectors, as follows:
Code
# nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
#   recipes::step_normalize(recipes::all_numeric_predictors()) %>%
#   recipes::prep() %>%
#   recipes::bake(new_data = NULL)
nci60_pca <- recipes::recipe(~., nci60_scaled) %>%
  step_pca(recipes::all_numeric_predictors(), num_comp = 5) %>%
  recipes::prep() %>%
  recipes::bake(new_data = NULL)Code

Code
stats::cutree(tree = nci60_pca_complete, k = 4)>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           2           1           2           2           2 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           2           2           2           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           2 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           2           2           2           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           1           3           3 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           3           3           3           3           1           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           1           1           1           1           1           1 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           4           4           4           4           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           2           2           2           2           2 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           2           2           2           2Clustering on NCI60 dataset (tidyclust)
Recall that we scale the data before clustering and remove the labels.
Here is how we can create the clustering elbow plot using tidyclust
Code
kmeans_spec <- tidyclust::k_means(
  num_clusters = tune::tune()
)
nci60_cv <- rsample::vfold_cv(nci60_num_cluster, v = 5)
nci60_rec <- recipes::recipe(formula = ~., data = nci60_num_cluster)
kmeans_wflow <- workflows::workflow(nci60_rec, kmeans_spec)
clust_num_grid <- dials::grid_regular(
  tidyclust::num_clusters(range = c(1, 15)),
  levels = 15
)
clust_num_grid %>%
  reactable::reactable(defaultPageSize = 5)Code
res <- tidyclust::tune_cluster(
  kmeans_wflow,
  resamples = nci60_cv,
  grid = clust_num_grid,
  control = tune::control_grid(save_pred = TRUE, extract = identity),
  metrics = tidyclust::cluster_metric_set(
    tidyclust::tot_wss,
    tidyclust::tot_sse,
    tidyclust::sse_ratio
  )
)
res_metrics <- res %>% tune::collect_metrics()
res_metrics %>%
  reactable::reactable(defaultPageSize = 5)Code
res_metrics %>%
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      x = .data[["num_clusters"]],
      y = .data[["mean"]],
      colour = .data[[".metric"]]
    )
  ) +
  ggplot2::geom_errorbar(
    mapping = ggplot2::aes(
      ymin = .data[["mean"]] - .data[["std_err"]],
      ymax = .data[["mean"]] + .data[["std_err"]]
    ),
    alpha = 0.8
  ) +
  ggplot2::geom_line(size = 1) +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[[".metric"]]),
    scales = "free",
    nrow = 3
  ) +
  ylab("mean over 5 folds") +
  xlab("Number of clusters") +
  scale_x_continuous(breaks = 1:15) +
  ggplot2::theme(legend.position = "none")
tidyclust::hier_clust is used to do the hierarchical clustering. Here, we set the number of clusters to three and use the complete linkage method. It actually uses stats::hclust() and Euclidean distance under the hood.
Code
hc_spec <- tidyclust::hier_clust(
  num_clusters = 3,
  linkage_method = "complete"
)
hc_fit <- hc_spec %>%
  tidyclust::fit(label ~ ., data = nci60_scaled)We use tidyclust::within_cluster_sse to calculate the Sum of Squared Error in each cluster
Code
hc_fit %>%
  tidyclust::within_cluster_sse() %>%
  reactable::reactable(defaultPageSize = 5)We use tidyclust::tot_sse to calculate the total sum of squares.
We use tidyclust::sse_ratio to calculate the ratio of the WSS to the total SSE.
We use tidyclust::silhouettes to measure silhouettes between clusters for all observations. Ensure that the labels are removed.
Code
hc_fit %>%
  tidyclust::silhouettes(new_data = nci60_num_cluster) %>%
  reactable::reactable(defaultPageSize = 5)We use tidyclust::avg_silhouette to measure average silhouettes across all observations. Ensure that the labels are removed. (This is also the mean of the column sil_width)
Code
hc_fit %>%
  tidyclust::avg_silhouette(new_data = nci60_num_cluster) %>%
  reactable::reactable(defaultPageSize = 5)Use tidyclust::cluster_metric_set to combine the metrics together.
Code
many_metrics <- tidyclust::cluster_metric_set(
  tidyclust::tot_wss,
  tidyclust::tot_sse,
  tidyclust::sse_ratio,
  tidyclust::avg_silhouette
)
# new_data required for tidyclust::avg_silhouette to work
hc_fit %>%
  many_metrics(new_data = nci60_num_cluster) %>%
  reactable::reactable(defaultPageSize = 5)We use tidyclust::extract_cluster_assignment to see the cluster assignments. However, the labels need to be updated.
Code
Centroid can be extracted using extract_centroids. Here is a preview of the first few columns. For hierarchical clustering, the geometric mean over the predictors for each cluster is used.
Code
hc_fit_centroids <- hc_fit %>%
  tidyclust::extract_centroids()
hc_fit_centroids[, 1:10] %>%
  reactable::reactable(defaultPageSize = 5)Here is how to see the plots. hc_fit$fit gives back the hclust object. We can plot the hierarchical clustering results.
Code
hc_fit$fit$labels <- nci60_scaled$label
factoextra::fviz_dend(hc_fit$fit, main = "complete", k = 3)
Code
stats::cutree(tree = hc_fit$fit, k = 3)>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           1           1 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           1           1           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           1 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           1           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           2           2           2 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           2           2           2           2           2           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           3           1           3           3           3           3 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           3           3           3           3           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1Alternatively, we can use tidyclust::extract_fit_summary as well.
Code
hc_summary <- hc_fit %>%
  tidyclust::extract_fit_summary()Here is how you can do k-means clustering using tidyclust::k_means.
Code
km_spec <- tidyclust::k_means(num_clusters = 5) %>%
  parsnip::set_engine("stats", algorithm = "Lloyd")
km_fit <- km_spec %>%
  tidyclust::fit(label ~ ., data = nci60_scaled)
km_summary <- km_fit %>% extract_fit_summary()stats::predict returns the cluster a new observation belongs to
However, stats::predict on the “training data” will not necessarily produce the same results as tidyclust::extract_cluster_assignment
Code
cbind(
  hc_preds,
  tidyclust::extract_cluster_assignment(hc_fit),
  nci60_scaled$label
) %>%
  reactable::reactable(defaultPageSize = 5)We notice that the three-cluster assignments from hier_clust do not line up perfectly with the five-cluster assignments from k_means.
Code
tidyclust also have a useful function called tidyclust::reconcile_clusterings to match clusters from k-means with the clusters from hierarchical clustering.
Code
cluster_table <- tidyclust::reconcile_clusterings(
  primary_cluster_assignment = hc_preds$.pred_cluster,
  alt_cluster_assignment = km_preds$.pred_cluster,
  one_to_one = FALSE
)
cluster_table %>%
  dplyr::mutate(labels = nci60_scaled$label) %>%
  reactable::reactable(
    defaultPageSize = 5,
    filterable = TRUE
  )We can see that KM_1 and KM_2 is matched to HC_1, KM_3 is matched to HC_2, KM_4 and KM_5 is matched to HC_3.
Blog References
- Tengku Hanis’s blog titled “Explore data using PCA” 
- Julia Silge’s blog titled “PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes” 
- Julia Silge’s blog titled “Getting started with k-means and #TidyTuesday employment status” 
- Tidymodels’ tutorial titled “K-means clustering with tidy data principles” 
- Emil Hvitfeldt’s ISLR tidymodels Labs Chapter 12 
- Eric (R_ic) RPubs’s post titled “Train and Evaluate Clustering Models using Tidymodels and friends” 
- Chapter 21 Hierarchical Clustering from Hands-On Machine Learning with R 
Package References
Code
get_citation <- function(package_name) {
  transform_name <- package_name %>%
    citation() %>%
    format(style = "text")
  return(transform_name[1])
}
packages <- c(
  "base", "ggplot2", "rmarkdown", "cluster",
  "report", "tidytext", "knitr", "broom",
  "cowplot", "dplyr", "factoextra", "forcats",
  "ggforce", "gridExtra", "htmltools", "ISLR2",
  "magrittr", "Matrix", "parsnip", "proxy",
  "purrr", "reactable", "reactablefmtr", "recipes",
  "scales", "softImpute", "stringr", "styler",
  "tibble", "tidyr", "yardstick", "tidyclust",
  "rsample", "tune", "dials", "workflows",
  "Rfast", "flexclust", "langevitour"
)
table <- tibble::tibble(Packages = packages)
table %>%
  dplyr::mutate(
    transform_name = purrr::map_chr(
      .data[["Packages"]],
      get_citation
    )
  ) %>%
  dplyr::pull(.data[["transform_name"]]) %>%
  report::as.report_parameters()- R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
- Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
- Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2022). rmarkdown: Dynamic Documents for R. R package version 2.14, https://github.com/rstudio/rmarkdown.
- Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2022). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.3 - For new features, see the ‘Changelog’ file (in the package source), https://CRAN.R-project.org/package=cluster.
- Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://github.com/easystats/report.
- Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi:10.21105/joss.00037 https://doi.org/10.21105/joss.00037, http://dx.doi.org/10.21105/joss.00037.
- Xie Y (2022). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.39, https://yihui.org/knitr/.
- Robinson D, Hayes A, Couch S (2022). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.0, https://CRAN.R-project.org/package=broom.
- Wilke C (2020). cowplot: Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. R package version 1.1.1, https://CRAN.R-project.org/package=cowplot.
- Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.
- Kassambara A, Mundt F (2020). factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package version 1.0.7, https://CRAN.R-project.org/package=factoextra.
- Wickham H (2021). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.1, https://CRAN.R-project.org/package=forcats.
- Pedersen T (2021). ggforce: Accelerating ‘ggplot2’. R package version 0.3.3, https://CRAN.R-project.org/package=ggforce.
- Auguie B (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3, https://CRAN.R-project.org/package=gridExtra.
- Cheng J, Sievert C, Schloerke B, Chang W, Xie Y, Allen J (2021). htmltools: Tools for HTML. R package version 0.5.2, https://CRAN.R-project.org/package=htmltools.
- James G, Witten D, Hastie T, Tibshirani R (2022). ISLR2: Introduction to Statistical Learning, Second Edition. R package version 1.3-1, https://CRAN.R-project.org/package=ISLR2.
- Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.
- Bates D, Maechler M, Jagan M (2022). Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.4-1, https://CRAN.R-project.org/package=Matrix.
- Kuhn M, Vaughan D (2022). parsnip: A Common API to Modeling and Analysis Functions. https://github.com/tidymodels/parsnip, https://parsnip.tidymodels.org/.
- Meyer D, Buchta C (2022). proxy: Distance and Similarity Measures. R package version 0.4-27, https://CRAN.R-project.org/package=proxy.
- Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, https://CRAN.R-project.org/package=purrr.
- Lin G (2022). reactable: Interactive Data Tables Based on ‘React Table’. R package version 0.3.0, https://CRAN.R-project.org/package=reactable.
- Cuilla K (2022). reactablefmtr: Streamlined Table Styling and Formatting for Reactable. R package version 2.0.0, https://CRAN.R-project.org/package=reactablefmtr.
- Kuhn M, Wickham H (2022). recipes: Preprocessing and Feature Engineering Steps for Modeling. R package version 1.0.1, https://CRAN.R-project.org/package=recipes.
- Wickham H, Seidel D (2022). scales: Scale Functions for Visualization. R package version 1.2.0, https://CRAN.R-project.org/package=scales.
- Hastie T, Mazumder R (2021). softImpute: Matrix Completion via Iterative Soft-Thresholded SVD. R package version 1.4-1, https://CRAN.R-project.org/package=softImpute.
- Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, https://CRAN.R-project.org/package=stringr.
- Müller K, Walthert L (2022). styler: Non-Invasive Pretty Printing of R Code. R package version 1.7.0, https://CRAN.R-project.org/package=styler.
- Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.8, https://CRAN.R-project.org/package=tibble.
- Wickham H, Girlich M (2022). tidyr: Tidy Messy Data. R package version 1.2.0, https://CRAN.R-project.org/package=tidyr.
- Kuhn M, Vaughan D (2022). yardstick: Tidy Characterizations of Model Performance. R package version 1.0.0, https://CRAN.R-project.org/package=yardstick.
- Hvitfeldt E, Bodwin K (2022). tidyclust: A Common API to Clustering. R package version 0.0.0.9000, https://github.com/EmilHvitfeldt/tidyclust.
- Silge J, Chow F, Kuhn M, Wickham H (2022). rsample: General Resampling Infrastructure. R package version 1.1.0, https://CRAN.R-project.org/package=rsample.
- Kuhn M (2022). tune: Tidy Tuning Tools. R package version 1.0.0, https://CRAN.R-project.org/package=tune.
- Kuhn M, Frick H (2022). dials: Tools for Creating Tuning Parameter Values. R package version 1.0.0, https://CRAN.R-project.org/package=dials.
- Vaughan D (2022). workflows: Modeling Workflows. https://github.com/tidymodels/workflows, https://workflows.tidymodels.org.
- Papadakis M, Tsagris M, Dimitriadis M, Fafalios S, Tsamardinos I, Fasiolo M, Borboudakis G, Burkardt J, Zou C, Lakiotaki K, Chatzipantsiou. C (2022). Rfast: A Collection of Efficient and Extremely Fast R Functions. R package version 2.0.6, https://CRAN.R-project.org/package=Rfast.
- Leisch F (2006). “A Toolbox for K-Centroids Cluster Analysis.” Computational Statistics and Data Analysis, 51(2), 526-544.
- Harrison P (2022). langevitour: Langevin Tour. R package version 0.5, https://CRAN.R-project.org/package=langevitour.


