This document was prepared on 2022-08-23.

Author

Jeremy Selva

Code
# Rmarkdown stuff
library(rmarkdown, quietly = TRUE)
library(knitr, quietly = TRUE)
library(htmltools, quietly = TRUE)
library(styler, quietly = TRUE)

# Dataset
library(ISLR2, quietly = TRUE)

# Session info and package reporting
library(report, quietly = TRUE)

# Data wrangling
library(dplyr, quietly = TRUE)
library(tidyr, quietly = TRUE)
library(magrittr, quietly = TRUE)
library(purrr, quietly = TRUE)
library(tibble, quietly = TRUE)
library(stringr, quietly = TRUE)

# View PCA
library(langevitour, quietly = TRUE)

# Matrix Completion
library(softImpute, quietly = TRUE)

# Create interactive tables
library(reactable, quietly = TRUE)
library(reactablefmtr, quietly = TRUE)

# For plotting
library(tidytext, quietly = TRUE)
library(scales, quietly = TRUE)
library(forcats, quietly = TRUE)
library(grid, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(gridExtra, quietly = TRUE)
library(grDevices, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(ggforce, quietly = TRUE)

# For applying tidymodels
library(recipes, quietly = TRUE)
# From GitHub remotes::install_github("tidymodels/parsnip")
library(parsnip, quietly = TRUE)
library(broom, quietly = TRUE)
library(yardstick, quietly = TRUE)

# For clustering
library(proxy, quietly = TRUE)
library(factoextra, quietly = TRUE)
library(cluster, quietly = TRUE)

# For tidy clustering
# From GitHub remotes::install_github("tidymodels/workflows@celery")
library(workflows, quietly = TRUE)
library(rsample, quietly = TRUE)
library(tune, quietly = TRUE)
library(dials, quietly = TRUE)
library(tidyclust, quietly = TRUE)
library(Rfast, quietly = TRUE)
library(flexclust, quietly = TRUE)

summary(report::report(sessionInfo()))

The analysis was done using the R Statistical language (v4.2.1; R Core Team, 2022) on Windows 10 x64, using the packages Rcpp (v1.0.9), ggplot2 (v3.3.6), rmarkdown (v2.14), cluster (v2.1.3), report (v0.5.1), lattice (v0.20.45), tidytext (v0.3.3), flexclust (v1.4.1), knitr (v1.39), broom (v1.0.0), cowplot (v1.1.1), dials (v1.0.0), dplyr (v1.0.9), factoextra (v1.0.7), forcats (v0.5.1), ggforce (v0.3.3), gridExtra (v2.3), htmltools (v0.5.2), ISLR2 (v1.3.1), langevitour (v0.5), magrittr (v2.0.3), Matrix (v1.4.1), modeltools (v0.2.23), parsnip (v1.0.0.9000), proxy (v0.4.27), purrr (v0.3.4), RcppZiggurat (v0.1.6), reactable (v0.3.0), reactablefmtr (v2.0.0), recipes (v1.0.1), Rfast (v2.0.6), rsample (v1.1.0), scales (v1.2.0), softImpute (v1.4.1), stringr (v1.4.0), styler (v1.7.0), tibble (v3.1.8), tidyclust (v0.0.0.9000), tidyr (v1.2.0), tune (v1.0.0), workflows (v1.0.0.9000) and yardstick (v1.0.0).

Principal Components Analysis

The USArrests data set from the datasets package contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

Column Name Column Description
Murder Number of arrests for murder (per 100,000)
Assault Number of arrests for assault (per 100,000)
UrbanPop Percent of the population living in urban areas
Rape Number of arrests for rape (per 100,000)

Turn USArrests into a tibble and move the rownames into a column.

Code
USArrests <- dplyr::as_tibble(datasets::USArrests,
  rownames = "state"
)

USArrests %>%
  reactable::reactable(
    defaultPageSize = 5,
    filterable = TRUE
  )

We create a recipe with recipes::recipe, recipes::all_numeric_predictors(), recipes::step_normalize and recipes::step_pca

By default, recipes::step_pca uses stats::prcomp with the argument defaults set to retx = FALSE, center = FALSE, scale. = FALSE, and tol = NULL.

Code
pca_recipe <- recipes::recipe(
  formula = ~.,
  data = USArrests
) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::step_pca(recipes::all_numeric_predictors(),
    id = "pca",
    num_comp = 4
  )

pca_recipe
> Recipe
> 
> Inputs:
> 
>       role #variables
>  predictor          5
> 
> Operations:
> 
> Centering and scaling for recipes::all_numeric_predictors()
> PCA extraction with recipes::all_numeric_predictors()

Apply recipes::prep and recipes::bake to compute the scores of the principal component analysis.

Code
pca_estimates <- recipes::prep(x = pca_recipe, training = USArrests)

pca_estimates %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)
Code
pca_score <- pca_estimates %>%
  recipes::bake(new_data = USArrests)


pca_score %>%
  reactable::reactable(defaultPageSize = 5)

We can explore the results of the PCA using the recipes::prep and parsnip::tidy. We can see that pca is done at step number 2

Code
pca_estimates %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)

As such, we extract the pca loadings of the pca step.

Code
tidied_pca_loadings <- pca_estimates %>%
  parsnip::tidy(id = "pca", type = "coef")


tidied_pca_loadings %>%
  reactable::reactable(defaultPageSize = 5)

Here are the pca loadings in wide table form.

Code
tidied_pca_loadings %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]]
  ) %>%
  dplyr::select(-dplyr::all_of("id")) %>%
  reactable::reactable(defaultPageSize = 5)

We make a visualization to see how the loadings of the four components look like

Code
tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["value"]],
    y = .data[["terms"]],
    fill = .data[["terms"]]
  )) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]])) +
  ggplot2::labs(y = NULL)

Let us take a closer look at the four variables that contribute to the four components

Code
tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::group_by(.data[["component"]]) %>%
  dplyr::top_n(4, abs(.data[["value"]])) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    terms = tidytext::reorder_within(
      x = .data[["terms"]],
      by = abs(.data[["value"]]),
      within = .data[["component"]]
    )
  ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = abs(.data[["value"]]),
    y = .data[["terms"]],
    fill = .data[["value"]] > 0
  )) +
  ggplot2::geom_col() +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[["component"]]),
    scales = "free_y"
  ) +
  tidytext::scale_y_reordered() +
  ggplot2::labs(
    x = "Absolute value of contribution to PCA component",
    y = NULL,
    fill = "Positive?"
  )

How much variation are we capturing for each component?

Here is how we can see the variance statistics but it is in a long form

Code
tidied_pca_variance <- pca_estimates %>%
  parsnip::tidy(id = "pca", type = "variance")

tidied_pca_variance %>%
  reactable::reactable(defaultPageSize = 5)

Here is how we can see the variance statistics in its wide form

Code
tidied_pca_variance %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]],
    names_prefix = "PC"
  ) %>%
  dplyr::select(-dplyr::all_of("id")) %>%
  reactable::reactable(defaultPageSize = 5)

Here is a simple plot to show the variance captured for each component

Code
# Get the variance
percent_variation <- tidied_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "percent variance") %>%
  dplyr::pull(.data[["value"]])

percent_variation <- percent_variation / 100

# I use [1:4] to select the first four components
dplyr::tibble(
  component = unique(tidied_pca_loadings$component)[1:4],
  percent_var = percent_variation[1:4]
) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["percent_var"]]
  )) +
  ggplot2::geom_col() +
  ggplot2::geom_text(
    mapping = ggplot2::aes(
      label = paste(round(.data[["percent_var"]] * 100, 0), "%")
    ),
    vjust = -0.2
  ) +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
  ggplot2::labs(
    x = NULL,
    y = "Percent variance \nexplained by each \nPCA component"
  )

Here is a PCA Biplot of PC1 and PC2

Code
filtered_loadings <- tidied_pca_loadings %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]]
  )

ggplot2::ggplot() +
  ggplot2::geom_point(
    data = pca_score,
    mapping = ggplot2::aes(
      x = .data[["PC1"]],
      y = .data[["PC2"]]
    ),
    size = 2,
    alpha = 0.6
  ) +
  ggplot2::geom_text(
    data = pca_score,
    mapping = ggplot2::aes(
      x = .data[["PC1"]],
      y = .data[["PC2"]],
      label = .data[["state"]]
    ),
    check_overlap = TRUE,
    size = 2.5,
    hjust = "inward"
  ) +
  ggplot2::geom_segment(
    data = filtered_loadings,
    mapping = ggplot2::aes(
      x = 0,
      y = 0,
      xend = .data[["PC1"]] * 2.5,
      yend = .data[["PC2"]] * 2.5
    ),
    arrow = ggplot2::arrow(length = grid::unit(
      x = 0.5,
      units = "picas"
    )),
    color = "blue"
  ) +
  ggplot2::annotate(
    geom = "text",
    x = filtered_loadings[["PC1"]] * 2.7,
    y = filtered_loadings[["PC2"]] * 2.7,
    label = filtered_loadings[["terms"]]
  ) +
  ggplot2::coord_fixed(clip = "off") +
  ggplot2::theme_minimal()

Matrix Completion

Initialisation

Code
missing_data <- data.frame(
  SBP = c(-3, NA, -1, 1, 1, NA),
  DBP = c(NA, -2, 0, NA, 2, 4)
)

colour_missing_cells <- function(value, index, name) {
  missing_indices <- which(is.na(missing_data[[name]]))
  if (index %in% missing_indices) {
    list(background = "#aaf0f8")
  }
}

xbar <- colMeans(missing_data, na.rm = TRUE)

initialisation_data <- data.frame(
  SBP = c(-3, -0.5, -1, 1, 1, -0.5),
  DBP = c(1, -2, 0, 1, 2, 4)
)

thresh <- 1e-2
ismiss <- is.na(missing_data)

Functions

These are the implementations for calculating step 2a and 2b for Algorithm 12.1

Code
fit_pca <- function(input_data, num_comp = 1) {
  data_pca <- prcomp(input_data, center = FALSE)

  pca_loadings <- data_pca$rotation
  pca_loadings_inverse <- solve(pca_loadings)[1:num_comp, , drop = FALSE]
  pca_scores <- data_pca$x[, 1:num_comp, drop = FALSE]

  estimated_data <- pca_scores %*% pca_loadings_inverse
  return(estimated_data)
}
Code
fit_svd <- function(input_data, num_comp = 1) {
  svd_object <- svd(input_data)

  pca_loadings <- svd_object$v[, 1:num_comp, drop = FALSE]
  pca_loadings_inverse <- t(pca_loadings)

  pca_scores <- (svd_object$u %*% diag(svd_object$d))[, 1:num_comp, drop = FALSE]

  estimated_data <- pca_scores %*% pca_loadings_inverse

  return(estimated_data)
}
Code
fit_pca_tidy <- function(input_data, num_comp = 1) {
  pca_recipe <- recipes::recipe(
    formula = ~.,
    data = input_data
  ) %>%
    recipes::step_pca(recipes::all_numeric_predictors(),
      id = "pca"
    )


  pca_estimates <- recipes::prep(x = pca_recipe, training = input_data)

  pca_scores <- pca_estimates %>%
    recipes::bake(new_data = input_data) %>%
    as.matrix()
  pca_scores <- pca_scores[, 1:num_comp, drop = FALSE]


  pca_loadings <- pca_estimates %>%
    parsnip::tidy(id = "pca", type = "coef") %>%
    tidyr::pivot_wider(
      names_from = .data[["component"]],
      values_from = .data[["value"]]
    ) %>%
    dplyr::select(-c("terms", "id")) %>%
    as.matrix()

  pca_loadings_inverse <- solve(pca_loadings)[1:num_comp, , drop = FALSE]

  estimated_data <- pca_scores %*% pca_loadings_inverse

  return(estimated_data)
}

Iteration 1

Code
initialisation_data_pca <- initialisation_data

# Step 2(a)
Xapp <- fit_pca(initialisation_data_pca, num_comp = 1)

# Step 2(b)
initialisation_data_pca[ismiss] <- Xapp[ismiss]

# Step 2(c)
mss_old_pca <- mean(((missing_data - Xapp)[!ismiss])^2)

initialisation_data_pca %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  )
Code
mss_old_pca
> [1] 1.48974
Code
initialisation_data_svd <- initialisation_data

# Step 2(a)
Xapp <- fit_svd(initialisation_data_svd, num_comp = 1)

# Step 2(b)
initialisation_data_svd[ismiss] <- Xapp[ismiss]

# Step 2(c)
mss_old_svd <- mean(((missing_data - Xapp)[!ismiss])^2)

initialisation_data_svd %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  )
Code
mss_old_svd
> [1] 1.48974
Code
initialisation_data_tidymodels <- initialisation_data

# Step 2(a)
Xapp <- fit_pca_tidy(initialisation_data_tidymodels, num_comp = 1)

# Step 2(b)
initialisation_data_tidymodels[ismiss] <- Xapp[ismiss]

# Step 2(c)
mss_old_tidymodels <- mean(((missing_data - Xapp)[!ismiss])^2)

initialisation_data_tidymodels %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  )
Code
mss_old_tidymodels
> [1] 1.48974

The rest of the iteration

Code
iter <- 1
rel_err <- mss_old_pca
thresh <- 1e-2
while (rel_err > thresh) {
  iter <- iter + 1

  # Step 2(a)
  Xapp <- fit_pca(initialisation_data_pca, num_comp = 1)

  # Step 2(b)
  initialisation_data_pca[ismiss] <- Xapp[ismiss]

  # Step 2(c)
  mss_pca <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_pca - mss_pca
  mss_old_pca <- mss_pca
  cat(
    "Iter:", iter, "\n",
    "MSS:", mss_pca, "\n",
    "Rel. Err:", rel_err, "\n"
  )
  print("Xapp")
  print(Xapp)
  print("Data")
  print(initialisation_data_pca)
}
> Iter: 2 
>  MSS: 1.43288 
>  Rel. Err: 0.05686009 
> [1] "Xapp"
>              SBP        DBP
> [1,] -0.24567017  1.6296426
> [2,]  0.29799874 -1.9767619
> [3,] -0.02222086  0.1474011
> [4,] -0.11358385  0.7534536
> [5,] -0.27258141  1.8081571
> [6,] -0.59617741  3.9547173
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  1.6296426
> 2  0.2979987 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.7534536
> 5  1.0000000  2.0000000
> 6 -0.5961774  4.0000000
> Iter: 3 
>  MSS: 1.222089 
>  Rel. Err: 0.2107907 
> [1] "Xapp"
>              SBP        DBP
> [1,] -0.73055871  2.3385975
> [2,]  0.59572858 -1.9069916
> [3,] -0.08891186  0.2846165
> [4,] -0.12553344  0.4018461
> [5,] -0.48032110  1.5375598
> [6,] -1.19147316  3.8140345
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  2.3385975
> 2  0.5957286 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.4018461
> 5  1.0000000  2.0000000
> 6 -1.1914732  4.0000000
> Iter: 4 
>  MSS: 0.8998268 
>  Rel. Err: 0.3222623 
> [1] "Xapp"
>              SBP         DBP
> [1,] -1.53549232  3.07087697
> [2,]  0.91916923 -1.83827401
> [3,] -0.20001122  0.40000841
> [4,]  0.03926938 -0.07853601
> [5,] -0.60000561  1.19996915
> [6,] -1.83834165  3.67655441
> [1] "Data"
>          SBP         DBP
> 1 -3.0000000  3.07087697
> 2  0.9191692 -2.00000000
> 3 -1.0000000  0.00000000
> 4  1.0000000 -0.07853601
> 5  1.0000000  2.00000000
> 6 -1.8383417  4.00000000
> Iter: 5 
>  MSS: 0.6967107 
>  Rel. Err: 0.2031161 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.1668365  3.5757572
> [2,]  1.1333206 -1.8702286
> [3,] -0.2685842  0.4432231
> [4,]  0.3033932 -0.5006656
> [5,] -0.6178620  1.0196084
> [6,] -2.2666421  3.7404586
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.5757572
> 2  1.133321 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.5006656
> 5  1.000000  2.0000000
> 6 -2.266642  4.0000000
> Iter: 6 
>  MSS: 0.612954 
>  Rel. Err: 0.08375675 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.5102026  3.8916848
> [2,]  1.2439901 -1.9286163
> [3,] -0.2938087  0.4555054
> [4,]  0.5218645 -0.8090711
> [5,] -0.6172021  0.9568773
> [6,] -2.4879805  3.8572330
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.8916848
> 2  1.243990 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.8090711
> 5  1.000000  2.0000000
> 6 -2.487981  4.0000000
> Iter: 7 
>  MSS: 0.5807517 
>  Rel. Err: 0.03220227 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.6813227  4.1000892
> [2,]  1.2887789 -1.9707096
> [3,] -0.2995594  0.4580650
> [4,]  0.6701665 -1.0247713
> [5,] -0.6165707  0.9428163
> [6,] -2.5775579  3.9414194
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.100089
> 2  1.288779 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.024771
> 5  1.000000  2.000000
> 6 -2.577558  4.000000
> Iter: 8 
>  MSS: 0.5665195 
>  Rel. Err: 0.01423225 
> [1] "Xapp"
>             SBP       DBP
> [1,] -2.7628696  4.254096
> [2,]  1.2959129 -1.995367
> [3,] -0.2966660  0.456788
> [4,]  0.7647692 -1.177544
> [5,] -0.6169100  0.949880
> [6,] -2.5918258  3.990734
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.254096
> 2  1.295913 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.177544
> 5  1.000000  2.000000
> 6 -2.591826  4.000000
> Iter: 9 
>  MSS: 0.558216 
>  Rel. Err: 0.00830349 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.7988324  4.3825676
> [2,]  1.2826615 -2.0084627
> [3,] -0.2896952  0.4536209
> [4,]  0.8238540 -1.2900364
> [5,] -0.6175465  0.9669887
> [6,] -2.5653230  4.0169255
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.382568
> 2  1.282661 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.290036
> 5  1.000000  2.000000
> 6 -2.565323  4.000000
Code
iter <- 1
rel_err <- mss_old_svd
thresh <- 1e-2
while (rel_err > thresh) {
  iter <- iter + 1

  # Step 2(a)
  Xapp <- fit_svd(initialisation_data_svd, num_comp = 1)

  # Step 2(b)
  initialisation_data_svd[ismiss] <- Xapp[ismiss]

  # Step 2(c)
  mss_svd <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_svd - mss_svd
  mss_old_svd <- mss_svd
  cat(
    "Iter:", iter, "\n",
    "MSS:", mss_svd, "\n",
    "Rel. Err:", rel_err, "\n"
  )
  print("Xapp")
  print(Xapp)
  print("Data")
  print(initialisation_data_svd)
}
> Iter: 2 
>  MSS: 1.43288 
>  Rel. Err: 0.05686009 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.24567017  1.6296426
> [2,]  0.29799874 -1.9767619
> [3,] -0.02222086  0.1474011
> [4,] -0.11358385  0.7534536
> [5,] -0.27258141  1.8081571
> [6,] -0.59617741  3.9547173
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  1.6296426
> 2  0.2979987 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.7534536
> 5  1.0000000  2.0000000
> 6 -0.5961774  4.0000000
> Iter: 3 
>  MSS: 1.222089 
>  Rel. Err: 0.2107907 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.73055871  2.3385975
> [2,]  0.59572858 -1.9069916
> [3,] -0.08891186  0.2846165
> [4,] -0.12553344  0.4018461
> [5,] -0.48032110  1.5375598
> [6,] -1.19147316  3.8140345
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  2.3385975
> 2  0.5957286 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.4018461
> 5  1.0000000  2.0000000
> 6 -1.1914732  4.0000000
> Iter: 4 
>  MSS: 0.8998268 
>  Rel. Err: 0.3222623 
> [1] "Xapp"
>             [,1]        [,2]
> [1,] -1.53549232  3.07087697
> [2,]  0.91916923 -1.83827401
> [3,] -0.20001122  0.40000841
> [4,]  0.03926938 -0.07853601
> [5,] -0.60000561  1.19996915
> [6,] -1.83834165  3.67655441
> [1] "Data"
>          SBP         DBP
> 1 -3.0000000  3.07087697
> 2  0.9191692 -2.00000000
> 3 -1.0000000  0.00000000
> 4  1.0000000 -0.07853601
> 5  1.0000000  2.00000000
> 6 -1.8383417  4.00000000
> Iter: 5 
>  MSS: 0.6967107 
>  Rel. Err: 0.2031161 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.1668365  3.5757572
> [2,]  1.1333206 -1.8702286
> [3,] -0.2685842  0.4432231
> [4,]  0.3033932 -0.5006656
> [5,] -0.6178620  1.0196084
> [6,] -2.2666421  3.7404586
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.5757572
> 2  1.133321 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.5006656
> 5  1.000000  2.0000000
> 6 -2.266642  4.0000000
> Iter: 6 
>  MSS: 0.612954 
>  Rel. Err: 0.08375675 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.5102026  3.8916848
> [2,]  1.2439901 -1.9286163
> [3,] -0.2938087  0.4555054
> [4,]  0.5218645 -0.8090711
> [5,] -0.6172021  0.9568773
> [6,] -2.4879805  3.8572330
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.8916848
> 2  1.243990 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.8090711
> 5  1.000000  2.0000000
> 6 -2.487981  4.0000000
> Iter: 7 
>  MSS: 0.5807517 
>  Rel. Err: 0.03220227 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.6813227  4.1000892
> [2,]  1.2887789 -1.9707096
> [3,] -0.2995594  0.4580650
> [4,]  0.6701665 -1.0247713
> [5,] -0.6165707  0.9428163
> [6,] -2.5775579  3.9414194
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.100089
> 2  1.288779 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.024771
> 5  1.000000  2.000000
> 6 -2.577558  4.000000
> Iter: 8 
>  MSS: 0.5665195 
>  Rel. Err: 0.01423225 
> [1] "Xapp"
>            [,1]      [,2]
> [1,] -2.7628696  4.254096
> [2,]  1.2959129 -1.995367
> [3,] -0.2966660  0.456788
> [4,]  0.7647692 -1.177544
> [5,] -0.6169100  0.949880
> [6,] -2.5918258  3.990734
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.254096
> 2  1.295913 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.177544
> 5  1.000000  2.000000
> 6 -2.591826  4.000000
> Iter: 9 
>  MSS: 0.558216 
>  Rel. Err: 0.00830349 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.7988324  4.3825676
> [2,]  1.2826615 -2.0084627
> [3,] -0.2896952  0.4536209
> [4,]  0.8238540 -1.2900364
> [5,] -0.6175465  0.9669887
> [6,] -2.5653230  4.0169255
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.382568
> 2  1.282661 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.290036
> 5  1.000000  2.000000
> 6 -2.565323  4.000000
Code
iter <- 1
rel_err <- mss_old_tidymodels
thresh <- 1e-2
while (rel_err > thresh) {
  iter <- iter + 1

  # Step 2(a)
  Xapp <- fit_svd(initialisation_data_tidymodels, num_comp = 1)

  # Step 2(b)
  initialisation_data_tidymodels[ismiss] <- Xapp[ismiss]

  # Step 2(c)
  mss_tidymodels <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_tidymodels - mss_tidymodels
  mss_old_tidymodels <- mss_tidymodels
  cat(
    "Iter:", iter, "\n",
    "MSS:", mss_tidymodels, "\n",
    "Rel. Err:", rel_err, "\n"
  )
  print("Xapp")
  print(Xapp)
  print("Data")
  print(initialisation_data_tidymodels)
}
> Iter: 2 
>  MSS: 1.43288 
>  Rel. Err: 0.05686009 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.24567017  1.6296426
> [2,]  0.29799874 -1.9767619
> [3,] -0.02222086  0.1474011
> [4,] -0.11358385  0.7534536
> [5,] -0.27258141  1.8081571
> [6,] -0.59617741  3.9547173
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  1.6296426
> 2  0.2979987 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.7534536
> 5  1.0000000  2.0000000
> 6 -0.5961774  4.0000000
> Iter: 3 
>  MSS: 1.222089 
>  Rel. Err: 0.2107907 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.73055871  2.3385975
> [2,]  0.59572858 -1.9069916
> [3,] -0.08891186  0.2846165
> [4,] -0.12553344  0.4018461
> [5,] -0.48032110  1.5375598
> [6,] -1.19147316  3.8140345
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  2.3385975
> 2  0.5957286 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.4018461
> 5  1.0000000  2.0000000
> 6 -1.1914732  4.0000000
> Iter: 4 
>  MSS: 0.8998268 
>  Rel. Err: 0.3222623 
> [1] "Xapp"
>             [,1]        [,2]
> [1,] -1.53549232  3.07087697
> [2,]  0.91916923 -1.83827401
> [3,] -0.20001122  0.40000841
> [4,]  0.03926938 -0.07853601
> [5,] -0.60000561  1.19996915
> [6,] -1.83834165  3.67655441
> [1] "Data"
>          SBP         DBP
> 1 -3.0000000  3.07087697
> 2  0.9191692 -2.00000000
> 3 -1.0000000  0.00000000
> 4  1.0000000 -0.07853601
> 5  1.0000000  2.00000000
> 6 -1.8383417  4.00000000
> Iter: 5 
>  MSS: 0.6967107 
>  Rel. Err: 0.2031161 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.1668365  3.5757572
> [2,]  1.1333206 -1.8702286
> [3,] -0.2685842  0.4432231
> [4,]  0.3033932 -0.5006656
> [5,] -0.6178620  1.0196084
> [6,] -2.2666421  3.7404586
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.5757572
> 2  1.133321 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.5006656
> 5  1.000000  2.0000000
> 6 -2.266642  4.0000000
> Iter: 6 
>  MSS: 0.612954 
>  Rel. Err: 0.08375675 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.5102026  3.8916848
> [2,]  1.2439901 -1.9286163
> [3,] -0.2938087  0.4555054
> [4,]  0.5218645 -0.8090711
> [5,] -0.6172021  0.9568773
> [6,] -2.4879805  3.8572330
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.8916848
> 2  1.243990 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.8090711
> 5  1.000000  2.0000000
> 6 -2.487981  4.0000000
> Iter: 7 
>  MSS: 0.5807517 
>  Rel. Err: 0.03220227 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.6813227  4.1000892
> [2,]  1.2887789 -1.9707096
> [3,] -0.2995594  0.4580650
> [4,]  0.6701665 -1.0247713
> [5,] -0.6165707  0.9428163
> [6,] -2.5775579  3.9414194
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.100089
> 2  1.288779 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.024771
> 5  1.000000  2.000000
> 6 -2.577558  4.000000
> Iter: 8 
>  MSS: 0.5665195 
>  Rel. Err: 0.01423225 
> [1] "Xapp"
>            [,1]      [,2]
> [1,] -2.7628696  4.254096
> [2,]  1.2959129 -1.995367
> [3,] -0.2966660  0.456788
> [4,]  0.7647692 -1.177544
> [5,] -0.6169100  0.949880
> [6,] -2.5918258  3.990734
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.254096
> 2  1.295913 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.177544
> 5  1.000000  2.000000
> 6 -2.591826  4.000000
> Iter: 9 
>  MSS: 0.558216 
>  Rel. Err: 0.00830349 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.7988324  4.3825676
> [2,]  1.2826615 -2.0084627
> [3,] -0.2896952  0.4536209
> [4,]  0.8238540 -1.2900364
> [5,] -0.6175465  0.9669887
> [6,] -2.5653230  4.0169255
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.382568
> 2  1.282661 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.290036
> 5  1.000000  2.000000
> 6 -2.565323  4.000000

SoftImpute

Code
initialisation_data_tidymodels %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  ) %>%
  reactablefmtr::add_title(
    title = "Algorithm 12.1"
  )
fits <- softImpute::softImpute(missing_data, type = "svd")
softImpute::complete(missing_data, fits) %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  ) %>%
  reactablefmtr::add_title(
    title = "Soft Impute"
  )

Algorithm 12.1

Soft Impute

K-Means Clustering

Introduction

We begin with a simple simulated example in which there truly are two clusters in the

Code
set.seed(2)

# CLuster centre at (0,0) and (3,-4)
centers <- tibble::tibble(
  cluster = factor(1:2),
  num_points = c(25, 25), # number points in each cluster
  V1 = c(0, 3), # V1 coordinate of cluster center
  V2 = c(0, -4) # V2 coordinate of cluster center
)

labelled_points <- centers %>%
  dplyr::mutate(
    V1 = purrr::map2(.data[["num_points"]], .data[["V1"]], rnorm),
    V2 = purrr::map2(.data[["num_points"]], .data[["V2"]], rnorm)
  ) %>%
  dplyr::select(-c("num_points")) %>%
  tidyr::unnest(cols = c("V1", "V2"))

labelled_points %>%
  reactable::reactable(
    defaultPageSize = 10,
    columns = list(
      V1 = reactable::colDef(format = reactable::colFormat(digits = 2)),
      V2 = reactable::colDef(format = reactable::colFormat(digits = 2))
    )
  )
Code
ggplot2::ggplot(
  data = labelled_points,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]],
    color = .data[["cluster"]]
  )
) +
  ggplot2::geom_point(alpha = 1)

We now perform K-means clustering with \(K = 2\).

We’ll use the built-in stats::kmeans function, which accepts a data frame with all numeric columns as it’s primary argument

Code
points <- labelled_points %>%
  dplyr::select(-c("cluster"))

result_kmeans <- stats::kmeans(
  x = points,
  centers = 2,
  nstart = 20
)

result_kmeans
> K-means clustering with 2 clusters of sizes 26, 24
> 
> Cluster means:
>         V1        V2
> 1 0.361163 -0.162447
> 2 2.877777 -4.262629
> 
> Clustering vector:
>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
> [39] 2 2 2 1 2 2 2 2 2 2 2 2
> 
> Within cluster sum of squares by cluster:
> [1] 68.52253 58.53435
>  (between_SS / total_SS =  69.5 %)
> 
> Available components:
> 
> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
> [6] "betweenss"    "size"         "iter"         "ifault"

The broom::tidy function returns information for each cluster, including their mean position (centroid), size and within-cluster sum-of-squares.

Code
result_kmeans %>%
  broom::tidy() %>%
  reactable::reactable()

The broom::glance function returns model wise metrics. One of these is tot.withinss which is the total within-cluster sum-of-squares that we seek to minimize when we perform K-means clustering.

Code
result_kmeans %>%
  broom::glance() %>%
  reactable::reactable()

We can see what cluster each observation belongs to by using broom::augment which “predict” which cluster a given observation belongs to.

Code
predicted_kmeans <- result_kmeans %>%
  broom::augment(data = points)

predicted_kmeans %>%
  reactable::reactable(defaultPageSize = 5)
Code
ggplot2::ggplot(
  data = predicted_kmeans,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]],
    color = .data[[".cluster"]]
  )
) +
  ggplot2::geom_point(alpha = 1)

Exploration

In reality, we will not know how many clusters is required for a given dataset. We need to try out a number of different values of \(K\) and then find the best one.

Let us try \(K\) from 1 to 9.

Code
kclusts <- tibble::tibble(k = 1:9) %>%
  dplyr::mutate(
    kclust = purrr::map(
      .x = .data[["k"]],
      .f = ~ kmeans(x = points, centers = .x)
    )
  ) %>%
  dplyr::mutate(
    tidied = purrr::map(
      .x = .data[["kclust"]],
      .f = broom::tidy
    ),
    glanced = purrr::map(
      .x = .data[["kclust"]],
      .f = broom::glance
    ),
    augmented = purrr::map(
      .x = .data[["kclust"]],
      .f = broom::augment, data = points
    )
  )

kclusts
> # A tibble: 9 × 5
>       k kclust   tidied           glanced          augmented        
>   <int> <list>   <list>           <list>           <list>           
> 1     1 <kmeans> <tibble [1 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 2     2 <kmeans> <tibble [2 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 3     3 <kmeans> <tibble [3 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 4     4 <kmeans> <tibble [4 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 5     5 <kmeans> <tibble [5 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 6     6 <kmeans> <tibble [6 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 7     7 <kmeans> <tibble [7 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 8     8 <kmeans> <tibble [8 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 9     9 <kmeans> <tibble [9 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>

Let assignments be a dataset of which cluster each points goes to for \(K\) from 1 to 9

Code
assignments <- kclusts %>%
  dplyr::select(c("k", "augmented")) %>%
  tidyr::unnest(cols = c("augmented"))

assignments %>%
  reactable::reactable(defaultPageSize = 5)

Now we can plot the original points using the data from assignments, with each point coloured according to the predicted cluster.

Code
ggplot2::ggplot(
  data = assignments,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]]
  )
) +
  geom_point(
    mapping = ggplot2::aes(color = .data[[".cluster"]]),
    alpha = 0.8
  ) +
  facet_wrap(ggplot2::vars(.data[["k"]]))

Let clusters be a dataset of cluster mean position (centroids) for \(K\) from 1 to 9

Code
clusters <- kclusts %>%
  dplyr::select(c("k", "tidied")) %>%
  tidyr::unnest(cols = c("tidied"))

clusters %>%
  reactable::reactable(defaultPageSize = 5)

We can then add the centers of the cluster using the data from clusters

Code
ggplot2::ggplot(
  data = assignments,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]]
  )
) +
  geom_point(
    mapping = ggplot2::aes(color = .data[[".cluster"]]),
    alpha = 0.8
  ) +
  facet_wrap(ggplot2::vars(.data[["k"]])) +
  geom_point(
    data = clusters,
    size = 10,
    shape = "x"
  )

Let clusterings be a dataset of cluster metrics for \(K\) from 1 to 9

Code
clusterings <- kclusts %>%
  dplyr::select(c("k", "glanced")) %>%
  tidyr::unnest(cols = c("glanced"))

clusterings %>%
  reactable::reactable(defaultPageSize = 5)

Now that we have the total within-cluster sum-of-squares for \(K\) from 1 to 9, we can plot them against k so we can use the elbow method to find the optimal number of clusters.

Code
ggplot2::ggplot(
  data = clusterings,
  mapping = ggplot2::aes(
    x = .data[["k"]],
    y = .data[["tot.withinss"]]
  )
) +
  ggplot2::geom_line(alpha = 0.5, size = 1.2, color = "midnightblue") +
  ggplot2::geom_point(size = 2, color = "midnightblue")

We see an elbow at k = 2 which makes us happy since the data set is specifically created to have 2 clusters. We can now extract the model where k = 2 from

Code
final_kclusts <- kclusts %>%
  dplyr::filter(.data[["k"]] == 2) %>%
  dplyr::pull(.data[["kclust"]]) %>%
  purrr::pluck(1)

Hierarchical Clustering

The stats::dist function is used to calculate the Euclidean distance dissimilarity measure.

Code
dissimilarity_measure <- points %>%
  stats::dist(method = "euclidean")

dissimilarity_measure %>%
  as.matrix() %>%
  reactable::reactable()

The stats::hclust function implements hierarchical clustering in R. We use the data used for K-means clustering to plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure.

Code
res_hclust_complete <- dissimilarity_measure %>%
  stats::hclust(method = "complete")

res_hclust_average <- dissimilarity_measure %>%
  stats::hclust(method = "average")

res_hclust_single <- dissimilarity_measure %>%
  stats::hclust(method = "single")

We use factoextra::fviz_dend to visualize the clustering created using hclust.

To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the stats::cutree function

Code
factoextra::fviz_dend(res_hclust_complete, main = "complete", k = 2)

Code
stats::cutree(tree = res_hclust_complete, k = 2)
>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
> [39] 2 2 2 1 2 2 2 2 2 2 2 2
Code
results_hclust <-
  tibble::tibble(
    cluster_id = stats::cutree(tree = res_hclust_complete, k = 2)
  ) %>%
  dplyr::bind_cols(points) %>%
  dplyr::mutate(cluster_id = as.factor(.data$cluster_id))

results_hclust %>%
  reactable::reactable(defaultPageSize = 5)
Code
factoextra::fviz_dend(res_hclust_average, main = "average", k = 2)

Code
cutree(tree = res_hclust_average, k = 2)
>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
> [39] 2 2 2 1 2 2 2 2 2 2 2 2
Code
results_hclust <-
  tibble::tibble(
    cluster_id = stats::cutree(tree = res_hclust_average, k = 2)
  ) %>%
  dplyr::bind_cols(points) %>%
  dplyr::mutate(cluster_id = as.factor(.data$cluster_id))

results_hclust %>%
  reactable::reactable(defaultPageSize = 5)
Code
factoextra::fviz_dend(res_hclust_single, main = "single", k = 4)

Code
cutree(tree = res_hclust_single, k = 4)
>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3
> [39] 3 3 3 1 3 4 3 4 3 3 3 3
Code
results_hclust <-
  tibble::tibble(
    cluster_id = stats::cutree(tree = res_hclust_single, k = 2)
  ) %>%
  dplyr::bind_cols(points) %>%
  dplyr::mutate(cluster_id = as.factor(.data$cluster_id))

results_hclust %>%
  reactable::reactable(defaultPageSize = 5)

We analyze the nature of the clusters mathematically by evaluating the c, which measures the clustering structure of the dataset - with values closer to 1 suggest a more balanced clustering structure and values closer to 0 suggest less well-formed clusters. However, the agglomerative coefficient tends to become larger as the number of samples increases, so it should not be used to compare across data sets of very different sizes.

cluster::agnes allows us to compute this metric on the results from hierarchical clustering.

It looks like the complete linkage did the best.

Code
# Compute agglomerative coefficient values

ac_metric <- list(
  complete_ac = cluster::agnes(
    x = dissimilarity_measure,
    metric = "euclidean",
    method = "complete"
  )$ac,
  average_ac = cluster::agnes(
    x = dissimilarity_measure,
    metric = "euclidean",
    method = "average"
  )$ac,
  single_ac = cluster::agnes(
    x = dissimilarity_measure,
    metric = "euclidean",
    method = "single"
  )$ac
)

ac_metric
> $complete_ac
> [1] 0.9359608
> 
> $average_ac
> [1] 0.8887208
> 
> $single_ac
> [1] 0.6700538

Although hierarchical clustering provides a fully connected dendrogram representing the cluster relationships, there may still need to choose the preferred number of clusters to extract. Fortunately we can execute approaches similar to those discussed for k-means clustering with factoextra::fviz_nbclust

Code
# Plot cluster results
p1 <- factoextra::fviz_nbclust(
  x = points,
  FUN = factoextra::hcut,
  method = "wss",
  k.max = 10
) +
  ggplot2::ggtitle(label = "(A) Elbow method")

p2 <- factoextra::fviz_nbclust(
  x = points,
  FUN = factoextra::hcut,
  method = "silhouette",
  k.max = 10
) +
  ggplot2::ggtitle("(B) Silhouette method")

p3 <- factoextra::fviz_nbclust(
  x = points,
  FUN = factoextra::hcut,
  method = "gap_stat",
  k.max = 10
) +
  ggplot2::ggtitle("(C) Gap statistic")

# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)

Another way of calculating distances is based on pearson correlation using proxy::dist. However, this only makes sense if we have data with 3 or more variables.

Code
# correlation based distance
set.seed(2)
three_var_data <- matrix(rnorm(30 * 3), ncol = 3)

three_var_data %>%
  proxy::dist(method = "correlation") %>%
  stats::hclust(method = "complete") %>%
  factoextra::fviz_dend(main = "complete", k = 2)

Code
# Plot cluster results
p1 <- factoextra::fviz_nbclust(
  x = three_var_data,
  FUN = factoextra::hcut,
  method = "wss",
  hc_metric = "pearson",
  k.max = 10
) +
  ggplot2::ggtitle(label = "(A) Elbow method")

p2 <- factoextra::fviz_nbclust(
  x = three_var_data,
  FUN = factoextra::hcut,
  method = "silhouette",
  hc_metric = "pearson",
  k.max = 10
) +
  ggplot2::ggtitle("(B) Silhouette method")

p3 <- factoextra::fviz_nbclust(
  x = three_var_data,
  FUN = factoextra::hcut,
  method = "gap_stat",
  hc_metric = "pearson",
  k.max = 10
) +
  ggplot2::ggtitle("(C) Gap statistic")

# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)

NCI60 Data Example

Unsupervised techniques are often used in the analysis of genomic data.

We illustrate these techniques on the NCI60 cancer cell line microarray data, which consists of 6,830 gene expression measurements on 64 cancer cell lines.

labels represents the type of cancer. V_1 to V_6830 represent a gene.

Code
nci60 <- ISLR2::NCI60$data %>%
  tibble::as_tibble() %>%
  magrittr::set_colnames(., paste0("V_", 1:ncol(.))) %>%
  dplyr::mutate(label = factor(ISLR2::NCI60$labs)) %>%
  dplyr::relocate(.data[["label"]])

# head(nci60) %>%
#   reactable::reactable(defaultPageSize = 10)

PCA on the NCI60 Data

We first perform PCA on the data after scaling the variables (genes) to have standard deviation one and plot the first few principal component score vectors, in order to visualize the data.

Code
nci60_preproc_rec <- recipes::recipe(formula = label ~ ., data = nci60) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::step_pca(recipes::all_numeric_predictors(),
    id = "pca",
    threshold = 0.99
  )
Code
nci60_pca_estimates <- recipes::prep(x = nci60_preproc_rec)

nci60_pca_estimates %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)
Code
nci60_pca_score <- nci60_pca_estimates %>%
  recipes::bake(new_data = NULL)


head(nci60_pca_score) %>%
  reactable::reactable(defaultPageSize = 5)

We visualise the projections of the NCI60 cancer cell lines onto the first three principal components

Code
colors <- unname(palette.colors(n = 14, palette = "Polychrome 36"))

nci60_pca_score %>%
  dplyr::select(c("label", "PC01", "PC02")) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["PC01"]],
    y = .data[["PC02"]],
    color = .data[["label"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::scale_color_manual(values = colors) +
  ggplot2::theme_bw()

Code
nci60_pca_score %>%
  dplyr::select(c("label", "PC01", "PC03")) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["PC01"]],
    y = .data[["PC03"]],
    color = .data[["label"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::scale_color_manual(values = colors) +
  ggplot2::theme_bw()

Here is a scatter plot matrix

Code
nci60_pca_score %>%
  dplyr::select(c("label", "PC01", "PC02", "PC03")) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .panel_x,
    y = .panel_y,
    color = .data[["label"]],
    fill = .data[["label"]]
  )) +
  ggplot2::geom_point(alpha = 0.4, size = 0.5) +
  ggforce::geom_autodensity(alpha = .3) +
  ggforce::facet_matrix(rows = ggplot2::vars(-c("label")), layer.diag = 2) +
  ggplot2::scale_color_manual(values = colors) +
  ggplot2::scale_fill_manual(values = colors) +
  cowplot::theme_minimal_grid() +
  cowplot::panel_border(color = "black") +
  ggplot2::theme(legend.position = "bottom")

To view them interactively, consider using langevitour::langevitour from the R package langevitour

Code
nci60_pca_score_filtered <- nci60_pca_score %>%
  dplyr::select(c(
    "label",
    "PC01", "PC02", "PC03",
    "PC04", "PC05", "PC06"
  ))

langevitour::langevitour(
  nci60_pca_score_filtered[, -1], nci60_pca_score_filtered$label,
  pointSize = 2
)

How much variation are we capturing for each component?

We extract the pca loadings of the pca step.

Code
nci60_pca_loadings <- nci60_pca_estimates %>%
  parsnip::tidy(id = "pca", type = "coef")


head(nci60_pca_loadings) %>%
  reactable::reactable(defaultPageSize = 5)

Here is how we can see the variance statistics but it is in a long form

Code
nci60_pca_variance <- nci60_pca_estimates %>%
  parsnip::tidy(id = "pca", type = "variance")

Here is how we can see the variance statistics in its wide form

Code
nci60_pca_variance %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]],
    names_prefix = "PC"
  ) %>%
  dplyr::select(-dplyr::all_of("id")) %>%
  reactable::reactable(defaultPageSize = 5)

Here is a simple plot to show the variance captured for each component.

Code
# Get the variance
percent_variation <- nci60_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "percent variance") %>%
  dplyr::pull(.data[["value"]])

percent_variation <- percent_variation / 100
component <- unique(nci60_pca_loadings$component) %>%
  stringr::str_remove(pattern = "PC") %>%
  as.integer()

# I use [1:4] to select the first four components
dplyr::tibble(
  component = component,
  percent_var = percent_variation
) %>%
  # dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["percent_var"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
  ggplot2::labs(
    x = NULL,
    y = "Percent variance \nexplained by each \nPCA component"
  )

We see that together, the first seven principal components explain around 40% of the variance in the data.

Code
# Get the variance
cumulative_percent_variation <- nci60_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "cumulative percent variance") %>%
  dplyr::pull(.data[["value"]])

cumulative_percent_variation <- cumulative_percent_variation / 100
component <- unique(nci60_pca_loadings$component) %>%
  stringr::str_remove(pattern = "PC") %>%
  as.integer()

# I use [1:4] to select the first four components
dplyr::tibble(
  component = component,
  cumulative_percent_var = cumulative_percent_variation
) %>%
  # dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["cumulative_percent_var"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
  ggplot2::labs(
    x = NULL,
    y = "Cumulative percent \nvariance explained \nby each PCA \ncomponent"
  )

Clustering on NCI60 dataset (Base R)

We now proceed to hierarchically cluster the cell lines in the NCI60 data, with the goal of finding out whether or not the observations cluster into distinct types of cancer.

We first preprocess the data suited for hierarchical clustering

Code
nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::prep() %>%
  recipes::bake(new_data = NULL)

We choose the preferred number of clusters to extract.

Code
nci60_num_cluster <- nci60_scaled %>%
  dplyr::select(-c("label"))

# Plot cluster results
p1 <- factoextra::fviz_nbclust(
  x = nci60_num_cluster,
  FUN = factoextra::hcut,
  method = "wss",
  hc_metric = "euclidean",
  k.max = 10
) +
  ggplot2::ggtitle(label = "(A) Elbow method")

p2 <- factoextra::fviz_nbclust(
  x = nci60_num_cluster,
  FUN = factoextra::hcut,
  method = "silhouette",
  hc_metric = "euclidean",
  k.max = 10
) +
  ggplot2::ggtitle("(B) Silhouette method")

p3 <- factoextra::fviz_nbclust(
  x = nci60_num_cluster,
  FUN = factoextra::hcut,
  method = "gap_stat",
  hc_metric = "euclidean",
  k.max = 10
) +
  ggplot2::ggtitle("(C) Gap statistic")

# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)

We now perform hierarchical clustering of the observations using complete, single, and average linkage. Euclidean distance is used as the dissimilarity measure.

Code
nci60_complete <- nci60_num_cluster %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "complete")

nci60_average <- nci60_num_cluster %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "average")

nci60_single <- nci60_num_cluster %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "single")

We then visualize them to see if any of them have some good natural separations.

Typically, single linkage will tend to yield trailing clusters: very large clusters onto which individual observations.

Complete and average linkage tend to yield more balanced, attractive clusters. For this reason, complete and average linkage are generally preferred to single linkage.

Code
nci60_complete$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_complete, main = "complete", k = 4)

Code
stats::cutree(tree = nci60_complete, k = 4)
>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           2           2 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           2           2           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           2 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           2           2           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           3           3           3 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           3           3           3           3           3           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           4           1           4           4           4           4 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           4           4           4           4           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1
Code
nci60_average$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_average, main = "average", k = 4)

Code
stats::cutree(tree = nci60_average, k = 4)
>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           1           1 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           1           1           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           1 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           2           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           3           3           3 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           3           3           3           3           3           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           1           1           1           1           1           1 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           1           1           1           1           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           4           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1
Code
nci60_single$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_single, main = "single", k = 4)

Code
stats::cutree(tree = nci60_single, k = 4)
>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           1           1 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           1           1           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           2 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           3           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           1           1           1 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           1           1           1           1           4           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           1           1           1           1           1           1 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           1           1           1           1           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1

We will use complete linkage hierarchical clustering for the analysis that follows. We calculate which Label is the most common within each cluster.

Code
tibble::tibble(
  label = nci60$label,
  cluster_id = stats::cutree(nci60_complete, k = 4)
) %>%
  dplyr::count(.data[["label"]], .data[["cluster_id"]]) %>%
  dplyr::group_by(.data[["cluster_id"]]) %>%
  dplyr::mutate(
    total = sum(.data[["n"]]),
    prop = .data[["n"]] / sum(.data[["n"]])
  ) %>%
  dplyr::slice_max(n = 1, order_by = .data[["prop"]]) %>%
  dplyr::ungroup() %>%
  reactable::reactable()

How do these NCI60 hierarchical clustering results compare to what we get if we perform K-means clustering with K = 4?

Code
nci60_kmeans_results <- kmeans(nci60_num_cluster, centers = 4, nstart = 50)
Code
nci60_kmeans_results %>%
  broom::tidy() %>%
  dplyr::select(cluster, size, withinss) %>%
  reactable::reactable()

We see that the four clusters obtained using hierarchical clustering and Kmeans clustering are somewhat different.

Code
cluster_kmeans <- nci60_kmeans_results$cluster
cluster_hclust <- cutree(nci60_complete, k = 4)

table(cluster_hclust, cluster_kmeans)
>               cluster_kmeans
> cluster_hclust  1  2  3  4
>              1  0  9 20 11
>              2  0  0  7  0
>              3  8  0  0  0
>              4  0  0  0  9
Code
tibble(
  kmeans = factor(cluster_kmeans),
  hclust = factor(cluster_hclust)
) %>%
  yardstick::conf_mat(
    truth = kmeans,
    estimate = hclust,
    dnn = c("Hierarchical clustering", "K-means")
  ) %>%
  parsnip::autoplot(type = "heatmap")

Cluster 2 in K-means clustering is identical to cluster 4 in hierarchical clustering. However, the other clusters defer. For the twenty seven samples in K-means Cluster 3, twenty goes to hierarchical clustering Cluster 1 and seven goes to hierarchical clustering Cluster 2.

Rather than performing hierarchical clustering on the entire data matrix, we can also perform hierarchical clustering on the first few principal component score vectors, as follows:

Code
# nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
#   recipes::step_normalize(recipes::all_numeric_predictors()) %>%
#   recipes::prep() %>%
#   recipes::bake(new_data = NULL)

nci60_pca <- recipes::recipe(~., nci60_scaled) %>%
  step_pca(recipes::all_numeric_predictors(), num_comp = 5) %>%
  recipes::prep() %>%
  recipes::bake(new_data = NULL)
Code
nci60_pca_complete <- nci60_pca %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "complete")

nci60_pca_complete$labels <- nci60_pca$label
factoextra::fviz_dend(nci60_pca_complete, main = "complete", k = 4)

Code
stats::cutree(tree = nci60_pca_complete, k = 4)
>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           2           1           2           2           2 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           2           2           2           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           2 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           2           2           2           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           1           3           3 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           3           3           3           3           1           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           1           1           1           1           1           1 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           4           4           4           4           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           2           2           2           2           2 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           2           2           2           2

Clustering on NCI60 dataset (tidyclust)

Recall that we scale the data before clustering and remove the labels.

Code
nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::prep() %>%
  recipes::bake(new_data = NULL)

nci60_num_cluster <- nci60_scaled %>%
  dplyr::select(-c("label"))

Here is how we can create the clustering elbow plot using tidyclust

Code
kmeans_spec <- tidyclust::k_means(
  num_clusters = tune::tune()
)

nci60_cv <- rsample::vfold_cv(nci60_num_cluster, v = 5)

nci60_rec <- recipes::recipe(formula = ~., data = nci60_num_cluster)

kmeans_wflow <- workflows::workflow(nci60_rec, kmeans_spec)

clust_num_grid <- dials::grid_regular(
  tidyclust::num_clusters(range = c(1, 15)),
  levels = 15
)

clust_num_grid %>%
  reactable::reactable(defaultPageSize = 5)
Code
res <- tidyclust::tune_cluster(
  kmeans_wflow,
  resamples = nci60_cv,
  grid = clust_num_grid,
  control = tune::control_grid(save_pred = TRUE, extract = identity),
  metrics = tidyclust::cluster_metric_set(
    tidyclust::tot_wss,
    tidyclust::tot_sse,
    tidyclust::sse_ratio
  )
)

res_metrics <- res %>% tune::collect_metrics()

res_metrics %>%
  reactable::reactable(defaultPageSize = 5)
Code
res_metrics %>%
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      x = .data[["num_clusters"]],
      y = .data[["mean"]],
      colour = .data[[".metric"]]
    )
  ) +
  ggplot2::geom_errorbar(
    mapping = ggplot2::aes(
      ymin = .data[["mean"]] - .data[["std_err"]],
      ymax = .data[["mean"]] + .data[["std_err"]]
    ),
    alpha = 0.8
  ) +
  ggplot2::geom_line(size = 1) +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[[".metric"]]),
    scales = "free",
    nrow = 3
  ) +
  ylab("mean over 5 folds") +
  xlab("Number of clusters") +
  scale_x_continuous(breaks = 1:15) +
  ggplot2::theme(legend.position = "none")

tidyclust::hier_clust is used to do the hierarchical clustering. Here, we set the number of clusters to three and use the complete linkage method. It actually uses stats::hclust() and Euclidean distance under the hood.

Code
hc_spec <- tidyclust::hier_clust(
  num_clusters = 3,
  linkage_method = "complete"
)

hc_fit <- hc_spec %>%
  tidyclust::fit(label ~ ., data = nci60_scaled)

We use tidyclust::within_cluster_sse to calculate the Sum of Squared Error in each cluster

Code
hc_fit %>%
  tidyclust::within_cluster_sse() %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::tot_sse to calculate the total sum of squares.

Code
hc_fit %>%
  tidyclust::tot_sse() %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::sse_ratio to calculate the ratio of the WSS to the total SSE.

Code
hc_fit %>%
  tidyclust::sse_ratio() %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::silhouettes to measure silhouettes between clusters for all observations. Ensure that the labels are removed.

Code
hc_fit %>%
  tidyclust::silhouettes(new_data = nci60_num_cluster) %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::avg_silhouette to measure average silhouettes across all observations. Ensure that the labels are removed. (This is also the mean of the column sil_width)

Code
hc_fit %>%
  tidyclust::avg_silhouette(new_data = nci60_num_cluster) %>%
  reactable::reactable(defaultPageSize = 5)

Use tidyclust::cluster_metric_set to combine the metrics together.

Code
many_metrics <- tidyclust::cluster_metric_set(
  tidyclust::tot_wss,
  tidyclust::tot_sse,
  tidyclust::sse_ratio,
  tidyclust::avg_silhouette
)

# new_data required for tidyclust::avg_silhouette to work
hc_fit %>%
  many_metrics(new_data = nci60_num_cluster) %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::extract_cluster_assignment to see the cluster assignments. However, the labels need to be updated.

Code
hc_fit %>%
  tidyclust::extract_cluster_assignment() %>%
  dplyr::mutate(labels = nci60_scaled$label) %>%
  reactable::reactable(defaultPageSize = 5)

Centroid can be extracted using extract_centroids. Here is a preview of the first few columns. For hierarchical clustering, the geometric mean over the predictors for each cluster is used.

Code
hc_fit_centroids <- hc_fit %>%
  tidyclust::extract_centroids()

hc_fit_centroids[, 1:10] %>%
  reactable::reactable(defaultPageSize = 5)

Here is how to see the plots. hc_fit$fit gives back the hclust object. We can plot the hierarchical clustering results.

Code
hc_fit$fit$labels <- nci60_scaled$label
factoextra::fviz_dend(hc_fit$fit, main = "complete", k = 3)

Code
stats::cutree(tree = hc_fit$fit, k = 3)
>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           1           1 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           1           1           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           1 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           1           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           2           2           2 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           2           2           2           2           2           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           3           1           3           3           3           3 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           3           3           3           3           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1

Alternatively, we can use tidyclust::extract_fit_summary as well.

Code
hc_summary <- hc_fit %>%
  tidyclust::extract_fit_summary()

Here is how you can do k-means clustering using tidyclust::k_means.

Code
km_spec <- tidyclust::k_means(num_clusters = 5) %>%
  parsnip::set_engine("stats", algorithm = "Lloyd")

km_fit <- km_spec %>%
  tidyclust::fit(label ~ ., data = nci60_scaled)

km_summary <- km_fit %>% extract_fit_summary()

stats::predict returns the cluster a new observation belongs to

Code
km_preds <- predict(km_fit, nci60_num_cluster, prefix = "KM_")
hc_preds <- predict(hc_fit, nci60_num_cluster, prefix = "HC_")

However, stats::predict on the “training data” will not necessarily produce the same results as tidyclust::extract_cluster_assignment

Code
cbind(
  hc_preds,
  tidyclust::extract_cluster_assignment(hc_fit),
  nci60_scaled$label
) %>%
  reactable::reactable(defaultPageSize = 5)

We notice that the three-cluster assignments from hier_clust do not line up perfectly with the five-cluster assignments from k_means.

Code
tibble(
  hc = hc_preds$.pred_cluster,
  km = km_preds$.pred_cluster
) %>%
  dplyr::count(.data[["hc"]], .data[["km"]]) %>%
  reactable::reactable(defaultPageSize = 5)

tidyclust also have a useful function called tidyclust::reconcile_clusterings to match clusters from k-means with the clusters from hierarchical clustering.

Code
cluster_table <- tidyclust::reconcile_clusterings(
  primary_cluster_assignment = hc_preds$.pred_cluster,
  alt_cluster_assignment = km_preds$.pred_cluster,
  one_to_one = FALSE
)

cluster_table %>%
  dplyr::mutate(labels = nci60_scaled$label) %>%
  reactable::reactable(
    defaultPageSize = 5,
    filterable = TRUE
  )

We can see that KM_1 and KM_2 is matched to HC_1, KM_3 is matched to HC_2, KM_4 and KM_5 is matched to HC_3.

Blog References

  • Tengku Hanis’s blog titled “Explore data using PCA”

  • Julia Silge’s blog titled “PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes”

  • Julia Silge’s blog titled “Getting started with k-means and #TidyTuesday employment status”

  • Tidymodels’ tutorial titled “K-means clustering with tidy data principles”

  • Emil Hvitfeldt’s ISLR tidymodels Labs Chapter 12

  • Eric (R_ic) RPubs’s post titled “Train and Evaluate Clustering Models using Tidymodels and friends”

  • Chapter 21 Hierarchical Clustering from Hands-On Machine Learning with R

Package References

Code
get_citation <- function(package_name) {
  transform_name <- package_name %>%
    citation() %>%
    format(style = "text")
  return(transform_name[1])
}

packages <- c(
  "base", "ggplot2", "rmarkdown", "cluster",
  "report", "tidytext", "knitr", "broom",
  "cowplot", "dplyr", "factoextra", "forcats",
  "ggforce", "gridExtra", "htmltools", "ISLR2",
  "magrittr", "Matrix", "parsnip", "proxy",
  "purrr", "reactable", "reactablefmtr", "recipes",
  "scales", "softImpute", "stringr", "styler",
  "tibble", "tidyr", "yardstick", "tidyclust",
  "rsample", "tune", "dials", "workflows",
  "Rfast", "flexclust", "langevitour"
)

table <- tibble::tibble(Packages = packages)

table %>%
  dplyr::mutate(
    transform_name = purrr::map_chr(
      .data[["Packages"]],
      get_citation
    )
  ) %>%
  dplyr::pull(.data[["transform_name"]]) %>%
  report::as.report_parameters()