Chapter 12 Lab

This document was prepared on 2022-08-23.

Author

Jeremy Selva

Code

# Rmarkdown stuff
library(rmarkdown, quietly = TRUE)
library(knitr, quietly = TRUE)
library(htmltools, quietly = TRUE)
library(styler, quietly = TRUE)

# Dataset
library(ISLR2, quietly = TRUE)

# Session info and package reporting
library(report, quietly = TRUE)

# Data wrangling
library(dplyr, quietly = TRUE)
library(tidyr, quietly = TRUE)
library(magrittr, quietly = TRUE)
library(purrr, quietly = TRUE)
library(tibble, quietly = TRUE)
library(stringr, quietly = TRUE)

# View PCA
library(langevitour, quietly = TRUE)

# Matrix Completion
library(softImpute, quietly = TRUE)

# Create interactive tables
library(reactable, quietly = TRUE)
library(reactablefmtr, quietly = TRUE)

# For plotting
library(tidytext, quietly = TRUE)
library(scales, quietly = TRUE)
library(forcats, quietly = TRUE)
library(grid, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(gridExtra, quietly = TRUE)
library(grDevices, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(ggforce, quietly = TRUE)

# For applying tidymodels
library(recipes, quietly = TRUE)
# From GitHub remotes::install_github("tidymodels/parsnip")
library(parsnip, quietly = TRUE)
library(broom, quietly = TRUE)
library(yardstick, quietly = TRUE)

# For clustering
library(proxy, quietly = TRUE)
library(factoextra, quietly = TRUE)
library(cluster, quietly = TRUE)

# For tidy clustering
# From GitHub remotes::install_github("tidymodels/workflows@celery")
library(workflows, quietly = TRUE)
library(rsample, quietly = TRUE)
library(tune, quietly = TRUE)
library(dials, quietly = TRUE)
library(tidyclust, quietly = TRUE)
library(Rfast, quietly = TRUE)
library(flexclust, quietly = TRUE)

summary(report::report(sessionInfo()))

The analysis was done using the R Statistical language (v4.2.1; R Core Team, 2022) on Windows 10 x64, using the packages Rcpp (v1.0.9), ggplot2 (v3.3.6), rmarkdown (v2.14), cluster (v2.1.3), report (v0.5.1), lattice (v0.20.45), tidytext (v0.3.3), flexclust (v1.4.1), knitr (v1.39), broom (v1.0.0), cowplot (v1.1.1), dials (v1.0.0), dplyr (v1.0.9), factoextra (v1.0.7), forcats (v0.5.1), ggforce (v0.3.3), gridExtra (v2.3), htmltools (v0.5.2), ISLR2 (v1.3.1), langevitour (v0.5), magrittr (v2.0.3), Matrix (v1.4.1), modeltools (v0.2.23), parsnip (v1.0.0.9000), proxy (v0.4.27), purrr (v0.3.4), RcppZiggurat (v0.1.6), reactable (v0.3.0), reactablefmtr (v2.0.0), recipes (v1.0.1), Rfast (v2.0.6), rsample (v1.1.0), scales (v1.2.0), softImpute (v1.4.1), stringr (v1.4.0), styler (v1.7.0), tibble (v3.1.8), tidyclust (v0.0.0.9000), tidyr (v1.2.0), tune (v1.0.0), workflows (v1.0.0.9000) and yardstick (v1.0.0).

Principal Components Analysis

The USArrests data set from the datasets package contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

Column Name	Column Description
`Murder`	Number of arrests for murder (per 100,000)
`Assault`	Number of arrests for assault (per 100,000)
`UrbanPop`	Percent of the population living in urban areas
`Rape`	Number of arrests for rape (per 100,000)

Turn USArrests into a tibble and move the rownames into a column.

Code

USArrests <- dplyr::as_tibble(datasets::USArrests,
  rownames = "state"
)

USArrests %>%
  reactable::reactable(
    defaultPageSize = 5,
    filterable = TRUE
  )

We create a recipe with recipes::recipe, recipes::all_numeric_predictors(), recipes::step_normalize and recipes::step_pca

By default, recipes::step_pca uses stats::prcomp with the argument defaults set to retx = FALSE, center = FALSE, scale. = FALSE, and tol = NULL.

Code

pca_recipe <- recipes::recipe(
  formula = ~.,
  data = USArrests
) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::step_pca(recipes::all_numeric_predictors(),
    id = "pca",
    num_comp = 4
  )

pca_recipe

> Recipe
> 
> Inputs:
> 
>       role #variables
>  predictor          5
> 
> Operations:
> 
> Centering and scaling for recipes::all_numeric_predictors()
> PCA extraction with recipes::all_numeric_predictors()

Apply recipes::prep and recipes::bake to compute the scores of the principal component analysis.

Code

pca_estimates <- recipes::prep(x = pca_recipe, training = USArrests)

pca_estimates %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)

Code

pca_score <- pca_estimates %>%
  recipes::bake(new_data = USArrests)


pca_score %>%
  reactable::reactable(defaultPageSize = 5)

We can explore the results of the PCA using the recipes::prep and parsnip::tidy. We can see that pca is done at step number 2

Code

pca_estimates %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)

As such, we extract the pca loadings of the pca step.

Code

tidied_pca_loadings <- pca_estimates %>%
  parsnip::tidy(id = "pca", type = "coef")


tidied_pca_loadings %>%
  reactable::reactable(defaultPageSize = 5)

Here are the pca loadings in wide table form.

Code

tidied_pca_loadings %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]]
  ) %>%
  dplyr::select(-dplyr::all_of("id")) %>%
  reactable::reactable(defaultPageSize = 5)

We make a visualization to see how the loadings of the four components look like

Code

tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["value"]],
    y = .data[["terms"]],
    fill = .data[["terms"]]
  )) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]])) +
  ggplot2::labs(y = NULL)

Let us take a closer look at the four variables that contribute to the four components

Code

tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::group_by(.data[["component"]]) %>%
  dplyr::top_n(4, abs(.data[["value"]])) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    terms = tidytext::reorder_within(
      x = .data[["terms"]],
      by = abs(.data[["value"]]),
      within = .data[["component"]]
    )
  ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = abs(.data[["value"]]),
    y = .data[["terms"]],
    fill = .data[["value"]] > 0
  )) +
  ggplot2::geom_col() +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[["component"]]),
    scales = "free_y"
  ) +
  tidytext::scale_y_reordered() +
  ggplot2::labs(
    x = "Absolute value of contribution to PCA component",
    y = NULL,
    fill = "Positive?"
  )

How much variation are we capturing for each component?

Here is how we can see the variance statistics but it is in a long form

Code

tidied_pca_variance <- pca_estimates %>%
  parsnip::tidy(id = "pca", type = "variance")

tidied_pca_variance %>%
  reactable::reactable(defaultPageSize = 5)

Here is how we can see the variance statistics in its wide form

Code

tidied_pca_variance %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]],
    names_prefix = "PC"
  ) %>%
  dplyr::select(-dplyr::all_of("id")) %>%
  reactable::reactable(defaultPageSize = 5)

Here is a simple plot to show the variance captured for each component

Code

# Get the variance
percent_variation <- tidied_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "percent variance") %>%
  dplyr::pull(.data[["value"]])

percent_variation <- percent_variation / 100

# I use [1:4] to select the first four components
dplyr::tibble(
  component = unique(tidied_pca_loadings$component)[1:4],
  percent_var = percent_variation[1:4]
) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["percent_var"]]
  )) +
  ggplot2::geom_col() +
  ggplot2::geom_text(
    mapping = ggplot2::aes(
      label = paste(round(.data[["percent_var"]] * 100, 0), "%")
    ),
    vjust = -0.2
  ) +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
  ggplot2::labs(
    x = NULL,
    y = "Percent variance \nexplained by each \nPCA component"
  )

Here is a PCA Biplot of PC1 and PC2

Code

filtered_loadings <- tidied_pca_loadings %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]]
  )

ggplot2::ggplot() +
  ggplot2::geom_point(
    data = pca_score,
    mapping = ggplot2::aes(
      x = .data[["PC1"]],
      y = .data[["PC2"]]
    ),
    size = 2,
    alpha = 0.6
  ) +
  ggplot2::geom_text(
    data = pca_score,
    mapping = ggplot2::aes(
      x = .data[["PC1"]],
      y = .data[["PC2"]],
      label = .data[["state"]]
    ),
    check_overlap = TRUE,
    size = 2.5,
    hjust = "inward"
  ) +
  ggplot2::geom_segment(
    data = filtered_loadings,
    mapping = ggplot2::aes(
      x = 0,
      y = 0,
      xend = .data[["PC1"]] * 2.5,
      yend = .data[["PC2"]] * 2.5
    ),
    arrow = ggplot2::arrow(length = grid::unit(
      x = 0.5,
      units = "picas"
    )),
    color = "blue"
  ) +
  ggplot2::annotate(
    geom = "text",
    x = filtered_loadings[["PC1"]] * 2.7,
    y = filtered_loadings[["PC2"]] * 2.7,
    label = filtered_loadings[["terms"]]
  ) +
  ggplot2::coord_fixed(clip = "off") +
  ggplot2::theme_minimal()

Matrix Completion

Initialisation

Code

missing_data <- data.frame(
  SBP = c(-3, NA, -1, 1, 1, NA),
  DBP = c(NA, -2, 0, NA, 2, 4)
)

colour_missing_cells <- function(value, index, name) {
  missing_indices <- which(is.na(missing_data[[name]]))
  if (index %in% missing_indices) {
    list(background = "#aaf0f8")
  }
}

xbar <- colMeans(missing_data, na.rm = TRUE)

initialisation_data <- data.frame(
  SBP = c(-3, -0.5, -1, 1, 1, -0.5),
  DBP = c(1, -2, 0, 1, 2, 4)
)

thresh <- 1e-2
ismiss <- is.na(missing_data)

Functions

These are the implementations for calculating step 2a and 2b for Algorithm 12.1

Code

fit_pca <- function(input_data, num_comp = 1) {
  data_pca <- prcomp(input_data, center = FALSE)

  pca_loadings <- data_pca$rotation
  pca_loadings_inverse <- solve(pca_loadings)[1:num_comp, , drop = FALSE]
  pca_scores <- data_pca$x[, 1:num_comp, drop = FALSE]

  estimated_data <- pca_scores %*% pca_loadings_inverse
  return(estimated_data)
}

Code

fit_svd <- function(input_data, num_comp = 1) {
  svd_object <- svd(input_data)

  pca_loadings <- svd_object$v[, 1:num_comp, drop = FALSE]
  pca_loadings_inverse <- t(pca_loadings)

  pca_scores <- (svd_object$u %*% diag(svd_object$d))[, 1:num_comp, drop = FALSE]

  estimated_data <- pca_scores %*% pca_loadings_inverse

  return(estimated_data)
}

Code

fit_pca_tidy <- function(input_data, num_comp = 1) {
  pca_recipe <- recipes::recipe(
    formula = ~.,
    data = input_data
  ) %>%
    recipes::step_pca(recipes::all_numeric_predictors(),
      id = "pca"
    )


  pca_estimates <- recipes::prep(x = pca_recipe, training = input_data)

  pca_scores <- pca_estimates %>%
    recipes::bake(new_data = input_data) %>%
    as.matrix()
  pca_scores <- pca_scores[, 1:num_comp, drop = FALSE]


  pca_loadings <- pca_estimates %>%
    parsnip::tidy(id = "pca", type = "coef") %>%
    tidyr::pivot_wider(
      names_from = .data[["component"]],
      values_from = .data[["value"]]
    ) %>%
    dplyr::select(-c("terms", "id")) %>%
    as.matrix()

  pca_loadings_inverse <- solve(pca_loadings)[1:num_comp, , drop = FALSE]

  estimated_data <- pca_scores %*% pca_loadings_inverse

  return(estimated_data)
}

Iteration 1

Code

initialisation_data_pca <- initialisation_data

# Step 2(a)
Xapp <- fit_pca(initialisation_data_pca, num_comp = 1)

# Step 2(b)
initialisation_data_pca[ismiss] <- Xapp[ismiss]

# Step 2(c)
mss_old_pca <- mean(((missing_data - Xapp)[!ismiss])^2)

initialisation_data_pca %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  )

Code

mss_old_pca

> [1] 1.48974

Code

initialisation_data_svd <- initialisation_data

# Step 2(a)
Xapp <- fit_svd(initialisation_data_svd, num_comp = 1)

# Step 2(b)
initialisation_data_svd[ismiss] <- Xapp[ismiss]

# Step 2(c)
mss_old_svd <- mean(((missing_data - Xapp)[!ismiss])^2)

initialisation_data_svd %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  )

Code

mss_old_svd

> [1] 1.48974

Code

initialisation_data_tidymodels <- initialisation_data

# Step 2(a)
Xapp <- fit_pca_tidy(initialisation_data_tidymodels, num_comp = 1)

# Step 2(b)
initialisation_data_tidymodels[ismiss] <- Xapp[ismiss]

# Step 2(c)
mss_old_tidymodels <- mean(((missing_data - Xapp)[!ismiss])^2)

initialisation_data_tidymodels %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  )

Code

mss_old_tidymodels

> [1] 1.48974

The rest of the iteration

Code

iter <- 1
rel_err <- mss_old_pca
thresh <- 1e-2
while (rel_err > thresh) {
  iter <- iter + 1

  # Step 2(a)
  Xapp <- fit_pca(initialisation_data_pca, num_comp = 1)

  # Step 2(b)
  initialisation_data_pca[ismiss] <- Xapp[ismiss]

  # Step 2(c)
  mss_pca <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_pca - mss_pca
  mss_old_pca <- mss_pca
  cat(
    "Iter:", iter, "\n",
    "MSS:", mss_pca, "\n",
    "Rel. Err:", rel_err, "\n"
  )
  print("Xapp")
  print(Xapp)
  print("Data")
  print(initialisation_data_pca)
}

> Iter: 2 
>  MSS: 1.43288 
>  Rel. Err: 0.05686009 
> [1] "Xapp"
>              SBP        DBP
> [1,] -0.24567017  1.6296426
> [2,]  0.29799874 -1.9767619
> [3,] -0.02222086  0.1474011
> [4,] -0.11358385  0.7534536
> [5,] -0.27258141  1.8081571
> [6,] -0.59617741  3.9547173
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  1.6296426
> 2  0.2979987 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.7534536
> 5  1.0000000  2.0000000
> 6 -0.5961774  4.0000000
> Iter: 3 
>  MSS: 1.222089 
>  Rel. Err: 0.2107907 
> [1] "Xapp"
>              SBP        DBP
> [1,] -0.73055871  2.3385975
> [2,]  0.59572858 -1.9069916
> [3,] -0.08891186  0.2846165
> [4,] -0.12553344  0.4018461
> [5,] -0.48032110  1.5375598
> [6,] -1.19147316  3.8140345
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  2.3385975
> 2  0.5957286 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.4018461
> 5  1.0000000  2.0000000
> 6 -1.1914732  4.0000000
> Iter: 4 
>  MSS: 0.8998268 
>  Rel. Err: 0.3222623 
> [1] "Xapp"
>              SBP         DBP
> [1,] -1.53549232  3.07087697
> [2,]  0.91916923 -1.83827401
> [3,] -0.20001122  0.40000841
> [4,]  0.03926938 -0.07853601
> [5,] -0.60000561  1.19996915
> [6,] -1.83834165  3.67655441
> [1] "Data"
>          SBP         DBP
> 1 -3.0000000  3.07087697
> 2  0.9191692 -2.00000000
> 3 -1.0000000  0.00000000
> 4  1.0000000 -0.07853601
> 5  1.0000000  2.00000000
> 6 -1.8383417  4.00000000
> Iter: 5 
>  MSS: 0.6967107 
>  Rel. Err: 0.2031161 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.1668365  3.5757572
> [2,]  1.1333206 -1.8702286
> [3,] -0.2685842  0.4432231
> [4,]  0.3033932 -0.5006656
> [5,] -0.6178620  1.0196084
> [6,] -2.2666421  3.7404586
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.5757572
> 2  1.133321 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.5006656
> 5  1.000000  2.0000000
> 6 -2.266642  4.0000000
> Iter: 6 
>  MSS: 0.612954 
>  Rel. Err: 0.08375675 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.5102026  3.8916848
> [2,]  1.2439901 -1.9286163
> [3,] -0.2938087  0.4555054
> [4,]  0.5218645 -0.8090711
> [5,] -0.6172021  0.9568773
> [6,] -2.4879805  3.8572330
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.8916848
> 2  1.243990 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.8090711
> 5  1.000000  2.0000000
> 6 -2.487981  4.0000000
> Iter: 7 
>  MSS: 0.5807517 
>  Rel. Err: 0.03220227 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.6813227  4.1000892
> [2,]  1.2887789 -1.9707096
> [3,] -0.2995594  0.4580650
> [4,]  0.6701665 -1.0247713
> [5,] -0.6165707  0.9428163
> [6,] -2.5775579  3.9414194
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.100089
> 2  1.288779 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.024771
> 5  1.000000  2.000000
> 6 -2.577558  4.000000
> Iter: 8 
>  MSS: 0.5665195 
>  Rel. Err: 0.01423225 
> [1] "Xapp"
>             SBP       DBP
> [1,] -2.7628696  4.254096
> [2,]  1.2959129 -1.995367
> [3,] -0.2966660  0.456788
> [4,]  0.7647692 -1.177544
> [5,] -0.6169100  0.949880
> [6,] -2.5918258  3.990734
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.254096
> 2  1.295913 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.177544
> 5  1.000000  2.000000
> 6 -2.591826  4.000000
> Iter: 9 
>  MSS: 0.558216 
>  Rel. Err: 0.00830349 
> [1] "Xapp"
>             SBP        DBP
> [1,] -2.7988324  4.3825676
> [2,]  1.2826615 -2.0084627
> [3,] -0.2896952  0.4536209
> [4,]  0.8238540 -1.2900364
> [5,] -0.6175465  0.9669887
> [6,] -2.5653230  4.0169255
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.382568
> 2  1.282661 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.290036
> 5  1.000000  2.000000
> 6 -2.565323  4.000000

Code

iter <- 1
rel_err <- mss_old_svd
thresh <- 1e-2
while (rel_err > thresh) {
  iter <- iter + 1

  # Step 2(a)
  Xapp <- fit_svd(initialisation_data_svd, num_comp = 1)

  # Step 2(b)
  initialisation_data_svd[ismiss] <- Xapp[ismiss]

  # Step 2(c)
  mss_svd <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_svd - mss_svd
  mss_old_svd <- mss_svd
  cat(
    "Iter:", iter, "\n",
    "MSS:", mss_svd, "\n",
    "Rel. Err:", rel_err, "\n"
  )
  print("Xapp")
  print(Xapp)
  print("Data")
  print(initialisation_data_svd)
}

> Iter: 2 
>  MSS: 1.43288 
>  Rel. Err: 0.05686009 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.24567017  1.6296426
> [2,]  0.29799874 -1.9767619
> [3,] -0.02222086  0.1474011
> [4,] -0.11358385  0.7534536
> [5,] -0.27258141  1.8081571
> [6,] -0.59617741  3.9547173
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  1.6296426
> 2  0.2979987 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.7534536
> 5  1.0000000  2.0000000
> 6 -0.5961774  4.0000000
> Iter: 3 
>  MSS: 1.222089 
>  Rel. Err: 0.2107907 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.73055871  2.3385975
> [2,]  0.59572858 -1.9069916
> [3,] -0.08891186  0.2846165
> [4,] -0.12553344  0.4018461
> [5,] -0.48032110  1.5375598
> [6,] -1.19147316  3.8140345
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  2.3385975
> 2  0.5957286 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.4018461
> 5  1.0000000  2.0000000
> 6 -1.1914732  4.0000000
> Iter: 4 
>  MSS: 0.8998268 
>  Rel. Err: 0.3222623 
> [1] "Xapp"
>             [,1]        [,2]
> [1,] -1.53549232  3.07087697
> [2,]  0.91916923 -1.83827401
> [3,] -0.20001122  0.40000841
> [4,]  0.03926938 -0.07853601
> [5,] -0.60000561  1.19996915
> [6,] -1.83834165  3.67655441
> [1] "Data"
>          SBP         DBP
> 1 -3.0000000  3.07087697
> 2  0.9191692 -2.00000000
> 3 -1.0000000  0.00000000
> 4  1.0000000 -0.07853601
> 5  1.0000000  2.00000000
> 6 -1.8383417  4.00000000
> Iter: 5 
>  MSS: 0.6967107 
>  Rel. Err: 0.2031161 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.1668365  3.5757572
> [2,]  1.1333206 -1.8702286
> [3,] -0.2685842  0.4432231
> [4,]  0.3033932 -0.5006656
> [5,] -0.6178620  1.0196084
> [6,] -2.2666421  3.7404586
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.5757572
> 2  1.133321 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.5006656
> 5  1.000000  2.0000000
> 6 -2.266642  4.0000000
> Iter: 6 
>  MSS: 0.612954 
>  Rel. Err: 0.08375675 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.5102026  3.8916848
> [2,]  1.2439901 -1.9286163
> [3,] -0.2938087  0.4555054
> [4,]  0.5218645 -0.8090711
> [5,] -0.6172021  0.9568773
> [6,] -2.4879805  3.8572330
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.8916848
> 2  1.243990 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.8090711
> 5  1.000000  2.0000000
> 6 -2.487981  4.0000000
> Iter: 7 
>  MSS: 0.5807517 
>  Rel. Err: 0.03220227 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.6813227  4.1000892
> [2,]  1.2887789 -1.9707096
> [3,] -0.2995594  0.4580650
> [4,]  0.6701665 -1.0247713
> [5,] -0.6165707  0.9428163
> [6,] -2.5775579  3.9414194
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.100089
> 2  1.288779 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.024771
> 5  1.000000  2.000000
> 6 -2.577558  4.000000
> Iter: 8 
>  MSS: 0.5665195 
>  Rel. Err: 0.01423225 
> [1] "Xapp"
>            [,1]      [,2]
> [1,] -2.7628696  4.254096
> [2,]  1.2959129 -1.995367
> [3,] -0.2966660  0.456788
> [4,]  0.7647692 -1.177544
> [5,] -0.6169100  0.949880
> [6,] -2.5918258  3.990734
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.254096
> 2  1.295913 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.177544
> 5  1.000000  2.000000
> 6 -2.591826  4.000000
> Iter: 9 
>  MSS: 0.558216 
>  Rel. Err: 0.00830349 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.7988324  4.3825676
> [2,]  1.2826615 -2.0084627
> [3,] -0.2896952  0.4536209
> [4,]  0.8238540 -1.2900364
> [5,] -0.6175465  0.9669887
> [6,] -2.5653230  4.0169255
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.382568
> 2  1.282661 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.290036
> 5  1.000000  2.000000
> 6 -2.565323  4.000000

Code

iter <- 1
rel_err <- mss_old_tidymodels
thresh <- 1e-2
while (rel_err > thresh) {
  iter <- iter + 1

  # Step 2(a)
  Xapp <- fit_svd(initialisation_data_tidymodels, num_comp = 1)

  # Step 2(b)
  initialisation_data_tidymodels[ismiss] <- Xapp[ismiss]

  # Step 2(c)
  mss_tidymodels <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_tidymodels - mss_tidymodels
  mss_old_tidymodels <- mss_tidymodels
  cat(
    "Iter:", iter, "\n",
    "MSS:", mss_tidymodels, "\n",
    "Rel. Err:", rel_err, "\n"
  )
  print("Xapp")
  print(Xapp)
  print("Data")
  print(initialisation_data_tidymodels)
}

> Iter: 2 
>  MSS: 1.43288 
>  Rel. Err: 0.05686009 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.24567017  1.6296426
> [2,]  0.29799874 -1.9767619
> [3,] -0.02222086  0.1474011
> [4,] -0.11358385  0.7534536
> [5,] -0.27258141  1.8081571
> [6,] -0.59617741  3.9547173
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  1.6296426
> 2  0.2979987 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.7534536
> 5  1.0000000  2.0000000
> 6 -0.5961774  4.0000000
> Iter: 3 
>  MSS: 1.222089 
>  Rel. Err: 0.2107907 
> [1] "Xapp"
>             [,1]       [,2]
> [1,] -0.73055871  2.3385975
> [2,]  0.59572858 -1.9069916
> [3,] -0.08891186  0.2846165
> [4,] -0.12553344  0.4018461
> [5,] -0.48032110  1.5375598
> [6,] -1.19147316  3.8140345
> [1] "Data"
>          SBP        DBP
> 1 -3.0000000  2.3385975
> 2  0.5957286 -2.0000000
> 3 -1.0000000  0.0000000
> 4  1.0000000  0.4018461
> 5  1.0000000  2.0000000
> 6 -1.1914732  4.0000000
> Iter: 4 
>  MSS: 0.8998268 
>  Rel. Err: 0.3222623 
> [1] "Xapp"
>             [,1]        [,2]
> [1,] -1.53549232  3.07087697
> [2,]  0.91916923 -1.83827401
> [3,] -0.20001122  0.40000841
> [4,]  0.03926938 -0.07853601
> [5,] -0.60000561  1.19996915
> [6,] -1.83834165  3.67655441
> [1] "Data"
>          SBP         DBP
> 1 -3.0000000  3.07087697
> 2  0.9191692 -2.00000000
> 3 -1.0000000  0.00000000
> 4  1.0000000 -0.07853601
> 5  1.0000000  2.00000000
> 6 -1.8383417  4.00000000
> Iter: 5 
>  MSS: 0.6967107 
>  Rel. Err: 0.2031161 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.1668365  3.5757572
> [2,]  1.1333206 -1.8702286
> [3,] -0.2685842  0.4432231
> [4,]  0.3033932 -0.5006656
> [5,] -0.6178620  1.0196084
> [6,] -2.2666421  3.7404586
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.5757572
> 2  1.133321 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.5006656
> 5  1.000000  2.0000000
> 6 -2.266642  4.0000000
> Iter: 6 
>  MSS: 0.612954 
>  Rel. Err: 0.08375675 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.5102026  3.8916848
> [2,]  1.2439901 -1.9286163
> [3,] -0.2938087  0.4555054
> [4,]  0.5218645 -0.8090711
> [5,] -0.6172021  0.9568773
> [6,] -2.4879805  3.8572330
> [1] "Data"
>         SBP        DBP
> 1 -3.000000  3.8916848
> 2  1.243990 -2.0000000
> 3 -1.000000  0.0000000
> 4  1.000000 -0.8090711
> 5  1.000000  2.0000000
> 6 -2.487981  4.0000000
> Iter: 7 
>  MSS: 0.5807517 
>  Rel. Err: 0.03220227 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.6813227  4.1000892
> [2,]  1.2887789 -1.9707096
> [3,] -0.2995594  0.4580650
> [4,]  0.6701665 -1.0247713
> [5,] -0.6165707  0.9428163
> [6,] -2.5775579  3.9414194
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.100089
> 2  1.288779 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.024771
> 5  1.000000  2.000000
> 6 -2.577558  4.000000
> Iter: 8 
>  MSS: 0.5665195 
>  Rel. Err: 0.01423225 
> [1] "Xapp"
>            [,1]      [,2]
> [1,] -2.7628696  4.254096
> [2,]  1.2959129 -1.995367
> [3,] -0.2966660  0.456788
> [4,]  0.7647692 -1.177544
> [5,] -0.6169100  0.949880
> [6,] -2.5918258  3.990734
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.254096
> 2  1.295913 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.177544
> 5  1.000000  2.000000
> 6 -2.591826  4.000000
> Iter: 9 
>  MSS: 0.558216 
>  Rel. Err: 0.00830349 
> [1] "Xapp"
>            [,1]       [,2]
> [1,] -2.7988324  4.3825676
> [2,]  1.2826615 -2.0084627
> [3,] -0.2896952  0.4536209
> [4,]  0.8238540 -1.2900364
> [5,] -0.6175465  0.9669887
> [6,] -2.5653230  4.0169255
> [1] "Data"
>         SBP       DBP
> 1 -3.000000  4.382568
> 2  1.282661 -2.000000
> 3 -1.000000  0.000000
> 4  1.000000 -1.290036
> 5  1.000000  2.000000
> 6 -2.565323  4.000000

SoftImpute

Code

initialisation_data_tidymodels %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  ) %>%
  reactablefmtr::add_title(
    title = "Algorithm 12.1"
  )
fits <- softImpute::softImpute(missing_data, type = "svd")
softImpute::complete(missing_data, fits) %>%
  reactable::reactable(
    defaultPageSize = 6,
    columns = list(
      SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
      DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
    ),
    defaultColDef = reactable::colDef(
      style = colour_missing_cells
    )
  ) %>%
  reactablefmtr::add_title(
    title = "Soft Impute"
  )

Algorithm 12.1

Soft Impute

K-Means Clustering

Introduction

We begin with a simple simulated example in which there truly are two clusters in the

Code

set.seed(2)

# CLuster centre at (0,0) and (3,-4)
centers <- tibble::tibble(
  cluster = factor(1:2),
  num_points = c(25, 25), # number points in each cluster
  V1 = c(0, 3), # V1 coordinate of cluster center
  V2 = c(0, -4) # V2 coordinate of cluster center
)

labelled_points <- centers %>%
  dplyr::mutate(
    V1 = purrr::map2(.data[["num_points"]], .data[["V1"]], rnorm),
    V2 = purrr::map2(.data[["num_points"]], .data[["V2"]], rnorm)
  ) %>%
  dplyr::select(-c("num_points")) %>%
  tidyr::unnest(cols = c("V1", "V2"))

labelled_points %>%
  reactable::reactable(
    defaultPageSize = 10,
    columns = list(
      V1 = reactable::colDef(format = reactable::colFormat(digits = 2)),
      V2 = reactable::colDef(format = reactable::colFormat(digits = 2))
    )
  )

Code

ggplot2::ggplot(
  data = labelled_points,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]],
    color = .data[["cluster"]]
  )
) +
  ggplot2::geom_point(alpha = 1)

We now perform K-means clustering with $K = 2$.

We’ll use the built-in stats::kmeans function, which accepts a data frame with all numeric columns as it’s primary argument

Code

points <- labelled_points %>%
  dplyr::select(-c("cluster"))

result_kmeans <- stats::kmeans(
  x = points,
  centers = 2,
  nstart = 20
)

result_kmeans

> K-means clustering with 2 clusters of sizes 26, 24
> 
> Cluster means:
>         V1        V2
> 1 0.361163 -0.162447
> 2 2.877777 -4.262629
> 
> Clustering vector:
>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
> [39] 2 2 2 1 2 2 2 2 2 2 2 2
> 
> Within cluster sum of squares by cluster:
> [1] 68.52253 58.53435
>  (between_SS / total_SS =  69.5 %)
> 
> Available components:
> 
> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
> [6] "betweenss"    "size"         "iter"         "ifault"

The broom::tidy function returns information for each cluster, including their mean position (centroid), size and within-cluster sum-of-squares.

Code

result_kmeans %>%
  broom::tidy() %>%
  reactable::reactable()

The broom::glance function returns model wise metrics. One of these is tot.withinss which is the total within-cluster sum-of-squares that we seek to minimize when we perform K-means clustering.

Code

result_kmeans %>%
  broom::glance() %>%
  reactable::reactable()

We can see what cluster each observation belongs to by using broom::augment which “predict” which cluster a given observation belongs to.

Code

predicted_kmeans <- result_kmeans %>%
  broom::augment(data = points)

predicted_kmeans %>%
  reactable::reactable(defaultPageSize = 5)

Code

ggplot2::ggplot(
  data = predicted_kmeans,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]],
    color = .data[[".cluster"]]
  )
) +
  ggplot2::geom_point(alpha = 1)

Exploration

In reality, we will not know how many clusters is required for a given dataset. We need to try out a number of different values of $K$ and then find the best one.

Let us try $K$ from 1 to 9.

Code

kclusts <- tibble::tibble(k = 1:9) %>%
  dplyr::mutate(
    kclust = purrr::map(
      .x = .data[["k"]],
      .f = ~ kmeans(x = points, centers = .x)
    )
  ) %>%
  dplyr::mutate(
    tidied = purrr::map(
      .x = .data[["kclust"]],
      .f = broom::tidy
    ),
    glanced = purrr::map(
      .x = .data[["kclust"]],
      .f = broom::glance
    ),
    augmented = purrr::map(
      .x = .data[["kclust"]],
      .f = broom::augment, data = points
    )
  )

kclusts

> # A tibble: 9 × 5
>       k kclust   tidied           glanced          augmented        
>   <int> <list>   <list>           <list>           <list>           
> 1     1 <kmeans> <tibble [1 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 2     2 <kmeans> <tibble [2 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 3     3 <kmeans> <tibble [3 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 4     4 <kmeans> <tibble [4 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 5     5 <kmeans> <tibble [5 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 6     6 <kmeans> <tibble [6 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 7     7 <kmeans> <tibble [7 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 8     8 <kmeans> <tibble [8 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 9     9 <kmeans> <tibble [9 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>

Let assignments be a dataset of which cluster each points goes to for $K$ from 1 to 9

Code

assignments <- kclusts %>%
  dplyr::select(c("k", "augmented")) %>%
  tidyr::unnest(cols = c("augmented"))

assignments %>%
  reactable::reactable(defaultPageSize = 5)

Now we can plot the original points using the data from assignments, with each point coloured according to the predicted cluster.

Code

ggplot2::ggplot(
  data = assignments,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]]
  )
) +
  geom_point(
    mapping = ggplot2::aes(color = .data[[".cluster"]]),
    alpha = 0.8
  ) +
  facet_wrap(ggplot2::vars(.data[["k"]]))

Let clusters be a dataset of cluster mean position (centroids) for $K$ from 1 to 9

Code

clusters <- kclusts %>%
  dplyr::select(c("k", "tidied")) %>%
  tidyr::unnest(cols = c("tidied"))

clusters %>%
  reactable::reactable(defaultPageSize = 5)

We can then add the centers of the cluster using the data from clusters

Code

ggplot2::ggplot(
  data = assignments,
  mapping = ggplot2::aes(
    x = .data[["V1"]],
    y = .data[["V2"]]
  )
) +
  geom_point(
    mapping = ggplot2::aes(color = .data[[".cluster"]]),
    alpha = 0.8
  ) +
  facet_wrap(ggplot2::vars(.data[["k"]])) +
  geom_point(
    data = clusters,
    size = 10,
    shape = "x"
  )

Let clusterings be a dataset of cluster metrics for $K$ from 1 to 9

Code

clusterings <- kclusts %>%
  dplyr::select(c("k", "glanced")) %>%
  tidyr::unnest(cols = c("glanced"))

clusterings %>%
  reactable::reactable(defaultPageSize = 5)

Now that we have the total within-cluster sum-of-squares for $K$ from 1 to 9, we can plot them against k so we can use the elbow method to find the optimal number of clusters.

Code

ggplot2::ggplot(
  data = clusterings,
  mapping = ggplot2::aes(
    x = .data[["k"]],
    y = .data[["tot.withinss"]]
  )
) +
  ggplot2::geom_line(alpha = 0.5, size = 1.2, color = "midnightblue") +
  ggplot2::geom_point(size = 2, color = "midnightblue")

We see an elbow at k = 2 which makes us happy since the data set is specifically created to have 2 clusters. We can now extract the model where k = 2 from

Code

final_kclusts <- kclusts %>%
  dplyr::filter(.data[["k"]] == 2) %>%
  dplyr::pull(.data[["kclust"]]) %>%
  purrr::pluck(1)

Hierarchical Clustering

The stats::dist function is used to calculate the Euclidean distance dissimilarity measure.

Code

dissimilarity_measure <- points %>%
  stats::dist(method = "euclidean")

dissimilarity_measure %>%
  as.matrix() %>%
  reactable::reactable()

The stats::hclust function implements hierarchical clustering in R. We use the data used for K-means clustering to plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure.

Code

res_hclust_complete <- dissimilarity_measure %>%
  stats::hclust(method = "complete")

res_hclust_average <- dissimilarity_measure %>%
  stats::hclust(method = "average")

res_hclust_single <- dissimilarity_measure %>%
  stats::hclust(method = "single")

We use factoextra::fviz_dend to visualize the clustering created using hclust.

To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the stats::cutree function

Code

factoextra::fviz_dend(res_hclust_complete, main = "complete", k = 2)

Code

stats::cutree(tree = res_hclust_complete, k = 2)

>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
> [39] 2 2 2 1 2 2 2 2 2 2 2 2

Code

results_hclust <-
  tibble::tibble(
    cluster_id = stats::cutree(tree = res_hclust_complete, k = 2)
  ) %>%
  dplyr::bind_cols(points) %>%
  dplyr::mutate(cluster_id = as.factor(.data$cluster_id))

results_hclust %>%
  reactable::reactable(defaultPageSize = 5)

Code

factoextra::fviz_dend(res_hclust_average, main = "average", k = 2)

Code

cutree(tree = res_hclust_average, k = 2)

>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
> [39] 2 2 2 1 2 2 2 2 2 2 2 2

Code

results_hclust <-
  tibble::tibble(
    cluster_id = stats::cutree(tree = res_hclust_average, k = 2)
  ) %>%
  dplyr::bind_cols(points) %>%
  dplyr::mutate(cluster_id = as.factor(.data$cluster_id))

results_hclust %>%
  reactable::reactable(defaultPageSize = 5)

Code

factoextra::fviz_dend(res_hclust_single, main = "single", k = 4)

Code

cutree(tree = res_hclust_single, k = 4)

>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3
> [39] 3 3 3 1 3 4 3 4 3 3 3 3

Code

results_hclust <-
  tibble::tibble(
    cluster_id = stats::cutree(tree = res_hclust_single, k = 2)
  ) %>%
  dplyr::bind_cols(points) %>%
  dplyr::mutate(cluster_id = as.factor(.data$cluster_id))

results_hclust %>%
  reactable::reactable(defaultPageSize = 5)

We analyze the nature of the clusters mathematically by evaluating the c, which measures the clustering structure of the dataset - with values closer to 1 suggest a more balanced clustering structure and values closer to 0 suggest less well-formed clusters. However, the agglomerative coefficient tends to become larger as the number of samples increases, so it should not be used to compare across data sets of very different sizes.

cluster::agnes allows us to compute this metric on the results from hierarchical clustering.

It looks like the complete linkage did the best.

Code

# Compute agglomerative coefficient values

ac_metric <- list(
  complete_ac = cluster::agnes(
    x = dissimilarity_measure,
    metric = "euclidean",
    method = "complete"
  )$ac,
  average_ac = cluster::agnes(
    x = dissimilarity_measure,
    metric = "euclidean",
    method = "average"
  )$ac,
  single_ac = cluster::agnes(
    x = dissimilarity_measure,
    metric = "euclidean",
    method = "single"
  )$ac
)

ac_metric

> $complete_ac
> [1] 0.9359608
> 
> $average_ac
> [1] 0.8887208
> 
> $single_ac
> [1] 0.6700538

Although hierarchical clustering provides a fully connected dendrogram representing the cluster relationships, there may still need to choose the preferred number of clusters to extract. Fortunately we can execute approaches similar to those discussed for k-means clustering with factoextra::fviz_nbclust

Code

# Plot cluster results
p1 <- factoextra::fviz_nbclust(
  x = points,
  FUN = factoextra::hcut,
  method = "wss",
  k.max = 10
) +
  ggplot2::ggtitle(label = "(A) Elbow method")

p2 <- factoextra::fviz_nbclust(
  x = points,
  FUN = factoextra::hcut,
  method = "silhouette",
  k.max = 10
) +
  ggplot2::ggtitle("(B) Silhouette method")

p3 <- factoextra::fviz_nbclust(
  x = points,
  FUN = factoextra::hcut,
  method = "gap_stat",
  k.max = 10
) +
  ggplot2::ggtitle("(C) Gap statistic")

# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)

Another way of calculating distances is based on pearson correlation using proxy::dist. However, this only makes sense if we have data with 3 or more variables.

Code

# correlation based distance
set.seed(2)
three_var_data <- matrix(rnorm(30 * 3), ncol = 3)

three_var_data %>%
  proxy::dist(method = "correlation") %>%
  stats::hclust(method = "complete") %>%
  factoextra::fviz_dend(main = "complete", k = 2)

Code

# Plot cluster results
p1 <- factoextra::fviz_nbclust(
  x = three_var_data,
  FUN = factoextra::hcut,
  method = "wss",
  hc_metric = "pearson",
  k.max = 10
) +
  ggplot2::ggtitle(label = "(A) Elbow method")

p2 <- factoextra::fviz_nbclust(
  x = three_var_data,
  FUN = factoextra::hcut,
  method = "silhouette",
  hc_metric = "pearson",
  k.max = 10
) +
  ggplot2::ggtitle("(B) Silhouette method")

p3 <- factoextra::fviz_nbclust(
  x = three_var_data,
  FUN = factoextra::hcut,
  method = "gap_stat",
  hc_metric = "pearson",
  k.max = 10
) +
  ggplot2::ggtitle("(C) Gap statistic")

# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)

NCI60 Data Example

Unsupervised techniques are often used in the analysis of genomic data.

We illustrate these techniques on the NCI60 cancer cell line microarray data, which consists of 6,830 gene expression measurements on 64 cancer cell lines.

labels represents the type of cancer. V_1 to V_6830 represent a gene.

Code

nci60 <- ISLR2::NCI60$data %>%
  tibble::as_tibble() %>%
  magrittr::set_colnames(., paste0("V_", 1:ncol(.))) %>%
  dplyr::mutate(label = factor(ISLR2::NCI60$labs)) %>%
  dplyr::relocate(.data[["label"]])

# head(nci60) %>%
#   reactable::reactable(defaultPageSize = 10)

PCA on the NCI60 Data

We first perform PCA on the data after scaling the variables (genes) to have standard deviation one and plot the first few principal component score vectors, in order to visualize the data.

Code

nci60_preproc_rec <- recipes::recipe(formula = label ~ ., data = nci60) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::step_pca(recipes::all_numeric_predictors(),
    id = "pca",
    threshold = 0.99
  )

Code

nci60_pca_estimates <- recipes::prep(x = nci60_preproc_rec)

nci60_pca_estimates %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)

Code

nci60_pca_score <- nci60_pca_estimates %>%
  recipes::bake(new_data = NULL)


head(nci60_pca_score) %>%
  reactable::reactable(defaultPageSize = 5)

We visualise the projections of the NCI60 cancer cell lines onto the first three principal components

Code

colors <- unname(palette.colors(n = 14, palette = "Polychrome 36"))

nci60_pca_score %>%
  dplyr::select(c("label", "PC01", "PC02")) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["PC01"]],
    y = .data[["PC02"]],
    color = .data[["label"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::scale_color_manual(values = colors) +
  ggplot2::theme_bw()

Code

nci60_pca_score %>%
  dplyr::select(c("label", "PC01", "PC03")) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["PC01"]],
    y = .data[["PC03"]],
    color = .data[["label"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::scale_color_manual(values = colors) +
  ggplot2::theme_bw()

Here is a scatter plot matrix

Code

nci60_pca_score %>%
  dplyr::select(c("label", "PC01", "PC02", "PC03")) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .panel_x,
    y = .panel_y,
    color = .data[["label"]],
    fill = .data[["label"]]
  )) +
  ggplot2::geom_point(alpha = 0.4, size = 0.5) +
  ggforce::geom_autodensity(alpha = .3) +
  ggforce::facet_matrix(rows = ggplot2::vars(-c("label")), layer.diag = 2) +
  ggplot2::scale_color_manual(values = colors) +
  ggplot2::scale_fill_manual(values = colors) +
  cowplot::theme_minimal_grid() +
  cowplot::panel_border(color = "black") +
  ggplot2::theme(legend.position = "bottom")

To view them interactively, consider using langevitour::langevitour from the R package langevitour

Code

nci60_pca_score_filtered <- nci60_pca_score %>%
  dplyr::select(c(
    "label",
    "PC01", "PC02", "PC03",
    "PC04", "PC05", "PC06"
  ))

langevitour::langevitour(
  nci60_pca_score_filtered[, -1], nci60_pca_score_filtered$label,
  pointSize = 2
)

How much variation are we capturing for each component?

We extract the pca loadings of the pca step.

Code

nci60_pca_loadings <- nci60_pca_estimates %>%
  parsnip::tidy(id = "pca", type = "coef")


head(nci60_pca_loadings) %>%
  reactable::reactable(defaultPageSize = 5)

Here is how we can see the variance statistics but it is in a long form

Code

nci60_pca_variance <- nci60_pca_estimates %>%
  parsnip::tidy(id = "pca", type = "variance")

Here is how we can see the variance statistics in its wide form

Code

nci60_pca_variance %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]],
    names_prefix = "PC"
  ) %>%
  dplyr::select(-dplyr::all_of("id")) %>%
  reactable::reactable(defaultPageSize = 5)

Here is a simple plot to show the variance captured for each component.

Code

# Get the variance
percent_variation <- nci60_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "percent variance") %>%
  dplyr::pull(.data[["value"]])

percent_variation <- percent_variation / 100
component <- unique(nci60_pca_loadings$component) %>%
  stringr::str_remove(pattern = "PC") %>%
  as.integer()

# I use [1:4] to select the first four components
dplyr::tibble(
  component = component,
  percent_var = percent_variation
) %>%
  # dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["percent_var"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
  ggplot2::labs(
    x = NULL,
    y = "Percent variance \nexplained by each \nPCA component"
  )

We see that together, the first seven principal components explain around 40% of the variance in the data.

Code

# Get the variance
cumulative_percent_variation <- nci60_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "cumulative percent variance") %>%
  dplyr::pull(.data[["value"]])

cumulative_percent_variation <- cumulative_percent_variation / 100
component <- unique(nci60_pca_loadings$component) %>%
  stringr::str_remove(pattern = "PC") %>%
  as.integer()

# I use [1:4] to select the first four components
dplyr::tibble(
  component = component,
  cumulative_percent_var = cumulative_percent_variation
) %>%
  # dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["cumulative_percent_var"]]
  )) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
  ggplot2::labs(
    x = NULL,
    y = "Cumulative percent \nvariance explained \nby each PCA \ncomponent"
  )

Clustering on NCI60 dataset (Base R)

We now proceed to hierarchically cluster the cell lines in the NCI60 data, with the goal of finding out whether or not the observations cluster into distinct types of cancer.

We first preprocess the data suited for hierarchical clustering

Code

nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::prep() %>%
  recipes::bake(new_data = NULL)

We choose the preferred number of clusters to extract.

Code

nci60_num_cluster <- nci60_scaled %>%
  dplyr::select(-c("label"))

# Plot cluster results
p1 <- factoextra::fviz_nbclust(
  x = nci60_num_cluster,
  FUN = factoextra::hcut,
  method = "wss",
  hc_metric = "euclidean",
  k.max = 10
) +
  ggplot2::ggtitle(label = "(A) Elbow method")

p2 <- factoextra::fviz_nbclust(
  x = nci60_num_cluster,
  FUN = factoextra::hcut,
  method = "silhouette",
  hc_metric = "euclidean",
  k.max = 10
) +
  ggplot2::ggtitle("(B) Silhouette method")

p3 <- factoextra::fviz_nbclust(
  x = nci60_num_cluster,
  FUN = factoextra::hcut,
  method = "gap_stat",
  hc_metric = "euclidean",
  k.max = 10
) +
  ggplot2::ggtitle("(C) Gap statistic")

# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)

We now perform hierarchical clustering of the observations using complete, single, and average linkage. Euclidean distance is used as the dissimilarity measure.

Code

nci60_complete <- nci60_num_cluster %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "complete")

nci60_average <- nci60_num_cluster %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "average")

nci60_single <- nci60_num_cluster %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "single")

We then visualize them to see if any of them have some good natural separations.

Typically, single linkage will tend to yield trailing clusters: very large clusters onto which individual observations.

Complete and average linkage tend to yield more balanced, attractive clusters. For this reason, complete and average linkage are generally preferred to single linkage.

Code

nci60_complete$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_complete, main = "complete", k = 4)

Code

stats::cutree(tree = nci60_complete, k = 4)

>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           2           2 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           2           2           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           2 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           2           2           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           3           3           3 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           3           3           3           3           3           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           4           1           4           4           4           4 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           4           4           4           4           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1

Code

nci60_average$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_average, main = "average", k = 4)

Code

stats::cutree(tree = nci60_average, k = 4)

>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           1           1 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           1           1           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           1 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           2           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           3           3           3 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           3           3           3           3           3           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           1           1           1           1           1           1 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           1           1           1           1           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           4           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1

Code

nci60_single$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_single, main = "single", k = 4)

Code

stats::cutree(tree = nci60_single, k = 4)

>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           1           1 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           1           1           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           2 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           3           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           1           1           1 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           1           1           1           1           4           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           1           1           1           1           1           1 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           1           1           1           1           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1

We will use complete linkage hierarchical clustering for the analysis that follows. We calculate which Label is the most common within each cluster.

Code

tibble::tibble(
  label = nci60$label,
  cluster_id = stats::cutree(nci60_complete, k = 4)
) %>%
  dplyr::count(.data[["label"]], .data[["cluster_id"]]) %>%
  dplyr::group_by(.data[["cluster_id"]]) %>%
  dplyr::mutate(
    total = sum(.data[["n"]]),
    prop = .data[["n"]] / sum(.data[["n"]])
  ) %>%
  dplyr::slice_max(n = 1, order_by = .data[["prop"]]) %>%
  dplyr::ungroup() %>%
  reactable::reactable()

How do these NCI60 hierarchical clustering results compare to what we get if we perform K-means clustering with K = 4?

Code

nci60_kmeans_results <- kmeans(nci60_num_cluster, centers = 4, nstart = 50)

Code

nci60_kmeans_results %>%
  broom::tidy() %>%
  dplyr::select(cluster, size, withinss) %>%
  reactable::reactable()

We see that the four clusters obtained using hierarchical clustering and Kmeans clustering are somewhat different.

Code

cluster_kmeans <- nci60_kmeans_results$cluster
cluster_hclust <- cutree(nci60_complete, k = 4)

table(cluster_hclust, cluster_kmeans)

>               cluster_kmeans
> cluster_hclust  1  2  3  4
>              1  0  9 20 11
>              2  0  0  7  0
>              3  8  0  0  0
>              4  0  0  0  9

Code

tibble(
  kmeans = factor(cluster_kmeans),
  hclust = factor(cluster_hclust)
) %>%
  yardstick::conf_mat(
    truth = kmeans,
    estimate = hclust,
    dnn = c("Hierarchical clustering", "K-means")
  ) %>%
  parsnip::autoplot(type = "heatmap")

Cluster 2 in K-means clustering is identical to cluster 4 in hierarchical clustering. However, the other clusters defer. For the twenty seven samples in K-means Cluster 3, twenty goes to hierarchical clustering Cluster 1 and seven goes to hierarchical clustering Cluster 2.

Rather than performing hierarchical clustering on the entire data matrix, we can also perform hierarchical clustering on the first few principal component score vectors, as follows:

Code

# nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
#   recipes::step_normalize(recipes::all_numeric_predictors()) %>%
#   recipes::prep() %>%
#   recipes::bake(new_data = NULL)

nci60_pca <- recipes::recipe(~., nci60_scaled) %>%
  step_pca(recipes::all_numeric_predictors(), num_comp = 5) %>%
  recipes::prep() %>%
  recipes::bake(new_data = NULL)

Code

nci60_pca_complete <- nci60_pca %>%
  stats::dist(method = "euclidean") %>%
  stats::hclust(method = "complete")

nci60_pca_complete$labels <- nci60_pca$label
factoextra::fviz_dend(nci60_pca_complete, main = "complete", k = 4)

Code

stats::cutree(tree = nci60_pca_complete, k = 4)

>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           2           1           2           2           2 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           2           2           2           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           2 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           2           2           2           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           1           3           3 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           3           3           3           3           1           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           1           1           1           1           1           1 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           4           4           4           4           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           2           2           2           2           2 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           2           2           2           2

Clustering on NCI60 dataset (tidyclust)

Recall that we scale the data before clustering and remove the labels.

Code

nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
  recipes::step_normalize(recipes::all_numeric_predictors()) %>%
  recipes::prep() %>%
  recipes::bake(new_data = NULL)

nci60_num_cluster <- nci60_scaled %>%
  dplyr::select(-c("label"))

Here is how we can create the clustering elbow plot using tidyclust

Code

kmeans_spec <- tidyclust::k_means(
  num_clusters = tune::tune()
)

nci60_cv <- rsample::vfold_cv(nci60_num_cluster, v = 5)

nci60_rec <- recipes::recipe(formula = ~., data = nci60_num_cluster)

kmeans_wflow <- workflows::workflow(nci60_rec, kmeans_spec)

clust_num_grid <- dials::grid_regular(
  tidyclust::num_clusters(range = c(1, 15)),
  levels = 15
)

clust_num_grid %>%
  reactable::reactable(defaultPageSize = 5)

Code

res <- tidyclust::tune_cluster(
  kmeans_wflow,
  resamples = nci60_cv,
  grid = clust_num_grid,
  control = tune::control_grid(save_pred = TRUE, extract = identity),
  metrics = tidyclust::cluster_metric_set(
    tidyclust::tot_wss,
    tidyclust::tot_sse,
    tidyclust::sse_ratio
  )
)

res_metrics <- res %>% tune::collect_metrics()

res_metrics %>%
  reactable::reactable(defaultPageSize = 5)

Code

res_metrics %>%
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      x = .data[["num_clusters"]],
      y = .data[["mean"]],
      colour = .data[[".metric"]]
    )
  ) +
  ggplot2::geom_errorbar(
    mapping = ggplot2::aes(
      ymin = .data[["mean"]] - .data[["std_err"]],
      ymax = .data[["mean"]] + .data[["std_err"]]
    ),
    alpha = 0.8
  ) +
  ggplot2::geom_line(size = 1) +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[[".metric"]]),
    scales = "free",
    nrow = 3
  ) +
  ylab("mean over 5 folds") +
  xlab("Number of clusters") +
  scale_x_continuous(breaks = 1:15) +
  ggplot2::theme(legend.position = "none")

tidyclust::hier_clust is used to do the hierarchical clustering. Here, we set the number of clusters to three and use the complete linkage method. It actually uses stats::hclust() and Euclidean distance under the hood.

Code

hc_spec <- tidyclust::hier_clust(
  num_clusters = 3,
  linkage_method = "complete"
)

hc_fit <- hc_spec %>%
  tidyclust::fit(label ~ ., data = nci60_scaled)

We use tidyclust::within_cluster_sse to calculate the Sum of Squared Error in each cluster

Code

hc_fit %>%
  tidyclust::within_cluster_sse() %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::tot_sse to calculate the total sum of squares.

Code

hc_fit %>%
  tidyclust::tot_sse() %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::sse_ratio to calculate the ratio of the WSS to the total SSE.

Code

hc_fit %>%
  tidyclust::sse_ratio() %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::silhouettes to measure silhouettes between clusters for all observations. Ensure that the labels are removed.

Code

hc_fit %>%
  tidyclust::silhouettes(new_data = nci60_num_cluster) %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::avg_silhouette to measure average silhouettes across all observations. Ensure that the labels are removed. (This is also the mean of the column sil_width)

Code

hc_fit %>%
  tidyclust::avg_silhouette(new_data = nci60_num_cluster) %>%
  reactable::reactable(defaultPageSize = 5)

Use tidyclust::cluster_metric_set to combine the metrics together.

Code

many_metrics <- tidyclust::cluster_metric_set(
  tidyclust::tot_wss,
  tidyclust::tot_sse,
  tidyclust::sse_ratio,
  tidyclust::avg_silhouette
)

# new_data required for tidyclust::avg_silhouette to work
hc_fit %>%
  many_metrics(new_data = nci60_num_cluster) %>%
  reactable::reactable(defaultPageSize = 5)

We use tidyclust::extract_cluster_assignment to see the cluster assignments. However, the labels need to be updated.

Code

hc_fit %>%
  tidyclust::extract_cluster_assignment() %>%
  dplyr::mutate(labels = nci60_scaled$label) %>%
  reactable::reactable(defaultPageSize = 5)

Centroid can be extracted using extract_centroids. Here is a preview of the first few columns. For hierarchical clustering, the geometric mean over the predictors for each cluster is used.

Code

hc_fit_centroids <- hc_fit %>%
  tidyclust::extract_centroids()

hc_fit_centroids[, 1:10] %>%
  reactable::reactable(defaultPageSize = 5)

Here is how to see the plots. hc_fit$fit gives back the hclust object. We can plot the hierarchical clustering results.

Code

hc_fit$fit$labels <- nci60_scaled$label
factoextra::fviz_dend(hc_fit$fit, main = "complete", k = 3)

Code

stats::cutree(tree = hc_fit$fit, k = 3)

>         CNS         CNS         CNS       RENAL      BREAST         CNS 
>           1           1           1           1           1           1 
>         CNS      BREAST       NSCLC       NSCLC       RENAL       RENAL 
>           1           1           1           1           1           1 
>       RENAL       RENAL       RENAL       RENAL       RENAL      BREAST 
>           1           1           1           1           1           1 
>       NSCLC       RENAL     UNKNOWN     OVARIAN    MELANOMA    PROSTATE 
>           1           1           1           1           1           1 
>     OVARIAN     OVARIAN     OVARIAN     OVARIAN     OVARIAN    PROSTATE 
>           1           1           1           1           1           1 
>       NSCLC       NSCLC       NSCLC    LEUKEMIA K562B-repro K562A-repro 
>           1           1           1           2           2           2 
>    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA    LEUKEMIA       COLON 
>           2           2           2           2           2           1 
>       COLON       COLON       COLON       COLON       COLON       COLON 
>           3           1           3           3           3           3 
> MCF7A-repro      BREAST MCF7D-repro      BREAST       NSCLC       NSCLC 
>           3           3           3           3           1           1 
>       NSCLC    MELANOMA      BREAST      BREAST    MELANOMA    MELANOMA 
>           1           1           1           1           1           1 
>    MELANOMA    MELANOMA    MELANOMA    MELANOMA 
>           1           1           1           1

Alternatively, we can use tidyclust::extract_fit_summary as well.

Code

hc_summary <- hc_fit %>%
  tidyclust::extract_fit_summary()

Here is how you can do k-means clustering using tidyclust::k_means.

Code

km_spec <- tidyclust::k_means(num_clusters = 5) %>%
  parsnip::set_engine("stats", algorithm = "Lloyd")

km_fit <- km_spec %>%
  tidyclust::fit(label ~ ., data = nci60_scaled)

km_summary <- km_fit %>% extract_fit_summary()

stats::predict returns the cluster a new observation belongs to

Code

km_preds <- predict(km_fit, nci60_num_cluster, prefix = "KM_")
hc_preds <- predict(hc_fit, nci60_num_cluster, prefix = "HC_")

However, stats::predict on the “training data” will not necessarily produce the same results as tidyclust::extract_cluster_assignment

Code

cbind(
  hc_preds,
  tidyclust::extract_cluster_assignment(hc_fit),
  nci60_scaled$label
) %>%
  reactable::reactable(defaultPageSize = 5)

We notice that the three-cluster assignments from hier_clust do not line up perfectly with the five-cluster assignments from k_means.

Code

tibble(
  hc = hc_preds$.pred_cluster,
  km = km_preds$.pred_cluster
) %>%
  dplyr::count(.data[["hc"]], .data[["km"]]) %>%
  reactable::reactable(defaultPageSize = 5)

tidyclust also have a useful function called tidyclust::reconcile_clusterings to match clusters from k-means with the clusters from hierarchical clustering.

Code

cluster_table <- tidyclust::reconcile_clusterings(
  primary_cluster_assignment = hc_preds$.pred_cluster,
  alt_cluster_assignment = km_preds$.pred_cluster,
  one_to_one = FALSE
)

cluster_table %>%
  dplyr::mutate(labels = nci60_scaled$label) %>%
  reactable::reactable(
    defaultPageSize = 5,
    filterable = TRUE
  )

We can see that KM_1 and KM_2 is matched to HC_1, KM_3 is matched to HC_2, KM_4 and KM_5 is matched to HC_3.

Blog References

Tengku Hanis’s blog titled “Explore data using PCA”
Julia Silge’s blog titled “PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes”
Julia Silge’s blog titled “Getting started with k-means and #TidyTuesday employment status”
Tidymodels’ tutorial titled “K-means clustering with tidy data principles”
Emil Hvitfeldt’s ISLR tidymodels Labs Chapter 12
Eric (R_ic) RPubs’s post titled “Train and Evaluate Clustering Models using Tidymodels and friends”
Chapter 21 Hierarchical Clustering from Hands-On Machine Learning with R

Package References

Code

get_citation <- function(package_name) {
  transform_name <- package_name %>%
    citation() %>%
    format(style = "text")
  return(transform_name[1])
}

packages <- c(
  "base", "ggplot2", "rmarkdown", "cluster",
  "report", "tidytext", "knitr", "broom",
  "cowplot", "dplyr", "factoextra", "forcats",
  "ggforce", "gridExtra", "htmltools", "ISLR2",
  "magrittr", "Matrix", "parsnip", "proxy",
  "purrr", "reactable", "reactablefmtr", "recipes",
  "scales", "softImpute", "stringr", "styler",
  "tibble", "tidyr", "yardstick", "tidyclust",
  "rsample", "tune", "dials", "workflows",
  "Rfast", "flexclust", "langevitour"
)

table <- tibble::tibble(Packages = packages)

table %>%
  dplyr::mutate(
    transform_name = purrr::map_chr(
      .data[["Packages"]],
      get_citation
    )
  ) %>%
  dplyr::pull(.data[["transform_name"]]) %>%
  report::as.report_parameters()

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2022). rmarkdown: Dynamic Documents for R. R package version 2.14, https://github.com/rstudio/rmarkdown.
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2022). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.3 - For new features, see the ‘Changelog’ file (in the package source), https://CRAN.R-project.org/package=cluster.
Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://github.com/easystats/report.
Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi:10.21105/joss.00037 https://doi.org/10.21105/joss.00037, http://dx.doi.org/10.21105/joss.00037.
Xie Y (2022). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.39, https://yihui.org/knitr/.
Robinson D, Hayes A, Couch S (2022). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.0, https://CRAN.R-project.org/package=broom.
Wilke C (2020). cowplot: Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. R package version 1.1.1, https://CRAN.R-project.org/package=cowplot.
Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.
Kassambara A, Mundt F (2020). factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package version 1.0.7, https://CRAN.R-project.org/package=factoextra.
Wickham H (2021). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.1, https://CRAN.R-project.org/package=forcats.
Pedersen T (2021). ggforce: Accelerating ‘ggplot2’. R package version 0.3.3, https://CRAN.R-project.org/package=ggforce.
Auguie B (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3, https://CRAN.R-project.org/package=gridExtra.
Cheng J, Sievert C, Schloerke B, Chang W, Xie Y, Allen J (2021). htmltools: Tools for HTML. R package version 0.5.2, https://CRAN.R-project.org/package=htmltools.
James G, Witten D, Hastie T, Tibshirani R (2022). ISLR2: Introduction to Statistical Learning, Second Edition. R package version 1.3-1, https://CRAN.R-project.org/package=ISLR2.
Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.
Bates D, Maechler M, Jagan M (2022). Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.4-1, https://CRAN.R-project.org/package=Matrix.
Kuhn M, Vaughan D (2022). parsnip: A Common API to Modeling and Analysis Functions. https://github.com/tidymodels/parsnip, https://parsnip.tidymodels.org/.
Meyer D, Buchta C (2022). proxy: Distance and Similarity Measures. R package version 0.4-27, https://CRAN.R-project.org/package=proxy.
Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, https://CRAN.R-project.org/package=purrr.
Lin G (2022). reactable: Interactive Data Tables Based on ‘React Table’. R package version 0.3.0, https://CRAN.R-project.org/package=reactable.
Cuilla K (2022). reactablefmtr: Streamlined Table Styling and Formatting for Reactable. R package version 2.0.0, https://CRAN.R-project.org/package=reactablefmtr.
Kuhn M, Wickham H (2022). recipes: Preprocessing and Feature Engineering Steps for Modeling. R package version 1.0.1, https://CRAN.R-project.org/package=recipes.
Wickham H, Seidel D (2022). scales: Scale Functions for Visualization. R package version 1.2.0, https://CRAN.R-project.org/package=scales.
Hastie T, Mazumder R (2021). softImpute: Matrix Completion via Iterative Soft-Thresholded SVD. R package version 1.4-1, https://CRAN.R-project.org/package=softImpute.
Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, https://CRAN.R-project.org/package=stringr.
Müller K, Walthert L (2022). styler: Non-Invasive Pretty Printing of R Code. R package version 1.7.0, https://CRAN.R-project.org/package=styler.
Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.8, https://CRAN.R-project.org/package=tibble.
Wickham H, Girlich M (2022). tidyr: Tidy Messy Data. R package version 1.2.0, https://CRAN.R-project.org/package=tidyr.
Kuhn M, Vaughan D (2022). yardstick: Tidy Characterizations of Model Performance. R package version 1.0.0, https://CRAN.R-project.org/package=yardstick.
Hvitfeldt E, Bodwin K (2022). tidyclust: A Common API to Clustering. R package version 0.0.0.9000, https://github.com/EmilHvitfeldt/tidyclust.
Silge J, Chow F, Kuhn M, Wickham H (2022). rsample: General Resampling Infrastructure. R package version 1.1.0, https://CRAN.R-project.org/package=rsample.
Kuhn M (2022). tune: Tidy Tuning Tools. R package version 1.0.0, https://CRAN.R-project.org/package=tune.
Kuhn M, Frick H (2022). dials: Tools for Creating Tuning Parameter Values. R package version 1.0.0, https://CRAN.R-project.org/package=dials.
Vaughan D (2022). workflows: Modeling Workflows. https://github.com/tidymodels/workflows, https://workflows.tidymodels.org.
Papadakis M, Tsagris M, Dimitriadis M, Fafalios S, Tsamardinos I, Fasiolo M, Borboudakis G, Burkardt J, Zou C, Lakiotaki K, Chatzipantsiou. C (2022). Rfast: A Collection of Efficient and Extremely Fast R Functions. R package version 2.0.6, https://CRAN.R-project.org/package=Rfast.
Leisch F (2006). “A Toolbox for K-Centroids Cluster Analysis.” Computational Statistics and Data Analysis, 51(2), 526-544.
Harrison P (2022). langevitour: Langevin Tour. R package version 0.5, https://CRAN.R-project.org/package=langevitour.