Chapter 12 Lab
This document was prepared on 2022-08-23.
Code
# Rmarkdown stuff
library(rmarkdown, quietly = TRUE)
library(knitr, quietly = TRUE)
library(htmltools, quietly = TRUE)
library(styler, quietly = TRUE)
# Dataset
library(ISLR2, quietly = TRUE)
# Session info and package reporting
library(report, quietly = TRUE)
# Data wrangling
library(dplyr, quietly = TRUE)
library(tidyr, quietly = TRUE)
library(magrittr, quietly = TRUE)
library(purrr, quietly = TRUE)
library(tibble, quietly = TRUE)
library(stringr, quietly = TRUE)
# View PCA
library(langevitour, quietly = TRUE)
# Matrix Completion
library(softImpute, quietly = TRUE)
# Create interactive tables
library(reactable, quietly = TRUE)
library(reactablefmtr, quietly = TRUE)
# For plotting
library(tidytext, quietly = TRUE)
library(scales, quietly = TRUE)
library(forcats, quietly = TRUE)
library(grid, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(gridExtra, quietly = TRUE)
library(grDevices, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(ggforce, quietly = TRUE)
# For applying tidymodels
library(recipes, quietly = TRUE)
# From GitHub remotes::install_github("tidymodels/parsnip")
library(parsnip, quietly = TRUE)
library(broom, quietly = TRUE)
library(yardstick, quietly = TRUE)
# For clustering
library(proxy, quietly = TRUE)
library(factoextra, quietly = TRUE)
library(cluster, quietly = TRUE)
# For tidy clustering
# From GitHub remotes::install_github("tidymodels/workflows@celery")
library(workflows, quietly = TRUE)
library(rsample, quietly = TRUE)
library(tune, quietly = TRUE)
library(dials, quietly = TRUE)
library(tidyclust, quietly = TRUE)
library(Rfast, quietly = TRUE)
library(flexclust, quietly = TRUE)
summary(report::report(sessionInfo()))
The analysis was done using the R Statistical language (v4.2.1; R Core Team, 2022) on Windows 10 x64, using the packages Rcpp (v1.0.9), ggplot2 (v3.3.6), rmarkdown (v2.14), cluster (v2.1.3), report (v0.5.1), lattice (v0.20.45), tidytext (v0.3.3), flexclust (v1.4.1), knitr (v1.39), broom (v1.0.0), cowplot (v1.1.1), dials (v1.0.0), dplyr (v1.0.9), factoextra (v1.0.7), forcats (v0.5.1), ggforce (v0.3.3), gridExtra (v2.3), htmltools (v0.5.2), ISLR2 (v1.3.1), langevitour (v0.5), magrittr (v2.0.3), Matrix (v1.4.1), modeltools (v0.2.23), parsnip (v1.0.0.9000), proxy (v0.4.27), purrr (v0.3.4), RcppZiggurat (v0.1.6), reactable (v0.3.0), reactablefmtr (v2.0.0), recipes (v1.0.1), Rfast (v2.0.6), rsample (v1.1.0), scales (v1.2.0), softImpute (v1.4.1), stringr (v1.4.0), styler (v1.7.0), tibble (v3.1.8), tidyclust (v0.0.0.9000), tidyr (v1.2.0), tune (v1.0.0), workflows (v1.0.0.9000) and yardstick (v1.0.0).
Principal Components Analysis
The USArrests
data set from the datasets
package contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
Column Name | Column Description |
---|---|
Murder |
Number of arrests for murder (per 100,000) |
Assault |
Number of arrests for assault (per 100,000) |
UrbanPop |
Percent of the population living in urban areas |
Rape |
Number of arrests for rape (per 100,000) |
Turn USArrests
into a tibble and move the rownames into a column.
We create a recipe with recipes::recipe
, recipes::all_numeric_predictors()
, recipes::step_normalize
and recipes::step_pca
By default, recipes::step_pca
uses stats::prcomp
with the argument defaults set to retx = FALSE
, center = FALSE
, scale. = FALSE
, and tol = NULL
.
Code
pca_recipe <- recipes::recipe(
formula = ~.,
data = USArrests
) %>%
recipes::step_normalize(recipes::all_numeric_predictors()) %>%
recipes::step_pca(recipes::all_numeric_predictors(),
id = "pca",
num_comp = 4
)
pca_recipe
> Recipe
>
> Inputs:
>
> role #variables
> predictor 5
>
> Operations:
>
> Centering and scaling for recipes::all_numeric_predictors()
> PCA extraction with recipes::all_numeric_predictors()
Apply recipes::prep
and recipes::bake
to compute the scores of the principal component analysis.
Code
Code
We can explore the results of the PCA using the recipes::prep
and parsnip::tidy
. We can see that pca is done at step number 2
As such, we extract the pca loadings of the pca step.
Code
Here are the pca loadings in wide table form.
Code
We make a visualization to see how the loadings of the four components look like
Code
tidied_pca_loadings %>%
dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
ggplot2::ggplot(mapping = ggplot2::aes(
x = .data[["value"]],
y = .data[["terms"]],
fill = .data[["terms"]]
)) +
ggplot2::geom_col(show.legend = FALSE) +
ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]])) +
ggplot2::labs(y = NULL)
Let us take a closer look at the four variables that contribute to the four components
Code
tidied_pca_loadings %>%
dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
dplyr::group_by(.data[["component"]]) %>%
dplyr::top_n(4, abs(.data[["value"]])) %>%
dplyr::ungroup() %>%
dplyr::mutate(
terms = tidytext::reorder_within(
x = .data[["terms"]],
by = abs(.data[["value"]]),
within = .data[["component"]]
)
) %>%
ggplot2::ggplot(mapping = ggplot2::aes(
x = abs(.data[["value"]]),
y = .data[["terms"]],
fill = .data[["value"]] > 0
)) +
ggplot2::geom_col() +
ggplot2::facet_wrap(
facets = ggplot2::vars(.data[["component"]]),
scales = "free_y"
) +
tidytext::scale_y_reordered() +
ggplot2::labs(
x = "Absolute value of contribution to PCA component",
y = NULL,
fill = "Positive?"
)
How much variation are we capturing for each component?
Here is how we can see the variance statistics but it is in a long form
Code
Here is how we can see the variance statistics in its wide form
Code
Here is a simple plot to show the variance captured for each component
Code
# Get the variance
percent_variation <- tidied_pca_variance %>%
dplyr::filter(.data[["terms"]] == "percent variance") %>%
dplyr::pull(.data[["value"]])
percent_variation <- percent_variation / 100
# I use [1:4] to select the first four components
dplyr::tibble(
component = unique(tidied_pca_loadings$component)[1:4],
percent_var = percent_variation[1:4]
) %>%
dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
ggplot2::ggplot(mapping = ggplot2::aes(
x = .data[["component"]],
y = .data[["percent_var"]]
)) +
ggplot2::geom_col() +
ggplot2::geom_text(
mapping = ggplot2::aes(
label = paste(round(.data[["percent_var"]] * 100, 0), "%")
),
vjust = -0.2
) +
ggplot2::scale_y_continuous(labels = scales::percent_format()) +
ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
ggplot2::labs(
x = NULL,
y = "Percent variance \nexplained by each \nPCA component"
)
Here is a PCA Biplot of PC1 and PC2
Code
filtered_loadings <- tidied_pca_loadings %>%
tidyr::pivot_wider(
names_from = .data[["component"]],
values_from = .data[["value"]]
)
ggplot2::ggplot() +
ggplot2::geom_point(
data = pca_score,
mapping = ggplot2::aes(
x = .data[["PC1"]],
y = .data[["PC2"]]
),
size = 2,
alpha = 0.6
) +
ggplot2::geom_text(
data = pca_score,
mapping = ggplot2::aes(
x = .data[["PC1"]],
y = .data[["PC2"]],
label = .data[["state"]]
),
check_overlap = TRUE,
size = 2.5,
hjust = "inward"
) +
ggplot2::geom_segment(
data = filtered_loadings,
mapping = ggplot2::aes(
x = 0,
y = 0,
xend = .data[["PC1"]] * 2.5,
yend = .data[["PC2"]] * 2.5
),
arrow = ggplot2::arrow(length = grid::unit(
x = 0.5,
units = "picas"
)),
color = "blue"
) +
ggplot2::annotate(
geom = "text",
x = filtered_loadings[["PC1"]] * 2.7,
y = filtered_loadings[["PC2"]] * 2.7,
label = filtered_loadings[["terms"]]
) +
ggplot2::coord_fixed(clip = "off") +
ggplot2::theme_minimal()
Matrix Completion
Initialisation
Code
missing_data <- data.frame(
SBP = c(-3, NA, -1, 1, 1, NA),
DBP = c(NA, -2, 0, NA, 2, 4)
)
colour_missing_cells <- function(value, index, name) {
missing_indices <- which(is.na(missing_data[[name]]))
if (index %in% missing_indices) {
list(background = "#aaf0f8")
}
}
xbar <- colMeans(missing_data, na.rm = TRUE)
initialisation_data <- data.frame(
SBP = c(-3, -0.5, -1, 1, 1, -0.5),
DBP = c(1, -2, 0, 1, 2, 4)
)
thresh <- 1e-2
ismiss <- is.na(missing_data)
Functions
These are the implementations for calculating step 2a and 2b for Algorithm 12.1
Code
fit_pca <- function(input_data, num_comp = 1) {
data_pca <- prcomp(input_data, center = FALSE)
pca_loadings <- data_pca$rotation
pca_loadings_inverse <- solve(pca_loadings)[1:num_comp, , drop = FALSE]
pca_scores <- data_pca$x[, 1:num_comp, drop = FALSE]
estimated_data <- pca_scores %*% pca_loadings_inverse
return(estimated_data)
}
Code
fit_svd <- function(input_data, num_comp = 1) {
svd_object <- svd(input_data)
pca_loadings <- svd_object$v[, 1:num_comp, drop = FALSE]
pca_loadings_inverse <- t(pca_loadings)
pca_scores <- (svd_object$u %*% diag(svd_object$d))[, 1:num_comp, drop = FALSE]
estimated_data <- pca_scores %*% pca_loadings_inverse
return(estimated_data)
}
Code
fit_pca_tidy <- function(input_data, num_comp = 1) {
pca_recipe <- recipes::recipe(
formula = ~.,
data = input_data
) %>%
recipes::step_pca(recipes::all_numeric_predictors(),
id = "pca"
)
pca_estimates <- recipes::prep(x = pca_recipe, training = input_data)
pca_scores <- pca_estimates %>%
recipes::bake(new_data = input_data) %>%
as.matrix()
pca_scores <- pca_scores[, 1:num_comp, drop = FALSE]
pca_loadings <- pca_estimates %>%
parsnip::tidy(id = "pca", type = "coef") %>%
tidyr::pivot_wider(
names_from = .data[["component"]],
values_from = .data[["value"]]
) %>%
dplyr::select(-c("terms", "id")) %>%
as.matrix()
pca_loadings_inverse <- solve(pca_loadings)[1:num_comp, , drop = FALSE]
estimated_data <- pca_scores %*% pca_loadings_inverse
return(estimated_data)
}
Iteration 1
Code
initialisation_data_pca <- initialisation_data
# Step 2(a)
Xapp <- fit_pca(initialisation_data_pca, num_comp = 1)
# Step 2(b)
initialisation_data_pca[ismiss] <- Xapp[ismiss]
# Step 2(c)
mss_old_pca <- mean(((missing_data - Xapp)[!ismiss])^2)
initialisation_data_pca %>%
reactable::reactable(
defaultPageSize = 6,
columns = list(
SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
),
defaultColDef = reactable::colDef(
style = colour_missing_cells
)
)
Code
mss_old_pca
> [1] 1.48974
Code
initialisation_data_svd <- initialisation_data
# Step 2(a)
Xapp <- fit_svd(initialisation_data_svd, num_comp = 1)
# Step 2(b)
initialisation_data_svd[ismiss] <- Xapp[ismiss]
# Step 2(c)
mss_old_svd <- mean(((missing_data - Xapp)[!ismiss])^2)
initialisation_data_svd %>%
reactable::reactable(
defaultPageSize = 6,
columns = list(
SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
),
defaultColDef = reactable::colDef(
style = colour_missing_cells
)
)
Code
mss_old_svd
> [1] 1.48974
Code
initialisation_data_tidymodels <- initialisation_data
# Step 2(a)
Xapp <- fit_pca_tidy(initialisation_data_tidymodels, num_comp = 1)
# Step 2(b)
initialisation_data_tidymodels[ismiss] <- Xapp[ismiss]
# Step 2(c)
mss_old_tidymodels <- mean(((missing_data - Xapp)[!ismiss])^2)
initialisation_data_tidymodels %>%
reactable::reactable(
defaultPageSize = 6,
columns = list(
SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
),
defaultColDef = reactable::colDef(
style = colour_missing_cells
)
)
Code
mss_old_tidymodels
> [1] 1.48974
The rest of the iteration
Code
iter <- 1
rel_err <- mss_old_pca
thresh <- 1e-2
while (rel_err > thresh) {
iter <- iter + 1
# Step 2(a)
Xapp <- fit_pca(initialisation_data_pca, num_comp = 1)
# Step 2(b)
initialisation_data_pca[ismiss] <- Xapp[ismiss]
# Step 2(c)
mss_pca <- mean(((missing_data - Xapp)[!ismiss])^2)
rel_err <- mss_old_pca - mss_pca
mss_old_pca <- mss_pca
cat(
"Iter:", iter, "\n",
"MSS:", mss_pca, "\n",
"Rel. Err:", rel_err, "\n"
)
print("Xapp")
print(Xapp)
print("Data")
print(initialisation_data_pca)
}
> Iter: 2
> MSS: 1.43288
> Rel. Err: 0.05686009
> [1] "Xapp"
> SBP DBP
> [1,] -0.24567017 1.6296426
> [2,] 0.29799874 -1.9767619
> [3,] -0.02222086 0.1474011
> [4,] -0.11358385 0.7534536
> [5,] -0.27258141 1.8081571
> [6,] -0.59617741 3.9547173
> [1] "Data"
> SBP DBP
> 1 -3.0000000 1.6296426
> 2 0.2979987 -2.0000000
> 3 -1.0000000 0.0000000
> 4 1.0000000 0.7534536
> 5 1.0000000 2.0000000
> 6 -0.5961774 4.0000000
> Iter: 3
> MSS: 1.222089
> Rel. Err: 0.2107907
> [1] "Xapp"
> SBP DBP
> [1,] -0.73055871 2.3385975
> [2,] 0.59572858 -1.9069916
> [3,] -0.08891186 0.2846165
> [4,] -0.12553344 0.4018461
> [5,] -0.48032110 1.5375598
> [6,] -1.19147316 3.8140345
> [1] "Data"
> SBP DBP
> 1 -3.0000000 2.3385975
> 2 0.5957286 -2.0000000
> 3 -1.0000000 0.0000000
> 4 1.0000000 0.4018461
> 5 1.0000000 2.0000000
> 6 -1.1914732 4.0000000
> Iter: 4
> MSS: 0.8998268
> Rel. Err: 0.3222623
> [1] "Xapp"
> SBP DBP
> [1,] -1.53549232 3.07087697
> [2,] 0.91916923 -1.83827401
> [3,] -0.20001122 0.40000841
> [4,] 0.03926938 -0.07853601
> [5,] -0.60000561 1.19996915
> [6,] -1.83834165 3.67655441
> [1] "Data"
> SBP DBP
> 1 -3.0000000 3.07087697
> 2 0.9191692 -2.00000000
> 3 -1.0000000 0.00000000
> 4 1.0000000 -0.07853601
> 5 1.0000000 2.00000000
> 6 -1.8383417 4.00000000
> Iter: 5
> MSS: 0.6967107
> Rel. Err: 0.2031161
> [1] "Xapp"
> SBP DBP
> [1,] -2.1668365 3.5757572
> [2,] 1.1333206 -1.8702286
> [3,] -0.2685842 0.4432231
> [4,] 0.3033932 -0.5006656
> [5,] -0.6178620 1.0196084
> [6,] -2.2666421 3.7404586
> [1] "Data"
> SBP DBP
> 1 -3.000000 3.5757572
> 2 1.133321 -2.0000000
> 3 -1.000000 0.0000000
> 4 1.000000 -0.5006656
> 5 1.000000 2.0000000
> 6 -2.266642 4.0000000
> Iter: 6
> MSS: 0.612954
> Rel. Err: 0.08375675
> [1] "Xapp"
> SBP DBP
> [1,] -2.5102026 3.8916848
> [2,] 1.2439901 -1.9286163
> [3,] -0.2938087 0.4555054
> [4,] 0.5218645 -0.8090711
> [5,] -0.6172021 0.9568773
> [6,] -2.4879805 3.8572330
> [1] "Data"
> SBP DBP
> 1 -3.000000 3.8916848
> 2 1.243990 -2.0000000
> 3 -1.000000 0.0000000
> 4 1.000000 -0.8090711
> 5 1.000000 2.0000000
> 6 -2.487981 4.0000000
> Iter: 7
> MSS: 0.5807517
> Rel. Err: 0.03220227
> [1] "Xapp"
> SBP DBP
> [1,] -2.6813227 4.1000892
> [2,] 1.2887789 -1.9707096
> [3,] -0.2995594 0.4580650
> [4,] 0.6701665 -1.0247713
> [5,] -0.6165707 0.9428163
> [6,] -2.5775579 3.9414194
> [1] "Data"
> SBP DBP
> 1 -3.000000 4.100089
> 2 1.288779 -2.000000
> 3 -1.000000 0.000000
> 4 1.000000 -1.024771
> 5 1.000000 2.000000
> 6 -2.577558 4.000000
> Iter: 8
> MSS: 0.5665195
> Rel. Err: 0.01423225
> [1] "Xapp"
> SBP DBP
> [1,] -2.7628696 4.254096
> [2,] 1.2959129 -1.995367
> [3,] -0.2966660 0.456788
> [4,] 0.7647692 -1.177544
> [5,] -0.6169100 0.949880
> [6,] -2.5918258 3.990734
> [1] "Data"
> SBP DBP
> 1 -3.000000 4.254096
> 2 1.295913 -2.000000
> 3 -1.000000 0.000000
> 4 1.000000 -1.177544
> 5 1.000000 2.000000
> 6 -2.591826 4.000000
> Iter: 9
> MSS: 0.558216
> Rel. Err: 0.00830349
> [1] "Xapp"
> SBP DBP
> [1,] -2.7988324 4.3825676
> [2,] 1.2826615 -2.0084627
> [3,] -0.2896952 0.4536209
> [4,] 0.8238540 -1.2900364
> [5,] -0.6175465 0.9669887
> [6,] -2.5653230 4.0169255
> [1] "Data"
> SBP DBP
> 1 -3.000000 4.382568
> 2 1.282661 -2.000000
> 3 -1.000000 0.000000
> 4 1.000000 -1.290036
> 5 1.000000 2.000000
> 6 -2.565323 4.000000
Code
iter <- 1
rel_err <- mss_old_svd
thresh <- 1e-2
while (rel_err > thresh) {
iter <- iter + 1
# Step 2(a)
Xapp <- fit_svd(initialisation_data_svd, num_comp = 1)
# Step 2(b)
initialisation_data_svd[ismiss] <- Xapp[ismiss]
# Step 2(c)
mss_svd <- mean(((missing_data - Xapp)[!ismiss])^2)
rel_err <- mss_old_svd - mss_svd
mss_old_svd <- mss_svd
cat(
"Iter:", iter, "\n",
"MSS:", mss_svd, "\n",
"Rel. Err:", rel_err, "\n"
)
print("Xapp")
print(Xapp)
print("Data")
print(initialisation_data_svd)
}
> Iter: 2
> MSS: 1.43288
> Rel. Err: 0.05686009
> [1] "Xapp"
> [,1] [,2]
> [1,] -0.24567017 1.6296426
> [2,] 0.29799874 -1.9767619
> [3,] -0.02222086 0.1474011
> [4,] -0.11358385 0.7534536
> [5,] -0.27258141 1.8081571
> [6,] -0.59617741 3.9547173
> [1] "Data"
> SBP DBP
> 1 -3.0000000 1.6296426
> 2 0.2979987 -2.0000000
> 3 -1.0000000 0.0000000
> 4 1.0000000 0.7534536
> 5 1.0000000 2.0000000
> 6 -0.5961774 4.0000000
> Iter: 3
> MSS: 1.222089
> Rel. Err: 0.2107907
> [1] "Xapp"
> [,1] [,2]
> [1,] -0.73055871 2.3385975
> [2,] 0.59572858 -1.9069916
> [3,] -0.08891186 0.2846165
> [4,] -0.12553344 0.4018461
> [5,] -0.48032110 1.5375598
> [6,] -1.19147316 3.8140345
> [1] "Data"
> SBP DBP
> 1 -3.0000000 2.3385975
> 2 0.5957286 -2.0000000
> 3 -1.0000000 0.0000000
> 4 1.0000000 0.4018461
> 5 1.0000000 2.0000000
> 6 -1.1914732 4.0000000
> Iter: 4
> MSS: 0.8998268
> Rel. Err: 0.3222623
> [1] "Xapp"
> [,1] [,2]
> [1,] -1.53549232 3.07087697
> [2,] 0.91916923 -1.83827401
> [3,] -0.20001122 0.40000841
> [4,] 0.03926938 -0.07853601
> [5,] -0.60000561 1.19996915
> [6,] -1.83834165 3.67655441
> [1] "Data"
> SBP DBP
> 1 -3.0000000 3.07087697
> 2 0.9191692 -2.00000000
> 3 -1.0000000 0.00000000
> 4 1.0000000 -0.07853601
> 5 1.0000000 2.00000000
> 6 -1.8383417 4.00000000
> Iter: 5
> MSS: 0.6967107
> Rel. Err: 0.2031161
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.1668365 3.5757572
> [2,] 1.1333206 -1.8702286
> [3,] -0.2685842 0.4432231
> [4,] 0.3033932 -0.5006656
> [5,] -0.6178620 1.0196084
> [6,] -2.2666421 3.7404586
> [1] "Data"
> SBP DBP
> 1 -3.000000 3.5757572
> 2 1.133321 -2.0000000
> 3 -1.000000 0.0000000
> 4 1.000000 -0.5006656
> 5 1.000000 2.0000000
> 6 -2.266642 4.0000000
> Iter: 6
> MSS: 0.612954
> Rel. Err: 0.08375675
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.5102026 3.8916848
> [2,] 1.2439901 -1.9286163
> [3,] -0.2938087 0.4555054
> [4,] 0.5218645 -0.8090711
> [5,] -0.6172021 0.9568773
> [6,] -2.4879805 3.8572330
> [1] "Data"
> SBP DBP
> 1 -3.000000 3.8916848
> 2 1.243990 -2.0000000
> 3 -1.000000 0.0000000
> 4 1.000000 -0.8090711
> 5 1.000000 2.0000000
> 6 -2.487981 4.0000000
> Iter: 7
> MSS: 0.5807517
> Rel. Err: 0.03220227
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.6813227 4.1000892
> [2,] 1.2887789 -1.9707096
> [3,] -0.2995594 0.4580650
> [4,] 0.6701665 -1.0247713
> [5,] -0.6165707 0.9428163
> [6,] -2.5775579 3.9414194
> [1] "Data"
> SBP DBP
> 1 -3.000000 4.100089
> 2 1.288779 -2.000000
> 3 -1.000000 0.000000
> 4 1.000000 -1.024771
> 5 1.000000 2.000000
> 6 -2.577558 4.000000
> Iter: 8
> MSS: 0.5665195
> Rel. Err: 0.01423225
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.7628696 4.254096
> [2,] 1.2959129 -1.995367
> [3,] -0.2966660 0.456788
> [4,] 0.7647692 -1.177544
> [5,] -0.6169100 0.949880
> [6,] -2.5918258 3.990734
> [1] "Data"
> SBP DBP
> 1 -3.000000 4.254096
> 2 1.295913 -2.000000
> 3 -1.000000 0.000000
> 4 1.000000 -1.177544
> 5 1.000000 2.000000
> 6 -2.591826 4.000000
> Iter: 9
> MSS: 0.558216
> Rel. Err: 0.00830349
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.7988324 4.3825676
> [2,] 1.2826615 -2.0084627
> [3,] -0.2896952 0.4536209
> [4,] 0.8238540 -1.2900364
> [5,] -0.6175465 0.9669887
> [6,] -2.5653230 4.0169255
> [1] "Data"
> SBP DBP
> 1 -3.000000 4.382568
> 2 1.282661 -2.000000
> 3 -1.000000 0.000000
> 4 1.000000 -1.290036
> 5 1.000000 2.000000
> 6 -2.565323 4.000000
Code
iter <- 1
rel_err <- mss_old_tidymodels
thresh <- 1e-2
while (rel_err > thresh) {
iter <- iter + 1
# Step 2(a)
Xapp <- fit_svd(initialisation_data_tidymodels, num_comp = 1)
# Step 2(b)
initialisation_data_tidymodels[ismiss] <- Xapp[ismiss]
# Step 2(c)
mss_tidymodels <- mean(((missing_data - Xapp)[!ismiss])^2)
rel_err <- mss_old_tidymodels - mss_tidymodels
mss_old_tidymodels <- mss_tidymodels
cat(
"Iter:", iter, "\n",
"MSS:", mss_tidymodels, "\n",
"Rel. Err:", rel_err, "\n"
)
print("Xapp")
print(Xapp)
print("Data")
print(initialisation_data_tidymodels)
}
> Iter: 2
> MSS: 1.43288
> Rel. Err: 0.05686009
> [1] "Xapp"
> [,1] [,2]
> [1,] -0.24567017 1.6296426
> [2,] 0.29799874 -1.9767619
> [3,] -0.02222086 0.1474011
> [4,] -0.11358385 0.7534536
> [5,] -0.27258141 1.8081571
> [6,] -0.59617741 3.9547173
> [1] "Data"
> SBP DBP
> 1 -3.0000000 1.6296426
> 2 0.2979987 -2.0000000
> 3 -1.0000000 0.0000000
> 4 1.0000000 0.7534536
> 5 1.0000000 2.0000000
> 6 -0.5961774 4.0000000
> Iter: 3
> MSS: 1.222089
> Rel. Err: 0.2107907
> [1] "Xapp"
> [,1] [,2]
> [1,] -0.73055871 2.3385975
> [2,] 0.59572858 -1.9069916
> [3,] -0.08891186 0.2846165
> [4,] -0.12553344 0.4018461
> [5,] -0.48032110 1.5375598
> [6,] -1.19147316 3.8140345
> [1] "Data"
> SBP DBP
> 1 -3.0000000 2.3385975
> 2 0.5957286 -2.0000000
> 3 -1.0000000 0.0000000
> 4 1.0000000 0.4018461
> 5 1.0000000 2.0000000
> 6 -1.1914732 4.0000000
> Iter: 4
> MSS: 0.8998268
> Rel. Err: 0.3222623
> [1] "Xapp"
> [,1] [,2]
> [1,] -1.53549232 3.07087697
> [2,] 0.91916923 -1.83827401
> [3,] -0.20001122 0.40000841
> [4,] 0.03926938 -0.07853601
> [5,] -0.60000561 1.19996915
> [6,] -1.83834165 3.67655441
> [1] "Data"
> SBP DBP
> 1 -3.0000000 3.07087697
> 2 0.9191692 -2.00000000
> 3 -1.0000000 0.00000000
> 4 1.0000000 -0.07853601
> 5 1.0000000 2.00000000
> 6 -1.8383417 4.00000000
> Iter: 5
> MSS: 0.6967107
> Rel. Err: 0.2031161
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.1668365 3.5757572
> [2,] 1.1333206 -1.8702286
> [3,] -0.2685842 0.4432231
> [4,] 0.3033932 -0.5006656
> [5,] -0.6178620 1.0196084
> [6,] -2.2666421 3.7404586
> [1] "Data"
> SBP DBP
> 1 -3.000000 3.5757572
> 2 1.133321 -2.0000000
> 3 -1.000000 0.0000000
> 4 1.000000 -0.5006656
> 5 1.000000 2.0000000
> 6 -2.266642 4.0000000
> Iter: 6
> MSS: 0.612954
> Rel. Err: 0.08375675
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.5102026 3.8916848
> [2,] 1.2439901 -1.9286163
> [3,] -0.2938087 0.4555054
> [4,] 0.5218645 -0.8090711
> [5,] -0.6172021 0.9568773
> [6,] -2.4879805 3.8572330
> [1] "Data"
> SBP DBP
> 1 -3.000000 3.8916848
> 2 1.243990 -2.0000000
> 3 -1.000000 0.0000000
> 4 1.000000 -0.8090711
> 5 1.000000 2.0000000
> 6 -2.487981 4.0000000
> Iter: 7
> MSS: 0.5807517
> Rel. Err: 0.03220227
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.6813227 4.1000892
> [2,] 1.2887789 -1.9707096
> [3,] -0.2995594 0.4580650
> [4,] 0.6701665 -1.0247713
> [5,] -0.6165707 0.9428163
> [6,] -2.5775579 3.9414194
> [1] "Data"
> SBP DBP
> 1 -3.000000 4.100089
> 2 1.288779 -2.000000
> 3 -1.000000 0.000000
> 4 1.000000 -1.024771
> 5 1.000000 2.000000
> 6 -2.577558 4.000000
> Iter: 8
> MSS: 0.5665195
> Rel. Err: 0.01423225
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.7628696 4.254096
> [2,] 1.2959129 -1.995367
> [3,] -0.2966660 0.456788
> [4,] 0.7647692 -1.177544
> [5,] -0.6169100 0.949880
> [6,] -2.5918258 3.990734
> [1] "Data"
> SBP DBP
> 1 -3.000000 4.254096
> 2 1.295913 -2.000000
> 3 -1.000000 0.000000
> 4 1.000000 -1.177544
> 5 1.000000 2.000000
> 6 -2.591826 4.000000
> Iter: 9
> MSS: 0.558216
> Rel. Err: 0.00830349
> [1] "Xapp"
> [,1] [,2]
> [1,] -2.7988324 4.3825676
> [2,] 1.2826615 -2.0084627
> [3,] -0.2896952 0.4536209
> [4,] 0.8238540 -1.2900364
> [5,] -0.6175465 0.9669887
> [6,] -2.5653230 4.0169255
> [1] "Data"
> SBP DBP
> 1 -3.000000 4.382568
> 2 1.282661 -2.000000
> 3 -1.000000 0.000000
> 4 1.000000 -1.290036
> 5 1.000000 2.000000
> 6 -2.565323 4.000000
SoftImpute
Code
initialisation_data_tidymodels %>%
reactable::reactable(
defaultPageSize = 6,
columns = list(
SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
),
defaultColDef = reactable::colDef(
style = colour_missing_cells
)
) %>%
reactablefmtr::add_title(
title = "Algorithm 12.1"
)
fits <- softImpute::softImpute(missing_data, type = "svd")
softImpute::complete(missing_data, fits) %>%
reactable::reactable(
defaultPageSize = 6,
columns = list(
SBP = reactable::colDef(format = reactable::colFormat(digits = 2)),
DBP = reactable::colDef(format = reactable::colFormat(digits = 2))
),
defaultColDef = reactable::colDef(
style = colour_missing_cells
)
) %>%
reactablefmtr::add_title(
title = "Soft Impute"
)
Algorithm 12.1
Soft Impute
K-Means Clustering
Introduction
We begin with a simple simulated example in which there truly are two clusters in the
Code
set.seed(2)
# CLuster centre at (0,0) and (3,-4)
centers <- tibble::tibble(
cluster = factor(1:2),
num_points = c(25, 25), # number points in each cluster
V1 = c(0, 3), # V1 coordinate of cluster center
V2 = c(0, -4) # V2 coordinate of cluster center
)
labelled_points <- centers %>%
dplyr::mutate(
V1 = purrr::map2(.data[["num_points"]], .data[["V1"]], rnorm),
V2 = purrr::map2(.data[["num_points"]], .data[["V2"]], rnorm)
) %>%
dplyr::select(-c("num_points")) %>%
tidyr::unnest(cols = c("V1", "V2"))
labelled_points %>%
reactable::reactable(
defaultPageSize = 10,
columns = list(
V1 = reactable::colDef(format = reactable::colFormat(digits = 2)),
V2 = reactable::colDef(format = reactable::colFormat(digits = 2))
)
)
Code
ggplot2::ggplot(
data = labelled_points,
mapping = ggplot2::aes(
x = .data[["V1"]],
y = .data[["V2"]],
color = .data[["cluster"]]
)
) +
ggplot2::geom_point(alpha = 1)
We now perform K-means clustering with \(K = 2\).
We’ll use the built-in stats::kmeans
function, which accepts a data frame with all numeric columns as it’s primary argument
Code
> K-means clustering with 2 clusters of sizes 26, 24
>
> Cluster means:
> V1 V2
> 1 0.361163 -0.162447
> 2 2.877777 -4.262629
>
> Clustering vector:
> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
> [39] 2 2 2 1 2 2 2 2 2 2 2 2
>
> Within cluster sum of squares by cluster:
> [1] 68.52253 58.53435
> (between_SS / total_SS = 69.5 %)
>
> Available components:
>
> [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
> [6] "betweenss" "size" "iter" "ifault"
The broom::tidy
function returns information for each cluster, including their mean position (centroid), size and within-cluster sum-of-squares.
The broom::glance
function returns model wise metrics. One of these is tot.withinss
which is the total within-cluster sum-of-squares that we seek to minimize when we perform K-means clustering.
We can see what cluster each observation belongs to by using broom::augment
which “predict” which cluster a given observation belongs to.
Code
Code
ggplot2::ggplot(
data = predicted_kmeans,
mapping = ggplot2::aes(
x = .data[["V1"]],
y = .data[["V2"]],
color = .data[[".cluster"]]
)
) +
ggplot2::geom_point(alpha = 1)
Exploration
In reality, we will not know how many clusters is required for a given dataset. We need to try out a number of different values of \(K\) and then find the best one.
Let us try \(K\) from 1 to 9.
Code
kclusts <- tibble::tibble(k = 1:9) %>%
dplyr::mutate(
kclust = purrr::map(
.x = .data[["k"]],
.f = ~ kmeans(x = points, centers = .x)
)
) %>%
dplyr::mutate(
tidied = purrr::map(
.x = .data[["kclust"]],
.f = broom::tidy
),
glanced = purrr::map(
.x = .data[["kclust"]],
.f = broom::glance
),
augmented = purrr::map(
.x = .data[["kclust"]],
.f = broom::augment, data = points
)
)
kclusts
> # A tibble: 9 × 5
> k kclust tidied glanced augmented
> <int> <list> <list> <list> <list>
> 1 1 <kmeans> <tibble [1 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 2 2 <kmeans> <tibble [2 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 3 3 <kmeans> <tibble [3 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 4 4 <kmeans> <tibble [4 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 5 5 <kmeans> <tibble [5 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 6 6 <kmeans> <tibble [6 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 7 7 <kmeans> <tibble [7 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 8 8 <kmeans> <tibble [8 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
> 9 9 <kmeans> <tibble [9 × 5]> <tibble [1 × 4]> <tibble [50 × 3]>
Let assignments
be a dataset of which cluster each points goes to for \(K\) from 1 to 9
Code
Now we can plot the original points using the data from assignments, with each point coloured according to the predicted cluster.
Code
ggplot2::ggplot(
data = assignments,
mapping = ggplot2::aes(
x = .data[["V1"]],
y = .data[["V2"]]
)
) +
geom_point(
mapping = ggplot2::aes(color = .data[[".cluster"]]),
alpha = 0.8
) +
facet_wrap(ggplot2::vars(.data[["k"]]))
Let clusters
be a dataset of cluster mean position (centroids) for \(K\) from 1 to 9
Code
We can then add the centers of the cluster using the data from clusters
Code
ggplot2::ggplot(
data = assignments,
mapping = ggplot2::aes(
x = .data[["V1"]],
y = .data[["V2"]]
)
) +
geom_point(
mapping = ggplot2::aes(color = .data[[".cluster"]]),
alpha = 0.8
) +
facet_wrap(ggplot2::vars(.data[["k"]])) +
geom_point(
data = clusters,
size = 10,
shape = "x"
)
Let clusterings
be a dataset of cluster metrics for \(K\) from 1 to 9
Code
Now that we have the total within-cluster sum-of-squares for \(K\) from 1 to 9, we can plot them against k so we can use the elbow method to find the optimal number of clusters.
Code
ggplot2::ggplot(
data = clusterings,
mapping = ggplot2::aes(
x = .data[["k"]],
y = .data[["tot.withinss"]]
)
) +
ggplot2::geom_line(alpha = 0.5, size = 1.2, color = "midnightblue") +
ggplot2::geom_point(size = 2, color = "midnightblue")
We see an elbow at k = 2 which makes us happy since the data set is specifically created to have 2 clusters. We can now extract the model where k = 2 from
Hierarchical Clustering
The stats::dist
function is used to calculate the Euclidean distance dissimilarity measure.
Code
The stats::hclust
function implements hierarchical clustering in R. We use the data used for K-means clustering to plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure.
We use factoextra::fviz_dend
to visualize the clustering created using hclust
.
To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the stats::cutree
function
We analyze the nature of the clusters mathematically by evaluating the c, which measures the clustering structure of the dataset - with values closer to 1 suggest a more balanced clustering structure and values closer to 0 suggest less well-formed clusters. However, the agglomerative coefficient tends to become larger as the number of samples increases, so it should not be used to compare across data sets of very different sizes.
cluster::agnes
allows us to compute this metric on the results from hierarchical clustering.
It looks like the complete linkage did the best.
Code
# Compute agglomerative coefficient values
ac_metric <- list(
complete_ac = cluster::agnes(
x = dissimilarity_measure,
metric = "euclidean",
method = "complete"
)$ac,
average_ac = cluster::agnes(
x = dissimilarity_measure,
metric = "euclidean",
method = "average"
)$ac,
single_ac = cluster::agnes(
x = dissimilarity_measure,
metric = "euclidean",
method = "single"
)$ac
)
ac_metric
> $complete_ac
> [1] 0.9359608
>
> $average_ac
> [1] 0.8887208
>
> $single_ac
> [1] 0.6700538
Although hierarchical clustering provides a fully connected dendrogram representing the cluster relationships, there may still need to choose the preferred number of clusters to extract. Fortunately we can execute approaches similar to those discussed for k-means clustering with factoextra::fviz_nbclust
Code
# Plot cluster results
p1 <- factoextra::fviz_nbclust(
x = points,
FUN = factoextra::hcut,
method = "wss",
k.max = 10
) +
ggplot2::ggtitle(label = "(A) Elbow method")
p2 <- factoextra::fviz_nbclust(
x = points,
FUN = factoextra::hcut,
method = "silhouette",
k.max = 10
) +
ggplot2::ggtitle("(B) Silhouette method")
p3 <- factoextra::fviz_nbclust(
x = points,
FUN = factoextra::hcut,
method = "gap_stat",
k.max = 10
) +
ggplot2::ggtitle("(C) Gap statistic")
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
Another way of calculating distances is based on pearson correlation using proxy::dist
. However, this only makes sense if we have data with 3 or more variables.
Code
Code
# Plot cluster results
p1 <- factoextra::fviz_nbclust(
x = three_var_data,
FUN = factoextra::hcut,
method = "wss",
hc_metric = "pearson",
k.max = 10
) +
ggplot2::ggtitle(label = "(A) Elbow method")
p2 <- factoextra::fviz_nbclust(
x = three_var_data,
FUN = factoextra::hcut,
method = "silhouette",
hc_metric = "pearson",
k.max = 10
) +
ggplot2::ggtitle("(B) Silhouette method")
p3 <- factoextra::fviz_nbclust(
x = three_var_data,
FUN = factoextra::hcut,
method = "gap_stat",
hc_metric = "pearson",
k.max = 10
) +
ggplot2::ggtitle("(C) Gap statistic")
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
NCI60 Data Example
Unsupervised techniques are often used in the analysis of genomic data.
We illustrate these techniques on the NCI60 cancer cell line microarray data, which consists of 6,830 gene expression measurements on 64 cancer cell lines.
labels
represents the type of cancer. V_1 to V_6830 represent a gene.
PCA on the NCI60 Data
We first perform PCA on the data after scaling the variables (genes) to have standard deviation one and plot the first few principal component score vectors, in order to visualize the data.
Code
nci60_preproc_rec <- recipes::recipe(formula = label ~ ., data = nci60) %>%
recipes::step_normalize(recipes::all_numeric_predictors()) %>%
recipes::step_pca(recipes::all_numeric_predictors(),
id = "pca",
threshold = 0.99
)
Code
Code
We visualise the projections of the NCI60 cancer cell lines onto the first three principal components
Code
colors <- unname(palette.colors(n = 14, palette = "Polychrome 36"))
nci60_pca_score %>%
dplyr::select(c("label", "PC01", "PC02")) %>%
ggplot2::ggplot(mapping = ggplot2::aes(
x = .data[["PC01"]],
y = .data[["PC02"]],
color = .data[["label"]]
)) +
ggplot2::geom_point() +
ggplot2::scale_color_manual(values = colors) +
ggplot2::theme_bw()
Code
nci60_pca_score %>%
dplyr::select(c("label", "PC01", "PC03")) %>%
ggplot2::ggplot(mapping = ggplot2::aes(
x = .data[["PC01"]],
y = .data[["PC03"]],
color = .data[["label"]]
)) +
ggplot2::geom_point() +
ggplot2::scale_color_manual(values = colors) +
ggplot2::theme_bw()
Here is a scatter plot matrix
Code
nci60_pca_score %>%
dplyr::select(c("label", "PC01", "PC02", "PC03")) %>%
ggplot2::ggplot(mapping = ggplot2::aes(
x = .panel_x,
y = .panel_y,
color = .data[["label"]],
fill = .data[["label"]]
)) +
ggplot2::geom_point(alpha = 0.4, size = 0.5) +
ggforce::geom_autodensity(alpha = .3) +
ggforce::facet_matrix(rows = ggplot2::vars(-c("label")), layer.diag = 2) +
ggplot2::scale_color_manual(values = colors) +
ggplot2::scale_fill_manual(values = colors) +
cowplot::theme_minimal_grid() +
cowplot::panel_border(color = "black") +
ggplot2::theme(legend.position = "bottom")
To view them interactively, consider using langevitour::langevitour from the R package langevitour
Code
nci60_pca_score_filtered <- nci60_pca_score %>%
dplyr::select(c(
"label",
"PC01", "PC02", "PC03",
"PC04", "PC05", "PC06"
))
langevitour::langevitour(
nci60_pca_score_filtered[, -1], nci60_pca_score_filtered$label,
pointSize = 2
)
How much variation are we capturing for each component?
We extract the pca loadings of the pca step.
Code
Here is how we can see the variance statistics but it is in a long form
Here is how we can see the variance statistics in its wide form
Code
Here is a simple plot to show the variance captured for each component.
Code
# Get the variance
percent_variation <- nci60_pca_variance %>%
dplyr::filter(.data[["terms"]] == "percent variance") %>%
dplyr::pull(.data[["value"]])
percent_variation <- percent_variation / 100
component <- unique(nci60_pca_loadings$component) %>%
stringr::str_remove(pattern = "PC") %>%
as.integer()
# I use [1:4] to select the first four components
dplyr::tibble(
component = component,
percent_var = percent_variation
) %>%
# dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
ggplot2::ggplot(mapping = ggplot2::aes(
x = .data[["component"]],
y = .data[["percent_var"]]
)) +
ggplot2::geom_point() +
ggplot2::geom_line() +
ggplot2::scale_y_continuous(labels = scales::percent_format()) +
ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
ggplot2::labs(
x = NULL,
y = "Percent variance \nexplained by each \nPCA component"
)
We see that together, the first seven principal components explain around 40% of the variance in the data.
Code
# Get the variance
cumulative_percent_variation <- nci60_pca_variance %>%
dplyr::filter(.data[["terms"]] == "cumulative percent variance") %>%
dplyr::pull(.data[["value"]])
cumulative_percent_variation <- cumulative_percent_variation / 100
component <- unique(nci60_pca_loadings$component) %>%
stringr::str_remove(pattern = "PC") %>%
as.integer()
# I use [1:4] to select the first four components
dplyr::tibble(
component = component,
cumulative_percent_var = cumulative_percent_variation
) %>%
# dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
ggplot2::ggplot(mapping = ggplot2::aes(
x = .data[["component"]],
y = .data[["cumulative_percent_var"]]
)) +
ggplot2::geom_point() +
ggplot2::geom_line() +
ggplot2::scale_y_continuous(labels = scales::percent_format()) +
ggplot2::theme(axis.title.y = ggplot2::element_text(angle = 0)) +
ggplot2::labs(
x = NULL,
y = "Cumulative percent \nvariance explained \nby each PCA \ncomponent"
)
Clustering on NCI60 dataset (Base R)
We now proceed to hierarchically cluster the cell lines in the NCI60 data, with the goal of finding out whether or not the observations cluster into distinct types of cancer.
We first preprocess the data suited for hierarchical clustering
Code
nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
recipes::step_normalize(recipes::all_numeric_predictors()) %>%
recipes::prep() %>%
recipes::bake(new_data = NULL)
We choose the preferred number of clusters to extract.
Code
nci60_num_cluster <- nci60_scaled %>%
dplyr::select(-c("label"))
# Plot cluster results
p1 <- factoextra::fviz_nbclust(
x = nci60_num_cluster,
FUN = factoextra::hcut,
method = "wss",
hc_metric = "euclidean",
k.max = 10
) +
ggplot2::ggtitle(label = "(A) Elbow method")
p2 <- factoextra::fviz_nbclust(
x = nci60_num_cluster,
FUN = factoextra::hcut,
method = "silhouette",
hc_metric = "euclidean",
k.max = 10
) +
ggplot2::ggtitle("(B) Silhouette method")
p3 <- factoextra::fviz_nbclust(
x = nci60_num_cluster,
FUN = factoextra::hcut,
method = "gap_stat",
hc_metric = "euclidean",
k.max = 10
) +
ggplot2::ggtitle("(C) Gap statistic")
# Display plots side by side
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
We now perform hierarchical clustering of the observations using complete, single, and average linkage. Euclidean distance is used as the dissimilarity measure.
Code
nci60_complete <- nci60_num_cluster %>%
stats::dist(method = "euclidean") %>%
stats::hclust(method = "complete")
nci60_average <- nci60_num_cluster %>%
stats::dist(method = "euclidean") %>%
stats::hclust(method = "average")
nci60_single <- nci60_num_cluster %>%
stats::dist(method = "euclidean") %>%
stats::hclust(method = "single")
We then visualize them to see if any of them have some good natural separations.
Typically, single linkage will tend to yield trailing clusters: very large clusters onto which individual observations.
Complete and average linkage tend to yield more balanced, attractive clusters. For this reason, complete and average linkage are generally preferred to single linkage.
Code
nci60_complete$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_complete, main = "complete", k = 4)
Code
stats::cutree(tree = nci60_complete, k = 4)
> CNS CNS CNS RENAL BREAST CNS
> 1 1 1 1 2 2
> CNS BREAST NSCLC NSCLC RENAL RENAL
> 2 2 1 1 1 1
> RENAL RENAL RENAL RENAL RENAL BREAST
> 1 1 1 1 1 2
> NSCLC RENAL UNKNOWN OVARIAN MELANOMA PROSTATE
> 2 2 1 1 1 1
> OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN PROSTATE
> 1 1 1 1 1 1
> NSCLC NSCLC NSCLC LEUKEMIA K562B-repro K562A-repro
> 1 1 1 3 3 3
> LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA COLON
> 3 3 3 3 3 1
> COLON COLON COLON COLON COLON COLON
> 4 1 4 4 4 4
> MCF7A-repro BREAST MCF7D-repro BREAST NSCLC NSCLC
> 4 4 4 4 1 1
> NSCLC MELANOMA BREAST BREAST MELANOMA MELANOMA
> 1 1 1 1 1 1
> MELANOMA MELANOMA MELANOMA MELANOMA
> 1 1 1 1
Code
nci60_average$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_average, main = "average", k = 4)
Code
stats::cutree(tree = nci60_average, k = 4)
> CNS CNS CNS RENAL BREAST CNS
> 1 1 1 1 1 1
> CNS BREAST NSCLC NSCLC RENAL RENAL
> 1 1 1 1 1 1
> RENAL RENAL RENAL RENAL RENAL BREAST
> 1 1 1 1 1 1
> NSCLC RENAL UNKNOWN OVARIAN MELANOMA PROSTATE
> 1 2 1 1 1 1
> OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN PROSTATE
> 1 1 1 1 1 1
> NSCLC NSCLC NSCLC LEUKEMIA K562B-repro K562A-repro
> 1 1 1 3 3 3
> LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA COLON
> 3 3 3 3 3 1
> COLON COLON COLON COLON COLON COLON
> 1 1 1 1 1 1
> MCF7A-repro BREAST MCF7D-repro BREAST NSCLC NSCLC
> 1 1 1 1 1 1
> NSCLC MELANOMA BREAST BREAST MELANOMA MELANOMA
> 4 1 1 1 1 1
> MELANOMA MELANOMA MELANOMA MELANOMA
> 1 1 1 1
Code
nci60_single$labels <- nci60_scaled$label
factoextra::fviz_dend(nci60_single, main = "single", k = 4)
Code
stats::cutree(tree = nci60_single, k = 4)
> CNS CNS CNS RENAL BREAST CNS
> 1 1 1 1 1 1
> CNS BREAST NSCLC NSCLC RENAL RENAL
> 1 1 1 1 1 1
> RENAL RENAL RENAL RENAL RENAL BREAST
> 1 1 1 1 1 2
> NSCLC RENAL UNKNOWN OVARIAN MELANOMA PROSTATE
> 1 3 1 1 1 1
> OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN PROSTATE
> 1 1 1 1 1 1
> NSCLC NSCLC NSCLC LEUKEMIA K562B-repro K562A-repro
> 1 1 1 1 1 1
> LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA COLON
> 1 1 1 1 4 1
> COLON COLON COLON COLON COLON COLON
> 1 1 1 1 1 1
> MCF7A-repro BREAST MCF7D-repro BREAST NSCLC NSCLC
> 1 1 1 1 1 1
> NSCLC MELANOMA BREAST BREAST MELANOMA MELANOMA
> 1 1 1 1 1 1
> MELANOMA MELANOMA MELANOMA MELANOMA
> 1 1 1 1
We will use complete linkage hierarchical clustering for the analysis that follows. We calculate which Label is the most common within each cluster.
Code
tibble::tibble(
label = nci60$label,
cluster_id = stats::cutree(nci60_complete, k = 4)
) %>%
dplyr::count(.data[["label"]], .data[["cluster_id"]]) %>%
dplyr::group_by(.data[["cluster_id"]]) %>%
dplyr::mutate(
total = sum(.data[["n"]]),
prop = .data[["n"]] / sum(.data[["n"]])
) %>%
dplyr::slice_max(n = 1, order_by = .data[["prop"]]) %>%
dplyr::ungroup() %>%
reactable::reactable()
How do these NCI60 hierarchical clustering results compare to what we get if we perform K-means clustering with K = 4?
Code
nci60_kmeans_results <- kmeans(nci60_num_cluster, centers = 4, nstart = 50)
Code
We see that the four clusters obtained using hierarchical clustering and Kmeans clustering are somewhat different.
Code
> cluster_kmeans
> cluster_hclust 1 2 3 4
> 1 0 9 20 11
> 2 0 0 7 0
> 3 8 0 0 0
> 4 0 0 0 9
Code
Cluster 2 in K-means clustering is identical to cluster 4 in hierarchical clustering. However, the other clusters defer. For the twenty seven samples in K-means Cluster 3, twenty goes to hierarchical clustering Cluster 1 and seven goes to hierarchical clustering Cluster 2.
Rather than performing hierarchical clustering on the entire data matrix, we can also perform hierarchical clustering on the first few principal component score vectors, as follows:
Code
# nci60_scaled <- recipes::recipe(formula = label ~ ., data = nci60) %>%
# recipes::step_normalize(recipes::all_numeric_predictors()) %>%
# recipes::prep() %>%
# recipes::bake(new_data = NULL)
nci60_pca <- recipes::recipe(~., nci60_scaled) %>%
step_pca(recipes::all_numeric_predictors(), num_comp = 5) %>%
recipes::prep() %>%
recipes::bake(new_data = NULL)
Code
Code
stats::cutree(tree = nci60_pca_complete, k = 4)
> CNS CNS CNS RENAL BREAST CNS
> 1 2 1 2 2 2
> CNS BREAST NSCLC NSCLC RENAL RENAL
> 2 2 2 1 1 1
> RENAL RENAL RENAL RENAL RENAL BREAST
> 1 1 1 1 1 2
> NSCLC RENAL UNKNOWN OVARIAN MELANOMA PROSTATE
> 1 2 2 2 1 1
> OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN PROSTATE
> 1 1 1 1 1 1
> NSCLC NSCLC NSCLC LEUKEMIA K562B-repro K562A-repro
> 1 1 1 1 3 3
> LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA COLON
> 3 3 3 3 1 1
> COLON COLON COLON COLON COLON COLON
> 1 1 1 1 1 1
> MCF7A-repro BREAST MCF7D-repro BREAST NSCLC NSCLC
> 4 4 4 4 1 1
> NSCLC MELANOMA BREAST BREAST MELANOMA MELANOMA
> 1 2 2 2 2 2
> MELANOMA MELANOMA MELANOMA MELANOMA
> 2 2 2 2
Clustering on NCI60 dataset (tidyclust)
Recall that we scale the data before clustering and remove the labels.
Here is how we can create the clustering elbow plot using tidyclust
Code
kmeans_spec <- tidyclust::k_means(
num_clusters = tune::tune()
)
nci60_cv <- rsample::vfold_cv(nci60_num_cluster, v = 5)
nci60_rec <- recipes::recipe(formula = ~., data = nci60_num_cluster)
kmeans_wflow <- workflows::workflow(nci60_rec, kmeans_spec)
clust_num_grid <- dials::grid_regular(
tidyclust::num_clusters(range = c(1, 15)),
levels = 15
)
clust_num_grid %>%
reactable::reactable(defaultPageSize = 5)
Code
res <- tidyclust::tune_cluster(
kmeans_wflow,
resamples = nci60_cv,
grid = clust_num_grid,
control = tune::control_grid(save_pred = TRUE, extract = identity),
metrics = tidyclust::cluster_metric_set(
tidyclust::tot_wss,
tidyclust::tot_sse,
tidyclust::sse_ratio
)
)
res_metrics <- res %>% tune::collect_metrics()
res_metrics %>%
reactable::reactable(defaultPageSize = 5)
Code
res_metrics %>%
ggplot2::ggplot(
mapping = ggplot2::aes(
x = .data[["num_clusters"]],
y = .data[["mean"]],
colour = .data[[".metric"]]
)
) +
ggplot2::geom_errorbar(
mapping = ggplot2::aes(
ymin = .data[["mean"]] - .data[["std_err"]],
ymax = .data[["mean"]] + .data[["std_err"]]
),
alpha = 0.8
) +
ggplot2::geom_line(size = 1) +
ggplot2::facet_wrap(
facets = ggplot2::vars(.data[[".metric"]]),
scales = "free",
nrow = 3
) +
ylab("mean over 5 folds") +
xlab("Number of clusters") +
scale_x_continuous(breaks = 1:15) +
ggplot2::theme(legend.position = "none")
tidyclust::hier_clust
is used to do the hierarchical clustering. Here, we set the number of clusters to three and use the complete linkage method. It actually uses stats::hclust()
and Euclidean distance under the hood.
Code
hc_spec <- tidyclust::hier_clust(
num_clusters = 3,
linkage_method = "complete"
)
hc_fit <- hc_spec %>%
tidyclust::fit(label ~ ., data = nci60_scaled)
We use tidyclust::within_cluster_sse
to calculate the Sum of Squared Error in each cluster
Code
hc_fit %>%
tidyclust::within_cluster_sse() %>%
reactable::reactable(defaultPageSize = 5)
We use tidyclust::tot_sse
to calculate the total sum of squares.
We use tidyclust::sse_ratio
to calculate the ratio of the WSS to the total SSE.
We use tidyclust::silhouettes
to measure silhouettes between clusters for all observations. Ensure that the labels are removed.
Code
hc_fit %>%
tidyclust::silhouettes(new_data = nci60_num_cluster) %>%
reactable::reactable(defaultPageSize = 5)
We use tidyclust::avg_silhouette
to measure average silhouettes across all observations. Ensure that the labels are removed. (This is also the mean of the column sil_width
)
Code
hc_fit %>%
tidyclust::avg_silhouette(new_data = nci60_num_cluster) %>%
reactable::reactable(defaultPageSize = 5)
Use tidyclust::cluster_metric_set
to combine the metrics together.
Code
many_metrics <- tidyclust::cluster_metric_set(
tidyclust::tot_wss,
tidyclust::tot_sse,
tidyclust::sse_ratio,
tidyclust::avg_silhouette
)
# new_data required for tidyclust::avg_silhouette to work
hc_fit %>%
many_metrics(new_data = nci60_num_cluster) %>%
reactable::reactable(defaultPageSize = 5)
We use tidyclust::extract_cluster_assignment
to see the cluster assignments. However, the labels need to be updated.
Code
Centroid can be extracted using extract_centroids
. Here is a preview of the first few columns. For hierarchical clustering, the geometric mean over the predictors for each cluster is used.
Code
hc_fit_centroids <- hc_fit %>%
tidyclust::extract_centroids()
hc_fit_centroids[, 1:10] %>%
reactable::reactable(defaultPageSize = 5)
Here is how to see the plots. hc_fit$fit
gives back the hclust
object. We can plot the hierarchical clustering results.
Code
hc_fit$fit$labels <- nci60_scaled$label
factoextra::fviz_dend(hc_fit$fit, main = "complete", k = 3)
Code
stats::cutree(tree = hc_fit$fit, k = 3)
> CNS CNS CNS RENAL BREAST CNS
> 1 1 1 1 1 1
> CNS BREAST NSCLC NSCLC RENAL RENAL
> 1 1 1 1 1 1
> RENAL RENAL RENAL RENAL RENAL BREAST
> 1 1 1 1 1 1
> NSCLC RENAL UNKNOWN OVARIAN MELANOMA PROSTATE
> 1 1 1 1 1 1
> OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN PROSTATE
> 1 1 1 1 1 1
> NSCLC NSCLC NSCLC LEUKEMIA K562B-repro K562A-repro
> 1 1 1 2 2 2
> LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA COLON
> 2 2 2 2 2 1
> COLON COLON COLON COLON COLON COLON
> 3 1 3 3 3 3
> MCF7A-repro BREAST MCF7D-repro BREAST NSCLC NSCLC
> 3 3 3 3 1 1
> NSCLC MELANOMA BREAST BREAST MELANOMA MELANOMA
> 1 1 1 1 1 1
> MELANOMA MELANOMA MELANOMA MELANOMA
> 1 1 1 1
Alternatively, we can use tidyclust::extract_fit_summary
as well.
Code
hc_summary <- hc_fit %>%
tidyclust::extract_fit_summary()
Here is how you can do k-means clustering using tidyclust::k_means
.
Code
km_spec <- tidyclust::k_means(num_clusters = 5) %>%
parsnip::set_engine("stats", algorithm = "Lloyd")
km_fit <- km_spec %>%
tidyclust::fit(label ~ ., data = nci60_scaled)
km_summary <- km_fit %>% extract_fit_summary()
stats::predict
returns the cluster a new observation belongs to
However, stats::predict
on the “training data” will not necessarily produce the same results as tidyclust::extract_cluster_assignment
Code
cbind(
hc_preds,
tidyclust::extract_cluster_assignment(hc_fit),
nci60_scaled$label
) %>%
reactable::reactable(defaultPageSize = 5)
We notice that the three-cluster assignments from hier_clust do not line up perfectly with the five-cluster assignments from k_means.
Code
tidyclust
also have a useful function called tidyclust::reconcile_clusterings
to match clusters from k-means with the clusters from hierarchical clustering.
Code
cluster_table <- tidyclust::reconcile_clusterings(
primary_cluster_assignment = hc_preds$.pred_cluster,
alt_cluster_assignment = km_preds$.pred_cluster,
one_to_one = FALSE
)
cluster_table %>%
dplyr::mutate(labels = nci60_scaled$label) %>%
reactable::reactable(
defaultPageSize = 5,
filterable = TRUE
)
We can see that KM_1
and KM_2
is matched to HC_1
, KM_3
is matched to HC_2
, KM_4
and KM_5
is matched to HC_3
.
Blog References
Tengku Hanis’s blog titled “Explore data using PCA”
Julia Silge’s blog titled “PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes”
Julia Silge’s blog titled “Getting started with k-means and #TidyTuesday employment status”
Tidymodels’ tutorial titled “K-means clustering with tidy data principles”
Emil Hvitfeldt’s ISLR tidymodels Labs Chapter 12
Eric (R_ic) RPubs’s post titled “Train and Evaluate Clustering Models using Tidymodels and friends”
Chapter 21 Hierarchical Clustering from Hands-On Machine Learning with R
Package References
Code
get_citation <- function(package_name) {
transform_name <- package_name %>%
citation() %>%
format(style = "text")
return(transform_name[1])
}
packages <- c(
"base", "ggplot2", "rmarkdown", "cluster",
"report", "tidytext", "knitr", "broom",
"cowplot", "dplyr", "factoextra", "forcats",
"ggforce", "gridExtra", "htmltools", "ISLR2",
"magrittr", "Matrix", "parsnip", "proxy",
"purrr", "reactable", "reactablefmtr", "recipes",
"scales", "softImpute", "stringr", "styler",
"tibble", "tidyr", "yardstick", "tidyclust",
"rsample", "tune", "dials", "workflows",
"Rfast", "flexclust", "langevitour"
)
table <- tibble::tibble(Packages = packages)
table %>%
dplyr::mutate(
transform_name = purrr::map_chr(
.data[["Packages"]],
get_citation
)
) %>%
dplyr::pull(.data[["transform_name"]]) %>%
report::as.report_parameters()
- R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
- Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
- Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2022). rmarkdown: Dynamic Documents for R. R package version 2.14, https://github.com/rstudio/rmarkdown.
- Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2022). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.3 - For new features, see the ‘Changelog’ file (in the package source), https://CRAN.R-project.org/package=cluster.
- Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://github.com/easystats/report.
- Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi:10.21105/joss.00037 https://doi.org/10.21105/joss.00037, http://dx.doi.org/10.21105/joss.00037.
- Xie Y (2022). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.39, https://yihui.org/knitr/.
- Robinson D, Hayes A, Couch S (2022). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.0, https://CRAN.R-project.org/package=broom.
- Wilke C (2020). cowplot: Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. R package version 1.1.1, https://CRAN.R-project.org/package=cowplot.
- Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.
- Kassambara A, Mundt F (2020). factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package version 1.0.7, https://CRAN.R-project.org/package=factoextra.
- Wickham H (2021). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.1, https://CRAN.R-project.org/package=forcats.
- Pedersen T (2021). ggforce: Accelerating ‘ggplot2’. R package version 0.3.3, https://CRAN.R-project.org/package=ggforce.
- Auguie B (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3, https://CRAN.R-project.org/package=gridExtra.
- Cheng J, Sievert C, Schloerke B, Chang W, Xie Y, Allen J (2021). htmltools: Tools for HTML. R package version 0.5.2, https://CRAN.R-project.org/package=htmltools.
- James G, Witten D, Hastie T, Tibshirani R (2022). ISLR2: Introduction to Statistical Learning, Second Edition. R package version 1.3-1, https://CRAN.R-project.org/package=ISLR2.
- Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.
- Bates D, Maechler M, Jagan M (2022). Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.4-1, https://CRAN.R-project.org/package=Matrix.
- Kuhn M, Vaughan D (2022). parsnip: A Common API to Modeling and Analysis Functions. https://github.com/tidymodels/parsnip, https://parsnip.tidymodels.org/.
- Meyer D, Buchta C (2022). proxy: Distance and Similarity Measures. R package version 0.4-27, https://CRAN.R-project.org/package=proxy.
- Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, https://CRAN.R-project.org/package=purrr.
- Lin G (2022). reactable: Interactive Data Tables Based on ‘React Table’. R package version 0.3.0, https://CRAN.R-project.org/package=reactable.
- Cuilla K (2022). reactablefmtr: Streamlined Table Styling and Formatting for Reactable. R package version 2.0.0, https://CRAN.R-project.org/package=reactablefmtr.
- Kuhn M, Wickham H (2022). recipes: Preprocessing and Feature Engineering Steps for Modeling. R package version 1.0.1, https://CRAN.R-project.org/package=recipes.
- Wickham H, Seidel D (2022). scales: Scale Functions for Visualization. R package version 1.2.0, https://CRAN.R-project.org/package=scales.
- Hastie T, Mazumder R (2021). softImpute: Matrix Completion via Iterative Soft-Thresholded SVD. R package version 1.4-1, https://CRAN.R-project.org/package=softImpute.
- Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, https://CRAN.R-project.org/package=stringr.
- Müller K, Walthert L (2022). styler: Non-Invasive Pretty Printing of R Code. R package version 1.7.0, https://CRAN.R-project.org/package=styler.
- Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.8, https://CRAN.R-project.org/package=tibble.
- Wickham H, Girlich M (2022). tidyr: Tidy Messy Data. R package version 1.2.0, https://CRAN.R-project.org/package=tidyr.
- Kuhn M, Vaughan D (2022). yardstick: Tidy Characterizations of Model Performance. R package version 1.0.0, https://CRAN.R-project.org/package=yardstick.
- Hvitfeldt E, Bodwin K (2022). tidyclust: A Common API to Clustering. R package version 0.0.0.9000, https://github.com/EmilHvitfeldt/tidyclust.
- Silge J, Chow F, Kuhn M, Wickham H (2022). rsample: General Resampling Infrastructure. R package version 1.1.0, https://CRAN.R-project.org/package=rsample.
- Kuhn M (2022). tune: Tidy Tuning Tools. R package version 1.0.0, https://CRAN.R-project.org/package=tune.
- Kuhn M, Frick H (2022). dials: Tools for Creating Tuning Parameter Values. R package version 1.0.0, https://CRAN.R-project.org/package=dials.
- Vaughan D (2022). workflows: Modeling Workflows. https://github.com/tidymodels/workflows, https://workflows.tidymodels.org.
- Papadakis M, Tsagris M, Dimitriadis M, Fafalios S, Tsamardinos I, Fasiolo M, Borboudakis G, Burkardt J, Zou C, Lakiotaki K, Chatzipantsiou. C (2022). Rfast: A Collection of Efficient and Extremely Fast R Functions. R package version 2.0.6, https://CRAN.R-project.org/package=Rfast.
- Leisch F (2006). “A Toolbox for K-Centroids Cluster Analysis.” Computational Statistics and Data Analysis, 51(2), 526-544.
- Harrison P (2022). langevitour: Langevin Tour. R package version 0.5, https://CRAN.R-project.org/package=langevitour.