Chapter 12: Unsupervised Learning

class: center, middle, inverse, title-slide

.title[
# Chapter 12: <br>Unsupervised Learning
]
.author[
### Jeremy Selva
]

---

# Introduction

This chapter introduces a diverse set of unsupervised learning techniques

- Principal Components Analysis
- Clustering

---

# 12.1 The Challenge of Unsupervised Learning

.pull-left[
Unsupervised learning is often performed as part of an exploratory data analysis.

Given a set of `$p$` features `$X_1,X_2, . . . ,X_p$` measured on `$n$` observations, find interesting groups about these features.

Interesting groups include
 - Finding subgroups among the variables
 - Finding subgroups among the observation
 ]
 
.pull-right[
<img src="img/biclustering.jpg" width="100%" style="display: block; margin: auto;" />
]

However, there is no clear criteria to determine if a group is interesting or not as it is subjective.

Image from https://www.sciencedirect.com/science/article/pii/S1532046415001380
---

# 12.2 Principal Components Analysis (PCA)

When we have a large set of (preferably correlated) features `$X_1,X_2,...,X_p$`, PCA helps to summarise them into a smaller number of representative variables, `$Z_1,Z_2, . . . ,Z_M$`, where `$M < p$` that helps to explain most of the variability in the original set. `$Z_1,Z_2,...,Z_M$` are also called principal components.

---

# 12.2.1 What Are Principal Components?

Given a `$n$` by `$p$` data set `$X$`, with all `$p$` variable to have mean `$0$`,

a principal component `$Z_j$`, for `$1 \le j \le M$`, must be expressed in terms of its features `$X_1, X_2,..., X_p$` and loadings `$\phi_{1j}, \phi_{2j}, ... , \phi_{pj}$` in the following (linear combination) way

`$$Z_j = \phi_{1j}X_1 + \phi_{2j}X_2 + ... + \phi_{pj}X_p$$`

where the sum of loading squares add up to one or  `$\sum_{k=1}^{p}(\phi_{kj})^2 = 1$`.

---

# 12.2.1 What Are Principal Components?

These `$M$` loadings make up the principal component loading vector/direction, `$\phi_{j} = (\phi_{1j}, \phi_{2j},..., \phi_{pj})^{T}$`

Images from [TileStats youtube video](https://www.youtube.com/watch?v=S51bTyIwxFs)

---

# 12.2.1 What Are Principal Components?

.pull-left[
The loading vector `$\phi_{j}$` is used to calculate the scores `$z_{ij}$` for `$1 \le i \le n$`, `$1 \le j \le M$` for each principal component `$Z_j$`.

`$$Z_j = (z_{1j},z_{2j},...,z_{ij},...,z_{nj})^T$$`
where `$z_{ij}$` is the sum of the products of the loadings and the individual data values

`$$z_{ij} = \sum_{k=1}^{p}(\phi_{kj}x_{ik}) = \phi_{1j}x_{i1}  + \phi_{2j}x_{i2} + ...+ \phi_{pj}x_{ip}$$`

Images from [TileStats youtube video](https://www.youtube.com/watch?v=S51bTyIwxFs)
]

.pull-right[
<img src="img/scores.jpg" width="75%" style="display: block; margin: auto;" />
]

---

# 12.2.1 What Are Principal Components?

Each principal component loadings `$\phi_{j} = (\phi_{1j}, \phi_{2j}, . . ., \phi_{pj})^{T}$` is optimised such that

`$$\underset{\phi_{1j},...,\phi_{pj}}{\text{maximize}\space} \{ \frac{1}{n} \sum_{i=1}^{n}( \sum_{k=1}^{p}(\phi_{kj}x_{ik}))^2 \}  \space \text{subject to} \sum_{k=1}^{p}(\phi_{kj})^2 = 1$$`

where `$\sum_{k=1}^{p}(\phi_{kj}x_{ik}) = z_{ij}$` are the scores in the principal component `$Z_j$`.

As the mean of the scores `$z_{ij}$` is `$0$`, `$\frac{1}{n} \sum_{i=1}^{n}(z_{ij} - 0)^2$` is the sample variance of the scores in the principal component `$Z_j$`.

Images from [TileStats youtube video](https://www.youtube.com/watch?v=S51bTyIwxFs)

---

# 12.2.2 Another Interpretation

Principal components provide low-dimensional linear surfaces that are closest to the `$n$` observations.

The first principal component loading vector `$\phi_{1} = (\phi_{11}, \phi_{21}, . . ., \phi_{p1})^{T}$` is the line in `$p$`-dimensional space that is closest to the `$n$` observations.

---

# 12.2.2 Another Interpretation

The first two principal component loading vector `$\phi_{1}$` and `$\phi_{2}$` is the 2D plane in `$p$`-dimensional space that best fit the `$n$` observations.

---

# 12.2.2 Another Interpretation

By a special property of the loading matrix `$\phi^-1 = \phi^T$` and when `$M = min(p,n-1)$`, it is possible to express each dataset in terms of principal component scores and loadings or `$x_{ij} = \sum_{k=1}^{p}(z_{ik}\phi_{kj})$`

---

# 12.2.2 Another Interpretation

However for `$m < p$`, `$x_{ij} \approx \sum_{k=1}^{p}(z_{ik}\phi_{kj})$`

`$x_{ij} - \sum_{k=1}^{p}(z_{ik}\phi_{kj})$` will get smaller as `$M$` gets closer to `$min(p,n-1)$`

---

# 12.2.3 The Proportion of Variance Explained

Variance represents how much information for a given data is lost as a result of projecting the observations onto the first few principal components. We define some parameters.

The total variance present in a data set (assuming that the variables have mean `$0$`) is

`$$\sum_{j=1}^{p}Var(X_j) = \sum_{j=1}^{p} (\frac{1}{n}\sum_{i=1}^{n}(x_{ij} - 0)^{2})$$`
The variance explained by the `$m$`th principal component is the variance of the scores (which also has the mean of `$0$`).

`$$\frac{1}{n}\sum_{i=1}^{n}z_{im}^2 = \frac{1}{n}\sum_{i=1}^{n}(\sum_{j=1}^{p}\phi_{jm}x_{ij})^{2}$$`
---

# 12.2.3 The Proportion of Variance Explained

Images from [TileStats youtube video](https://www.youtube.com/watch?v=S51bTyIwxFs)

---

# 12.2.3 The Proportion of Variance Explained

The proportion of variance explained `$(PVE)$` by the `$m$`th principal component

`$$PVE_m = \frac{Variance\space explained\space by\space the\space mth\space principal\space component\space}{Total\space variance\space present\space in\space a\space data\space set\space} = \frac{\frac{1}{n}\sum_{i=1}^{n}z_{im}^2}{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}}$$`
<img src="img/proportion_of_variance.jpg" width="50%" style="display: block; margin: auto;" />

Images from [TileStats youtube video](https://www.youtube.com/watch?v=S51bTyIwxFs)

---

# 12.2.3 The Proportion of Variance Explained

The variance of the data can also be decomposed into the variance of the `$M$` principal components plus the mean squared error of this `$M$`-dimensional approximation.

`$$\underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}}_\text{Var. of data} =  \underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{m=1}^{M}(z_{ik})^{2}}_\text{Var. of first M PCs} +
\underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij}-\sum_{m=1}^{M}z_{im}\phi_{jm})^{2}}_\text{MSE of M-dimensional approximation}$$`

Images from [TileStats youtube video](https://www.youtube.com/watch?v=S51bTyIwxFs)

---

# 12.2.3 The Proportion of Variance Explained

`$$\underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}}_\text{Var. of data} =  \underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{m=1}^{M}(z_{im})^{2}}_\text{Var. of first M PCs} +
\underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij}-\sum_{m=1}^{M}z_{im}\phi_{jm})^{2}}_\text{MSE of M-dimensional approximation}$$`

Bring `$\underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij}-\sum_{m=1}^{M}z_{im}\phi_{jm})^{2}}_\text{MSE of M-dimensional approximation}$` to the left hand side

`$$\underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}}_\text{Var. of data} - \underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij}-\sum_{m=1}^{M}z_{im}\phi_{jm})^{2}}_\text{MSE of M-dimensional approximation} =  \underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{m=1}^{M}(z_{im})^{2}}_\text{Var. of first M PCs}$$`
---

# 12.2.3 The Proportion of Variance Explained

Divide by `$\underbrace{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}}_\text{Var. of data} \neq 0$` on both side

`$$1 - \frac{\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij}-\sum_{m=1}^{M}z_{im}\phi_{jm})^{2}}{\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}} = \frac{\frac{1}{n}\sum_{i=1}^{n}\sum_{m=1}^{M}(z_{im})^{2}}{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}} = \sum_{m=1}^{M} \frac{\frac{1}{n}\sum_{i=1}^{n}(z_{im})^{2}}{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}}$$`
Recall that `$PVE_m = \frac{Variance\space explained\space by\space the\space mth\space principal\space component\space}{Total\space variance\space present\space in\space a\space data\space set\space} = \frac{\frac{1}{n}\sum_{i=1}^{n}z_{im}^2}{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}}$`

---

# 12.2.3 The Proportion of Variance Explained

We have

`$$1 - \frac{\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij}-\sum_{m=1}^{M}z_{im}\phi_{jm})^{2}}{\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}} = \sum_{m=1}^{M} \frac{\frac{1}{n}\sum_{i=1}^{n}(z_{im})^{2}}{\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{p}(x_{ij})^{2}} = \sum_{m=1}^{M}PVE_m$$`
Using `$TSS$` as the total sum of squared elements of `$X$`, and `$RSS$` as the residual sum of squares of the M-dimensional approximation given by the `$M$` principal components.

$$ \sum_{m=1}^{M}PVE_m = 1 - \frac{RSS}{TSS} = R^2$$

We can interpret the cumulative `$PVE$` as the `$R^2$` of the approximation for `$X$` given by the first `$M$` principal components

---

# 12.2.4 More on PCA

Why do we need to scale the variables have standard deviation one ?

---

# 12.2.4 More on PCA

Each principal component loading vector is unique, up to a sign flip or

Given `$j$` and `$j'$`, `$\phi_{j'm} = k\phi_{jm}$` where `$k$` = `$1$` or `$-1$`.

---

# 12.2.4 More on PCA

How many principal components is enough ? Typically decided using the scree plot though it can be subjective.

---

# 12.2.4 More on PCA

Image from Nature's [Points of Significance](https://www.nature.com/collections/qghhqm/pointsofsignificance) Series [Principal component analysis](https://www.nature.com/articles/nmeth.4346)

---

# 12.2.4 More on PCA

.pull-left[
Image from [Ten quick tips for effective dimensionality reduction](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]

.pull-right[
<img src="img/pca_tips.jpg" width="100%" style="display: block; margin: auto;" />
]
---

# 12.3 Missing Values and Matrix Completion

It is possible for datasets to have missing values. Unfortunately, the statistical learning methods that we have seen in this book cannot handle missing values.

We could remove data with missing values but it is wasteful.

Another way is set the missing `$x_{ij}$` to be the mean of the `$j$`th column.

Although this is a common and convenient strategy, often we can do better by exploiting the correlation between the variables by using principal components.

---

# 12.3 Missing Values and Matrix Completion

.panelset[
.panel[.panel-name[Full Data]

.pull-left[

```r
data <- data.frame(
  SBP = c(-3,-1,-1, 1, 1, 3),
  DBP = c(-4,-2, 0, 0, 2, 4))
```
]
.pull-right[
<div id="htmlwidget-b5f46aaab23f3dcce47b" class="reactable html-widget" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-b5f46aaab23f3dcce47b">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"SBP":[-3,-1,-1,1,1,3],"DBP":[-4,-2,0,0,2,4]},"columns":[{"accessor":"SBP","name":"SBP","type":"numeric","format":{"cell":{"digits":2},"aggregated":{"digits":2}}},{"accessor":"DBP","name":"DBP","type":"numeric","format":{"cell":{"digits":2},"aggregated":{"digits":2}}}],"defaultPageSize":6,"paginationType":"numbers","showPageInfo":true,"minRows":1,"dataKey":"70b9f792dae02a270ba873e526d1b3c0"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
]
]

.panel[.panel-name[Missing Data]
.pull-left[

```r
missing_data <- data.frame(
  SBP = c(-3, NA, -1, 1, 1, NA),
  DBP = c(NA, -2, 0, NA, 2, 4)
)

ismiss <- is.na(missing_data)
```

]
.pull-right[
<div id="htmlwidget-f2297f04194870a69ef8" class="reactable html-widget" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-f2297f04194870a69ef8">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"SBP":[-3,"NA",-1,1,1,"NA"],"DBP":["NA",-2,0,"NA",2,4]},"columns":[{"accessor":"SBP","name":"SBP","type":"numeric","style":[null,{"background":"#aaf0f8"},null,null,null,{"background":"#aaf0f8"}],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}},{"accessor":"DBP","name":"DBP","type":"numeric","style":[{"background":"#aaf0f8"},null,null,{"background":"#aaf0f8"},null,null],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}}],"defaultPageSize":6,"paginationType":"numbers","showPageInfo":true,"minRows":1,"dataKey":"16541ce108ac7a9d09c6f08731529ef1"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
]
]

.panel[.panel-name[Initialisation]
Fill in the missing values with these mean values.
Mean of SBP is `$-0.5$`, Mean of DBP is `$1$`.
.pull-left[

```r
initialisation_data <- data.frame(
  SBP = c(-3, -0.5, -1, 1, 1, -0.5),
  DBP = c(1, -2, 0, 1, 2, 4)
)
```
]
.pull-right[
<div id="htmlwidget-db3c19cadf838175779a" class="reactable html-widget" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-db3c19cadf838175779a">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"SBP":[-3,-0.5,-1,1,1,-0.5],"DBP":[1,-2,0,1,2,4]},"columns":[{"accessor":"SBP","name":"SBP","type":"numeric","style":[null,{"background":"#aaf0f8"},null,null,null,{"background":"#aaf0f8"}],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}},{"accessor":"DBP","name":"DBP","type":"numeric","style":[{"background":"#aaf0f8"},null,null,{"background":"#aaf0f8"},null,null],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}}],"defaultPageSize":6,"paginationType":"numbers","showPageInfo":true,"minRows":1,"dataKey":"32d866b345a319865cce1e44725fefae"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
]
]

.panel[.panel-name[2a function]

Implementations for calculating step 2a for Algorithm 12.1

.pull-left[

```r
fit_pca <- function(input_data, num_comp = 1) {
  data_pca <- prcomp(input_data, center = FALSE)
  
  pca_loadings <- data_pca$rotation
  pca_loadings_inverse <- solve(pca_loadings)[1:num_comp,, drop = FALSE]
  pca_scores <- data_pca$x[,1:num_comp, drop = FALSE]
  
  estimated_data <- pca_scores %*% pca_loadings_inverse
  return(estimated_data)

}
```
]
.pull-right[
<img src="img/algorithm_12_1_2a.jpg" width="100%" style="display: block; margin: auto;" />
]
]

.panel[.panel-name[Iteration 1]

.pull-left[

```r
initialisation_data_pca <- initialisation_data

# Step 2(a)
Xapp <- fit_pca(initialisation_data_pca , num_comp = 1)

# Step 2(b)
initialisation_data_pca[ismiss] <- Xapp[ismiss]
```
]
.pull-right[

<div id="htmlwidget-fd3b79457f4dcafecaa2" class="reactable html-widget" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-fd3b79457f4dcafecaa2">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"SBP":[-3,0.14384952667793,-1,1,1,-0.295796456590457],"DBP":[1.21442472281113,-2,0,0.921327400632204,2,4]},"columns":[{"accessor":"SBP","name":"SBP","type":"numeric","style":[null,{"background":"#aaf0f8"},null,null,null,{"background":"#aaf0f8"}],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}},{"accessor":"DBP","name":"DBP","type":"numeric","style":[{"background":"#aaf0f8"},null,null,{"background":"#aaf0f8"},null,null],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}}],"defaultPageSize":6,"paginationType":"numbers","showPageInfo":true,"minRows":1,"dataKey":"f69c3dc293641e179690916105e98135"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
]
]

.panel[.panel-name[Iteration 1 cont]
.pull-left[

```r
# Step 2(c)

mss_old_pca <- mean(((missing_data - Xapp)[!ismiss])^2)

mss_old_pca
```

```
#> [1] 1.48974
```
]
.pull-right[
<img src="img/algorithm_12_1_2.jpg" width="100%" style="display: block; margin: auto;" />

<img src="img/algorithm_12_1_2_objective.jpg" width="100%" style="display: block; margin: auto;" />
]
]
]

---

# 12.3 Missing Values and Matrix Completion

.pull-left[

```r
iter <- 1
rel_err <- mss_old_pca
thresh <- 1e-2
while(rel_err > thresh) {
  iter <- iter + 1
  # Step 2(a)
  Xapp <- fit_pca(initialisation_data_pca , num_comp = 1)
  # Step 2(b)
  initialisation_data_pca[ismiss] <- Xapp[ismiss]
  # Step 2(c)
  mss_pca <- mean(((missing_data - Xapp)[!ismiss])^2)
  rel_err <- mss_old_pca - mss_pca
  mss_old_pca <- mss_pca
  cat("Iter:", iter, "\n", 
      "MSS:", mss_pca, "\n")
}
```
]
.pull-right[

```
#> Iter: 2 
#>  Rel. Err: 0.05686009 
#> Iter: 3 
#>  Rel. Err: 0.2107907 
#> Iter: 4 
#>  Rel. Err: 0.3222623 
#> Iter: 5 
#>  Rel. Err: 0.2031161 
#> Iter: 6 
#>  Rel. Err: 0.08375675 
#> Iter: 7 
#>  Rel. Err: 0.03220227 
#> Iter: 8 
#>  Rel. Err: 0.01423225 
#> Iter: 9 
#>  Rel. Err: 0.00830349
```
]

---

# 12.3 Missing Values and Matrix Completion

.pull-left[
Actual Data
<div id="htmlwidget-b383878f25d57844434f" class="reactable html-widget" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-b383878f25d57844434f">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"SBP":[-3,-1,-1,1,1,3],"DBP":[-4,-2,0,0,2,4]},"columns":[{"accessor":"SBP","name":"SBP","type":"numeric","style":[null,{"background":"#aaf0f8"},null,null,null,{"background":"#aaf0f8"}],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}},{"accessor":"DBP","name":"DBP","type":"numeric","style":[{"background":"#aaf0f8"},null,null,{"background":"#aaf0f8"},null,null],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}}],"defaultPageSize":6,"paginationType":"numbers","showPageInfo":true,"minRows":1,"dataKey":"a02ab76ae6a3708d43a6b74f6caa7f0e"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
]
.pull-right[
Imputed Data
<div id="htmlwidget-370f5622cb9dac8aef61" class="reactable html-widget" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-370f5622cb9dac8aef61">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"SBP":[-3,1.28266147864382,-1,1,1,-2.56532296378831],"DBP":[4.38256755539688,-2,0,-1.29003636881543,2,4]},"columns":[{"accessor":"SBP","name":"SBP","type":"numeric","style":[null,{"background":"#aaf0f8"},null,null,null,{"background":"#aaf0f8"}],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}},{"accessor":"DBP","name":"DBP","type":"numeric","style":[{"background":"#aaf0f8"},null,null,{"background":"#aaf0f8"},null,null],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}}],"defaultPageSize":6,"paginationType":"numbers","showPageInfo":true,"minRows":1,"dataKey":"b231b3041e26d2f3bd0c81e38fa726c7"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
]

---

# 12.3 Missing Values and Matrix Completion

.pull-left[
Actual Data
<div id="htmlwidget-87b682d4842c39b19282" class="reactable html-widget" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-87b682d4842c39b19282">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"SBP":[-3,-1,-1,1,1,3],"DBP":[-4,-2,0,0,2,4]},"columns":[{"accessor":"SBP","name":"SBP","type":"numeric","style":[null,{"background":"#aaf0f8"},null,null,null,{"background":"#aaf0f8"}],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}},{"accessor":"DBP","name":"DBP","type":"numeric","style":[{"background":"#aaf0f8"},null,null,{"background":"#aaf0f8"},null,null],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}}],"defaultPageSize":6,"paginationType":"numbers","showPageInfo":true,"minRows":1,"dataKey":"a02ab76ae6a3708d43a6b74f6caa7f0e"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
]
.pull-right[

Soft Imputed Data
<div id="htmlwidget-ad62ff0a02b3ea4ff033" class="reactable html-widget" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-ad62ff0a02b3ea4ff033">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"SBP":[-3,-1.42293242457236,-1,1,1,2.84586484914471],"DBP":[-4.18812805883708,-2,0,1.39604268627903,2,4]},"columns":[{"accessor":"SBP","name":"SBP","type":"numeric","style":[null,{"background":"#aaf0f8"},null,null,null,{"background":"#aaf0f8"}],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}},{"accessor":"DBP","name":"DBP","type":"numeric","style":[{"background":"#aaf0f8"},null,null,{"background":"#aaf0f8"},null,null],"format":{"cell":{"digits":2},"aggregated":{"digits":2}}}],"defaultPageSize":6,"paginationType":"numbers","showPageInfo":true,"minRows":1,"dataKey":"af935ee1c0fe2078fec87fb1b1472010"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
]

---

# 12.4 Clustering Methods

Clustering is to split into distinct homogeneous groups such that 
 * Observations within each group are quite similar to each other
 * Observations in different groups are quite different from each other.

In this section we focus on perhaps the two best-known
clustering approaches: 
* K-means Clustering
* Hierarchical Clustering.

---

# 12.4 Clustering Methods

Image from [Francesco Archetti's slides](https://www.slideserve.com/macy-holden/metodi-numerici-per-la-bioinformatica)

---

# 12.4.1 K-Means Clustering

.pull-left[
An approach to partition a data set into K distinct, non-overlapping clusters.

* `$i$` be the number of observations
* `$x_i$` and `$x_{i'}$` be two different observations
* `$j$` be the number of features
]
.pull-right[
To measure how much two observations `$x_i$` and `$x_{i'}$` are different from each other, the squared Euclidean distance is used

`$$\sum_{j=1}^{p}(x_{ij}-x_{i'j})^2$$`
<img src="img/euclidean_distance.jpg" width="100%" style="display: block; margin: auto;" />
]

Image from [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)

---

# 12.4.1 K-Means Clustering

* `$k$` be the number of clusters
* `$C_k$` be the `$k$`th cluster

Sum of all of pairwise squared Euclidean distances between the observations in the `$k$`th cluster

`$$\sum_{i,i' \in C_k} (\sum_{j=1}^{p}(x_{ij}-x_{i'j})^2)$$`
* `$|C_k|$` be the number of samples in the `$k$`th cluster

The within-cluster variation for cluster `$C_k$` denote as `$W(C_k)$` is the mean measurement of how much the observations within the cluster `$C_k$` differ from each other.

`$$W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k} (\sum_{j=1}^{p}(x_{ij}-x_{i'j})^2)$$`
---

# 12.4.1 K-Means Clustering

An approach to partition a data set into K distinct, non-overlapping clusters such that the sum of within-cluster variation is as small as possible.

`$$\underset{C_1,...,C_k}{\text{minimize}}\{\sum_{k=1}^{K}W(C_k)\}$$`
`$$\underset{C_1,...,C_k}{\text{minimize}}\{\sum_{k=1}^{K}\frac{1}{|C_k|} \sum_{i,i' \in C_k} (\sum_{j=1}^{p}(x_{ij}-x_{i'j})^2)\}$$`
---

# 12.4.1 K-Means Clustering

Using `$k=3$` and `$j=2$` as an example, the algorithm starts by randomly assign a cluster number/group `$C_1,C_2$` and `$C_k$` to each of the observations.

---

# 12.4.1 K-Means Clustering
.pull-left[
We proceed with iteration 1.

For each cluster group, calculate the cluster's *centroid*. The cluster centroids are computed as the mean of the observations assigned to each cluster.

`$$\bar{x}_{k} = \frac{1}{|C_k|} \sum_{i \in C_k}{x_{i}}$$`
where `$|C_k|$` is the number of observations in cluster `$C_k$`
]

.pull-right[
<img src="img/algorithm_12_2_interation1_step2a.jpg" width="60%" style="display: block; margin: auto;" />
]
---

# 12.4.1 K-Means Clustering
.pull-left[
Next, for each given observation, calculate the distance between the observation and the centroids.

Reassign the given observation to the cluster that corresponds to the shortest distance centroid.

Calculate `$\sum_{k=1}^{K}W(C_k)$` for that iteration.
]
.pull-right[
<img src="img/algorithm_12_2_interation1_step2b.jpg" width="60%" style="display: block; margin: auto;" />
]

---

# 12.4.1 K-Means Clustering

Repeat the iteration step on the newly assigned clusters until there are minimal changes to `$\sum_{k=1}^{K}W(C_k)$`.

---

# 12.4.1 K-Means Clustering

Caveats

* We must decide how many clusters we expect in the data.
* Results obtained will depend on the initial (random) cluster assignment of each observation which may give inconsistent results.

For this reason, it is important to run the algorithm multiple times from different random initial configurations. Then one selects the best solution, i.e. that for which `$\sum_{k=1}^{K}W(C_k)$` is the smallest.

Hierarchical clustering is an alternative approach to overcome some of the weaknesses of K-Means Clustering

---

# 12.4.2 Hierarchical Clustering

Hierarchical clustering results in a tree-based representation of the observations, called a dendrogram. Each leaf of the dendrogram represents one observation. Clusters are obtained by cutting the dendrogram at a given height.

---

# 12.4.2 Hierarchical Clustering

In practice, people often look at the dendrogram and select by eye a sensible number of clusters, based on the heights of the fusion and the number of clusters desired.

---

# 12.4.2 Hierarchical Clustering

Each leaf of the dendrogram represents one observation. Starting out at the bottom of the dendrogram, each of the `$n$` observations is treated as its own cluster.

Image from Kavana Rudresh [GitHub page]((https://github.com/kavana-r/RLadies_Hierarchical_Clustering) for [R-Ladies Philly](https://www.youtube.com/RLadiesPhilly)
---

# 12.4.2 Hierarchical Clustering

The algorithm then measures all possible pairwise distance within the `$n$` observations.

The two observations with the smallest distance is then classified as one cluster, giving a remaining of `$``\hspace{0.1cm}n-1"$` observations

Image from Kavana Rudresh [GitHub page]((https://github.com/kavana-r/RLadies_Hierarchical_Clustering) for [R-Ladies Philly](https://www.youtube.com/RLadiesPhilly)
---

# 12.4.2 Hierarchical Clustering

The algorithm continues until all observations are used.

Height of the blocks represents the distance between clusters.

Image from Kavana Rudresh [GitHub page]((https://github.com/kavana-r/RLadies_Hierarchical_Clustering) for [R-Ladies Philly](https://www.youtube.com/RLadiesPhilly)

---

# 12.4.2 Hierarchical Clustering

How did we determine that the cluster `$\{p5, p6\}$` should be fused with the cluster `$\{p4\}$` ?

The concept of dissimilarity between a pair of observations needs to be extended to a pair of groups of observations. This extension is achieved by developing the notion of *linkage*,

Image from Kavana Rudresh [GitHub page]((https://github.com/kavana-r/RLadies_Hierarchical_Clustering) for [R-Ladies Philly](https://www.youtube.com/RLadiesPhilly)
---

# 12.4.2 Hierarchical Clustering

.pull-left[
Single-linkage: the distance between two clusters is defined as the shortest distance between two points in each cluster

Complete-linkage: the distance between two clusters is defined as the longest distance between two points in each cluster.

Average-linkage: the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.

Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.
]

.pull-right[
<img src="img/linkages.jpg" width="70%" style="display: block; margin: auto;" />

Image from Kavana Rudresh [GitHub page]((https://github.com/kavana-r/RLadies_Hierarchical_Clustering) for [R-Ladies Philly](https://www.youtube.com/RLadiesPhilly)
]
---

# 12.4.2 Hierarchical Clustering

.pull-left[
Mustafa Murat's [Hierarchical clustering slides](https://mmuratarat.github.io/2020-05-23/hierarchical_clustering)
]

.pull-right[
<img src="img/linkages_example.jpg" width="80%" style="display: block; margin: auto;" />
]

---

# 12.4.2 Hierarchical Clustering

---

# 12.4.2 Hierarchical Clustering

Image from Kavana Rudresh [GitHub page]((https://github.com/kavana-r/RLadies_Hierarchical_Clustering) for [R-Ladies Philly](https://www.youtube.com/RLadiesPhilly)

---

# 12.4.3 Practical Issues in Clustering

Image from Kavana Rudresh [GitHub page]((https://github.com/kavana-r/RLadies_Hierarchical_Clustering) for [R-Ladies Philly](https://www.youtube.com/RLadiesPhilly)

---

# 12.4.3 Practical Issues in Clustering

Mustafa Murat's [Hierarchical clustering slides](https://mmuratarat.github.io/2020-05-23/hierarchical_clustering)

---

# 12.4.2 Hierarchical Clustering

Different dissimilarity measure gives different clusters.

---

# 12.4.3 Practical Issues in Clustering

Should the observations or features be standardized to have standard deviation of one ?

In the case of hierarchical clustering,
* What dissimilarity measure should be used ?
* What type of linkage should be used ?
* Where should we cut the dendrogram in order to obtain clusters ?

In the case of K-means clustering, 
* How many clusters should we look for in the data ?

---

# 12.4.3 Practical Issues in Clustering

Kmeans and hierarchical clustering force every observation into a cluster, the
clusters found may be heavily distorted due to the presence of outliers that do not belong to any cluster.

Clustering methods generally are not very robust to perturbations to the data. We recommend clustering subsets of the data in order to get a sense of the robustness of the clusters obtained.

---

# 12.4.3 Practical Issues in Clustering

EduClust [webpage](https://educlust.dbvis.de/)