# Rmarkdown stuff
library(rmarkdown, quietly = TRUE)
library(knitr, quietly = TRUE)
library(htmltools, quietly = TRUE)
library(styler, quietly = TRUE)
library(xaringanExtra, quietly = TRUE)

# Dataset
library(ISLR2, quietly = TRUE)

# Session info and package reporting
library(report, quietly = TRUE)

# Data wrangling

library(dplyr, quietly = TRUE)
library(tidyr, quietly = TRUE)
library(magrittr, quietly = TRUE)

# Create interactive tables
library(reactable, quietly = TRUE)

# For plotting

library(forcats, quietly = TRUE)
library(scales, quietly = TRUE)
library(tidytext, quietly = TRUE)
library(ggplot2, quietly = TRUE)

# For parallel processing for tuning
library(foreach, quietly = TRUE)
library(doParallel, quietly = TRUE)

# For applying tidymodels

library(broom, quietly = TRUE)
library(rsample, quietly = TRUE)
library(parsnip, quietly = TRUE)
library(recipes, quietly = TRUE)
library(dials, quietly = TRUE)
library(tune, quietly = TRUE)
library(workflows, quietly = TRUE)
library(yardstick, quietly = TRUE)

# For subset selection
library(leaps, quietly = TRUE)

# For ridge regression and lasso
library(glmnet, quietly = TRUE)

# For partial least square regression
# In recipes::step_pls
# Warning message:
#  `step_pls()` failed: Error in loadNamespace(x) : there is no package called ‘mixOmics’
# Install from Bioconductor
library(mixOmics, quietly = TRUE)

# For variable importance
library(vip, quietly = TRUE)

summary(report::report(sessionInfo()))

The analysis was done using the R Statistical language (v4.1.3; R Core Team, 2022) on Windows 10 x64, using the packages vip (v0.3.2), broom (v0.7.12), workflows (v0.2.6), Matrix (v1.4.0), ISLR2 (v1.3.1), xaringanExtra (v0.5.5), reactable (v0.2.3), ggplot2 (v3.3.5), forcats (v0.5.1), scales (v1.1.1), tidyr (v1.2.0), dplyr (v1.0.8), glmnet (v4.1.3), rmarkdown (v2.13), htmltools (v0.5.2), rsample (v0.1.1), styler (v1.7.0), report (v0.5.1), tune (v0.2.0), yardstick (v0.0.9), parsnip (v0.2.1), recipes (v0.2.0), dials (v0.1.0), foreach (v1.5.2), doParallel (v1.0.17), iterators (v1.0.14), mixOmics (v6.18.1), lattice (v0.20.45), tidytext (v0.3.2), magrittr (v2.0.2), leaps (v3.1), MASS (v7.3.55) and knitr (v1.38).

Subset Selection Methods

Best Subset Selection

We will be using the Hitters data set from the ISLR2 package. We wish to predict the baseball players Salary based on several different characteristics which are included in the data set.

Remove all rows with missing data from that column.

Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters %>%
  reactable::reactable(
    defaultPageSize = 5,
    filterable = TRUE
  )

Use leaps::regsubsets function to performs best subset selection.

An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI.

regfit.full <- leaps::regsubsets(
  x = Salary ~ .,
  data = Hitters,
  nvmax = 8 # default
)

summary(regfit.full)
> Subset selection object
> Call: regsubsets.formula(x = Salary ~ ., data = Hitters, nvmax = 8)
> 19 Variables  (and intercept)
>            Forced in Forced out
> AtBat          FALSE      FALSE
> Hits           FALSE      FALSE
> HmRun          FALSE      FALSE
> Runs           FALSE      FALSE
> RBI            FALSE      FALSE
> Walks          FALSE      FALSE
> Years          FALSE      FALSE
> CAtBat         FALSE      FALSE
> CHits          FALSE      FALSE
> CHmRun         FALSE      FALSE
> CRuns          FALSE      FALSE
> CRBI           FALSE      FALSE
> CWalks         FALSE      FALSE
> LeagueN        FALSE      FALSE
> DivisionW      FALSE      FALSE
> PutOuts        FALSE      FALSE
> Assists        FALSE      FALSE
> Errors         FALSE      FALSE
> NewLeagueN     FALSE      FALSE
> 1 subsets of each size up to 8
> Selection Algorithm: exhaustive
>          AtBat Hits HmRun Runs RBI Walks Years CAtBat
> 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "   
> 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "   
> 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "   
> 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "   
> 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "   
> 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "   
> 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"   
> 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "   
>          CHits CHmRun CRuns CRBI CWalks LeagueN
> 1  ( 1 ) " "   " "    " "   "*"  " "    " "    
> 2  ( 1 ) " "   " "    " "   "*"  " "    " "    
> 3  ( 1 ) " "   " "    " "   "*"  " "    " "    
> 4  ( 1 ) " "   " "    " "   "*"  " "    " "    
> 5  ( 1 ) " "   " "    " "   "*"  " "    " "    
> 6  ( 1 ) " "   " "    " "   "*"  " "    " "    
> 7  ( 1 ) "*"   "*"    " "   " "  " "    " "    
> 8  ( 1 ) " "   "*"    "*"   " "  "*"    " "    
>          DivisionW PutOuts Assists Errors NewLeagueN
> 1  ( 1 ) " "       " "     " "     " "    " "       
> 2  ( 1 ) " "       " "     " "     " "    " "       
> 3  ( 1 ) " "       "*"     " "     " "    " "       
> 4  ( 1 ) "*"       "*"     " "     " "    " "       
> 5  ( 1 ) "*"       "*"     " "     " "    " "       
> 6  ( 1 ) "*"       "*"     " "     " "    " "       
> 7  ( 1 ) "*"       "*"     " "     " "    " "       
> 8  ( 1 ) "*"       "*"     " "     " "    " "

Here we fit up to a 19-variable model.

regfit.full <- leaps::regsubsets(
  x = Salary ~ .,
  data = Hitters,
  nvmax = 19
)

The summary function also returns \(R^2\), \(RSS\), adjusted \(R^2\), \(C_p\), and \(BIC\). We can examine these to try to select the best overall model.

reg.summary <- regfit.full %>%
  summary()

names(reg.summary)
> [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"   
> [7] "outmat" "obj"

For instance, we see that the \(R^2\) statistic increases from 32 %, when only one variable is included in the model, to almost 55 %, when all variables are included. As expected, the \(R^2\) statistic increases monotonically as more variables are included.

reg.summary$rsq
>  [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036
>  [6] 0.5087146 0.5141227 0.5285569 0.5346124 0.5404950
> [11] 0.5426153 0.5436302 0.5444570 0.5452164 0.5454692
> [16] 0.5457656 0.5459518 0.5460945 0.5461159

The leaps::regsubsets function has a built-in plot command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to the \(BIC\), \(C_p\), adjusted \(R^2\), or \(AIC\). To find out more about this function, type ?plot.regsubsets.

The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic. For instance, we see that several models share a \(BIC\) close to \(−150\). However, the model with the lowest \(BIC\) is the six-variable model that contains only AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts.

plot(regfit.full, scale = "bic")

We can use the coef function to see the coefficient estimates associated with this model.

coef(regfit.full, 6)
>  (Intercept)        AtBat         Hits        Walks 
>   91.5117981   -1.8685892    7.6043976    3.6976468 
>         CRBI    DivisionW      PutOuts 
>    0.6430169 -122.9515338    0.2643076

Forward and Backward Stepwise Selection

We can also use the leaps::regsubsets function to perform forward stepwise selection using the argument method = "forward"

regfit.fwd <- leaps::regsubsets(
  x = Salary ~ .,
  data = Hitters,
  nvmax = 19,
  method = "forward"
)

summary(regfit.fwd)
> Subset selection object
> Call: regsubsets.formula(x = Salary ~ ., data = Hitters, nvmax = 19, 
>     method = "forward")
> 19 Variables  (and intercept)
>            Forced in Forced out
> AtBat          FALSE      FALSE
> Hits           FALSE      FALSE
> HmRun          FALSE      FALSE
> Runs           FALSE      FALSE
> RBI            FALSE      FALSE
> Walks          FALSE      FALSE
> Years          FALSE      FALSE
> CAtBat         FALSE      FALSE
> CHits          FALSE      FALSE
> CHmRun         FALSE      FALSE
> CRuns          FALSE      FALSE
> CRBI           FALSE      FALSE
> CWalks         FALSE      FALSE
> LeagueN        FALSE      FALSE
> DivisionW      FALSE      FALSE
> PutOuts        FALSE      FALSE
> Assists        FALSE      FALSE
> Errors         FALSE      FALSE
> NewLeagueN     FALSE      FALSE
> 1 subsets of each size up to 19
> Selection Algorithm: forward
>           AtBat Hits HmRun Runs RBI Walks Years CAtBat
> 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "   
> 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "   
> 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "   
> 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "   
> 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "   
> 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "   
> 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "   
> 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "   
> 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"   
> 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"   
> 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"   
> 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"   
> 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"   
> 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"   
> 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"   
> 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"   
> 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"   
> 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"   
> 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"   
>           CHits CHmRun CRuns CRBI CWalks LeagueN
> 1  ( 1 )  " "   " "    " "   "*"  " "    " "    
> 2  ( 1 )  " "   " "    " "   "*"  " "    " "    
> 3  ( 1 )  " "   " "    " "   "*"  " "    " "    
> 4  ( 1 )  " "   " "    " "   "*"  " "    " "    
> 5  ( 1 )  " "   " "    " "   "*"  " "    " "    
> 6  ( 1 )  " "   " "    " "   "*"  " "    " "    
> 7  ( 1 )  " "   " "    " "   "*"  "*"    " "    
> 8  ( 1 )  " "   " "    "*"   "*"  "*"    " "    
> 9  ( 1 )  " "   " "    "*"   "*"  "*"    " "    
> 10  ( 1 ) " "   " "    "*"   "*"  "*"    " "    
> 11  ( 1 ) " "   " "    "*"   "*"  "*"    "*"    
> 12  ( 1 ) " "   " "    "*"   "*"  "*"    "*"    
> 13  ( 1 ) " "   " "    "*"   "*"  "*"    "*"    
> 14  ( 1 ) " "   " "    "*"   "*"  "*"    "*"    
> 15  ( 1 ) "*"   " "    "*"   "*"  "*"    "*"    
> 16  ( 1 ) "*"   " "    "*"   "*"  "*"    "*"    
> 17  ( 1 ) "*"   " "    "*"   "*"  "*"    "*"    
> 18  ( 1 ) "*"   " "    "*"   "*"  "*"    "*"    
> 19  ( 1 ) "*"   "*"    "*"   "*"  "*"    "*"    
>           DivisionW PutOuts Assists Errors NewLeagueN
> 1  ( 1 )  " "       " "     " "     " "    " "       
> 2  ( 1 )  " "       " "     " "     " "    " "       
> 3  ( 1 )  " "       "*"     " "     " "    " "       
> 4  ( 1 )  "*"       "*"     " "     " "    " "       
> 5  ( 1 )  "*"       "*"     " "     " "    " "       
> 6  ( 1 )  "*"       "*"     " "     " "    " "       
> 7  ( 1 )  "*"       "*"     " "     " "    " "       
> 8  ( 1 )  "*"       "*"     " "     " "    " "       
> 9  ( 1 )  "*"       "*"     " "     " "    " "       
> 10  ( 1 ) "*"       "*"     "*"     " "    " "       
> 11  ( 1 ) "*"       "*"     "*"     " "    " "       
> 12  ( 1 ) "*"       "*"     "*"     " "    " "       
> 13  ( 1 ) "*"       "*"     "*"     "*"    " "       
> 14  ( 1 ) "*"       "*"     "*"     "*"    " "       
> 15  ( 1 ) "*"       "*"     "*"     "*"    " "       
> 16  ( 1 ) "*"       "*"     "*"     "*"    " "       
> 17  ( 1 ) "*"       "*"     "*"     "*"    "*"       
> 18  ( 1 ) "*"       "*"     "*"     "*"    "*"       
> 19  ( 1 ) "*"       "*"     "*"     "*"    "*"

To perform backward stepwise selection, use the argument method = "backward"

regfit.bwd <- leaps::regsubsets(
  x = Salary ~ .,
  data = Hitters,
  nvmax = 19,
  method = "backward"
)

summary(regfit.bwd)
> Subset selection object
> Call: regsubsets.formula(x = Salary ~ ., data = Hitters, nvmax = 19, 
>     method = "backward")
> 19 Variables  (and intercept)
>            Forced in Forced out
> AtBat          FALSE      FALSE
> Hits           FALSE      FALSE
> HmRun          FALSE      FALSE
> Runs           FALSE      FALSE
> RBI            FALSE      FALSE
> Walks          FALSE      FALSE
> Years          FALSE      FALSE
> CAtBat         FALSE      FALSE
> CHits          FALSE      FALSE
> CHmRun         FALSE      FALSE
> CRuns          FALSE      FALSE
> CRBI           FALSE      FALSE
> CWalks         FALSE      FALSE
> LeagueN        FALSE      FALSE
> DivisionW      FALSE      FALSE
> PutOuts        FALSE      FALSE
> Assists        FALSE      FALSE
> Errors         FALSE      FALSE
> NewLeagueN     FALSE      FALSE
> 1 subsets of each size up to 19
> Selection Algorithm: backward
>           AtBat Hits HmRun Runs RBI Walks Years CAtBat
> 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "   
> 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "   
> 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "   
> 4  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "   
> 5  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "   
> 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "   
> 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "   
> 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "   
> 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"   
> 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"   
> 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"   
> 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"   
> 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"   
> 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"   
> 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"   
> 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"   
> 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"   
> 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"   
> 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"   
>           CHits CHmRun CRuns CRBI CWalks LeagueN
> 1  ( 1 )  " "   " "    "*"   " "  " "    " "    
> 2  ( 1 )  " "   " "    "*"   " "  " "    " "    
> 3  ( 1 )  " "   " "    "*"   " "  " "    " "    
> 4  ( 1 )  " "   " "    "*"   " "  " "    " "    
> 5  ( 1 )  " "   " "    "*"   " "  " "    " "    
> 6  ( 1 )  " "   " "    "*"   " "  " "    " "    
> 7  ( 1 )  " "   " "    "*"   " "  "*"    " "    
> 8  ( 1 )  " "   " "    "*"   "*"  "*"    " "    
> 9  ( 1 )  " "   " "    "*"   "*"  "*"    " "    
> 10  ( 1 ) " "   " "    "*"   "*"  "*"    " "    
> 11  ( 1 ) " "   " "    "*"   "*"  "*"    "*"    
> 12  ( 1 ) " "   " "    "*"   "*"  "*"    "*"    
> 13  ( 1 ) " "   " "    "*"   "*"  "*"    "*"    
> 14  ( 1 ) " "   " "    "*"   "*"  "*"    "*"    
> 15  ( 1 ) "*"   " "    "*"   "*"  "*"    "*"    
> 16  ( 1 ) "*"   " "    "*"   "*"  "*"    "*"    
> 17  ( 1 ) "*"   " "    "*"   "*"  "*"    "*"    
> 18  ( 1 ) "*"   " "    "*"   "*"  "*"    "*"    
> 19  ( 1 ) "*"   "*"    "*"   "*"  "*"    "*"    
>           DivisionW PutOuts Assists Errors NewLeagueN
> 1  ( 1 )  " "       " "     " "     " "    " "       
> 2  ( 1 )  " "       " "     " "     " "    " "       
> 3  ( 1 )  " "       "*"     " "     " "    " "       
> 4  ( 1 )  " "       "*"     " "     " "    " "       
> 5  ( 1 )  " "       "*"     " "     " "    " "       
> 6  ( 1 )  "*"       "*"     " "     " "    " "       
> 7  ( 1 )  "*"       "*"     " "     " "    " "       
> 8  ( 1 )  "*"       "*"     " "     " "    " "       
> 9  ( 1 )  "*"       "*"     " "     " "    " "       
> 10  ( 1 ) "*"       "*"     "*"     " "    " "       
> 11  ( 1 ) "*"       "*"     "*"     " "    " "       
> 12  ( 1 ) "*"       "*"     "*"     " "    " "       
> 13  ( 1 ) "*"       "*"     "*"     "*"    " "       
> 14  ( 1 ) "*"       "*"     "*"     "*"    " "       
> 15  ( 1 ) "*"       "*"     "*"     "*"    " "       
> 16  ( 1 ) "*"       "*"     "*"     "*"    " "       
> 17  ( 1 ) "*"       "*"     "*"     "*"    "*"       
> 18  ( 1 ) "*"       "*"     "*"     "*"    "*"       
> 19  ( 1 ) "*"       "*"     "*"     "*"    "*"

For this data, the best seven-variable models identified by forward stepwise selection, backward stepwise selection, and best subset selection are different.

coef(regfit.full, 7)
>  (Intercept)         Hits        Walks       CAtBat 
>   79.4509472    1.2833513    3.2274264   -0.3752350 
>        CHits       CHmRun    DivisionW      PutOuts 
>    1.4957073    1.4420538 -129.9866432    0.2366813
coef(regfit.fwd, 7)
>  (Intercept)        AtBat         Hits        Walks 
>  109.7873062   -1.9588851    7.4498772    4.9131401 
>         CRBI       CWalks    DivisionW      PutOuts 
>    0.8537622   -0.3053070 -127.1223928    0.2533404
coef(regfit.bwd, 7)
>  (Intercept)        AtBat         Hits        Walks 
>  105.6487488   -1.9762838    6.7574914    6.0558691 
>        CRuns       CWalks    DivisionW      PutOuts 
>    1.1293095   -0.7163346 -116.1692169    0.3028847

Choosing Among Models Using the Validation-Set Approach

We split the samples into training set and a test set

set.seed(1234)

Hitters_split <- Hitters %>%
  dplyr::mutate(
    isTrainingSet = sample(
      x = c(TRUE, FALSE),
      size = nrow(Hitters),
      replace = TRUE
    )
  ) %>%
  dplyr::relocate(.data[["isTrainingSet"]])

full_train <- Hitters_split %>%
  dplyr::filter(.data[["isTrainingSet"]] == TRUE) %>%
  dplyr::select(-c("isTrainingSet"))

test <- Hitters_split %>%
  dplyr::filter(.data[["isTrainingSet"]] == FALSE) %>%
  dplyr::select(-c("isTrainingSet"))

train_split <- full_train %>%
  dplyr::mutate(
    isValidationSet = sample(
      x = c(TRUE, FALSE),
      size = nrow(full_train),
      replace = TRUE
    )
  ) %>%
  dplyr::relocate(.data[["isValidationSet"]])

validate <- train_split %>%
  dplyr::filter(.data[["isValidationSet"]] == TRUE) %>%
  dplyr::select(-c("isValidationSet"))

train <- train_split %>%
  dplyr::filter(.data[["isValidationSet"]] == FALSE) %>%
  dplyr::select(-c("isValidationSet"))

Now, we apply leaps::regsubsets to the training set in order to perform best subset selection and compute the the validation set error for the best model of each model size.

regfit.best <- leaps::regsubsets(Salary ~ ., data = train, nvmax = 19)

validate.mat <- stats::model.matrix(Salary ~ ., data = validate)

val.errors <- rep(NA, 19)

for (i in 1:19) {
  coefi <- stats::coef(regfit.best, id = i)
  pred <- validate.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean((validate$Salary - pred)^2)
}

val.errors
>  [1] 152197.4 176717.6 175668.1 189872.0 188283.5
>  [6] 139431.4 140402.2 144188.3 137154.9 130012.8
> [11] 128254.4 136305.3 175777.6 171918.5 159903.7
> [16] 150711.9 156424.0 153931.9 158701.9

We find that the best model is the one that contains 11 variables.

which.min(val.errors)
> [1] 11
regfit.best <- leaps::regsubsets(Salary ~ ., data = full_train, nvmax = 19)

coef(regfit.best, which.min(val.errors))
> (Intercept)       AtBat        Hits       Walks 
> 374.6296081  -3.2269892   8.7256272   9.3417318 
>       Years      CHmRun       CRuns      CWalks 
> -23.7454571   1.9340683   1.4336789  -1.2611593 
>   DivisionW     PutOuts     Assists      Errors 
> -94.2091001   0.2327136   0.8373956 -11.1969894

We now create a predict function

# A predict method for leaps::regsubsets
predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

And apply it on the test.

pred_train <- stats::predict(regfit.best, full_train, id = which.min(val.errors))
pred_test <- stats::predict(regfit.best, test, id = which.min(val.errors))

Here is the \(MSE\) for train and test data

mean((full_train$Salary - pred_train)^2)
> [1] 96679.82
mean((test$Salary - pred_test)^2)
> [1] 105683.6

Choosing Among Models Using Cross-Validation

We now try to choose among the models of different sizes using cross validation. Create a vector that allocates each observation to one of k = 10 folds.

set.seed(1)

Hitters_split <- Hitters %>%
  dplyr::mutate(
    isTrainingSet = sample(
      x = c(TRUE, FALSE),
      size = nrow(Hitters),
      replace = TRUE
    )
  ) %>%
  dplyr::relocate(.data[["isTrainingSet"]])

train <- Hitters_split %>%
  dplyr::filter(.data[["isTrainingSet"]] == TRUE) %>%
  dplyr::select(-c("isTrainingSet"))

test <- Hitters_split %>%
  dplyr::filter(.data[["isTrainingSet"]] == FALSE) %>%
  dplyr::select(-c("isTrainingSet"))

k <- 10
n <- nrow(train)
folds <- sample(rep(1:k, length = n))

We write a for loop that performs cross-validation

# A predict method for leaps::regsubsets
predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

cv.errors <- matrix(NA, k, 19,
  dimnames = list(NULL, paste(1:19))
)
for (j in 1:k) {
  # For each cross validation fold data,
  # get the best i variable model using subset selection
  best.fit <- leaps::regsubsets(Salary ~ .,
    data = train[folds != j, ],
    nvmax = 19
  )

  for (i in 1:19) {
    # Compute the cross validation error for each
    # best i variable model
    pred <- stats::predict(best.fit, train[folds == j, ], id = i)
    cv.errors[j, i] <- mean((train$Salary[folds == j] - pred)^2)
  }
}

This has given us a \(10\) by \(19\) matrix, of which the \((j, i)\)th element corresponds to the cross validation \(MSE\) for the \(j\)th cross-validation fold for the best \(i\)-variable model.

We use the apply function to average over the columns of this apply matrix in order to obtain a vector for which the \(i\)th element is the cross validation error for the \(i\)-variable model.

mean.cv.errors <- apply(cv.errors, 2, mean)

plot(mean.cv.errors, type = "b")

We see that cross-validation selects a 11-variable model. So we retrain the model

regfit.best <- leaps::regsubsets(Salary ~ ., data = train, nvmax = 19)
coef(regfit.best, which.min(mean.cv.errors))
>  (Intercept)        AtBat         Hits        HmRun 
>   27.1490022   -1.4275486    5.8330210  -10.8599653 
>        Walks       CAtBat        CRuns         CRBI 
>    8.1700662   -0.1884247    1.8806606    0.7803514 
>       CWalks      LeagueN    DivisionW      PutOuts 
>   -0.9328753   76.1994969 -108.1186878    0.2263419

And apply it on the test.

pred_train <- stats::predict(regfit.best, train, id = which.min(mean.cv.errors))
pred_test <- stats::predict(regfit.best, test, id = which.min(mean.cv.errors))

Here is the \(MSE\) for train and test data

mean((train$Salary - pred_train)^2)
> [1] 67248.67
mean((test$Salary - pred_test)^2)
> [1] 136963.3

Ridge Regression

Introduction to ridge regression

Use parsnip::linear_reg, parsnip::set_mode and parsnip::set_engine to create the model.

Set mixture to 0 to indicate Ridge Regression. For now, we set penalty/lambda as 0.

Use parsnip::translate to better understand the model created.

Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

ridge_spec <- parsnip::linear_reg(
  mixture = 0,
  penalty = 0
) %>%
  parsnip::set_mode("regression") %>%
  parsnip::set_engine("glmnet")

ridge_spec %>%
  parsnip::translate()
> Linear Regression Model Specification (regression)
> 
> Main Arguments:
>   penalty = 0
>   mixture = 0
> 
> Computational engine: glmnet 
> 
> Model fit template:
> glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
>     alpha = 0, family = "gaussian")

Once the specification is created we can fit it to our data using parsnip::fit. We will use all the predictors.

ridge_fit <- ridge_spec %>%
  parsnip::fit(
    formula = Salary ~ .,
    data = Hitters
  )

We can see the parameter estimate for different values of penalty/lambda with parsnip::tidy. Notice how the estimates are decreasing when the amount of penalty goes up

parsnip::tidy(ridge_fit) %>%
  reactable::reactable(defaultPageSize = 5)
parsnip::tidy(ridge_fit, penalty = 705) %>%
  reactable::reactable(defaultPageSize = 5)
parsnip::tidy(ridge_fit, penalty = 11498) %>%
  reactable::reactable(defaultPageSize = 5)

Using parsnip::extract_fit_engine we can visualize how the magnitude of the coefficients are being regularized towards zero as the penalty goes up.

# Arrange figure margin
par(mar = c(5, 5, 5, 1))
ridge_fit %>%
  parsnip::extract_fit_engine() %>%
  plot(xvar = "lambda")

It would be nice if we could find the “best” value of the penalty/lambda. We can do this using the tune::tune_grid

Need we need three things in order to use the tune::tune_grid

  • a resample object containing the resamples the workflow should be fitted within,

  • a workflow object containing the model and preprocessor,and

  • a tibble penalty_grid containing the parameter values to be evaluated.

Create the resample object

First, we split the samples into a training set and a test set. From the training set, we create a 10-fold cross-validation data set from the training set.

This is done with rsample::initial_split, rsample::training, rsample::testing and rsample::vfold_cv.

set.seed(1234)
Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters_split <- rsample::initial_split(Hitters, strata = "Salary")

Hitters_train <- rsample::training(Hitters_split)
Hitters_test <- rsample::testing(Hitters_split)

Hitters_fold <- rsample::vfold_cv(Hitters_train, v = 10)

Create the preprocessor

We use the recipes package to create the preprocessing steps. However, the order of the preprocessing step is actually important. See the Ordering of steps vignette

We create a basic recipe with recipes::recipe.

Apply recipes::prep and recipes::bake to compute the preprocessing step.

ridge_recipe <- recipes::recipe(
  formula = Salary ~ .,
  data = Hitters_train
)

rec <- recipes::prep(x = ridge_recipe, training = Hitters_train)

rec %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)
rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)

Do preprocessing on the factor variables recipes::all_nominal_predictors. recipes::step_novel and recipes::step_dummy was used.

ridge_recipe <- ridge_recipe %>%
  # Step Novel is important.
  # If test set has a new factor not found in training, it will be treated as "new" and not NA
  # This will prevent the model from giving an error
  # https://blog.datascienceheroes.com/how-to-use-recipes-package-for-one-hot-encoding/
  recipes::step_novel(recipes::all_nominal_predictors()) %>%
  # Create Dummy variables
  recipes::step_dummy(recipes::all_nominal_predictors())

rec <- recipes::prep(x = ridge_recipe, training = Hitters_train)

rec %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)
rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)

Do normalisation step on all predictors recipes::all_predictors(). recipes::step_zv and recipes::step_normalize was used.

ridge_recipe <- ridge_recipe %>%
  # Remove predictors with zero variation
  recipes::step_zv(recipes::all_predictors()) %>%
  # Standardise all variable
  recipes::step_normalize(recipes::all_predictors())

rec <- recipes::prep(x = ridge_recipe, training = Hitters_train)

rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)

Specify the model

The model specification will look very similar to what we have seen earlier, but we will set penalty = tune::tune. This tells tune::tune_grid that the penalty parameter should be tuned using tune::tune.

ridge_spec <-
  parsnip::linear_reg(penalty = tune::tune(), mixture = 0) %>%
  parsnip::set_mode("regression") %>%
  parsnip::set_engine("glmnet")

ridge_spec %>%
  parsnip::translate()
> Linear Regression Model Specification (regression)
> 
> Main Arguments:
>   penalty = tune()
>   mixture = 0
> 
> Computational engine: glmnet 
> 
> Model fit template:
> glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
>     alpha = 0, family = "gaussian")

Create the workflow

workflows::workflow, workflows::add_recipe and workflows::add_model are used.

ridge_workflow <- workflows::workflow() %>%
  workflows::add_recipe(ridge_recipe) %>%
  workflows::add_model(ridge_spec)

ridge_workflow
> == Workflow =============================================
> Preprocessor: Recipe
> Model: linear_reg()
> 
> -- Preprocessor -----------------------------------------
> 4 Recipe Steps
> 
> * step_novel()
> * step_dummy()
> * step_zv()
> * step_normalize()
> 
> -- Model ------------------------------------------------
> Linear Regression Model Specification (regression)
> 
> Main Arguments:
>   penalty = tune::tune()
>   mixture = 0
> 
> Computational engine: glmnet

Create the penalty/lambda grid

A penalty/lambda grid of \(50\) numbers from \(0.00001\) (\(10^{-5}\)) to \(10000\) (\(10^5\)) is created.

Regular grid is created using dials::grid_regular, dials::penalty and scales::log10_trans

penalty_grid <- dials::grid_regular(
  x = dials::penalty(
    range = c(-5, 5),
    trans = scales::log10_trans()
  ),
  levels = 50
)

penalty_grid %>%
  reactable::reactable(defaultPageSize = 5)

Ridge regression model fitting on cross validated data

Now we have everything we need and we can fit all the models on the cross validated data with tune::tune_grid. Note that this process may take some time.

doParallel::registerDoParallel()
foreach::getDoParWorkers()
> [1] 3
tune_res <- tune::tune_grid(
  object = ridge_workflow,
  resamples = Hitters_fold,
  grid = penalty_grid
)

tune_res
> # Tuning results
> # 10-fold cross-validation 
> # A tibble: 10 x 4
>    splits           id     .metrics           .notes  
>    <list>           <chr>  <list>             <list>  
>  1 <split [176/20]> Fold01 <tibble [100 x 5]> <tibble>
>  2 <split [176/20]> Fold02 <tibble [100 x 5]> <tibble>
>  3 <split [176/20]> Fold03 <tibble [100 x 5]> <tibble>
>  4 <split [176/20]> Fold04 <tibble [100 x 5]> <tibble>
>  5 <split [176/20]> Fold05 <tibble [100 x 5]> <tibble>
>  6 <split [176/20]> Fold06 <tibble [100 x 5]> <tibble>
>  7 <split [177/19]> Fold07 <tibble [100 x 5]> <tibble>
>  8 <split [177/19]> Fold08 <tibble [100 x 5]> <tibble>
>  9 <split [177/19]> Fold09 <tibble [100 x 5]> <tibble>
> 10 <split [177/19]> Fold10 <tibble [100 x 5]> <tibble>

Here we see that the amount of regularization affects the performance metrics differently using tune::autoplot. Do note that using a different seed will give a different plot

# Note that a different seed will give different plots
tune::autoplot(tune_res)

We can also see the raw metrics that created this chart by calling tune::collect_metrics().

tune::collect_metrics(tune_res) %>%
  reactable::reactable(defaultPageSize = 5)

Here is the ggplot way should tune::autoplot fails

tune_res %>%
  tune::collect_metrics() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["penalty"]],
    y = .data[["mean"]],
    colour = .data[[".metric"]]
  )) +
  ggplot2::geom_errorbar(
    mapping = ggplot2::aes(
      ymin = .data[["mean"]] - .data[["std_err"]],
      ymax = .data[["mean"]] + .data[["std_err"]]
    ),
    alpha = 0.5
  ) +
  ggplot2::geom_line(size = 1.5) +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[[".metric"]]),
    scales = "free",
    nrow = 2
  ) +
  ggplot2::scale_x_log10() +
  ggplot2::theme(legend.position = "none")

Use tune::show_best to see the top few values for a given metric.

The “best” values can be selected using tune::select_best, this function requires you to specify a metric that it should select against. The penalty/lambda value is 569 for metric rsq since it gives the highest value. Do note that using a different seed will give a different best penalty/lambda value.

top_penalty <- tune::show_best(tune_res, metric = "rsq", n = 5)
top_penalty %>%
  reactable::reactable(defaultPageSize = 5)
best_penalty <- tune::select_best(tune_res, metric = "rsq")
best_penalty %>%
  reactable::reactable(defaultPageSize = 5)

Ridge regression model with optimised penalty/lambda value

We create the ridge regression workflow with the best penalty score using tune::finalize_workflow.

ridge_final <- tune::finalize_workflow(
  x = ridge_workflow,
  parameters = best_penalty
)

ridge_final
> == Workflow =============================================
> Preprocessor: Recipe
> Model: linear_reg()
> 
> -- Preprocessor -----------------------------------------
> 4 Recipe Steps
> 
> * step_novel()
> * step_dummy()
> * step_zv()
> * step_normalize()
> 
> -- Model ------------------------------------------------
> Linear Regression Model Specification (regression)
> 
> Main Arguments:
>   penalty = 568.98660290183
>   mixture = 0
> 
> Computational engine: glmnet

We now train the ridge regression model with the training data using parsnip::fit

ridge_final_fit <- parsnip::fit(object = ridge_final, data = Hitters_train)

We can see the coefficients using tune::extract_fit_parsnip and parsnip::tidy

ridge_final_fit %>%
  tune::extract_fit_parsnip() %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)

Variable Importance

While we’re at it, let’s see what the most important variables are using the vip package and workflows::extract_fit_parsnip.

vip_table <- ridge_final_fit %>%
  workflows::extract_fit_parsnip() %>%
  vip::vi(lambda = best_penalty$penalty)

vip_table %>%
  reactable::reactable(defaultPageSize = 5)
vip_table %>%
  dplyr::mutate(
    Importance = abs(.data[["Importance"]]),
    Variable = forcats::fct_reorder(
      .f = .data[["Variable"]],
      .x = .data[["Importance"]]
    )
  ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["Importance"]],
    y = .data[["Variable"]],
    fill = .data[["Sign"]]
  )) +
  ggplot2::geom_col() +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::labs(y = NULL)

Ridge regression model on test data

This ridge regression model can now be applied on our testing data set to validate its performance. For regression models, a .pred column is added when parsnip::augment is used.

test_results <- parsnip::augment(x = ridge_final_fit, new_data = Hitters_test)

test_results %>%
  reactable::reactable(defaultPageSize = 5)

We check how well the .pred column matches the Salary using yardstick::rsq.

test_results %>%
  yardstick::rsq(truth = .data[["Salary"]], estimate = .data[[".pred"]]) %>%
  reactable::reactable(defaultPageSize = 5)

Alternatively, we can use tune::last_fit and tune::collect_metrics.

test_rs <- tune::last_fit(
  object = ridge_final_fit,
  split = Hitters_split
)

test_rs %>%
  tune::collect_metrics() %>%
  reactable::reactable(defaultPageSize = 5)

Use tune::collect_predictions, to see only the actual and predicted values of the test data.

test_rs %>%
  tune::collect_predictions() %>%
  reactable::reactable(defaultPageSize = 5)

Let us take a closer look at the predicted and actual response as a scatter plot.

test_rs %>%
  tune::collect_predictions() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["Salary"]],
    y = .data[[".pred"]]
  )) +
  ggplot2::geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
  ggplot2::geom_point(alpha = 0.6, color = "midnightblue") +
  ggplot2::coord_fixed()

The Lasso

The following procedure will be very similar to what we saw in the ridge regression section. The resampling step is the same.

Create the resample object

First, we split the samples into a training set and a test set. From the training set, we create a 10-fold cross-validation data set from the training set.

This is done with rsample::initial_split, rsample::training, rsample::testing and rsample::vfold_cv.

set.seed(1234)
Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters_split <- rsample::initial_split(Hitters, strata = "Salary")

Hitters_train <- rsample::training(Hitters_split)
Hitters_test <- rsample::testing(Hitters_split)

Hitters_fold <- rsample::vfold_cv(Hitters_train, v = 10)

Create the preprocessor

The following procedure will be very similar to what we saw in the ridge regression section. The preprocessing steps needed are the same using recipes::recipe, recipes::all_nominal_predictors, recipes::all_predictors(), recipes::step_novel, recipes::step_dummy, recipes::step_zv and recipes::step_normalize.

lasso_recipe <-
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>%
  recipes::step_novel(recipes::all_nominal_predictors()) %>%
  recipes::step_dummy(recipes::all_nominal_predictors()) %>%
  recipes::step_zv(recipes::all_predictors()) %>%
  recipes::step_normalize(recipes::all_predictors())

Specify the model

Again we set penalty = tune::tune. This tells tune::tune_grid that the penalty parameter should be tuned using tune::tune.

This time, it is mixture=1 in parsnip::linear_reg

lasso_spec <-
  parsnip::linear_reg(penalty = tune::tune(), mixture = 1) %>%
  parsnip::set_mode("regression") %>%
  parsnip::set_engine("glmnet")

lasso_spec %>%
  parsnip::translate()
> Linear Regression Model Specification (regression)
> 
> Main Arguments:
>   penalty = tune()
>   mixture = 1
> 
> Computational engine: glmnet 
> 
> Model fit template:
> glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
>     alpha = 1, family = "gaussian")

Create the workflow

workflows::workflow, workflows::add_recipe and workflows::add_model are used.

lasso_workflow <- workflows::workflow() %>%
  workflows::add_recipe(lasso_recipe) %>%
  workflows::add_model(lasso_spec)

lasso_workflow
> == Workflow =============================================
> Preprocessor: Recipe
> Model: linear_reg()
> 
> -- Preprocessor -----------------------------------------
> 4 Recipe Steps
> 
> * step_novel()
> * step_dummy()
> * step_zv()
> * step_normalize()
> 
> -- Model ------------------------------------------------
> Linear Regression Model Specification (regression)
> 
> Main Arguments:
>   penalty = tune::tune()
>   mixture = 1
> 
> Computational engine: glmnet

Create the penalty/lambda grid

A penalty/lambda grid of \(50\) numbers from \(0.01\) (\(10^{-2}\)) to \(100\) (\(10^2\)) is created.

Regular grid is created using dials::grid_regular, dials::penalty and scales::log10_trans

penalty_grid <- dials::grid_regular(
  x = dials::penalty(
    range = c(-2, 2),
    trans = scales::log10_trans()
  ),
  levels = 50
)

penalty_grid %>%
  reactable::reactable(defaultPageSize = 5)

Lasso model fitting on cross validated data

Now we have everything we need and we can fit all the models on the cross validated data with tune::tune_grid. Note that this process may take some time.

doParallel::registerDoParallel()
foreach::getDoParWorkers()
> [1] 3
tune_res <- tune::tune_grid(
  object = lasso_workflow,
  resamples = Hitters_fold,
  grid = penalty_grid
)

tune_res
> # Tuning results
> # 10-fold cross-validation 
> # A tibble: 10 x 4
>    splits           id     .metrics           .notes  
>    <list>           <chr>  <list>             <list>  
>  1 <split [176/20]> Fold01 <tibble [100 x 5]> <tibble>
>  2 <split [176/20]> Fold02 <tibble [100 x 5]> <tibble>
>  3 <split [176/20]> Fold03 <tibble [100 x 5]> <tibble>
>  4 <split [176/20]> Fold04 <tibble [100 x 5]> <tibble>
>  5 <split [176/20]> Fold05 <tibble [100 x 5]> <tibble>
>  6 <split [176/20]> Fold06 <tibble [100 x 5]> <tibble>
>  7 <split [177/19]> Fold07 <tibble [100 x 5]> <tibble>
>  8 <split [177/19]> Fold08 <tibble [100 x 5]> <tibble>
>  9 <split [177/19]> Fold09 <tibble [100 x 5]> <tibble>
> 10 <split [177/19]> Fold10 <tibble [100 x 5]> <tibble>

Here we see that the amount of regularization affects the performance metrics differently using tune::autoplot. Do note that using a different seed will give a different plot

# Note that a different seed will give different plots
tune::autoplot(tune_res)

We can also see the raw metrics that created this chart by calling tune::collect_metrics().

tune::collect_metrics(tune_res) %>%
  reactable::reactable(defaultPageSize = 5)

Here is the ggplot way should tune::autoplot fails

tune_res %>%
  tune::collect_metrics() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["penalty"]],
    y = .data[["mean"]],
    colour = .data[[".metric"]]
  )) +
  ggplot2::geom_errorbar(
    mapping = ggplot2::aes(
      ymin = .data[["mean"]] - .data[["std_err"]],
      ymax = .data[["mean"]] + .data[["std_err"]]
    ),
    alpha = 0.5
  ) +
  ggplot2::geom_line(size = 1.5) +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[[".metric"]]),
    scales = "free",
    nrow = 2
  ) +
  ggplot2::scale_x_log10() +
  ggplot2::theme(legend.position = "none")

Use tune::show_best to see the top few values for a given metric.

The “best” values can be selected using tune::select_best, this function requires you to specify a metric that it should select against. The penalty/lambda value is 22.2 for metric rsq since it gives the highest value. Do note that using a different seed will give a different best penalty/lambda value.

top_penalty <- tune::show_best(tune_res, metric = "rsq", n = 5)
top_penalty %>%
  reactable::reactable(defaultPageSize = 5)
best_penalty <- tune::select_best(tune_res, metric = "rsq")
best_penalty %>%
  reactable::reactable(defaultPageSize = 5)

Lasso model with optimised penalty/lambda value

We create the lasso regression workflow with the best penalty score using tune::finalize_workflow.

lasso_final <- tune::finalize_workflow(
  x = lasso_workflow,
  parameters = best_penalty
)

lasso_final
> == Workflow =============================================
> Preprocessor: Recipe
> Model: linear_reg()
> 
> -- Preprocessor -----------------------------------------
> 4 Recipe Steps
> 
> * step_novel()
> * step_dummy()
> * step_zv()
> * step_normalize()
> 
> -- Model ------------------------------------------------
> Linear Regression Model Specification (regression)
> 
> Main Arguments:
>   penalty = 22.2299648252619
>   mixture = 1
> 
> Computational engine: glmnet

We now train the lasso regression model with the training data using parsnip::fit

lasso_final_fit <- parsnip::fit(object = lasso_final, data = Hitters_train)

We can see the coefficients using tune::extract_fit_parsnip and parsnip::tidy

lasso_final_fit %>%
  tune::extract_fit_parsnip() %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)

Variable Importance

While we’re at it, let’s see what the most important variables are using the vip package and workflows::extract_fit_parsnip.

vip_table <- lasso_final_fit %>%
  workflows::extract_fit_parsnip() %>%
  vip::vi(lambda = best_penalty$penalty)

vip_table %>%
  reactable::reactable(defaultPageSize = 5)
vip_table %>%
  dplyr::mutate(
    Importance = abs(.data[["Importance"]]),
    Variable = forcats::fct_reorder(
      .f = .data[["Variable"]],
      .x = .data[["Importance"]]
    )
  ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["Importance"]],
    y = .data[["Variable"]],
    fill = .data[["Sign"]]
  )) +
  ggplot2::geom_col() +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::labs(y = NULL)

Lasso regression model on test data

This lasso regression model can now be applied on our testing data set to validate its performance. For regression models, a .pred column is added when parsnip::augment is used.

test_results <- parsnip::augment(x = lasso_final_fit, new_data = Hitters_test)

test_results %>%
  reactable::reactable(defaultPageSize = 5)

We check how well the .pred column matches the Salary using yardstick::rsq.

test_results %>%
  yardstick::rsq(truth = .data[["Salary"]], estimate = .data[[".pred"]]) %>%
  reactable::reactable(defaultPageSize = 5)

Alternatively, we can use tune::last_fit and tune::collect_metrics.

test_rs <- tune::last_fit(
  object = lasso_final_fit,
  split = Hitters_split
)

test_rs %>%
  tune::collect_metrics() %>%
  reactable::reactable(defaultPageSize = 5)

Use tune::collect_predictions, to see only the actual and predicted values of the test data.

test_rs %>%
  tune::collect_predictions() %>%
  reactable::reactable(defaultPageSize = 5)

Let us take a closer look at the predicted and actual response as a scatter plot.

test_rs %>%
  tune::collect_predictions() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["Salary"]],
    y = .data[[".pred"]]
  )) +
  ggplot2::geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
  ggplot2::geom_point(alpha = 0.6, color = "midnightblue") +
  ggplot2::coord_fixed()

Principal Components Regression

The principal component regression is a linear model with pca transformed data. Hence, the major changes will be on the preprocessing steps.

Create the resample object

First, we split the samples into a training set and a test set. From the training set, we create a 10-fold cross-validation data set from the training set.

This is done with rsample::initial_split, rsample::training, rsample::testing and rsample::vfold_cv.

set.seed(1234)
Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters_split <- rsample::initial_split(Hitters, strata = "Salary")

Hitters_train <- rsample::training(Hitters_split)
Hitters_test <- rsample::testing(Hitters_split)

Hitters_fold <- rsample::vfold_cv(Hitters_train, v = 10)

Create the preprocessor

We create a recipe with recipes::recipe, recipes::all_nominal_predictors, recipes::all_predictors(), recipes::step_novel, recipes::step_dummy, recipes::step_zv and recipes::step_normalize.

We add recipes::step_pca this time to perform principal component analysis on all the predictors.

pca_recipe <-
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>%
  recipes::step_novel(recipes::all_nominal_predictors()) %>%
  recipes::step_dummy(recipes::all_nominal_predictors()) %>%
  recipes::step_zv(recipes::all_predictors()) %>%
  recipes::step_normalize(recipes::all_predictors()) %>%
  recipes::step_pca(recipes::all_predictors(), id = "pca")

Apply recipes::prep and recipes::bake to compute the preprocessing step.

rec <- recipes::prep(x = pca_recipe, training = Hitters_train)

rec %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)
rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)

PCA exploration

We can explore the results of the PCA using the recipes::prep and parsnip::tidy. We can see that pca is done at step number 5

recipes::prep(x = pca_recipe, training = Hitters_train) %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)

As such, we extract the results of step number 5 or the pca step.

tidied_pca_loadings <- recipes::prep(x = pca_recipe, training = Hitters_train) %>%
  parsnip::tidy(id = "pca", type = "coef")


tidied_pca_loadings %>%
  reactable::reactable(defaultPageSize = 5)

We make a visualization to see what the first four components look like

tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["value"]],
    y = .data[["terms"]],
    fill = .data[["terms"]]
  )) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]])) +
  ggplot2::labs(y = NULL)

Let us take a closer look at the top 6 variables that contribute to the first four components

tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::group_by(.data[["component"]]) %>%
  dplyr::top_n(6, abs(.data[["value"]])) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    terms = tidytext::reorder_within(
      x = .data[["terms"]],
      by = abs(.data[["value"]]),
      within = .data[["component"]]
    )
  ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = abs(.data[["value"]]),
    y = .data[["terms"]],
    fill = .data[["value"]] > 0
  )) +
  ggplot2::geom_col() +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[["component"]]),
    scales = "free_y"
  ) +
  tidytext::scale_y_reordered() +
  ggplot2::labs(
    x = "Absolute value of contribution to PCA component",
    y = NULL,
    fill = "Positive?"
  )

How much variation are we capturing for each component?

Here is how we can see the variance statistics but it is in a long form

tidied_pca_variance <- recipes::prep(x = pca_recipe, training = Hitters_train) %>%
  broom::tidy(id = "pca", type = "variance")

tidied_pca_variance %>%
  reactable::reactable(defaultPageSize = 5)

Here is how we can see the variance statistics in its wide form

tidied_pca_variance %>%
  tidyr::pivot_wider(
    names_from = .data[["component"]],
    values_from = .data[["value"]],
    names_prefix = "PC"
  ) %>%
  dplyr::select(-dplyr::all_of("id")) %>%
  reactable::reactable(defaultPageSize = 5)

Here is a simple plot to show the variance captured for each component

# Get the variance
percent_variation <- tidied_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "percent variance") %>%
  dplyr::pull(.data[["value"]])

percent_variation <- percent_variation / 100

# I use [1:4] to select the first four components
dplyr::tibble(
  component = unique(tidied_pca_loadings$component)[1:4],
  percent_var = percent_variation[1:4]
) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["component"]],
    y = .data[["percent_var"]]
  )) +
  ggplot2::geom_col() +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::labs(
    x = NULL,
    y = "Percent variance explained by each PCA component"
  )

Threshold is the fraction of the total variance that should be covered by the components. For example, threshold = .75 means that step_pca should generate enough components to capture 75 percent of the variability in the variables. Note: using this argument will override and reset any value given to num_comp.

We will now try to find the best threshold value using the tune::tune()

pca_recipe <-
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>%
  recipes::step_novel(recipes::all_nominal_predictors()) %>%
  recipes::step_dummy(recipes::all_nominal_predictors()) %>%
  recipes::step_zv(recipes::all_predictors()) %>%
  recipes::step_normalize(recipes::all_predictors()) %>%
  recipes::step_pca(recipes::all_predictors(), threshold = tune::tune())

Specify the model

We use a linear model

lm_spec <-
  parsnip::linear_reg() %>%
  parsnip::set_mode("regression") %>%
  parsnip::set_engine("lm")

lm_spec %>%
  parsnip::translate()
> Linear Regression Model Specification (regression)
> 
> Computational engine: lm 
> 
> Model fit template:
> stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

Create the workflow

workflows::workflow, workflows::add_recipe and workflows::add_model are used.

pca_workflow <- workflows::workflow() %>%
  workflows::add_recipe(pca_recipe) %>%
  workflows::add_model(lm_spec)

pca_workflow
> == Workflow =============================================
> Preprocessor: Recipe
> Model: linear_reg()
> 
> -- Preprocessor -----------------------------------------
> 5 Recipe Steps
> 
> * step_novel()
> * step_dummy()
> * step_zv()
> * step_normalize()
> * step_pca()
> 
> -- Model ------------------------------------------------
> Linear Regression Model Specification (regression)
> 
> Computational engine: lm

Create the threshold grid

A threshold grid of \(10\) numbers from \(0\) to \(1\) is created.

Threshold grid is created using dials::grid_regular and dials::threshold

threshold_grid <- dials::grid_regular(
  x = dials::threshold(range = c(0, 1)),
  levels = 10
)

threshold_grid %>%
  reactable::reactable(defaultPageSize = 5)

Principal component regression model fitting on cross validated data

Now we have everything we need and we can fit all the models on the cross validated data with tune::tune_grid. Note that this process may take some time.

doParallel::registerDoParallel()
foreach::getDoParWorkers()
> [1] 3
tune_res <- tune::tune_grid(
  object = pca_workflow,
  resamples = Hitters_fold,
  grid = threshold_grid
)

tune_res
> # Tuning results
> # 10-fold cross-validation 
> # A tibble: 10 x 4
>    splits           id     .metrics          .notes  
>    <list>           <chr>  <list>            <list>  
>  1 <split [176/20]> Fold01 <tibble [20 x 5]> <tibble>
>  2 <split [176/20]> Fold02 <tibble [20 x 5]> <tibble>
>  3 <split [176/20]> Fold03 <tibble [20 x 5]> <tibble>
>  4 <split [176/20]> Fold04 <tibble [20 x 5]> <tibble>
>  5 <split [176/20]> Fold05 <tibble [20 x 5]> <tibble>
>  6 <split [176/20]> Fold06 <tibble [20 x 5]> <tibble>
>  7 <split [177/19]> Fold07 <tibble [20 x 5]> <tibble>
>  8 <split [177/19]> Fold08 <tibble [20 x 5]> <tibble>
>  9 <split [177/19]> Fold09 <tibble [20 x 5]> <tibble>
> 10 <split [177/19]> Fold10 <tibble [20 x 5]> <tibble>

Here we see that the amount of regularization affects the performance metrics differently using tune::autoplot. Do note that using a different seed will give a different plot

# Note that a different seed will give different plots
tune::autoplot(tune_res)

We can also see the raw metrics that created this chart by calling tune::collect_metrics().

tune::collect_metrics(tune_res) %>%
  reactable::reactable(defaultPageSize = 5)

Here is the ggplot way should tune::autoplot fails

tune_res %>%
  tune::collect_metrics() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["threshold"]],
    y = .data[["mean"]],
    colour = .data[[".metric"]]
  )) +
  ggplot2::geom_errorbar(
    mapping = ggplot2::aes(
      ymin = .data[["mean"]] - .data[["std_err"]],
      ymax = .data[["mean"]] + .data[["std_err"]]
    ),
    alpha = 0.5
  ) +
  ggplot2::geom_line(size = 1.5) +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[[".metric"]]),
    scales = "free",
    nrow = 2
  ) +
  ggplot2::theme(legend.position = "none")

Use tune::show_best to see the top few values for a given metric.

The “best” values can be selected using tune::select_best, this function requires you to specify a metric that it should select against. The threshold value is 0.889 for metric rsme since it gives the lowest value. Do note that using a different seed will give a different best number of threshold value.

top_threshold <- tune::show_best(tune_res, metric = "rmse", n = 5)
top_threshold %>%
  reactable::reactable(defaultPageSize = 5)
best_threshold <- tune::select_best(tune_res, metric = "rmse")
best_threshold %>%
  reactable::reactable(defaultPageSize = 5)

Principal component regression model with optimised threshold value

We create the principal component regression workflow with the best threshold using tune::finalize_workflow.

pca_final <- tune::finalize_workflow(
  x = pca_workflow,
  parameters = best_threshold
)

pca_final
> == Workflow =============================================
> Preprocessor: Recipe
> Model: linear_reg()
> 
> -- Preprocessor -----------------------------------------
> 5 Recipe Steps
> 
> * step_novel()
> * step_dummy()
> * step_zv()
> * step_normalize()
> * step_pca()
> 
> -- Model ------------------------------------------------
> Linear Regression Model Specification (regression)
> 
> Computational engine: lm

We now train the principal component regression model with the training data using parsnip::fit

pca_final_fit <- parsnip::fit(object = pca_final, data = Hitters_train)

We can see the coefficients and statistics using tune::extract_fit_parsnip, broom::tidy and broom::glance for lm class objects

pca_final_fit %>%
  tune::extract_fit_parsnip() %>%
  broom::tidy() %>%
  reactable::reactable(defaultPageSize = 5)
pca_final_fit %>%
  tune::extract_fit_parsnip() %>%
  broom::glance() %>%
  reactable::reactable(defaultPageSize = 5)

Principal component regression model on test data

This principal component regression model can now be applied on our testing data set to validate its performance. For regression models, a .pred column is added when parsnip::augment is used.

test_results <- parsnip::augment(x = pca_final_fit, new_data = Hitters_test)

test_results %>%
  reactable::reactable(defaultPageSize = 5)

We check how well the .pred column matches the Salary using yardstick::rsq.

test_results %>%
  yardstick::rsq(truth = .data[["Salary"]], estimate = .data[[".pred"]]) %>%
  reactable::reactable(defaultPageSize = 5)

Alternatively, we can use tune::last_fit and tune::collect_metrics.

test_rs <- tune::last_fit(
  object = pca_final_fit,
  split = Hitters_split
)

test_rs %>%
  tune::collect_metrics() %>%
  reactable::reactable(defaultPageSize = 5)

Use tune::collect_predictions, to see only the actual and predicted values of the test data.

test_rs %>%
  tune::collect_predictions() %>%
  reactable::reactable(defaultPageSize = 5)

Let us take a closer look at the predicted and actual response as a scatter plot.

test_rs %>%
  tune::collect_predictions() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["Salary"]],
    y = .data[[".pred"]]
  )) +
  ggplot2::geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
  ggplot2::geom_point(alpha = 0.6, color = "midnightblue") +
  ggplot2::coord_fixed()

Partial Least Square

The partial least square regression is a linear model with pls transformed data. Hence, the major changes will be on the preprocessing steps.

Create the resample object

First, we split the samples into a training set and a test set. From the training set, we create a 10-fold cross-validation data set from the training set.

This is done with rsample::initial_split, rsample::training, rsample::testing and rsample::vfold_cv.

set.seed(1234)
Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters_split <- rsample::initial_split(Hitters, strata = "Salary")

Hitters_train <- rsample::training(Hitters_split)
Hitters_test <- rsample::testing(Hitters_split)

Hitters_fold <- rsample::vfold_cv(Hitters_train, v = 10)

Create the preprocessor

We create a recipe with recipes::recipe, recipes::all_nominal_predictors, recipes::all_predictors(), recipes::step_novel, recipes::step_dummy, recipes::step_zv and recipes::step_normalize.

We add recipes::step_pls this time to perform partial least square calculation on all the predictors. num_comp is the number of partial least square components to retain as new predictors. outcome is the response variable for partial least square regression to use.

pls_recipe <-
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>%
  recipes::step_novel(recipes::all_nominal_predictors()) %>%
  recipes::step_dummy(recipes::all_nominal_predictors()) %>%
  recipes::step_zv(recipes::all_predictors()) %>%
  recipes::step_normalize(recipes::all_predictors()) %>%
  recipes::step_pls(recipes::all_predictors(), num_comp = 19, outcome = "Salary")

Apply recipes::prep and recipes::bake to compute the preprocessing step.

rec <- recipes::prep(x = pls_recipe, training = Hitters_train)

rec %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)
rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)

PLS exploration

We can explore the results of the PCA using the recipes::prep and parsnip::tidy. We can see that pca is done at step number 5

recipes::prep(x = pls_recipe, training = Hitters_train) %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)

As such, we extract the results of step number 5 or the pca step.

tidied_pls <- recipes::prep(x = pls_recipe, training = Hitters_train) %>%
  parsnip::tidy(5)


tidied_pls %>%
  reactable::reactable(defaultPageSize = 5)

We make a visualization to see what the first four components look like

tidied_pls %>%
  dplyr::filter(.data[["component"]] %in% c("PLS1", "PLS2", "PLS3", "PLS4")) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["value"]],
    y = .data[["terms"]],
    fill = .data[["terms"]]
  )) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]])) +
  ggplot2::labs(y = NULL)

Let us take a closer look at the top 6 variables that contribute to the first four components

tidied_pls %>%
  dplyr::filter(.data[["component"]] %in% c("PLS1", "PLS2", "PLS3", "PLS4")) %>%
  dplyr::group_by(.data[["component"]]) %>%
  dplyr::top_n(6, abs(.data[["value"]])) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    terms = tidytext::reorder_within(
      x = .data[["terms"]],
      by = abs(.data[["value"]]),
      within = .data[["component"]]
    )
  ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = abs(.data[["value"]]),
    y = .data[["terms"]],
    fill = .data[["value"]] > 0
  )) +
  ggplot2::geom_col() +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[["component"]]),
    scales = "free_y"
  ) +
  tidytext::scale_y_reordered() +
  ggplot2::labs(
    x = "Absolute value of contribution to PLS component",
    y = NULL,
    fill = "Positive?"
  )

num_comp is the number of partial least square components to retain as new predictors. We will now try to find the best number of component value using the tune::tune()

pls_recipe <-
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>%
  recipes::step_novel(recipes::all_nominal_predictors()) %>%
  recipes::step_dummy(recipes::all_nominal_predictors()) %>%
  recipes::step_zv(recipes::all_predictors()) %>%
  recipes::step_normalize(recipes::all_predictors()) %>%
  recipes::step_pls(recipes::all_predictors(), num_comp = tune::tune(), outcome = "Salary")

Specify the model

We use a linear model

lm_spec <-
  parsnip::linear_reg() %>%
  parsnip::set_mode("regression") %>%
  parsnip::set_engine("lm")

lm_spec %>%
  parsnip::translate()
> Linear Regression Model Specification (regression)
> 
> Computational engine: lm 
> 
> Model fit template:
> stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

Create the workflow

workflows::workflow, workflows::add_recipe and workflows::add_model are used.

pls_workflow <- workflows::workflow() %>%
  workflows::add_recipe(pls_recipe) %>%
  workflows::add_model(lm_spec)

pls_workflow
> == Workflow =============================================
> Preprocessor: Recipe
> Model: linear_reg()
> 
> -- Preprocessor -----------------------------------------
> 5 Recipe Steps
> 
> * step_novel()
> * step_dummy()
> * step_zv()
> * step_normalize()
> * step_pls()
> 
> -- Model ------------------------------------------------
> Linear Regression Model Specification (regression)
> 
> Computational engine: lm

Create the number of components grid

A number of components grid of \(10\) numbers from \(1\) to \(20\) is created.

Number of components grid is created using dials::grid_regular and dials::num_comp

num_comp_grid <- dials::grid_regular(
  x = dials::num_comp(range = c(1, 20)),
  levels = 10
)

num_comp_grid %>%
  reactable::reactable(defaultPageSize = 5)

Partial least square regression model fitting on cross validated data

Now we have everything we need and we can fit all the models on the cross validated data with tune::tune_grid. Note that this process may take some time.

doParallel::registerDoParallel()
foreach::getDoParWorkers()
> [1] 3
tune_res <- tune::tune_grid(
  object = pls_workflow,
  resamples = Hitters_fold,
  grid = num_comp_grid
)

tune_res
> # Tuning results
> # 10-fold cross-validation 
> # A tibble: 10 x 4
>    splits           id     .metrics          .notes  
>    <list>           <chr>  <list>            <list>  
>  1 <split [176/20]> Fold01 <tibble [20 x 5]> <tibble>
>  2 <split [176/20]> Fold02 <tibble [20 x 5]> <tibble>
>  3 <split [176/20]> Fold03 <tibble [20 x 5]> <tibble>
>  4 <split [176/20]> Fold04 <tibble [20 x 5]> <tibble>
>  5 <split [176/20]> Fold05 <tibble [20 x 5]> <tibble>
>  6 <split [176/20]> Fold06 <tibble [20 x 5]> <tibble>
>  7 <split [177/19]> Fold07 <tibble [20 x 5]> <tibble>
>  8 <split [177/19]> Fold08 <tibble [20 x 5]> <tibble>
>  9 <split [177/19]> Fold09 <tibble [20 x 5]> <tibble>
> 10 <split [177/19]> Fold10 <tibble [20 x 5]> <tibble>

Here we see that the amount of regularization affects the performance metrics differently using tune::autoplot. Do note that using a different seed will give a different plot

# Note that a different seed will give different plots
tune::autoplot(tune_res)

We can also see the raw metrics that created this chart by calling tune::collect_metrics().

tune::collect_metrics(tune_res) %>%
  reactable::reactable(defaultPageSize = 5)

Here is the ggplot way should tune::autoplot fails

tune_res %>%
  tune::collect_metrics() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["num_comp"]],
    y = .data[["mean"]],
    colour = .data[[".metric"]]
  )) +
  ggplot2::geom_errorbar(
    mapping = ggplot2::aes(
      ymin = .data[["mean"]] - .data[["std_err"]],
      ymax = .data[["mean"]] + .data[["std_err"]]
    ),
    alpha = 0.5
  ) +
  ggplot2::geom_line(size = 1.5) +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data[[".metric"]]),
    scales = "free",
    nrow = 2
  ) +
  ggplot2::theme(legend.position = "none")

Use tune::show_best to see the top few values for a given metric.

The “best” values can be selected using tune::select_best, this function requires you to specify a metric that it should select against. The number of components value is 1 for metric rsme since it gives the lowest value. Do note that using a different seed will give a different best number of components value.

top_num_comp <- tune::show_best(tune_res, metric = "rmse", n = 5)
top_num_comp %>%
  reactable::reactable(defaultPageSize = 5)
best_num_comp <- tune::select_best(tune_res, metric = "rmse")
best_num_comp %>%
  reactable::reactable(defaultPageSize = 5)

Partial least square model with optimised threshold value

We create the partial least square regression workflow with the best threshold using tune::finalize_workflow.

pls_final <- tune::finalize_workflow(
  x = pls_workflow,
  parameters = best_num_comp
)

pls_final
> == Workflow =============================================
> Preprocessor: Recipe
> Model: linear_reg()
> 
> -- Preprocessor -----------------------------------------
> 5 Recipe Steps
> 
> * step_novel()
> * step_dummy()
> * step_zv()
> * step_normalize()
> * step_pls()
> 
> -- Model ------------------------------------------------
> Linear Regression Model Specification (regression)
> 
> Computational engine: lm

We now train the partial least square regression model with the training data using parsnip::fit

pls_final_fit <- parsnip::fit(object = pls_final, data = Hitters_train)

We can see the coefficients and statistics using tune::extract_fit_parsnip, broom::tidy and broom::glance for lm class objects

pls_final_fit %>%
  tune::extract_fit_parsnip() %>%
  broom::tidy() %>%
  reactable::reactable(defaultPageSize = 5)
pls_final_fit %>%
  tune::extract_fit_parsnip() %>%
  broom::glance() %>%
  reactable::reactable(defaultPageSize = 5)

Partial least square regression model on test data

This partial least square regression model can now be applied on our testing data set to validate its performance. For regression models, a .pred column is added when parsnip::augment is used.

test_results <- parsnip::augment(x = pls_final_fit, new_data = Hitters_test)

test_results %>%
  reactable::reactable(defaultPageSize = 5)

We check how well the .pred column matches the Salary using yardstick::rsq.

test_results %>%
  yardstick::rsq(truth = .data[["Salary"]], estimate = .data[[".pred"]]) %>%
  reactable::reactable(defaultPageSize = 5)

Alternatively, we can use tune::last_fit and tune::collect_metrics.

test_rs <- tune::last_fit(
  object = pls_final_fit,
  split = Hitters_split
)

test_rs %>%
  tune::collect_metrics() %>%
  reactable::reactable(defaultPageSize = 5)

Use tune::collect_predictions, to see only the actual and predicted values of the test data.

test_rs %>%
  tune::collect_predictions() %>%
  reactable::reactable(defaultPageSize = 5)

Let us take a closer look at the predicted and actual response as a scatter plot.

test_rs %>%
  tune::collect_predictions() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(
    x = .data[["Salary"]],
    y = .data[[".pred"]]
  )) +
  ggplot2::geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
  ggplot2::geom_point(alpha = 0.6, color = "midnightblue") +
  ggplot2::coord_fixed()

Rmarkdown Template

This Rmarkdown template is created by the Reality Bending Lab. The template can be download from the lab’s github repository. For more information about the motivation behind creating this template, check out Dr. Dominique Makowski’s blog post

Blog References

  • Emil Hvitfeldt’s ISLR tidymodels Labs

  • Julia Silge’s blog titled “LASSO regression using tidymodels and #TidyTuesday data for The Office”

  • Julia Silge’s blog titled “PCA and the #TidyTuesday best hip hop songs ever”

Package References

report::cite_packages(sessionInfo())
---
title: '**Chapter 6 Lab**'
author: "Jeremy Selva"
subtitle: This document was prepared on `r format(Sys.Date())`.
output:
  html_document:
    theme: cerulean
    highlight: pygments
    toc: yes
    toc_depth: 3
    toc_float:
      collapsed: true
      smooth_scroll: true
    number_sections: no
    code_folding: show
    code_download: yes
    self_contained: false
    lib_dir: "docs/rmarkdown_libs"
  rmarkdown::html_vignette:
    toc: yes
    toc_depth: 2
editor_options:
  chunk_output_type: console
  markdown: 
    wrap: 72
---

<!-- 
!!!! IMPORTANT: run `source("utils/render.R")` to publish instead of clicking on 'Knit'
# See https://yihui.org/knitr/options/
-->

```{r setup, warning=FALSE, message=TRUE, include=FALSE}
# Set up the environment

# Options relative to figure size
# 1.618 is the golden ratio
figheight <- 4
figwidth <- 4 * 1.618 

# General options
options(knitr.kable.NA = "",
        nsmall = 3,
        tidyverse.quiet = TRUE
        )
hook_output <- knitr::knit_hooks$get('output')

knitr::knit_hooks$set(
  output = function(x, options) {
    if (!is.null(options$max.height)) {
      options$attr.output <- c(options$attr.output,
                               sprintf('style="max-height: %s;"', options$max.height))
    }
    hook_output(x, options)
    }
  )

# Chunk options (see https://yihui.org/knitr/options/#chunk_options)
knitr::opts_chunk$set(
  comment = ">",  # The prefix to be added before each line of the text output.
  dpi = 600,
  fig.path = "docs/figures/",
  fig.height = figheight,
  fig.width = figwidth,
  fig.align = "center",
  # See https://community.rstudio.com/t/centering-images-in-blogdown-post/20962
  # to learn how to center images
  # See https://bookdown.org/yihui/rmarkdown-cookbook/opts-tidy.html
  # See https://www.zotero.org/styles for citation style respository
  tidy='styler',
  tidy.opts=list(strict=TRUE)
)

htmltools::tagList(
  xaringanExtra::use_clipboard(
    button_text = "<i class=\"fa fa-clipboard\"></i> Copy Code",
    success_text = "<i class=\"fa fa-check\" style=\"color: #90BE6D\"></i> Copied!",
  ),
  rmarkdown::html_dependency_font_awesome()
)
```

```{r warning=FALSE, message=FALSE, results='asis', class.source = 'fold-hide'}
# Rmarkdown stuff
library(rmarkdown, quietly=TRUE)
library(knitr, quietly=TRUE)
library(htmltools, quietly=TRUE)
library(styler, quietly=TRUE)
library(xaringanExtra, quietly=TRUE)

# Dataset
library(ISLR2, quietly=TRUE)

# Session info and package reporting
library(report, quietly=TRUE)

# Data wrangling

library(dplyr, quietly=TRUE)
library(tidyr, quietly=TRUE)
library(magrittr, quietly=TRUE)

# Create interactive tables
library(reactable, quietly=TRUE)

# For plotting

library(forcats, quietly=TRUE)
library(scales, quietly=TRUE)
library(tidytext, quietly=TRUE)
library(ggplot2, quietly=TRUE)

# For parallel processing for tuning
library(foreach, quietly=TRUE)
library(doParallel, quietly=TRUE)

# For applying tidymodels

library(broom, quietly=TRUE)
library(rsample, quietly=TRUE)
library(parsnip, quietly=TRUE)
library(recipes, quietly=TRUE)
library(dials, quietly=TRUE)
library(tune, quietly=TRUE)
library(workflows, quietly=TRUE)
library(yardstick, quietly=TRUE)

# For subset selection
library(leaps, quietly=TRUE)

# For ridge regression and lasso
library(glmnet, quietly=TRUE)

# For partial least square regression
# In recipes::step_pls
# Warning message:
#  `step_pls()` failed: Error in loadNamespace(x) : there is no package called ‘mixOmics’
# Install from Bioconductor
library(mixOmics, quietly=TRUE)

# For variable importance
library(vip, quietly=TRUE)

summary(report::report(sessionInfo()))
```

# Subset Selection Methods

## Best Subset Selection

```{r, echo=FALSE}
set.seed(1234)
```

We will be using the [`Hitters`](https://rdrr.io/cran/ISLR2/man/Hitters.html) data set from the `ISLR2` package. We wish
to predict the baseball players `Salary` based on several different
characteristics which are included in the data set.

Remove all rows with missing data from that column.

```{r, message=FALSE}

Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters %>%
  reactable::reactable(defaultPageSize = 5,
                       filterable = TRUE)
```

Use `leaps::regsubsets` function to performs best subset selection.

An asterisk indicates that a given variable is included in the
corresponding model. For instance, this output indicates that the best
two-variable model contains only Hits and CRBI.

```{r, message = FALSE, max.height = '150px'}

regfit.full <- leaps::regsubsets(
  x = Salary ~ ., 
  data = Hitters,
  nvmax = 8 #default
  )

summary(regfit.full)
```

Here we fit up to a 19-variable model.

```{r, message=FALSE}

regfit.full <- leaps::regsubsets(
  x = Salary ~ ., 
  data = Hitters,
  nvmax = 19
  )
```

The `summary` function also returns $R^2$, $RSS$, adjusted $R^2$, $C_p$,
and $BIC$. We can examine these to try to select the best overall model.

```{r, message=FALSE}

reg.summary <- regfit.full %>%
  summary()

names(reg.summary)
```

For instance, we see that the $R^2$ statistic increases from 32 %, when
only one variable is included in the model, to almost 55 %, when all
variables are included. As expected, the $R^2$ statistic increases
monotonically as more variables are included.

```{r, message=FALSE}

reg.summary$rsq
```

The `leaps::regsubsets` function has a built-in `plot` command which can
be used to display the selected variables for the best model with a
given number of predictors, ranked according to the $BIC$, $C_p$,
adjusted $R^2$, or $AIC$. To find out more about this function, type
`?plot.regsubsets`.

The top row of each plot contains a black square for each variable
selected according to the optimal model associated with that statistic.
For instance, we see that several models share a $BIC$ close to $−150$.
However, the model with the lowest $BIC$ is the six-variable model that
contains only `AtBat`, `Hits`, `Walks`, `CRBI`, `DivisionW`, and
`PutOuts`.

```{r, message=FALSE}
plot(regfit.full , scale = "bic")
```

We can use the `coef` function to see the coefficient estimates
associated with this model.

```{r, message=FALSE}
coef(regfit.full , 6)
```

## Forward and Backward Stepwise Selection

We can also use the `leaps::regsubsets` function to perform forward
stepwise selection using the argument `method = "forward"`

```{r, message=FALSE, max.height='150px'}
regfit.fwd <- leaps::regsubsets(
  x = Salary ~ ., 
  data = Hitters ,
  nvmax = 19, 
  method = "forward"
)

summary(regfit.fwd)
```

To perform backward stepwise selection, use the argument
`method = "backward"`

```{r, message=FALSE, max.height='150px'}
regfit.bwd <- leaps::regsubsets(
  x = Salary ~ ., 
  data = Hitters ,
  nvmax = 19, 
  method = "backward"
)

summary(regfit.bwd)
```

For this data, the best seven-variable models identified by forward
stepwise selection, backward stepwise selection, and best subset
selection are different.

```{r, message=FALSE}

coef(regfit.full,7)
coef(regfit.fwd,7)
coef(regfit.bwd,7)

```

## Choosing Among Models Using the Validation-Set Approach

We split the samples into training set and a test set

```{r, message=FALSE}

set.seed(1234)

Hitters_split <- Hitters %>%
  dplyr::mutate(
    isTrainingSet = sample(x = c(TRUE , FALSE),
                           size = nrow(Hitters),
                           replace = TRUE)
  ) %>%
  dplyr::relocate(.data[["isTrainingSet"]])

full_train <- Hitters_split %>%
  dplyr::filter(.data[["isTrainingSet"]] == TRUE) %>%
  dplyr::select(-c("isTrainingSet"))

test <- Hitters_split %>%
  dplyr::filter(.data[["isTrainingSet"]]== FALSE) %>%
  dplyr::select(-c("isTrainingSet"))

train_split <- full_train %>%
  dplyr::mutate(
    isValidationSet = sample(x = c(TRUE , FALSE),
                           size = nrow(full_train),
                           replace = TRUE)
  ) %>%
  dplyr::relocate(.data[["isValidationSet"]])

validate <- train_split %>%
  dplyr::filter(.data[["isValidationSet"]] == TRUE) %>%
  dplyr::select(-c("isValidationSet"))

train <- train_split %>%
  dplyr::filter(.data[["isValidationSet"]] == FALSE) %>%
  dplyr::select(-c("isValidationSet"))


```

Now, we apply `leaps::regsubsets` to the training set in order to
perform best subset selection and compute the the validation set error
for the best model of each model size.

```{r, message=FALSE}

regfit.best <- leaps::regsubsets(Salary ~ ., data = train, nvmax = 19)

validate.mat <- stats::model.matrix(Salary ~ ., data = validate)

val.errors <- rep(NA, 19)

for (i in 1:19) {
  coefi <- stats::coef(regfit.best , id = i)
  pred <- validate.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean (( validate$Salary - pred)^2)
}

val.errors
```

We find that the best model is the one that contains `r which.min(val.errors)` variables.

```{r, message=FALSE}
which.min(val.errors)
regfit.best <- leaps::regsubsets(Salary ~ ., data = full_train, nvmax = 19)

coef(regfit.best , which.min(val.errors))
```

We now create a predict function

```{r, message=FALSE}

# A predict method for leaps::regsubsets
predict.regsubsets <- function(object , newdata , id, ...) {
  form <- as.formula(object$call [[2]])
  mat <- model.matrix(form , newdata)
  coefi <- coef(object , id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}
```

And apply it on the test.

```{r, message=FALSE}
pred_train <- stats::predict(regfit.best, full_train, id = which.min(val.errors))
pred_test <- stats::predict(regfit.best, test, id = which.min(val.errors))
```

Here is the $MSE$ for train and test data

```{r, message=FALSE}
mean(( full_train$Salary - pred_train )^2)
mean(( test$Salary - pred_test )^2)
```

## Choosing Among Models Using Cross-Validation

We now try to choose among the models of different sizes using cross
validation. Create a vector that allocates each observation to one of k
= 10 folds.

```{r, message=FALSE}
set.seed(1)

Hitters_split <- Hitters %>%
  dplyr::mutate(
    isTrainingSet = sample(x = c(TRUE , FALSE),
                           size = nrow(Hitters),
                           replace = TRUE)
  ) %>%
  dplyr::relocate(.data[["isTrainingSet"]])

train <- Hitters_split %>%
  dplyr::filter(.data[["isTrainingSet"]] == TRUE) %>%
  dplyr::select(-c("isTrainingSet"))

test <- Hitters_split %>%
  dplyr::filter(.data[["isTrainingSet"]]== FALSE) %>%
  dplyr::select(-c("isTrainingSet"))

k <- 10
n <- nrow(train)
folds <- sample(rep (1:k, length = n))

```

We write a for loop that performs cross-validation

```{r, message=FALSE}

# A predict method for leaps::regsubsets
predict.regsubsets <- function(object , newdata , id, ...) {
  form <- as.formula(object$call [[2]])
  mat <- model.matrix(form , newdata)
  coefi <- coef(object , id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
  }

cv.errors <- matrix(NA, k, 19,
                    dimnames = list(NULL , paste (1:19)))
for (j in 1:k) {
  # For each cross validation fold data,
  # get the best i variable model using subset selection
  best.fit <- leaps::regsubsets(Salary ~ .,
                                data = train[folds != j,],
                                nvmax = 19)
  
  for (i in 1:19) {
    # Compute the cross validation error for each
    # best i variable model
    pred <- stats::predict(best.fit, train[folds == j, ], id = i)
    cv.errors[j, i] <- mean(( train$Salary[folds == j] - pred)^2)
  }
  }

```

This has given us a $10$ by $19$ matrix, of which the $(j, i)$th element
corresponds to the cross validation $MSE$ for the $j$th cross-validation fold for
the best $i$-variable model.

We use the `apply` function to average over the columns of this `apply`
matrix in order to obtain a vector for which the $i$th element is the
cross validation error for the $i$-variable model.

```{r, message=FALSE}

mean.cv.errors <- apply(cv.errors , 2, mean)

plot(mean.cv.errors , type = "b")

```

We see that cross-validation selects a `r which.min(mean.cv.errors)`-variable model. So we retrain the model

```{r, message=FALSE}
regfit.best <- leaps::regsubsets(Salary ~ ., data = train , nvmax = 19)
coef(regfit.best , which.min(mean.cv.errors))
```

And apply it on the test.

```{r, message=FALSE}
pred_train <- stats::predict(regfit.best, train, id = which.min(mean.cv.errors))
pred_test <- stats::predict(regfit.best, test, id = which.min(mean.cv.errors))
```

Here is the $MSE$ for train and test data

```{r, message=FALSE}
mean(( train$Salary - pred_train )^2)
mean(( test$Salary - pred_test )^2)
```

# Ridge Regression

## Introduction to ridge regression

Use
[`parsnip::linear_reg`](https://parsnip.tidymodels.org/reference/linear_reg.html),
[`parsnip::set_mode`](https://parsnip.tidymodels.org/reference/set_args.html)
and
[`parsnip::set_engine`](https://parsnip.tidymodels.org/reference/set_engine.html)
to create the model.

Set mixture to 0 to indicate Ridge Regression. For now, we set
penalty/lambda as 0.

Use
[`parsnip::translate`](https://parsnip.tidymodels.org/reference/translate.html)
to better understand the model created.

```{r, message=FALSE}
Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

ridge_spec <- parsnip::linear_reg(mixture = 0,
                                  penalty = 0) %>%
  parsnip::set_mode("regression") %>%
  parsnip::set_engine("glmnet")

ridge_spec %>%
  parsnip::translate()
```

Once the specification is created we can fit it to our data using
[`parsnip::fit`](https://parsnip.tidymodels.org/reference/fit.html). We
will use all the predictors.

```{r, message=FALSE}
ridge_fit <- ridge_spec %>%
  parsnip::fit(formula = Salary ~ ., 
               data = Hitters)

```

We can see the parameter estimate for different values of penalty/lambda
with
[`parsnip::tidy`](https://parsnip.tidymodels.org/reference/tidy.model_fit.html).
Notice how the estimates are decreasing when the amount of penalty goes
up

```{r, message=FALSE, max.height='150px'}
parsnip::tidy(ridge_fit) %>%
  reactable::reactable(defaultPageSize = 5)
parsnip::tidy(ridge_fit, penalty = 705) %>%
  reactable::reactable(defaultPageSize = 5)
parsnip::tidy(ridge_fit, penalty = 11498) %>%
  reactable::reactable(defaultPageSize = 5)
```

Using
[`parsnip::extract_fit_engine`](https://parsnip.tidymodels.org/reference/extract-parsnip.html)
we can visualize how the magnitude of the coefficients are being
regularized towards zero as the penalty goes up.

```{r, message=FALSE}
# Arrange figure margin
par(mar=c(5,5,5,1))
ridge_fit %>%
  parsnip::extract_fit_engine() %>%
  plot(xvar = "lambda")
```

It would be nice if we could find the "best" value of the
penalty/lambda. We can do this using the
[`tune::tune_grid`](https://tune.tidymodels.org/reference/tune_grid.html)

Need we need three things in order to use the
[`tune::tune_grid`](https://tune.tidymodels.org/reference/tune_grid.html)

-   a `resample` object containing the resamples the `workflow` should
    be fitted within,

-   a `workflow` object containing the model and preprocessor,and

-   a tibble `penalty_grid` containing the parameter values to be
    evaluated.

## Create the resample object

First, we split the samples into a training set and a test set. From the training set, we create a 10-fold cross-validation data set from the training set.

This is done with [`rsample::initial_split`](https://rsample.tidymodels.org/reference/initial_split.html),
[`rsample::training`](https://rsample.tidymodels.org/reference/initial_split.html),
[`rsample::testing`](https://rsample.tidymodels.org/reference/initial_split.html)
and
[`rsample::vfold_cv`](https://rsample.tidymodels.org/reference/vfold_cv.html).

```{r, message=FALSE}
set.seed(1234)
Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters_split <- rsample::initial_split(Hitters, strata = "Salary")

Hitters_train <- rsample::training(Hitters_split)
Hitters_test <- rsample::testing(Hitters_split)

Hitters_fold <- rsample::vfold_cv(Hitters_train, v = 10)

```

## Create the preprocessor

We use the `recipes` package to create the preprocessing steps. However,
the order of the preprocessing step is actually important. See the
[Ordering of steps
vignette](https://cran.r-project.org/web/packages/recipes/vignettes/Ordering.html)

We create a basic recipe with
[`recipes::recipe`](https://recipes.tidymodels.org/reference/recipe.html).

Apply
[`recipes::prep`](https://recipes.tidymodels.org/reference/prep.html)
and
[`recipes::bake`](https://recipes.tidymodels.org/reference/bake.html) to
compute the preprocessing step.

```{r, message=FALSE}

ridge_recipe <- recipes::recipe(formula = Salary ~ .,
                                data = Hitters_train)

rec <- recipes::prep(x = ridge_recipe, training = Hitters_train)
  
rec %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)

rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)

```

Do preprocessing on the factor variables
[`recipes::all_nominal_predictors`](https://recipes.tidymodels.org/reference/has_role.html).
[`recipes::step_novel`](https://recipes.tidymodels.org/reference/step_novel.html)
and
[`recipes::step_dummy`](https://recipes.tidymodels.org/reference/step_dummy.html)
was used.

```{r, message=FALSE}

ridge_recipe <- ridge_recipe %>%
  # Step Novel is important.
  # If test set has a new factor not found in training, it will be treated as "new" and not NA
  # This will prevent the model from giving an error
  # https://blog.datascienceheroes.com/how-to-use-recipes-package-for-one-hot-encoding/
  recipes::step_novel(recipes::all_nominal_predictors()) %>%
  # Create Dummy variables
  recipes::step_dummy(recipes::all_nominal_predictors())

rec <- recipes::prep(x = ridge_recipe, training = Hitters_train)
  
rec %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)

rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)

```

Do normalisation step on all predictors
[`recipes::all_predictors()`](https://recipes.tidymodels.org/reference/has_role.html).
[`recipes::step_zv`](https://recipes.tidymodels.org/reference/step_zv.html)
and
[`recipes::step_normalize`](https://recipes.tidymodels.org/reference/step_normalize.html)
was used.

```{r, message=FALSE}

ridge_recipe <- ridge_recipe %>%
  # Remove predictors with zero variation
  recipes::step_zv(recipes::all_predictors()) %>%
  # Standardise all variable
  recipes::step_normalize(recipes::all_predictors())

rec <- recipes::prep(x = ridge_recipe, training = Hitters_train)
  
rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)


```

## Specify the model

The model specification will look very similar to what we have seen
earlier, but we will set `penalty = tune::tune`. This tells
[`tune::tune_grid`](https://tune.tidymodels.org/reference/tune_grid.html)
that the penalty parameter should be tuned using
[`tune::tune`](https://hardhat.tidymodels.org/reference/tune.html).

```{r, message=FALSE}

ridge_spec <- 
  parsnip::linear_reg(penalty = tune::tune(), mixture = 0) %>% 
  parsnip::set_mode("regression") %>% 
  parsnip::set_engine("glmnet")

ridge_spec %>%
  parsnip::translate()
```

## Create the workflow

[`workflows::workflow`](https://workflows.tidymodels.org/reference/workflow.html),
[`workflows::add_recipe`](https://workflows.tidymodels.org/reference/add_recipe.html)
and
[`workflows::add_model`](https://workflows.tidymodels.org/reference/add_model.html)
are used.

```{r, message=FALSE, max.height='150px'}

ridge_workflow <-  workflows::workflow() %>% 
  workflows::add_recipe(ridge_recipe) %>% 
  workflows::add_model(ridge_spec)

ridge_workflow
```

## Create the penalty/lambda grid

A penalty/lambda grid of $50$ numbers from $0.00001$ ($10^{-5}$) to
$10000$ ($10^5$) is created.

Regular grid is created using
[`dials::grid_regular`](https://dials.tidymodels.org/reference/grid_regular.html),
[`dials::penalty`](https://dials.tidymodels.org/reference/penalty.html)
and
[`scales::log10_trans`](https://scales.r-lib.org/reference/log_trans.html)

```{r, message=FALSE}

penalty_grid <- dials::grid_regular(x = dials::penalty(range = c(-5, 5),
                                                       trans = scales::log10_trans()
                                                       ), 
                                    levels = 50)

penalty_grid  %>% 
  reactable::reactable(defaultPageSize = 5)
```

## Ridge regression model fitting on cross validated data

Now we have everything we need and we can fit all the models on the
cross validated data with
[`tune::tune_grid`](https://tune.tidymodels.org/reference/tune_grid.html).
Note that this process may take some time.

```{r, message=FALSE}
doParallel::registerDoParallel()
foreach::getDoParWorkers()
```

```{r, message=FALSE, max.height='150px'}
tune_res <- tune::tune_grid(
  object = ridge_workflow,
  resamples = Hitters_fold, 
  grid = penalty_grid
)

tune_res
```

Here we see that the amount of regularization affects the performance
metrics differently using
[`tune::autoplot`](https://tune.tidymodels.org/reference/autoplot.tune_results.html).
Do note that using a different seed will give a different plot

```{r, message=FALSE}
# Note that a different seed will give different plots
tune::autoplot(tune_res)
```

We can also see the raw metrics that created this chart by calling
[`tune::collect_metrics()`](https://tune.tidymodels.org/reference/collect_predictions.html).

```{r, message=FALSE, max.height='150px'}

tune::collect_metrics(tune_res) %>% 
  reactable::reactable(defaultPageSize = 5)
```

Here is the `ggplot` way should
[`tune::autoplot`](https://tune.tidymodels.org/reference/autoplot.tune_results.html)
fails

```{r, message=FALSE, max.height='150px'}

tune_res %>%
  tune::collect_metrics() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["penalty"]],
                                         y = .data[["mean"]],
                                         colour = .data[[".metric"]])) +
  ggplot2::geom_errorbar(mapping = ggplot2::aes(ymin = .data[["mean"]] - .data[["std_err"]],
                                                ymax = .data[["mean"]] + .data[["std_err"]]),
                         alpha = 0.5) +
  ggplot2::geom_line(size = 1.5) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[[".metric"]]), 
                      scales = "free", 
                      nrow = 2) +
  ggplot2::scale_x_log10() +
  ggplot2::theme(legend.position = "none")
```

Use
[`tune::show_best`](https://tune.tidymodels.org/reference/show_best.html)
to see the top few values for a given metric.

The "best" values can be selected using
[`tune::select_best`](https://tune.tidymodels.org/reference/show_best.html),
this function requires you to specify a metric that it should select
against. The penalty/lambda value is 569 for metric `rsq` since it gives
the highest value. Do note that using a different seed will give a
different best penalty/lambda value.

```{r, message=FALSE, max.height='150px'}

top_penalty <- tune::show_best(tune_res, metric = "rsq", n = 5)
top_penalty %>%
  reactable::reactable(defaultPageSize = 5)

best_penalty <- tune::select_best(tune_res, metric = "rsq")
best_penalty %>%
  reactable::reactable(defaultPageSize = 5)

```

## Ridge regression model with optimised penalty/lambda value

We create the ridge regression workflow with the best penalty score
using
[`tune::finalize_workflow`](https://tune.tidymodels.org/reference/finalize_model.html).

```{r, message=FALSE, max.height='150px'}

ridge_final <- tune::finalize_workflow(x = ridge_workflow, 
                                       parameters = best_penalty)

ridge_final
```

We now train the ridge regression model with the training data using
[`parsnip::fit`](https://parsnip.tidymodels.org/reference/fit.html)

```{r, message=FALSE}

ridge_final_fit <- parsnip::fit(object = ridge_final, data = Hitters_train)
```

We can see the coefficients using
[`tune::extract_fit_parsnip`](https://tune.tidymodels.org/reference/extract-tune.html) and [`parsnip::tidy`](https://parsnip.tidymodels.org/reference/tidy.model_fit.html)

```{r, message=FALSE}

ridge_final_fit %>%
  tune::extract_fit_parsnip() %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)
```

## Variable Importance

While we're at it, let's see what the most important variables are using
the `vip` package and
[`workflows::extract_fit_parsnip`](https://workflows.tidymodels.org/reference/extract-workflow.html).

```{r, message=FALSE}

vip_table <- ridge_final_fit %>%
  workflows::extract_fit_parsnip() %>%
  vip::vi(lambda = best_penalty$penalty) 

vip_table %>%
  reactable::reactable(defaultPageSize = 5)
```

```{r, message=FALSE}
vip_table %>% 
  dplyr::mutate(
    Importance = abs(.data[["Importance"]]),
    Variable = forcats::fct_reorder(.f = .data[["Variable"]], 
                                    .x = .data[["Importance"]])
  ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["Importance"]], 
                                         y = .data[["Variable"]], 
                                         fill = .data[["Sign"]])) +
  ggplot2::geom_col() +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::labs(y = NULL)
```

## Ridge regression model on test data

This ridge regression model can now be applied on our testing data set
to validate its performance. For regression models, a `.pred` column is
added when
[`parsnip::augment`](https://parsnip.tidymodels.org/reference/augment.html)
is used.

```{r, message=FALSE}

test_results <- parsnip::augment(x = ridge_final_fit, new_data = Hitters_test)
  
test_results %>%
  reactable::reactable(defaultPageSize = 5)
  
```

We check how well the `.pred` column matches the Salary using
[`yardstick::rsq`](https://yardstick.tidymodels.org/reference/rsq.html).

```{r, message=FALSE}
test_results %>%
  yardstick::rsq(truth = .data[["Salary"]], estimate = .data[[".pred"]]) %>%
  reactable::reactable(defaultPageSize = 5)
```

Alternatively, we can use
[`tune::last_fit`](https://tune.tidymodels.org/reference/last_fit.html)
and
[`tune::collect_metrics`](https://tune.tidymodels.org/reference/collect_predictions.html).

```{r, message=FALSE}

test_rs <- tune::last_fit(object = ridge_final_fit, 
                          split = Hitters_split)
  
test_rs %>%
  tune::collect_metrics() %>%
  reactable::reactable(defaultPageSize = 5)
  
```

Use [`tune::collect_predictions`](https://tune.tidymodels.org/reference/collect_predictions.html), to see only the actual and predicted values of the test data.

```{r, message=FALSE}
test_rs %>%
  tune::collect_predictions() %>%
  reactable::reactable(defaultPageSize = 5)
```

Let us take a closer look at the predicted and actual response as a scatter plot.

```{r, message=FALSE}
test_rs %>%
  tune::collect_predictions() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["Salary"]],
                                         y = .data[[".pred"]]
                                         )
                  ) +
  ggplot2::geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
  ggplot2::geom_point(alpha = 0.6, color = "midnightblue") +
  ggplot2::coord_fixed()
```

# The Lasso

The following procedure will be very similar to what we saw in the ridge
regression section. The resampling step is the same.

## Create the resample object

First, we split the samples into a training set and a test set. From the training set, we create a 10-fold cross-validation data set from the training set.

This is done with [`rsample::initial_split`](https://rsample.tidymodels.org/reference/initial_split.html),
[`rsample::training`](https://rsample.tidymodels.org/reference/initial_split.html),
[`rsample::testing`](https://rsample.tidymodels.org/reference/initial_split.html)
and
[`rsample::vfold_cv`](https://rsample.tidymodels.org/reference/vfold_cv.html).

```{r, message=FALSE}
set.seed(1234)
Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters_split <- rsample::initial_split(Hitters, strata = "Salary")

Hitters_train <- rsample::training(Hitters_split)
Hitters_test <- rsample::testing(Hitters_split)

Hitters_fold <- rsample::vfold_cv(Hitters_train, v = 10)

```

## Create the preprocessor

The following procedure will be very similar to what we saw in the ridge
regression section. The preprocessing steps needed are the same using
[`recipes::recipe`](https://recipes.tidymodels.org/reference/recipe.html),
[`recipes::all_nominal_predictors`](https://recipes.tidymodels.org/reference/has_role.html),
[`recipes::all_predictors()`](https://recipes.tidymodels.org/reference/has_role.html),
[`recipes::step_novel`](https://recipes.tidymodels.org/reference/step_novel.html),
[`recipes::step_dummy`](https://recipes.tidymodels.org/reference/step_dummy.html),
[`recipes::step_zv`](https://recipes.tidymodels.org/reference/step_zv.html)
and
[`recipes::step_normalize`](https://recipes.tidymodels.org/reference/step_normalize.html).

```{r, message=FALSE}

lasso_recipe <- 
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>% 
  recipes::step_novel(recipes::all_nominal_predictors()) %>% 
  recipes::step_dummy(recipes::all_nominal_predictors()) %>% 
  recipes::step_zv(recipes::all_predictors()) %>% 
  recipes::step_normalize(recipes::all_predictors())
```

## Specify the model

Again we set `penalty = tune::tune`. This tells
[`tune::tune_grid`](https://tune.tidymodels.org/reference/tune_grid.html)
that the penalty parameter should be tuned using
[`tune::tune`](https://hardhat.tidymodels.org/reference/tune.html).

This time, it is `mixture=1` in
[`parsnip::linear_reg`](https://parsnip.tidymodels.org/reference/linear_reg.html)

```{r, message=FALSE}

lasso_spec <- 
  parsnip::linear_reg(penalty = tune::tune(), mixture = 1) %>% 
  parsnip::set_mode("regression") %>% 
  parsnip::set_engine("glmnet")

lasso_spec %>%
  parsnip::translate()

```

## Create the workflow

[`workflows::workflow`](https://workflows.tidymodels.org/reference/workflow.html),
[`workflows::add_recipe`](https://workflows.tidymodels.org/reference/add_recipe.html)
and
[`workflows::add_model`](https://workflows.tidymodels.org/reference/add_model.html)
are used.

```{r, message=FALSE, max.height='150px'}

lasso_workflow <-  workflows::workflow() %>% 
  workflows::add_recipe(lasso_recipe) %>% 
  workflows::add_model(lasso_spec)

lasso_workflow
```

## Create the penalty/lambda grid

A penalty/lambda grid of $50$ numbers from $0.01$ ($10^{-2}$) to $100$
($10^2$) is created.

Regular grid is created using
[`dials::grid_regular`](https://dials.tidymodels.org/reference/grid_regular.html),
[`dials::penalty`](https://dials.tidymodels.org/reference/penalty.html)
and
[`scales::log10_trans`](https://scales.r-lib.org/reference/log_trans.html)

```{r, message=FALSE}

penalty_grid <- dials::grid_regular(x = dials::penalty(range = c(-2, 2),
                                                       trans = scales::log10_trans()),  
                                    levels = 50)

penalty_grid  %>% 
  reactable::reactable(defaultPageSize = 5)
```

## Lasso model fitting on cross validated data

Now we have everything we need and we can fit all the models on the
cross validated data with
[`tune::tune_grid`](https://tune.tidymodels.org/reference/tune_grid.html).
Note that this process may take some time.

```{r, message=FALSE}
doParallel::registerDoParallel()
foreach::getDoParWorkers()
```

```{r, message=FALSE, max.height='150px'}
tune_res <- tune::tune_grid(
  object = lasso_workflow,
  resamples = Hitters_fold, 
  grid = penalty_grid
)

tune_res
```

Here we see that the amount of regularization affects the performance
metrics differently using
[`tune::autoplot`](https://tune.tidymodels.org/reference/autoplot.tune_results.html).
Do note that using a different seed will give a different plot

```{r, message=FALSE}
# Note that a different seed will give different plots
tune::autoplot(tune_res)
```

We can also see the raw metrics that created this chart by calling
[`tune::collect_metrics()`](https://tune.tidymodels.org/reference/collect_predictions.html).

```{r, message=FALSE, max.height='150px'}

tune::collect_metrics(tune_res) %>% 
  reactable::reactable(defaultPageSize = 5)
```

Here is the `ggplot` way should
[`tune::autoplot`](https://tune.tidymodels.org/reference/autoplot.tune_results.html)
fails

```{r, message=FALSE, max.height='150px'}

tune_res %>%
  tune::collect_metrics() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["penalty"]],
                                         y = .data[["mean"]],
                                         colour = .data[[".metric"]])) +
  ggplot2::geom_errorbar(mapping = ggplot2::aes(ymin = .data[["mean"]] - .data[["std_err"]],
                                                ymax = .data[["mean"]] + .data[["std_err"]]),
                         alpha = 0.5) +
  ggplot2::geom_line(size = 1.5) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[[".metric"]]), 
                      scales = "free", 
                      nrow = 2) +
  ggplot2::scale_x_log10() +
  ggplot2::theme(legend.position = "none")
```

Use
[`tune::show_best`](https://tune.tidymodels.org/reference/show_best.html)
to see the top few values for a given metric.

The "best" values can be selected using
[`tune::select_best`](https://tune.tidymodels.org/reference/show_best.html),
this function requires you to specify a metric that it should select
against. The penalty/lambda value is 22.2 for metric `rsq` since it
gives the highest value. Do note that using a different seed will give a
different best penalty/lambda value.

```{r, message=FALSE, max.height='150px'}

top_penalty <- tune::show_best(tune_res, metric = "rsq", n = 5)
top_penalty %>%
  reactable::reactable(defaultPageSize = 5)

best_penalty <- tune::select_best(tune_res, metric = "rsq")
best_penalty %>%
  reactable::reactable(defaultPageSize = 5)

```

## Lasso model with optimised penalty/lambda value

We create the lasso regression workflow with the best penalty score
using
[`tune::finalize_workflow`](https://tune.tidymodels.org/reference/finalize_model.html).

```{r, message=FALSE, max.height='150px'}

lasso_final <- tune::finalize_workflow(x = lasso_workflow, 
                                       parameters = best_penalty)

lasso_final
```

We now train the lasso regression model with the training data using
[`parsnip::fit`](https://parsnip.tidymodels.org/reference/fit.html)

```{r, message=FALSE}

lasso_final_fit <- parsnip::fit(object = lasso_final, data = Hitters_train)
```

We can see the coefficients using
[`tune::extract_fit_parsnip`](https://tune.tidymodels.org/reference/extract-tune.html) and [`parsnip::tidy`](https://parsnip.tidymodels.org/reference/tidy.model_fit.html)

```{r, message=FALSE}

lasso_final_fit %>%
  tune::extract_fit_parsnip() %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)
```

## Variable Importance

While we're at it, let's see what the most important variables are using
the `vip` package and
[`workflows::extract_fit_parsnip`](https://workflows.tidymodels.org/reference/extract-workflow.html).

```{r, message=FALSE}

vip_table <- lasso_final_fit %>%
  workflows::extract_fit_parsnip() %>%
  vip::vi(lambda = best_penalty$penalty) 

vip_table %>%
  reactable::reactable(defaultPageSize = 5)
```

```{r, message=FALSE}
vip_table %>% 
  dplyr::mutate(
    Importance = abs(.data[["Importance"]]),
    Variable = forcats::fct_reorder(.f = .data[["Variable"]], 
                                    .x = .data[["Importance"]])
  ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["Importance"]], 
                                         y = .data[["Variable"]], 
                                         fill = .data[["Sign"]])) +
  ggplot2::geom_col() +
  ggplot2::scale_x_continuous(expand = c(0, 0)) +
  ggplot2::labs(y = NULL)
```

## Lasso regression model on test data

This lasso regression model can now be applied on our testing data set
to validate its performance. For regression models, a .pred column is
added when
[`parsnip::augment`](https://parsnip.tidymodels.org/reference/augment.html)
is used.

```{r, message=FALSE}

test_results <- parsnip::augment(x = lasso_final_fit, new_data = Hitters_test)
  
test_results %>%
  reactable::reactable(defaultPageSize = 5)
  
```

We check how well the `.pred` column matches the Salary using
[`yardstick::rsq`](https://yardstick.tidymodels.org/reference/rsq.html).

```{r, message=FALSE}
test_results %>%
  yardstick::rsq(truth = .data[["Salary"]], estimate = .data[[".pred"]]) %>%
  reactable::reactable(defaultPageSize = 5)
```

Alternatively, we can use
[`tune::last_fit`](https://tune.tidymodels.org/reference/last_fit.html)
and
[`tune::collect_metrics`](https://tune.tidymodels.org/reference/collect_predictions.html).

```{r, message=FALSE}

test_rs <- tune::last_fit(object = lasso_final_fit, 
                          split = Hitters_split)
  
test_rs %>%
  tune::collect_metrics() %>%
  reactable::reactable(defaultPageSize = 5)
  
```

Use [`tune::collect_predictions`](https://tune.tidymodels.org/reference/collect_predictions.html), to see only the actual and predicted values of the test data.

```{r, message=FALSE}
test_rs %>%
  tune::collect_predictions() %>%
  reactable::reactable(defaultPageSize = 5)
```

Let us take a closer look at the predicted and actual response as a scatter plot.

```{r, message=FALSE}
test_rs %>%
  tune::collect_predictions() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["Salary"]],
                                         y = .data[[".pred"]]
                                         )
                  ) +
  ggplot2::geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
  ggplot2::geom_point(alpha = 0.6, color = "midnightblue") +
  ggplot2::coord_fixed()
```

# Principal Components Regression

The principal component regression is a linear model with pca
transformed data. Hence, the major changes will be on the preprocessing
steps.

## Create the resample object

First, we split the samples into a training set and a test set. From the training set, we create a 10-fold cross-validation data set from the training set.

This is done with [`rsample::initial_split`](https://rsample.tidymodels.org/reference/initial_split.html),
[`rsample::training`](https://rsample.tidymodels.org/reference/initial_split.html),
[`rsample::testing`](https://rsample.tidymodels.org/reference/initial_split.html)
and
[`rsample::vfold_cv`](https://rsample.tidymodels.org/reference/vfold_cv.html).

```{r, message=FALSE}
set.seed(1234)
Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters_split <- rsample::initial_split(Hitters, strata = "Salary")

Hitters_train <- rsample::training(Hitters_split)
Hitters_test <- rsample::testing(Hitters_split)

Hitters_fold <- rsample::vfold_cv(Hitters_train, v = 10)

```

## Create the preprocessor

We create a recipe with
[`recipes::recipe`](https://recipes.tidymodels.org/reference/recipe.html),
[`recipes::all_nominal_predictors`](https://recipes.tidymodels.org/reference/has_role.html),
[`recipes::all_predictors()`](https://recipes.tidymodels.org/reference/has_role.html),
[`recipes::step_novel`](https://recipes.tidymodels.org/reference/step_novel.html),
[`recipes::step_dummy`](https://recipes.tidymodels.org/reference/step_dummy.html),
[`recipes::step_zv`](https://recipes.tidymodels.org/reference/step_zv.html)
and
[`recipes::step_normalize`](https://recipes.tidymodels.org/reference/step_normalize.html).

We add
[`recipes::step_pca`](https://recipes.tidymodels.org/reference/step_pca.html)
this time to perform principal component analysis on all the predictors.

```{r, message=FALSE}

pca_recipe <- 
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>% 
  recipes::step_novel(recipes::all_nominal_predictors()) %>% 
  recipes::step_dummy(recipes::all_nominal_predictors()) %>% 
  recipes::step_zv(recipes::all_predictors()) %>% 
  recipes::step_normalize(recipes::all_predictors()) %>%
  recipes::step_pca(recipes::all_predictors(), id = "pca")
```

Apply
[`recipes::prep`](https://recipes.tidymodels.org/reference/prep.html)
and
[`recipes::bake`](https://recipes.tidymodels.org/reference/bake.html) to
compute the preprocessing step.

```{r, message=FALSE}

rec <- recipes::prep(x = pca_recipe, training = Hitters_train)
  
rec %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)

rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)

```

## PCA exploration

We can explore the results of the PCA using the [`recipes::prep`](https://recipes.tidymodels.org/reference/prep.html) and
[`parsnip::tidy`](https://parsnip.tidymodels.org/reference/tidy.model_fit.html). We can see that pca is done at step number 5

```{r, message=FALSE}

recipes::prep(x = pca_recipe, training = Hitters_train) %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)
```

As such, we extract the results of step number 5 or the pca step.

```{r, message=FALSE}

tidied_pca_loadings <- recipes::prep(x = pca_recipe, training = Hitters_train) %>%
  parsnip::tidy(id = "pca", type = "coef")


tidied_pca_loadings %>%
  reactable::reactable(defaultPageSize = 5)

```

We make a visualization to see what the first four components look like

```{r, message=FALSE, fig.height = 8}

tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["value"]], 
                                         y = .data[["terms"]], 
                                         fill = .data[["terms"]])) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]])) +
  ggplot2::labs(y = NULL)

```

Let us take a closer look at the top 6 variables that contribute to the
first four components

```{r, message=FALSE}

tidied_pca_loadings %>%
  dplyr::filter(.data[["component"]] %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  dplyr::group_by(.data[["component"]]) %>%
  dplyr::top_n(6, abs(.data[["value"]])) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    terms = tidytext::reorder_within(x = .data[["terms"]], 
                                     by = abs(.data[["value"]]), 
                                     within = .data[["component"]])
    ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = abs(.data[["value"]]), 
                                         y = .data[["terms"]], 
                                         fill = .data[["value"]] > 0)
                  ) +
  ggplot2::geom_col() +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]]), 
                      scales = "free_y") +
  tidytext::scale_y_reordered() +
  ggplot2::labs(
    x = "Absolute value of contribution to PCA component",
    y = NULL, 
    fill = "Positive?"
  )

```

How much variation are we capturing for each component?

Here is how we can see the variance statistics but it is in a long form

```{r, message=FALSE}
tidied_pca_variance <- recipes::prep(x = pca_recipe, training = Hitters_train) %>%
  broom::tidy(id = "pca", type = "variance")

tidied_pca_variance %>% 
  reactable::reactable(defaultPageSize = 5)
```

Here is how we can see the variance statistics in its wide form

```{r, message=FALSE}
tidied_pca_variance %>% 
  tidyr::pivot_wider(names_from = .data[["component"]], 
                     values_from = .data[["value"]], 
                     names_prefix = "PC") %>% 
  dplyr::select(-dplyr::all_of("id")) %>% 
  reactable::reactable(defaultPageSize = 5)
```

Here is a simple plot to show the variance captured for each component

```{r, message=FALSE}
# Get the variance
percent_variation <- tidied_pca_variance %>%
  dplyr::filter(.data[["terms"]] == "percent variance") %>% 
  dplyr::pull(.data[["value"]])

percent_variation <- percent_variation/100

# I use [1:4] to select the first four components
dplyr::tibble(component = unique(tidied_pca_loadings$component)[1:4], 
              percent_var = percent_variation[1:4]) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["component"]], 
                                         y = .data[["percent_var"]])) +
  ggplot2::geom_col() +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::labs(x = NULL, 
                y = "Percent variance explained by each PCA component")

```

Threshold is the fraction of the total variance that should be covered
by the components. For example, `threshold = .75` means that step_pca
should generate enough components to capture 75 percent of the
variability in the variables. Note: using this argument will override
and reset any value given to `num_comp`.

We will now try to find the best threshold value using the
[`tune::tune()`](https://hardhat.tidymodels.org/reference/tune.html)

```{r, message=FALSE}

pca_recipe <- 
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>% 
  recipes::step_novel(recipes::all_nominal_predictors()) %>% 
  recipes::step_dummy(recipes::all_nominal_predictors()) %>% 
  recipes::step_zv(recipes::all_predictors()) %>% 
  recipes::step_normalize(recipes::all_predictors()) %>%
  recipes::step_pca(recipes::all_predictors(), threshold = tune::tune())
```

## Specify the model

We use a linear model

```{r, message=FALSE}

lm_spec <- 
  parsnip::linear_reg() %>% 
  parsnip::set_mode("regression") %>% 
  parsnip::set_engine("lm")

lm_spec %>%
  parsnip::translate()

```

## Create the workflow

[`workflows::workflow`](https://workflows.tidymodels.org/reference/workflow.html),
[`workflows::add_recipe`](https://workflows.tidymodels.org/reference/add_recipe.html)
and
[`workflows::add_model`](https://workflows.tidymodels.org/reference/add_model.html)
are used.

```{r, message=FALSE, max.height='150px'}

pca_workflow <-  workflows::workflow() %>% 
  workflows::add_recipe(pca_recipe) %>% 
  workflows::add_model(lm_spec)

pca_workflow
```

## Create the threshold grid

A threshold grid of $10$ numbers from $0$ to $1$ is created.

Threshold grid is created using
[`dials::grid_regular`](https://dials.tidymodels.org/reference/grid_regular.html)
and
[`dials::threshold`](https://dials.tidymodels.org/reference/threshold.html)

```{r, message=FALSE}
threshold_grid  <- dials::grid_regular(x = dials::threshold(range = c(0, 1)), 
                                       levels = 10)

threshold_grid %>% 
  reactable::reactable(defaultPageSize = 5)
```

## Principal component regression model fitting on cross validated data

Now we have everything we need and we can fit all the models on the
cross validated data with
[`tune::tune_grid`](https://tune.tidymodels.org/reference/tune_grid.html).
Note that this process may take some time.

```{r, message=FALSE}
doParallel::registerDoParallel()
foreach::getDoParWorkers()
```

```{r, message=FALSE, max.height='150px'}
tune_res <- tune::tune_grid(
  object = pca_workflow,
  resamples = Hitters_fold, 
  grid = threshold_grid
)

tune_res
```

Here we see that the amount of regularization affects the performance
metrics differently using
[`tune::autoplot`](https://tune.tidymodels.org/reference/autoplot.tune_results.html).
Do note that using a different seed will give a different plot

```{r, message=FALSE}
# Note that a different seed will give different plots
tune::autoplot(tune_res)
```

We can also see the raw metrics that created this chart by calling
[`tune::collect_metrics()`](https://tune.tidymodels.org/reference/collect_predictions.html).

```{r, message=FALSE, max.height='150px'}

tune::collect_metrics(tune_res) %>% 
  reactable::reactable(defaultPageSize = 5)
```

Here is the `ggplot` way should
[`tune::autoplot`](https://tune.tidymodels.org/reference/autoplot.tune_results.html)
fails

```{r, message=FALSE, max.height='150px'}

tune_res %>%
  tune::collect_metrics() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["threshold"]],
                                         y = .data[["mean"]],
                                         colour = .data[[".metric"]])) +
  ggplot2::geom_errorbar(mapping = ggplot2::aes(ymin = .data[["mean"]] - .data[["std_err"]],
                                                ymax = .data[["mean"]] + .data[["std_err"]]),
                         alpha = 0.5) +
  ggplot2::geom_line(size = 1.5) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[[".metric"]]), 
                      scales = "free", 
                      nrow = 2) +
  ggplot2::theme(legend.position = "none")
```

Use
[`tune::show_best`](https://tune.tidymodels.org/reference/show_best.html)
to see the top few values for a given metric.

The "best" values can be selected using
[`tune::select_best`](https://tune.tidymodels.org/reference/show_best.html),
this function requires you to specify a metric that it should select
against. The threshold value is 0.889 for metric `rsme` since it gives
the lowest value. Do note that using a different seed will give a
different best number of threshold value.

```{r, message=FALSE, max.height='150px'}

top_threshold <- tune::show_best(tune_res, metric = "rmse", n = 5)
top_threshold %>%
  reactable::reactable(defaultPageSize = 5)

best_threshold <- tune::select_best(tune_res, metric = "rmse")
best_threshold %>%
  reactable::reactable(defaultPageSize = 5)

```

## Principal component regression model with optimised threshold value

We create the principal component regression workflow with the best
threshold using
[`tune::finalize_workflow`](https://tune.tidymodels.org/reference/finalize_model.html).

```{r, message=FALSE, max.height='150px'}

pca_final <- tune::finalize_workflow(x = pca_workflow, 
                                     parameters = best_threshold)

pca_final
```

We now train the principal component regression model with the training
data using
[`parsnip::fit`](https://parsnip.tidymodels.org/reference/fit.html)

```{r, message=FALSE}

pca_final_fit <- parsnip::fit(object = pca_final, data = Hitters_train)
```

We can see the coefficients and statistics using
[`tune::extract_fit_parsnip`](https://tune.tidymodels.org/reference/extract-tune.html), [`broom::tidy`](https://broom.tidymodels.org/reference/tidy.lm.html) and [`broom::glance`](https://broom.tidymodels.org/reference/glance.lm.html) for `lm` class objects

```{r, message=FALSE}
pca_final_fit  %>%
  tune::extract_fit_parsnip() %>%
  broom::tidy() %>%
  reactable::reactable(defaultPageSize = 5)
```

```{r, message=FALSE}
pca_final_fit  %>%
  tune::extract_fit_parsnip() %>%
  broom::glance() %>%
  reactable::reactable(defaultPageSize = 5)
```

## Principal component regression model on test data

This principal component regression model can now be applied on our
testing data set to validate its performance. For regression models, a
`.pred` column is added when
[`parsnip::augment`](https://parsnip.tidymodels.org/reference/augment.html)
is used.

```{r, message=FALSE}

test_results <- parsnip::augment(x = pca_final_fit, new_data = Hitters_test)
  
test_results %>%
  reactable::reactable(defaultPageSize = 5)
  
```

We check how well the `.pred` column matches the Salary using
[`yardstick::rsq`](https://yardstick.tidymodels.org/reference/rsq.html).

```{r, message=FALSE}
test_results %>%
  yardstick::rsq(truth = .data[["Salary"]], estimate = .data[[".pred"]]) %>%
  reactable::reactable(defaultPageSize = 5)
```

Alternatively, we can use
[`tune::last_fit`](https://tune.tidymodels.org/reference/last_fit.html) and [`tune::collect_metrics`](https://tune.tidymodels.org/reference/collect_predictions.html).

```{r, message=FALSE}

test_rs <- tune::last_fit(object = pca_final_fit, 
                          split = Hitters_split)

test_rs %>%
  tune::collect_metrics() %>%
  reactable::reactable(defaultPageSize = 5)
```

Use [`tune::collect_predictions`](https://tune.tidymodels.org/reference/collect_predictions.html), to see only the actual and predicted values of the test data.

```{r, message=FALSE}
test_rs %>%
  tune::collect_predictions() %>%
  reactable::reactable(defaultPageSize = 5)
```

Let us take a closer look at the predicted and actual response as a scatter plot.

```{r, message=FALSE}
test_rs %>%
  tune::collect_predictions() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["Salary"]],
                                         y = .data[[".pred"]]
                                         )
                  ) +
  ggplot2::geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
  ggplot2::geom_point(alpha = 0.6, color = "midnightblue") +
  ggplot2::coord_fixed()
```

# Partial Least Square

The partial least square regression is a linear model with pls
transformed data. Hence, the major changes will be on the preprocessing
steps.

## Create the resample object

First, we split the samples into a training set and a test set. From the training set, we create a 10-fold cross-validation data set from the training set.

This is done with [`rsample::initial_split`](https://rsample.tidymodels.org/reference/initial_split.html),
[`rsample::training`](https://rsample.tidymodels.org/reference/initial_split.html),
[`rsample::testing`](https://rsample.tidymodels.org/reference/initial_split.html)
and
[`rsample::vfold_cv`](https://rsample.tidymodels.org/reference/vfold_cv.html).

```{r, message=FALSE}
set.seed(1234)
Hitters <- dplyr::as_tibble(ISLR2::Hitters) %>%
  tidyr::drop_na()

Hitters_split <- rsample::initial_split(Hitters, strata = "Salary")

Hitters_train <- rsample::training(Hitters_split)
Hitters_test <- rsample::testing(Hitters_split)

Hitters_fold <- rsample::vfold_cv(Hitters_train, v = 10)

```

## Create the preprocessor

We create a recipe with
[`recipes::recipe`](https://recipes.tidymodels.org/reference/recipe.html),
[`recipes::all_nominal_predictors`](https://recipes.tidymodels.org/reference/has_role.html),
[`recipes::all_predictors()`](https://recipes.tidymodels.org/reference/has_role.html),
[`recipes::step_novel`](https://recipes.tidymodels.org/reference/step_novel.html),
[`recipes::step_dummy`](https://recipes.tidymodels.org/reference/step_dummy.html),
[`recipes::step_zv`](https://recipes.tidymodels.org/reference/step_zv.html)
and
[`recipes::step_normalize`](https://recipes.tidymodels.org/reference/step_normalize.html).

We add
[`recipes::step_pls`](https://recipes.tidymodels.org/reference/step_pls.html)
this time to perform partial least square calculation on all the
predictors. `num_comp` is the number of partial least square components
to retain as new predictors. `outcome` is the response variable for
partial least square regression to use.

```{r, message=FALSE}

pls_recipe <- 
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>% 
  recipes::step_novel(recipes::all_nominal_predictors()) %>% 
  recipes::step_dummy(recipes::all_nominal_predictors()) %>% 
  recipes::step_zv(recipes::all_predictors()) %>% 
  recipes::step_normalize(recipes::all_predictors()) %>%
  recipes::step_pls(recipes::all_predictors(), num_comp = 19 ,outcome = "Salary")
```

Apply
[`recipes::prep`](https://recipes.tidymodels.org/reference/prep.html)
and
[`recipes::bake`](https://recipes.tidymodels.org/reference/bake.html) to
compute the preprocessing step.

```{r, message=FALSE}

rec <- recipes::prep(x = pls_recipe, training = Hitters_train)
  
rec %>%
  summary() %>%
  reactable::reactable(defaultPageSize = 5)

rec %>%
  recipes::bake(new_data = Hitters_train) %>%
  reactable::reactable(defaultPageSize = 5)

```

## PLS exploration

We can explore the results of the PCA using the [`recipes::prep`](https://recipes.tidymodels.org/reference/prep.html) and
[`parsnip::tidy`](https://parsnip.tidymodels.org/reference/tidy.model_fit.html). We can see that pca is done at step number 5

```{r, message=FALSE}

recipes::prep(x = pls_recipe, training = Hitters_train) %>%
  parsnip::tidy() %>%
  reactable::reactable(defaultPageSize = 5)
```

As such, we extract the results of step number 5 or the pca step.

```{r, message=FALSE}

tidied_pls <- recipes::prep(x = pls_recipe, training = Hitters_train) %>%
  parsnip::tidy(5)


tidied_pls %>%
  reactable::reactable(defaultPageSize = 5)

```

We make a visualization to see what the first four components look like

```{r, message=FALSE, fig.height = 8}

tidied_pls %>%
  dplyr::filter(.data[["component"]] %in% c("PLS1", "PLS2", "PLS3", "PLS4")) %>%
  dplyr::mutate(component = forcats::fct_inorder(.data[["component"]])) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["value"]], 
                                         y = .data[["terms"]], 
                                         fill = .data[["terms"]])) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]])) +
  ggplot2::labs(y = NULL)

```

Let us take a closer look at the top 6 variables that contribute to the
first four components

```{r, message=FALSE}

tidied_pls %>%
  dplyr::filter(.data[["component"]] %in% c("PLS1", "PLS2", "PLS3", "PLS4")) %>%
  dplyr::group_by(.data[["component"]]) %>%
  dplyr::top_n(6, abs(.data[["value"]])) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(
    terms = tidytext::reorder_within(x = .data[["terms"]], 
                                     by = abs(.data[["value"]]), 
                                     within = .data[["component"]])
    ) %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = abs(.data[["value"]]), 
                                         y = .data[["terms"]], 
                                         fill = .data[["value"]] > 0)
                  ) +
  ggplot2::geom_col() +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[["component"]]), 
                      scales = "free_y") +
  tidytext::scale_y_reordered() +
  ggplot2::labs(
    x = "Absolute value of contribution to PLS component",
    y = NULL, 
    fill = "Positive?"
  )

```

`num_comp` is the number of partial least square components to retain as
new predictors. We will now try to find the best number of component
value using the
[`tune::tune()`](https://hardhat.tidymodels.org/reference/tune.html)

```{r, message=FALSE}

pls_recipe <- 
  recipes::recipe(formula = Salary ~ ., data = Hitters_train) %>% 
  recipes::step_novel(recipes::all_nominal_predictors()) %>% 
  recipes::step_dummy(recipes::all_nominal_predictors()) %>% 
  recipes::step_zv(recipes::all_predictors()) %>% 
  recipes::step_normalize(recipes::all_predictors()) %>%
  recipes::step_pls(recipes::all_predictors(), num_comp = tune::tune() , outcome = "Salary")
```

## Specify the model

We use a linear model

```{r, message=FALSE}

lm_spec <- 
  parsnip::linear_reg() %>% 
  parsnip::set_mode("regression") %>% 
  parsnip::set_engine("lm")

lm_spec %>%
  parsnip::translate()

```

## Create the workflow

[`workflows::workflow`](https://workflows.tidymodels.org/reference/workflow.html),
[`workflows::add_recipe`](https://workflows.tidymodels.org/reference/add_recipe.html)
and
[`workflows::add_model`](https://workflows.tidymodels.org/reference/add_model.html)
are used.

```{r, message=FALSE, max.height='150px'}

pls_workflow <-  workflows::workflow() %>% 
  workflows::add_recipe(pls_recipe) %>% 
  workflows::add_model(lm_spec)

pls_workflow
```

## Create the number of components grid

A number of components grid of $10$ numbers from $1$ to $20$ is created.

Number of components grid is created using
[`dials::grid_regular`](https://dials.tidymodels.org/reference/grid_regular.html)
and
[`dials::num_comp`](https://dials.tidymodels.org/reference/num_comp.html)

```{r, message=FALSE}
num_comp_grid  <- dials::grid_regular(x = dials::num_comp(range = c(1, 20)), 
                                      levels = 10)

num_comp_grid %>% 
  reactable::reactable(defaultPageSize = 5)
```

## Partial least square regression model fitting on cross validated data

Now we have everything we need and we can fit all the models on the
cross validated data with
[`tune::tune_grid`](https://tune.tidymodels.org/reference/tune_grid.html).
Note that this process may take some time.

```{r, message=FALSE}
doParallel::registerDoParallel()
foreach::getDoParWorkers()
```

```{r, message=FALSE, max.height='150px'}
tune_res <- tune::tune_grid(
  object = pls_workflow,
  resamples = Hitters_fold, 
  grid = num_comp_grid
)

tune_res
```

Here we see that the amount of regularization affects the performance
metrics differently using
[`tune::autoplot`](https://tune.tidymodels.org/reference/autoplot.tune_results.html).
Do note that using a different seed will give a different plot

```{r, message=FALSE}
# Note that a different seed will give different plots
tune::autoplot(tune_res)
```

We can also see the raw metrics that created this chart by calling
[`tune::collect_metrics()`](https://tune.tidymodels.org/reference/collect_predictions.html).

```{r, message=FALSE, max.height='150px'}

tune::collect_metrics(tune_res) %>% 
  reactable::reactable(defaultPageSize = 5)
```

Here is the `ggplot` way should
[`tune::autoplot`](https://tune.tidymodels.org/reference/autoplot.tune_results.html)
fails

```{r, message=FALSE, max.height='150px'}

tune_res %>%
  tune::collect_metrics() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["num_comp"]],
                                         y = .data[["mean"]],
                                         colour = .data[[".metric"]])) +
  ggplot2::geom_errorbar(mapping = ggplot2::aes(ymin = .data[["mean"]] - .data[["std_err"]],
                                                ymax = .data[["mean"]] + .data[["std_err"]]),
                         alpha = 0.5) +
  ggplot2::geom_line(size = 1.5) +
  ggplot2::facet_wrap(facets = ggplot2::vars(.data[[".metric"]]), 
                      scales = "free", 
                      nrow = 2) +
  ggplot2::theme(legend.position = "none")
```

Use
[`tune::show_best`](https://tune.tidymodels.org/reference/show_best.html)
to see the top few values for a given metric.

The "best" values can be selected using
[`tune::select_best`](https://tune.tidymodels.org/reference/show_best.html),
this function requires you to specify a metric that it should select
against. The number of components value is 1 for metric `rsme` since it
gives the lowest value. Do note that using a different seed will give a
different best number of components value.

```{r, message=FALSE, max.height='150px'}

top_num_comp <- tune::show_best(tune_res, metric = "rmse", n = 5)
top_num_comp %>%
  reactable::reactable(defaultPageSize = 5)

best_num_comp <- tune::select_best(tune_res, metric = "rmse")
best_num_comp %>%
  reactable::reactable(defaultPageSize = 5)

```

## Partial least square model with optimised threshold value

We create the partial least square regression workflow with the best
threshold using
[`tune::finalize_workflow`](https://tune.tidymodels.org/reference/finalize_model.html).

```{r, message=FALSE, max.height='150px'}

pls_final <- tune::finalize_workflow(x = pls_workflow, 
                                     parameters = best_num_comp)

pls_final
```

We now train the partial least square regression model with the training
data using
[`parsnip::fit`](https://parsnip.tidymodels.org/reference/fit.html)

```{r, message=FALSE}

pls_final_fit <- parsnip::fit(object = pls_final, data = Hitters_train)
```

We can see the coefficients and statistics using
[`tune::extract_fit_parsnip`](https://tune.tidymodels.org/reference/extract-tune.html), [`broom::tidy`](https://broom.tidymodels.org/reference/tidy.lm.html) and [`broom::glance`](https://broom.tidymodels.org/reference/glance.lm.html) for `lm` class objects

```{r, message=FALSE}
pls_final_fit  %>%
  tune::extract_fit_parsnip() %>%
  broom::tidy() %>%
  reactable::reactable(defaultPageSize = 5)
```

```{r, message=FALSE}
pls_final_fit  %>%
  tune::extract_fit_parsnip() %>%
  broom::glance() %>%
  reactable::reactable(defaultPageSize = 5)
```

## Partial least square regression model on test data

This partial least square regression model can now be applied on our
testing data set to validate its performance. For regression models, a
`.pred` column is added when
[`parsnip::augment`](https://parsnip.tidymodels.org/reference/augment.html)
is used.

```{r, message=FALSE}

test_results <- parsnip::augment(x = pls_final_fit, new_data = Hitters_test)
  
test_results %>%
  reactable::reactable(defaultPageSize = 5)
  
```

We check how well the `.pred` column matches the Salary using
[`yardstick::rsq`](https://yardstick.tidymodels.org/reference/rsq.html).

```{r, message=FALSE}
test_results %>%
  yardstick::rsq(truth = .data[["Salary"]], estimate = .data[[".pred"]]) %>%
  reactable::reactable(defaultPageSize = 5)
```

Alternatively, we can use
[`tune::last_fit`](https://tune.tidymodels.org/reference/last_fit.html)
and
[`tune::collect_metrics`](https://tune.tidymodels.org/reference/collect_predictions.html).

```{r, message=FALSE}

test_rs <- tune::last_fit(object = pls_final_fit, 
                          split = Hitters_split)
  
test_rs %>%
  tune::collect_metrics() %>%
  reactable::reactable(defaultPageSize = 5)
  
```

Use [`tune::collect_predictions`](https://tune.tidymodels.org/reference/collect_predictions.html), to see only the actual and predicted values of the test data.

```{r, message=FALSE}
test_rs %>%
  tune::collect_predictions() %>%
  reactable::reactable(defaultPageSize = 5)
```

Let us take a closer look at the predicted and actual response as a scatter plot.

```{r, message=FALSE}
test_rs %>%
  tune::collect_predictions() %>%
  ggplot2::ggplot(mapping = ggplot2::aes(x = .data[["Salary"]],
                                         y = .data[[".pred"]]
                                         )
                  ) +
  ggplot2::geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
  ggplot2::geom_point(alpha = 0.6, color = "midnightblue") +
  ggplot2::coord_fixed()
```

# Rmarkdown Template

This Rmarkdown template is created by the [Reality Bending
Lab](https://realitybending.github.io/). The template can be download
from the lab's
[github](https://github.com/RealityBending/TemplateResults) repository.
For more information about the motivation behind creating this template,
check out [Dr. Dominique Makowski's blog
post](https://dominiquemakowski.github.io/post/2021-02-10-template_results/)

# Blog References

-   Emil Hvitfeldt's [ISLR tidymodels
    Labs](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/linear-model-selection-and-regularization.html)

-   Julia Silge's
    [blog](https://juliasilge.com/blog/lasso-the-office.html) titled
    "LASSO regression using tidymodels and #TidyTuesday data for The
    Office"

-   Julia Silge's [blog](https://juliasilge.com/blog/best-hip-hop.html)
    titled "PCA and the #TidyTuesday best hip hop songs ever"

# Package References

```{r warning=FALSE, message=TRUE, results='asis'}
report::cite_packages(sessionInfo())
```
