Unsupervised Learning

class: clear, center, middle

background-image: url(images/unsupervised-cover.jpg)
background-position: center
background-size: contain

.font200.white[Unsupervised Learning]

---
# Concept

___Unsupervised learning___: a set of statistical tools to better understand *n* observations that contain a set of features ( `$x_1, x_2, \dots, x_p$` ) without being guided by a response variable (*Y*).

In essence, unsupervised learning is concerned with identifying groups in a data set

* .bold[clustering]: reduce the observation space of a data set
* .bold[dimension reduction]:  reduce the feature space of a data set

---
# Concept

In essence, unsupervised learning is concerned with identifying groups in a data set

* .bold[clustering]: reduce the observation space of a data set
  - _k_-means clustering
  - hierarchical clustering
  - spectral clustering
* .bold[dimension reduction]: reduce the feature space of a data set
  - principal components analysis (PCA)
  - factor analysis
  - matrix factorization
  - autoencoders
* .bold[Generalized low rank models]: a generalization of the clustering & dimension reduction

---
# Concept

In essence, unsupervised learning is concerned with identifying groups in a data set

* clustering: reduce the observation space of a data set
  - .bold.blue[_k_-means clustering]
  - .bold.blue[hierarchical clustering]
  - spectral clustering
* dimension reduction: reduce the feature space of a data set
  - .bold.blue[principal components analysis (PCA)]
  - factor analysis
  - matrix factorization
  - autoencoders
* Generalized low rank models: a generalization of the clustering & dimension reduction

.center.bold.white[.content-box-blue-dark[This module's focus]]
 
---
# Prerequisites .red[ code chunk 1]

.pull-left[

.center.font120[Packages]

```r
library(tidyverse)   # data munging & visualization
library(cluster)     # additional clustering techniques
library(factoextra)  # clustering & PCA visualizations
```

]

.pull-right[

.center.font120[Data]

```r
USArrests                 # primary example data
AmesHousing::make_ames()  # a few additional examples
```

]

---
class: center, middle, inverse

.font300.white[Clustering]

---
# Types of clustering

Clustering is a broad set of techniques for ___finding subgroups of observations___ within a data set.

.pull-left[

* .bold[Objective]: we want observations in the same group to be similar and observations in different groups to be dissimilar

* .bold[Use cases:]
  - customer segmentation
  - concentration of crime activity
  - common patient traits
  - voter profiles
  
]

.pull-right[

]

---
# Types of clustering

Clustering is a broad set of techniques for ___finding subgroups of observations___ within a data set.

.pull-left[

* .bold[Objective]: we want observations in the same group to be similar and observations in different groups to be dissimilar

* .bold[Use cases:]
  - customer segmentation
  - concentration of crime activity
  - common patient traits
  - voter profiles
  
]

.pull-right[

* .bold[Methods]: several clustering algorithms exists:
  - k-means
  - hierarchical
  - partitioning around mediods (PAM)
  - clustering large applications (CLARA)

]

---
# Measuring observation distances

.pull-left[
* classification of observations into groups requires methods for computing the distance of the (dis)similarity between each pair of observations

* distance measures
  - Euclidean: `$d_{euc}(x_a,x_b) = \sqrt{\sum^P_{j=1}(x_{aj} - x_{bj})^2}$`
  - Manhattan: `$d_{man}(x_a,x_b) = \sum^P_{j=1}|x_{aj} - x_{bj}|$`

]

.pull-right[

```r
(two_states <- USArrests[1:2, 1:2])
## Murder Assault
## Alabama 13.2 236
## Alaska 10.0 263

dist(two_states, method = "euclidean")
## Alabama
## Alaska 27.18897
dist(two_states, method = "manhattan")
## Alabama
## Alaska 30.2
```
<img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />

]

---
# Measuring observation distances

.pull-left[
* classification of observations into groups requires methods for computing the distance of the (dis)similarity between each pair of observations

* distance measures
  - .opacity20[Euclidean]
  - .opacity20[Manhattan]
  - Correlation-based (i.e. Pearson, Spearman)
  
]

.pull-right[

]

.center.bold.blue.font120[.content-box-gray[There are several other distance measures but these are the most common]]

---
# Measuring observation distances

When to use certain distance measures

* Euclidean
   - most sensitive to outliers 
   - outliers can skew clusters giving false interpretation
   - use if you are relatively certain minimal outliers exists
* Manhattan
   - less sensitive to outliers 
   - use if you want to be more robust to existing outliers
* Correlation-based 
   - captures common relationships regardless of magnitude
   - use if you want to analyze unscaled data where observations may have larger differences in magnitudes but possibly similar behaviors 
   - i.e. group customers based on purchased quantities (large volume and low volume customers can be in same group if they exhibit common preferences)

.center.bold.blue[.content-box-gray[Euclidean is the most commonly used.]]

---
# *k*-means clustering

Basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total .blue[within-cluster variation]) is .blue[minimized]

`$$\underset{C_1, \dots, C_k}{\texttt{minimize}} \bigg\{ \sum^K_{k=1}(W(C_k)) \bigg \}$$`

.bold.red.center[Tighter clusters = more well defined clusters]

---
# *k*-means clustering

Basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total .blue[within-cluster variation]) is .blue[minimized]

`$$\underset{C_1, \dots, C_k}{\texttt{minimize}} \bigg\{ \sum^K_{k=1} (\texttt{Mean Euclidean Distance}_k )\bigg \}$$`

.bold.red.center[We measure within-cluster variation using our distance measure of choice (i.e. Euclidean, Manhattan)]

---
# *k*-means algorithm

.pull-left[
* Unfortunately we cannot evaluate every possible cluster combination because there are almost `$K^n$` ways to partition _n_ observations into _K_ clusters
* Estimate _greedy local optimum_ ([](https://www.jstor.org/stable/pdf/2346830.pdf?casa_token=DyTW0ZLNC4gAAAAA:VNX2TGwDfcs5foMa96ZxnOM2mjaQU1WuCOLL8qF6iDBWp6ClU8-i2-OSXKbtO1uHm6_1oda_2egpvgYCvaix8UxUqUryqZj-Pw3G4m771Ev5-4kL46Y))
 1. .red[randomly] assign each observation to an initial cluster
 2. iterate until the cluster assignments stop changing 
 a. compute cluster centroids 
 b. reassign each observation to the cluster whose centroid is closest 
* Do to randomization
 - we can get slightly different results each try
 - use several random starts (rule of <img src="https://emojis.slackmojis.com/emojis/images/1511903783/3230/wiggle_thumbs_up.gif?1511903783" style="height:1em; width:auto; "/>: 20)
 - algorithm uses iteration with lowest `$W(C_k)$`

]

.pull-right[
<img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]

---
# Prepare our data for _k_-means .red[ code chunk 2]

.pull-left[

1. Rows are observations (individuals) and columns are variables.

2. Any missing value in the data must be removed or estimated.

3. The data must be standardized (centered at mean zero and scaled to one standard deviation) to make variables comparable. `$^a$`

`$^a$` .font60[In some rare cases you may want to leave data unstandardized.]

]

.pull-right[

```r
crime <- USArrests %>%
 drop_na() %>%
 scale() %>%
 print()
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.78283935 -0.52090661 -0.003416473
## Alaska 0.50786248 1.10682252 -1.21176419 2.484202941
## Arizona 0.07163341 1.47880321 0.99898006 1.042878388
## Arkansas 0.23234938 0.23086801 -1.07359268 -0.184916602
## California 0.27826823 1.26281442 1.75892340 2.067820292
## Colorado 0.02571456 0.39885929 0.86080854 1.864967207
## Connecticut -1.03041900 -0.72908214 0.79172279 -1.081740768
## Delaware -0.43347395 0.80683810 0.44629400 -0.579946294
## Florida 1.74767144 1.97077766 0.99898006 1.138966691
## Georgia 2.20685994 0.48285493 -0.38273510 0.487701523
## Hawaii -0.57123050 -1.49704226 1.20623733 -0.110181255
## Idaho -1.19113497 -0.60908837 -0.79724965 -0.750769945
## Illinois 0.59970018 0.93883125 1.20623733 0.295524916
## Indiana -0.13500142 -0.69308401 -0.03730631 -0.024769429
## Iowa -1.28297267 -1.37704849 -0.58999237 -1.060387812
## Kansas -0.41051452 -0.66908525 0.03177945 -0.345063775
## Kentucky 0.43898421 -0.74108152 -0.93542116 -0.526563903
## Louisiana 1.74767144 0.93883125 0.03177945 0.103348309
## Maine -1.30593210 -1.05306531 -1.00450692 -1.434064548
## Maryland 0.80633501 1.55079947 0.10086521 0.701231086
## Massachusetts -0.77786532 -0.26110644 1.34440885 -0.526563903
## Michigan 0.99001041 1.01082751 0.58446551 1.480613993
## Minnesota -1.16817555 -1.18505846 0.03177945 -0.676034598
## Mississippi 1.90838741 1.05882502 -1.48810723 -0.441152078
## Missouri 0.27826823 0.08687549 0.30812248 0.743936999
## Montana -0.41051452 -0.74108152 -0.86633540 -0.515887425
## Nebraska -0.80082475 -0.82507715 -0.24456358 -0.505210947
## Nevada 1.01296983 0.97482938 1.06806582 2.644350114
## New Hampshire -1.30593210 -1.36504911 -0.65907813 -1.252564419
## New Jersey -0.08908257 -0.14111267 1.62075188 -0.259651949
## New Mexico 0.82929443 1.37080881 0.30812248 1.160319648
## New York 0.76041616 0.99882813 1.41349461 0.519730957
## North Carolina 1.19664523 1.99477641 -1.41902147 -0.547916860
## North Dakota -1.60440462 -1.50904164 -1.48810723 -1.487446939
## Ohio -0.11204199 -0.60908837 0.65355127 0.017936483
## Oklahoma -0.27275797 -0.23710769 0.16995096 -0.131534211
## Oregon -0.66306820 -0.14111267 0.10086521 0.861378259
## Pennsylvania -0.34163624 -0.77707965 0.44629400 -0.676034598
## Rhode Island -1.00745957 0.03887798 1.48258036 -1.380682157
## South Carolina 1.51807718 1.29881255 -1.21176419 0.135377743
## South Dakota -0.91562187 -1.01706718 -1.41902147 -0.900240639
## Tennessee 1.24256408 0.20686926 -0.45182086 0.605142783
## Texas 1.12776696 0.36286116 0.99898006 0.455672088
## Utah -1.05337842 -0.60908837 0.99898006 0.178083656
## Vermont -1.28297267 -1.47304350 -2.31713632 -1.071064290
## Virginia 0.16347111 -0.17711080 -0.17547783 -0.056798864
## Washington -0.86970302 -0.30910395 0.51537975 0.530407436
## West Virginia -0.47939280 -1.07706407 -1.83353601 -1.273917376
## Wisconsin -1.19113497 -1.41304662 0.03177945 -1.113770203
## Wyoming -0.22683912 -0.11711392 -0.38273510 -0.601299251
## attr(,"scaled:center")
## Murder Assault UrbanPop Rape 
## 7.788 170.760 65.540 21.232 
## attr(,"scaled:scale")
## Murder Assault UrbanPop Rape 
## 4.355510 83.337661 14.474763 9.366385
```

]

---
# Applying _k_-means .red[ code chunk 3]

.scrollable90[
.pull-left[

1. .opacity20[Rows are observations (individuals) and columns are variables.]

2. .opacity20[Any missing value in the data must be removed or estimated.]

3. .opacity20[The data must be standardized (centered at mean zero and scaled to one standard deviation) to make variables comparable.]

4. Apply `kmeans()`
]

.pull-right[

```r
k3 <- kmeans(crime, centers = 3, nstart = 20)

# tidied output
broom::tidy(k3)
## # A tibble: 3 x 7
## x1 x2 x3 x4 size withinss cluster
## <dbl> <dbl> <dbl> <dbl> <int> <dbl> <fct> 
## 1 -0.962 -1.11 -0.930 -0.967 13 12.0 1 
## 2 -0.447 -0.347 0.479 -0.257 17 19.6 2 
## 3 1.00 1.01 0.198 0.847 20 46.7 3

# full model output
glimpse(k3)
## List of 9
##  $ cluster     : Named int [1:50] 3 3 3 2 3 3 2 2 3 3 ...
##   ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ centers     : num [1:3, 1:4] -0.962 -0.447 1.005 -1.107 -0.347 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
##  $ totss       : num 196
##  $ withinss    : num [1:3] 12 19.6 46.7
##  $ tot.withinss: num 78.3
##  $ betweenss   : num 118
##  $ size        : int [1:3] 13 17 20
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
```

]
]

---
# Interpreting output .red[ code chunk 4]

```r
as_tibble(crime) %>%
  mutate(
    cluster = k3$cluster, 
    label = paste0(row.names(USArrests), " (", cluster, ")")
    ) %>% 
  gather(Crime, Rate, Murder, Assault, Rape) %>%
  ggplot(aes(UrbanPop, Rate, color = factor(cluster), label = label)) +
  geom_text(show.legend = FALSE) +
  facet_wrap(~ Crime) +
  ylab("arrests per 100,000 residents (standardized)")
```

---
class: yourturn
# How many clusters <img src="https://emojis.slackmojis.com/emojis/images/1542340471/4979/thinking.gif?1542340471" style="height:1.25em; width:auto; "/>

.font150.center[But how do we know we specified the right value for _k_?]

.font150.center[But how do we know we specified the right value for _k_?]

.center.bold[.content-box-gray[What do you notice?]]

---
# Determining optimal clusters

Choice for the number of _K_ clusters depends on the goal:

- .bold.font120[Deterministic resource allocation]: a company may employ *k* sales people, and the goal is to partition customers into one of the _k_ segments. (_k_ is predetermined: <img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1em; width:auto; "/>)

- .bold.font120[Descriptive understanding]: *k* is unknown and the goal is to ascertain what natural distinct groupings exist in the data (you need to determine _k_: <img src="https://emojis.slackmojis.com/emojis/images/1471045885/967/wtf.gif?1471045885" style="height:1.5em; width:auto; "/>)

.pull-left[

- Elbow heuristic:
  - monotonic decreasing
  - Most common approach (looks for the kink in the `$W(C_k)$`)

]

.pull-right[

```r
fviz_nbclust(df, kmeans, method = "wss", k.max = 20)
```

]

---
# Determining optimal clusters

Choice for the number of _K_ clusters depends on the goal:

.pull-left[

- Elbow heuristic:
 - monotonic decreasing
 - Most common approach (looks for the kink in the `$W(C_k)$`)
- Gap stat ([](http://web.stanford.edu/~hastie/Papers/gap.pdf)):
 - compares `$\texttt{log}(W(C_k))$` with their expected values under null reference distribution of the data
 - automated way to id "kink"
 - can ID when optimal _k_ is one

]

.pull-right[

]

---
# Determining optimal clusters

Choice for the number of _K_ clusters depends on the goal:

.pull-left[

- Elbow heuristic:
 - monotonic decreasing
 - Most common approach (looks for the kink in the `$W(C_k)$`)
- Gap stat ([](http://web.stanford.edu/~hastie/Papers/gap.pdf)):
 - compares `$\texttt{log}(W(C_k))$` with with their expected values under null reference distribution of the data
 - automated way to id "kink"
 - can ID when optimal _k_ is one

]

.pull-right[

```r
fviz_nbclust(df, kmeans, method = "gap_stat", k.max = 20, verbose = FALSE)
```

]

---
class: yourturn
# Determine optimal clusters .red[ code chunks 5 & 6]

.pull-left[

Fill in code chunks 5 & 6 to identify the optimal number of clusters for our crime data.

```
# use the elbow heuristic (hint: "wss")
fviz_nbclust(____, kmeans, method = "___", k.max = 20)
```

```
# use the gab stat (hint: "gap_stat")
fviz_nbclust(____, kmeans, method = "___", k.max = 20)
```

]

.pull-right[

```r
p1 <- fviz_nbclust(crime, kmeans, method = "wss", k.max = 20)
p2 <- fviz_nbclust(crime, kmeans, method = "gap_stat", k.max = 20, verbose = FALSE)
gridExtra::grid.arrange(p1, p2)
```

]

---
# Hierarchical clustering

.pull-left[

- .bold.font120[_k_-means]
   - Requires _k_ to be predetermined
   - clusters are not nested
   
- .bold.font120[Hierarchical]
   - Do not require _k_ to be predetermined
   - clusters can be nested
   - user must specify measure of _dissimilarity between groups_
   - results in a .bold[dendrogram]
]

.pull-right[

]

---
# Hierarchical strategies

.pull-left[

.pull-right[

]

---
# Group clustering strategies

.pull-left[

- .bold[AGNES]
   - start with individual observation clusters
   - merge closest two clusters based on:
      - Single linkage
      - Complete linkage
      - Group average linkage
      - Centroid linkage
      - Ward's linkage
   - repeat until all clusters have been merged
- .bold[DIANA]
   - start with all obs in one cluster
   - iteratively divide one cluster into two daughter clusters
   - uses a dissimiliarity matrix (see [Kaufman & Rousseeuw, 1990](https://www.amazon.com/Finding-Groups-Data-Introduction-Analysis/dp/0471735787/ref=sr_1_1?ie=UTF8&qid=1548179235&sr=8-1&keywords=Finding+Groups+in+Data%3A+An+Introduction+to+Cluster+Analysis))
   - .blue[Pro tip:] used if focused on relatively small number of clusters

]

.pull-right[

]

---
# Group clustering strategies

.pull-left[

- AGNES
   - start with individual observation clusters
   - merge closest two clusters based on:
     - __Single linkage__
     - Complete linkage
     - Group average linkage
     - Centroid linkage
     - Ward's linkage
   - repeat until all clusters have been merged
- DIANA
   - start with all obs in one cluster
   - iteratively divide one cluster into two daughter clusters
   - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990)
   - Pro tip: used if focused on relatively small number of clusters

]

.pull-right[

.center.bold.red[Merge clusters where the closest two observations have the smallest distance]

]

---
# Group clustering strategies

.pull-left[

- AGNES
   - start with individual observation clusters
   - merge closest two clusters based on:
     - Single linkage
     - __Complete linkage__
     - Group average linkage
     - Centroid linkage
     - Ward's linkage
   - repeat until all clusters have been merged
- DIANA
   - start with all obs in one cluster
   - iteratively divide one cluster into two daughter clusters
   - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990)
   - Pro tip: used if focused on relatively small number of clusters

]

.pull-right[

.center.bold.red[Merge clusters where the most distant two observations have the smallest distance]

]

---
# Group clustering strategies

.pull-left[

- AGNES
   - start with individual observation clusters
   - merge closest two clusters based on:
     - Single linkage
     - Complete linkage
     - __Group average linkage__
     - Centroid linkage
     - Ward's linkage
   - repeat until all clusters have been merged
- DIANA
   - start with all obs in one cluster
   - iteratively divide one cluster into two daughter clusters
   - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990)
   - Pro tip: used if focused on relatively small number of clusters

]

.pull-right[

.center.bold.red[Merge clusters where the average distance between all observations are smallest]

]

---
# Group clustering strategies

.pull-left[

- AGNES
   - start with individual observation clusters
   - merge closest two clusters based on:
     - Single linkage
     - Complete linkage
     - Group average linkage
     - __Centroid linkage__
     - Ward's linkage
   - repeat until all clusters have been merged
- DIANA
   - start with all obs in one cluster
   - iteratively divide one cluster into two daughter clusters
   - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990)
   - Pro tip: used if focused on relatively small number of clusters

]

.pull-right[

.center.bold.red[Merge clusters with the smallest distance between their centroids]

]

---
# Group clustering strategies

.pull-left[

- AGNES
   - start with individual observation clusters
   - merge closest two clusters based on:
     - Single linkage
     - Complete linkage
     - Group average linkage
     - Centroid linkage
     - __Ward's linkage__
   - repeat until all clusters have been merged
- DIANA
   - start with all obs in one cluster
   - iteratively divide one cluster into two daughter clusters
   - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990)
   - Pro tip: used if focused on relatively small number of clusters

]

.pull-right[

.center.bold.red[Merge clusters with the smallest between cluster variation]

]

---
# Comparing AGNES clustering

.pull-left[

- AGNES
   - start with individual observation clusters
   - merge closest two clusters based on:
     - Single linkage: produces many "loose" clusters
     - Complete linkage: produces "compact" clusters
     - Centroid linkage: depends on variability
     - Group average linkage: balances clusters 
     - Ward's linkage: balances clusters
   - repeat until all clusters have been merged

.center.bold.red[I tend to look for the method that produces more compact clusters]

]

.pull-right[
<img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" />
]

---
# Applying hierarchical clustering .red[ code chunks 7 & 8]

.pull-left[

```r
# you can use hclust
option1 <- hclust(dist(crime), method = "average")
plot(option1)
```

]

.pull-left[

```r
# or agnes
option2 <- cluster::agnes(crime, method = "average")
plot(option2, which = 2)
```

]

---
# Interpreting the dendrogram

- vertical axis represents distance at fusion
- 9 & 2 appear close on dendrogram but, in fact, their closeness on the dendrogram imply they are approximately the same distance from obs (5, 7, & 8)

.pull-left[
<img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-25-1.png" width="80%" height="80%" style="display: block; margin: auto;" />
]

.pull-right[

]

---
# Which method to use? .red[ code chunk 9]

The `agnes()` function provides an ___agglomerative coefficent___, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure).

.code70[

```r
# agglomerative coefficent from earlier agnes() model
option2$ac
## [1] 0.7379371
```
]

This allows us to find certain hierarchical clustering methods that can identify stronger clustering structures.

.code70[

```r
# methods to assess
m <- c("average", "single", "complete", "ward")
names(m) <- m

# function to compute coefficient
ac <- function(x) {
 cluster::agnes(crime, method = x)$ac
}

# get agglomerative coefficient for each linkage method
map_dbl(m, ac)
##   average    single  complete      ward 
## 0.7379371 0.6276128 0.8531583 0.9346210
```
]

---
# Which clusters to use? .red[ code chunks 10 & 11]

.pull-left[

.center.bold[Visual assessment]

```r
# re-run hc with Ward method
hc_ward <- cluster::agnes(crime, method = "ward")

# highlight 4 clusters
plot(hc_ward, which = 2)
rect.hclust(hc_ward, k = 4, border = 2:5)
```

]

.pull-right[

.center.bold[Elbow method or Gap stat]

```r
# fviz_nbclust can use differen FUNctions
# see ?fviz_nbclust & ?hcut
fviz_nbclust(crime, FUN = hcut, method = "gap_stat", verbose = FALSE)
```

]

---
# Extract results and visualize .red[ code chunk 12]

```r
as_tibble(crime) %>%
  mutate(
*   cluster = cutree(hc_ward, k = 3), # use cutree to get clusters
    label = paste0(row.names(USArrests), " (", cluster, ")")
    ) %>% 
  gather(Crime, Rate, Murder, Assault, Rape) %>%
  ggplot(aes(UrbanPop, Rate, color = factor(cluster), label = label)) +
  geom_text(show.legend = FALSE) +
  facet_wrap(~ Crime) +
  ylab("arrests per 100,000 residents (standardized)")
```

---
# Ohio's cluster .red[ code chunks 13]

.bold[Common states between _k_-means and Hierarchical clustering]

```
##  [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
##  [5] "Indiana"       "Kansas"        "Massachusetts" "New Jersey"   
##  [9] "Ohio"          "Oklahoma"      "Oregon"        "Pennsylvania" 
## [13] "Rhode Island"  "Utah"          "Virginia"      "Washington"   
## [17] "Wyoming"
```

.bold[Different states between _k_-means and Hierarchical clustering]

```
## [1] "Kentucky" "Missouri"
```

---
# Final Thoughts

.pull-left[

.center.bold[What to remember]

- We measure differences between obs with
   - Euclidean distance
   - Manhattan distance
   - Correlation-based distances
- Data prep
   - clean & standardize data
- _k_-means
   - minimizes within cluster variance
   - find optimal clusters w/elbow & Gap stat
- Hierarchical
   - AGNES vs DIANA
   - Many ways to merge clusters
   - Use method that produces compact clusters
]

.pull-right[

.center.bold[What to learn]

- Many more methods exist
   - `daisy()`: compute distance measures for mixed data
   - `pam()`: medoid clustering
   - `clara()`: big data clustering
   - `diana()`: Devisive hierarchical clustering
- Great resources
   - [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)
   - [Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/)
   - [Practical Guide to Cluster Analysis in R](https://www.amazon.com/Practical-Guide-Cluster-Analysis-Unsupervised/dp/1542462703)
   - [Finding Groups in Data](https://www.amazon.com/Finding-Groups-Data-Introduction-Analysis/dp/0471735787)

]

---
class: center, middle, inverse

# Dimension Reduction via PCA

---
# Motivation

PCA ___reduces the dimensionality of a feature set___.

.pull-left[

* .bold[Objective]:  explain the variability of our feature set using fewer variables than the original data set.

* .bold[Use cases:]
  - descriptive: explaining common feature attributes
     - customer profiling
     - student profiling
     - recidivism profiling
     - &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar; profiling
  - predictive: feature engineering for downstream modeling
     - reducing noise
     - minimizing multicollinearity
  
]

.pull-right[

]

---
# Motivation

PCA ___reduces the dimensionality of a feature set___.

.pull-left[

* .opacity20[Objective:  explain the variability of our feature set using fewer variables than the original data set.]

* .opacity20[Use cases:]
  - descriptive: explaining common feature attributes
     - customer profiling
     - student profiling
     - recidivism profiling
     - &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar; profiling
  - .opacity20[predictive: feature engineering for downstream modeling]
  
]

.pull-right[

]

---
# Motivation

PCA ___reduces the dimensionality of a feature set___.

.pull-left[

* .opacity20[Objective:  explain the variability of our feature set using fewer variables than the original data set.]

* .opacity20[Use cases:]
 - .opacity20[descriptive: explaining common feature attributes]
 
 - predictive: feature engineering for downstream modeling
 - reducing noise
 - minimizing multicollinearity
 
]

.pull-right[

]

---
# Finding principal components

.pull-left[

The *first principal component* of a data set `$X_1$`, `$X_2$`, ..., `$X_p$` is the linear combination of the features

`$$Z_{1} = \phi_{11}X_{1} + \phi_{21}X_{2} + ... + \phi_{p1}X_{p}$$`

that has the largest variance. We refer to `$\phi_{11}, \dots, \phi_{n1}$` as the _loadings_ of the first principal component.

We can find the second principal component `$Z_2$` by identifying the linear combination of `$X_1,\dots , X_p$` that has maximal variance out of all linear combinations that are __*uncorrelated*__ with `$Z_1$`.

]

.pull-right[

.center.bold.red[Do you prefer pictures?]

]

---
# Finding principal components

.pull-left[

The *first principal component* of a data set `$X_1$`, `$X_2$`, ..., `$X_p$` is the linear combination of the features

`$$Z_{1} = \phi_{11}X_{1} + \phi_{21}X_{2} + ... + \phi_{p1}X_{p}$$`

that has the largest variance. We refer to `$\phi_{11}, \dots, \phi_{n1}$` as the _loadings_ of the first principal component.

]

.pull-right[

]

---
# Finding principal components

.pull-left[

The *first principal component* of a data set `$X_1$`, `$X_2$`, ..., `$X_p$` is the linear combination of the features

`$$Z_{1} = \phi_{11}X_{1} + \phi_{21}X_{2} + ... + \phi_{p1}X_{p}$$`

that has the largest variance. We refer to `$\phi_{11}, \dots, \phi_{n1}$` as the _loadings_ of the first principal component.

]

.pull-right[

.right.font50[Made with pca3d package]

]

---
# Prepare our data for PCA

.pull-left[
1. Rows are observations (individuals) and columns are variables.

2. Any missing value in the data must be removed or estimated.

3. The data must be standardized (centered at mean zero and scaled to one standard deviation) to make variables comparable.
]

.pull-right[

```r
# exact same procedure you ran for clustering
crime <- USArrests %>%
 drop_na() %>%
 scale() %>%
 print()
```

]

---
# Applying PCA .red[ code chunk 14]

.pull-left[
1. .opacity20[Rows are observations (individuals) and columns are variables.]

2. .opacity20[Any missing value in the data must be removed or estimated.]

3. .opacity20[The data must be standardized (centered at mean zero and scaled to one standard deviation) to make variables comparable.]

4. Apply `prcomp()`
]

.pull-right[

```r
# perform PCA
pca_result <- prcomp(crime, scale = FALSE)

# PCA model output
names(pca_result)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
```

.font120[`rotation`] provides the principal component loadings ( `$\phi_{ij}$` )

]

---
# Understanding PCs .red[ code chunk 15]

.pull-left[

* .font110[`rotation`] provides the principal component loadings ( `$\phi_{ij}$` )

* There will be the same number of PCs as variables

* Positive vs negative direction:
   - By default, loadings (aka eigenvectors) in R point in the negative direction. 
   - Positive pointing eigenvectors are more intuitive 
   - To use the positive-pointing vector, we multiply the default loadings by -1.

]

.pull-right[

```r
# convert loadings to positive
pca_result$rotation <- -pca_result$rotation

# there will be the same number of PCs as variables
pca_result$rotation
##                PC1        PC2        PC3         PC4
## Murder   0.5358995 -0.4181809  0.3412327 -0.64922780
## Assault  0.5831836 -0.1879856  0.2681484  0.74340748
## UrbanPop 0.2781909  0.8728062  0.3780158 -0.13387773
## Rape     0.5434321  0.1673186 -0.8177779 -0.08902432
```

]

.center.bold[__Loadings represent coefficients; illustrates each variables influence on the principal component__]

---
# Understanding PCs .red[ code chunk 16]

We can visualize these contributions

.pull-left[

```r
fviz_contrib(pca_result, choice = "var", axes = 1)
```

.center[PC1 appears to represent violent crime]

]

.pull-right[

```r
fviz_contrib(pca_result, choice = "var", axes = 2)
```

.center[PC2 appears to represent urban density]

]

---
# Understanding PCs .red[ code chunk 17]

.pull-left[

* .font110[`x`] provides the principal component scores

* The principal components scores simply places a standardized score for each observation for each principal component.

* Interpretation example: Alaska
   - PC1: 1.93 standard deviations above average value for PC1 (high violent crime)
   - PC2: 1 standard deviation below average value for PC2 (low urbanization)

]

.pull-right[

```r
# convert scores to positive
pca_result$x <- -pca_result$x
pca_result$x
## PC1 PC2 PC3 PC4
## Alabama 0.97566045 -1.12200121 0.43980366 -0.154696581
*## Alaska 1.93053788 -1.06242692 -2.01950027 0.434175454
## Arizona 1.74544285 0.73845954 -0.05423025 0.826264240
## Arkansas -0.13999894 -1.10854226 -0.11342217 0.180973554
## California 2.49861285 1.52742672 -0.59254100 0.338559240
## Colorado 1.49934074 0.97762966 -1.08400162 -0.001450164
## Connecticut -1.34499236 1.07798362 0.63679250 0.117278736
## Delaware 0.04722981 0.32208890 0.71141032 0.873113315
## Florida 2.98275967 -0.03883425 0.57103206 0.095317042
## Georgia 1.62280742 -1.26608838 0.33901818 -1.065974459
## Hawaii -0.90348448 1.55467609 -0.05027151 -0.893733198
## Idaho -1.62331903 -0.20885253 -0.25719021 0.494087852
## Illinois 1.36505197 0.67498834 0.67068647 0.120794916
## Indiana -0.50038122 0.15003926 -0.22576277 -0.420397595
## Iowa -2.23099579 0.10300828 -0.16291036 -0.017379470
## Kansas -0.78887206 0.26744941 -0.02529648 -0.204421034
## Kentucky -0.74331256 -0.94880748 0.02808429 -0.663817237
## Louisiana 1.54909076 -0.86230011 0.77560598 -0.450157791
## Maine -2.37274014 -0.37260865 0.06502225
```

]

---
# Selecting the number of principal components

.pull-left[

* The goal of PCA is dimension reduction but how many PCs do we keep?

* Two common approaches:
   1. scree plot
   2. proportion of variance explained

]

---
# Selecting the number of PCs .red[ code chunk 18]

.pull-left[

* The goal of PCA is dimension reduction but how many PCs do we keep?

* Two common approaches:
   1. scree plot  
      a. Eigenvalue: retain PCs `$\geq$` 1 (an eigenvalue of 1 means that the principal component would explain about one variable’s worth of the variability)  
      b. Variance explained: look for elbow

]

.pull-right[

```r
fviz_screeplot(pca_result, choice = "eigenvalue")
```

```r
fviz_screeplot(pca_result, choice = "variance")
```

]

---
# Selecting the number of PCs .red[ code chunk 18]

.pull-left[

* The goal of PCA is dimension reduction but how many PCs do we keep?

* Two common approaches:
   1. .opacity20[scree plot]
   2. proportion of variance explained

]

.pull-right[

```r
# compute eigenvalues
eigen <- pca_result$sdev^2

# compute the PVE of each principal component
PVE <- eigen / sum(eigen)

# how many PCs required to explain at least 90% of total variability
min(which(cumsum(PVE) >= .90))
## [1] 3
```

]

.center[_PVE provides us a technical way to identify the optimal number of principal components to keep based on the total variability that we would like to account for. In feature engineering, it is common for the default to be set at 95%._]

---
# Insight interpretation .red[ code chunk 19]

.pull-left[

```r
fviz_pca_var(pca_result, alpha.var = "contrib")
```

]

.pull-right[

```r
fviz_pca(pca_result, alpha.ind = .5, labelsize = 3, repel = TRUE)
```

]

---
# PCA with mixed data

.pull-left[

* Rarely do we have only numeric data

* But what do we do with categorical features?

* The answer is...it depends

]

.pull-right[

]

---
# PCA with mixed data .red[ code chunk 20]

.scrollable90[
.pull-left[

* Rarely do we have only numeric data

* But what do we do with categorical features?

* The answer is...it depends

* .bold[PCA for inference]
 - option 1: one hot encode <img src="https://emojis.slackmojis.com/emojis/images/1542340469/4974/notinterested.gif?1542340469" style="height:2em; width:auto; "/>
 - option 2: use GLRMs <img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1.5em; width:auto; "/>

]

.pull-right[

.center.bold[One-hot encoding example]

```r
# full ames data set --> recode ordinal variables to numeric
ames_full <- AmesHousing::make_ames() %>%
 mutate_if(str_detect(names(.), "Qual|Cond|QC|Qu"), as.numeric)

# one-hot encode --> retain only the features and not sale price
full_rank <- caret::dummyVars(Sale_Price ~ ., data = ames_full, fullRank = TRUE)
ames_1hot <- predict(full_rank, ames_full)

# new dimensions
dim(ames_1hot)
## [1] 2930  240

# apply PCA to one-hot encoded data
pca_one_hot <- prcomp(ames_1hot, scale = TRUE)

# sign adjustment to loadings and scores
pca_one_hot$rotation <- -pca_one_hot$rotation
pca_one_hot$x <- -pca_one_hot$x

# scree plot
fviz_screeplot(pca_one_hot, ncp = 20, choice = "eigenvalue")
```

```r
# variable contributions
fviz_pca_var(pca_one_hot, alpha.var = "contrib", geom = "text", labelsize = 3, arrowsize = NULL)
```

]
]

---
# PCA with mixed data .red[ code chunk 21]

.scrollable90[
.pull-left[

* Rarely do we have only numeric data

* But what do we do with categorical features?

* The answer is...it depends

* PCA for inference
   - option 1: one hot encode
   - option 2: use GLRMs

* .bold[PCA for downstream modeling]

]

.pull-right[

```r
# get feature set
ames_full <- AmesHousing::make_ames()
features <- subset(ames_full, select = -Sale_Price)

# preprocess data
preprocess <- caret::preProcess(
 x = features,
 method = c("center", "scale", "pca"),
 thresh = 0.95
)

preprocess
## Created from 2930 samples and 80 variables
## 
## Pre-processing:
##   - centered (34)
##   - ignored (46)
##   - principal component signal extraction (34)
##   - scaled (34)
## 
## PCA needed 26 components to capture 95 percent of the variance
```
]
]

---
# .red[A lot of information in a little amount of time!]

---
# Final Thoughts

.pull-left[

.center.bold[What to remember]

- PCA reduces the dimensionality of a feature set
- PCA explains the variability of our feature set using fewer variables than the original data set.
- Data prep
   - clean & standardize data
- convert `prcomp()` eigenvalues & scores to positive
- Eigenvalues (loadings) represent a variables influence on the principal component
- PC scores represent standardized value for observation for a given PC
- Select number of PCs with
   - scree plot or
   - PVE
- Got mixed data
   - one-hot encode for inference
   - PCA on numeric for downstream modeling
]

.pull-right[

.center.bold[What to learn]

- Many more methods exist; definitely start learning
   - Non-negative matrix factorization
   - [Generalized low rank models](https://stanford.edu/~boyd/papers/pdf/glrm.pdf)
- Great resources
   - [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)
   - [Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/)
   - [Practical Guide to Principal Component Methods in R](https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=Practical+Guide+to+Principal+Component+Methods+in+R)

]

---
# Questions?

---
# Back home

[.center[]](https://github.com/uc-r/Advanced-R)

.center[https://github.com/uc-r/Advanced-R]