class: clear, center, middle background-image: url(images/unsupervised-cover.jpg) background-position: center background-size: contain <br><br><br><br><br><br><br><br><br> .font200.white[Unsupervised Learning] --- # Concept ___Unsupervised learning___: a set of statistical tools to better understand *n* observations that contain a set of features ( `\(x_1, x_2, \dots, x_p\)` ) without being guided by a response variable (*Y*). In essence, unsupervised learning is concerned with identifying groups in a data set * .bold[clustering]: reduce the observation space of a data set * .bold[dimension reduction]: reduce the feature space of a data set <img src="images/clustering_vs_pca.jpeg" width="2247" style="display: block; margin: auto;" /> --- # Concept ___Unsupervised learning___: a set of statistical tools to better understand *n* observations that contain a set of features ( `\(x_1, x_2, \dots, x_p\)` ) without being guided by a response variable (*Y*). In essence, unsupervised learning is concerned with identifying groups in a data set * .bold[clustering]: reduce the observation space of a data set - _k_-means clustering - hierarchical clustering - spectral clustering * .bold[dimension reduction]: reduce the feature space of a data set - principal components analysis (PCA) - factor analysis - matrix factorization - autoencoders * .bold[Generalized low rank models]: a generalization of the clustering & dimension reduction --- # Concept ___Unsupervised learning___: a set of statistical tools to better understand *n* observations that contain a set of features ( `\(x_1, x_2, \dots, x_p\)` ) without being guided by a response variable (*Y*). In essence, unsupervised learning is concerned with identifying groups in a data set * clustering: reduce the observation space of a data set - .bold.blue[_k_-means clustering] - .bold.blue[hierarchical clustering] - spectral clustering * dimension reduction: reduce the feature space of a data set - .bold.blue[principal components analysis (PCA)] - factor analysis - matrix factorization - autoencoders * Generalized low rank models: a generalization of the clustering & dimension reduction <br> .center.bold.white[.content-box-blue-dark[This module's focus]] --- # Prerequisites .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 1] .pull-left[ .center.font120[Packages] ```r library(tidyverse) # data munging & visualization library(cluster) # additional clustering techniques library(factoextra) # clustering & PCA visualizations ``` ] .pull-right[ .center.font120[Data] ```r USArrests # primary example data AmesHousing::make_ames() # a few additional examples ``` ] --- class: center, middle, inverse .font300.white[Clustering] --- # Types of clustering Clustering is a broad set of techniques for ___finding subgroups of observations___ within a data set. .pull-left[ * .bold[Objective]: we want observations in the same group to be similar and observations in different groups to be dissimilar * .bold[Use cases:] - customer segmentation - concentration of crime activity - common patient traits - voter profiles ] .pull-right[ <br><br> <img src="images/cluster-icon.jpg" width="1089" style="display: block; margin: auto;" /> ] --- # Types of clustering Clustering is a broad set of techniques for ___finding subgroups of observations___ within a data set. .pull-left[ * .bold[Objective]: we want observations in the same group to be similar and observations in different groups to be dissimilar * .bold[Use cases:] - customer segmentation - concentration of crime activity - common patient traits - voter profiles ] .pull-right[ * .bold[Methods]: several clustering algorithms exists: - k-means - hierarchical - partitioning around mediods (PAM) - clustering large applications (CLARA) ] --- # Measuring observation distances .pull-left[ * classification of observations into groups requires methods for computing the distance of the (dis)similarity between each pair of observations * distance measures - Euclidean: `\(d_{euc}(x_a,x_b) = \sqrt{\sum^P_{j=1}(x_{aj} - x_{bj})^2}\)` - Manhattan: `\(d_{man}(x_a,x_b) = \sum^P_{j=1}|x_{aj} - x_{bj}|\)` ] .pull-right[ ```r (two_states <- USArrests[1:2, 1:2]) ## Murder Assault ## Alabama 13.2 236 ## Alaska 10.0 263 dist(two_states, method = "euclidean") ## Alabama ## Alaska 27.18897 dist(two_states, method = "manhattan") ## Alabama ## Alaska 30.2 ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- # Measuring observation distances .pull-left[ * classification of observations into groups requires methods for computing the distance of the (dis)similarity between each pair of observations * distance measures - .opacity20[Euclidean] - .opacity20[Manhattan] - Correlation-based (i.e. Pearson, Spearman) ] .pull-right[ <br> <img src="02-unsupervised-learning_files/figure-html/correlation-distance-example-1.png" style="display: block; margin: auto;" /> ] -- <br> .center.bold.blue.font120[.content-box-gray[There are several other distance measures but these are the most common]] --- # Measuring observation distances When to use certain distance measures * Euclidean - most sensitive to outliers - outliers can skew clusters giving false interpretation - use if you are relatively certain minimal outliers exists * Manhattan - less sensitive to outliers - use if you want to be more robust to existing outliers * Correlation-based - captures common relationships regardless of magnitude - use if you want to analyze unscaled data where observations may have larger differences in magnitudes but possibly similar behaviors - i.e. group customers based on purchased quantities (large volume and low volume customers can be in same group if they exhibit common preferences) <br> .center.bold.blue[.content-box-gray[Euclidean is the most commonly used.]] --- # *k*-means clustering Basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total .blue[within-cluster variation]) is .blue[minimized] `$$\underset{C_1, \dots, C_k}{\texttt{minimize}} \bigg\{ \sum^K_{k=1}(W(C_k)) \bigg \}$$` <img src="02-unsupervised-learning_files/figure-html/generated-data-1.png" style="display: block; margin: auto;" /> .bold.red.center[Tighter clusters = more well defined clusters] --- # *k*-means clustering Basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total .blue[within-cluster variation]) is .blue[minimized] `$$\underset{C_1, \dots, C_k}{\texttt{minimize}} \bigg\{ \sum^K_{k=1} (\texttt{Mean Euclidean Distance}_k )\bigg \}$$` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> .bold.red.center[We measure within-cluster variation using our distance measure of choice (i.e. Euclidean, Manhattan)] --- # *k*-means algorithm .pull-left[ * Unfortunately we cannot evaluate every possible cluster combination because there are almost `\(K^n\)` ways to partition _n_ observations into _K_ clusters * Estimate _greedy local optimum_ ([
<i class="ai ai-google-scholar faa-tada animated-hover "></i>
](https://www.jstor.org/stable/pdf/2346830.pdf?casa_token=DyTW0ZLNC4gAAAAA:VNX2TGwDfcs5foMa96ZxnOM2mjaQU1WuCOLL8qF6iDBWp6ClU8-i2-OSXKbtO1uHm6_1oda_2egpvgYCvaix8UxUqUryqZj-Pw3G4m771Ev5-4kL46Y)) 1. .red[randomly] assign each observation to an initial cluster 2. iterate until the cluster assignments stop changing a. compute cluster centroids b. reassign each observation to the cluster whose centroid is closest * Do to randomization - we can get slightly different results each try - use several random starts (rule of
<img src="https://emojis.slackmojis.com/emojis/images/1511903783/3230/wiggle_thumbs_up.gif?1511903783" style="height:1em; width:auto; "/>
: 20) - algorithm uses iteration with lowest `\(W(C_k)\)` ] .pull-right[ <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- # Prepare our data for _k_-means .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 2] .pull-left[ 1. Rows are observations (individuals) and columns are variables. 2. Any missing value in the data must be removed or estimated. 3. The data must be standardized (centered at mean zero and scaled to one standard deviation) to make variables comparable. `\(^a\)` <br><br><br><br><br> `\(^a\)` .font60[In some rare cases you may want to leave data unstandardized.] ] .pull-right[ ```r crime <- USArrests %>% drop_na() %>% scale() %>% print() ## Murder Assault UrbanPop Rape ## Alabama 1.24256408 0.78283935 -0.52090661 -0.003416473 ## Alaska 0.50786248 1.10682252 -1.21176419 2.484202941 ## Arizona 0.07163341 1.47880321 0.99898006 1.042878388 ## Arkansas 0.23234938 0.23086801 -1.07359268 -0.184916602 ## California 0.27826823 1.26281442 1.75892340 2.067820292 ## Colorado 0.02571456 0.39885929 0.86080854 1.864967207 ## Connecticut -1.03041900 -0.72908214 0.79172279 -1.081740768 ## Delaware -0.43347395 0.80683810 0.44629400 -0.579946294 ## Florida 1.74767144 1.97077766 0.99898006 1.138966691 ## Georgia 2.20685994 0.48285493 -0.38273510 0.487701523 ## Hawaii -0.57123050 -1.49704226 1.20623733 -0.110181255 ## Idaho -1.19113497 -0.60908837 -0.79724965 -0.750769945 ## Illinois 0.59970018 0.93883125 1.20623733 0.295524916 ## Indiana -0.13500142 -0.69308401 -0.03730631 -0.024769429 ## Iowa -1.28297267 -1.37704849 -0.58999237 -1.060387812 ## Kansas -0.41051452 -0.66908525 0.03177945 -0.345063775 ## Kentucky 0.43898421 -0.74108152 -0.93542116 -0.526563903 ## Louisiana 1.74767144 0.93883125 0.03177945 0.103348309 ## Maine -1.30593210 -1.05306531 -1.00450692 -1.434064548 ## Maryland 0.80633501 1.55079947 0.10086521 0.701231086 ## Massachusetts -0.77786532 -0.26110644 1.34440885 -0.526563903 ## Michigan 0.99001041 1.01082751 0.58446551 1.480613993 ## Minnesota -1.16817555 -1.18505846 0.03177945 -0.676034598 ## Mississippi 1.90838741 1.05882502 -1.48810723 -0.441152078 ## Missouri 0.27826823 0.08687549 0.30812248 0.743936999 ## Montana -0.41051452 -0.74108152 -0.86633540 -0.515887425 ## Nebraska -0.80082475 -0.82507715 -0.24456358 -0.505210947 ## Nevada 1.01296983 0.97482938 1.06806582 2.644350114 ## New Hampshire -1.30593210 -1.36504911 -0.65907813 -1.252564419 ## New Jersey -0.08908257 -0.14111267 1.62075188 -0.259651949 ## New Mexico 0.82929443 1.37080881 0.30812248 1.160319648 ## New York 0.76041616 0.99882813 1.41349461 0.519730957 ## North Carolina 1.19664523 1.99477641 -1.41902147 -0.547916860 ## North Dakota -1.60440462 -1.50904164 -1.48810723 -1.487446939 ## Ohio -0.11204199 -0.60908837 0.65355127 0.017936483 ## Oklahoma -0.27275797 -0.23710769 0.16995096 -0.131534211 ## Oregon -0.66306820 -0.14111267 0.10086521 0.861378259 ## Pennsylvania -0.34163624 -0.77707965 0.44629400 -0.676034598 ## Rhode Island -1.00745957 0.03887798 1.48258036 -1.380682157 ## South Carolina 1.51807718 1.29881255 -1.21176419 0.135377743 ## South Dakota -0.91562187 -1.01706718 -1.41902147 -0.900240639 ## Tennessee 1.24256408 0.20686926 -0.45182086 0.605142783 ## Texas 1.12776696 0.36286116 0.99898006 0.455672088 ## Utah -1.05337842 -0.60908837 0.99898006 0.178083656 ## Vermont -1.28297267 -1.47304350 -2.31713632 -1.071064290 ## Virginia 0.16347111 -0.17711080 -0.17547783 -0.056798864 ## Washington -0.86970302 -0.30910395 0.51537975 0.530407436 ## West Virginia -0.47939280 -1.07706407 -1.83353601 -1.273917376 ## Wisconsin -1.19113497 -1.41304662 0.03177945 -1.113770203 ## Wyoming -0.22683912 -0.11711392 -0.38273510 -0.601299251 ## attr(,"scaled:center") ## Murder Assault UrbanPop Rape ## 7.788 170.760 65.540 21.232 ## attr(,"scaled:scale") ## Murder Assault UrbanPop Rape ## 4.355510 83.337661 14.474763 9.366385 ``` ] --- # Applying _k_-means .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 3] .scrollable90[ .pull-left[ 1. .opacity20[Rows are observations (individuals) and columns are variables.] 2. .opacity20[Any missing value in the data must be removed or estimated.] 3. .opacity20[The data must be standardized (centered at mean zero and scaled to one standard deviation) to make variables comparable.] 4. Apply `kmeans()` ] .pull-right[ ```r k3 <- kmeans(crime, centers = 3, nstart = 20) # tidied output broom::tidy(k3) ## # A tibble: 3 x 7 ## x1 x2 x3 x4 size withinss cluster ## <dbl> <dbl> <dbl> <dbl> <int> <dbl> <fct> ## 1 -0.962 -1.11 -0.930 -0.967 13 12.0 1 ## 2 -0.447 -0.347 0.479 -0.257 17 19.6 2 ## 3 1.00 1.01 0.198 0.847 20 46.7 3 # full model output glimpse(k3) ## List of 9 ## $ cluster : Named int [1:50] 3 3 3 2 3 3 2 2 3 3 ... ## ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ... ## $ centers : num [1:3, 1:4] -0.962 -0.447 1.005 -1.107 -0.347 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:3] "1" "2" "3" ## .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape" ## $ totss : num 196 ## $ withinss : num [1:3] 12 19.6 46.7 ## $ tot.withinss: num 78.3 ## $ betweenss : num 118 ## $ size : int [1:3] 13 17 20 ## $ iter : int 2 ## $ ifault : int 0 ## - attr(*, "class")= chr "kmeans" ``` ] ] --- # Interpreting output .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 4] ```r as_tibble(crime) %>% mutate( cluster = k3$cluster, label = paste0(row.names(USArrests), " (", cluster, ")") ) %>% gather(Crime, Rate, Murder, Assault, Rape) %>% ggplot(aes(UrbanPop, Rate, color = factor(cluster), label = label)) + geom_text(show.legend = FALSE) + facet_wrap(~ Crime) + ylab("arrests per 100,000 residents (standardized)") ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- class: yourturn # How many clusters
<img src="https://emojis.slackmojis.com/emojis/images/1542340471/4979/thinking.gif?1542340471" style="height:1.25em; width:auto; "/>
.font150.center[But how do we know we specified the right value for _k_?] <img src="https://media.giphy.com/media/xT5LMXWSmQ5xjWyTpC/giphy.gif" width="95%" height="95%" style="display: block; margin: auto;" /> --- class: yourturn # How many clusters
<img src="https://emojis.slackmojis.com/emojis/images/1542340471/4979/thinking.gif?1542340471" style="height:1.25em; width:auto; "/>
.font150.center[But how do we know we specified the right value for _k_?] <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> .center.bold[.content-box-gray[What do you notice?]] --- # Determining optimal clusters Choice for the number of _K_ clusters depends on the goal: - .bold.font120[Deterministic resource allocation]: a company may employ *k* sales people, and the goal is to partition customers into one of the _k_ segments. (_k_ is predetermined:
<img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1em; width:auto; "/>
) - .bold.font120[Descriptive understanding]: *k* is unknown and the goal is to ascertain what natural distinct groupings exist in the data (you need to determine _k_:
<img src="https://emojis.slackmojis.com/emojis/images/1471045885/967/wtf.gif?1471045885" style="height:1.5em; width:auto; "/>
) -- .pull-left[ - Elbow heuristic: - monotonic decreasing - Most common approach (looks for the kink in the `\(W(C_k)\)`) ] .pull-right[ ```r fviz_nbclust(df, kmeans, method = "wss", k.max = 20) ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- # Determining optimal clusters Choice for the number of _K_ clusters depends on the goal: - .bold.font120[Deterministic resource allocation]: a company may employ *k* sales people, and the goal is to partition customers into one of the _k_ segments. (_k_ is predetermined:
<img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1em; width:auto; "/>
) - .bold.font120[Descriptive understanding]: *k* is unknown and the goal is to ascertain what natural distinct groupings exist in the data (you need to determine _k_:
<img src="https://emojis.slackmojis.com/emojis/images/1471045885/967/wtf.gif?1471045885" style="height:1.5em; width:auto; "/>
) .pull-left[ - Elbow heuristic: - monotonic decreasing - Most common approach (looks for the kink in the `\(W(C_k)\)`) - Gap stat ([
<i class="ai ai-google-scholar faa-tada animated-hover "></i>
](http://web.stanford.edu/~hastie/Papers/gap.pdf)): - compares `\(\texttt{log}(W(C_k))\)` with their expected values under null reference distribution of the data - automated way to id "kink" - can ID when optimal _k_ is one ] .pull-right[ <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] --- # Determining optimal clusters Choice for the number of _K_ clusters depends on the goal: - .bold.font120[Deterministic resource allocation]: a company may employ *k* sales people, and the goal is to partition customers into one of the _k_ segments. (_k_ is predetermined:
<img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1em; width:auto; "/>
) - .bold.font120[Descriptive understanding]: *k* is unknown and the goal is to ascertain what natural distinct groupings exist in the data (you need to determine _k_:
<img src="https://emojis.slackmojis.com/emojis/images/1471045885/967/wtf.gif?1471045885" style="height:1.5em; width:auto; "/>
) .pull-left[ - Elbow heuristic: - monotonic decreasing - Most common approach (looks for the kink in the `\(W(C_k)\)`) - Gap stat ([
<i class="ai ai-google-scholar faa-tada animated-hover "></i>
](http://web.stanford.edu/~hastie/Papers/gap.pdf)): - compares `\(\texttt{log}(W(C_k))\)` with with their expected values under null reference distribution of the data - automated way to id "kink" - can ID when optimal _k_ is one ] .pull-right[ ```r fviz_nbclust(df, kmeans, method = "gap_stat", k.max = 20, verbose = FALSE) ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] --- class: yourturn # Determine optimal clusters .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunks 5 & 6] .pull-left[ Fill in code chunks 5 & 6 to identify the optimal number of clusters for our crime data. ``` # use the elbow heuristic (hint: "wss") fviz_nbclust(____, kmeans, method = "___", k.max = 20) ``` ``` # use the gab stat (hint: "gap_stat") fviz_nbclust(____, kmeans, method = "___", k.max = 20) ``` ] -- .pull-right[ ```r p1 <- fviz_nbclust(crime, kmeans, method = "wss", k.max = 20) p2 <- fviz_nbclust(crime, kmeans, method = "gap_stat", k.max = 20, verbose = FALSE) gridExtra::grid.arrange(p1, p2) ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] --- # Hierarchical clustering .pull-left[ - .bold.font120[_k_-means] - Requires _k_ to be predetermined - clusters are not nested - .bold.font120[Hierarchical] - Do not require _k_ to be predetermined - clusters can be nested - user must specify measure of _dissimilarity between groups_ - results in a .bold[dendrogram] ] .pull-right[ <img src="02-unsupervised-learning_files/figure-html/dendrogram-1.png" style="display: block; margin: auto;" /> ] --- # Hierarchical strategies .pull-left[ - .bold.font120[_k_-means] - Requires _k_ to be predetermined - clusters are not nested - .bold.font120[Hierarchical] - Do not require _k_ to be predetermined - clusters can be nested - user must specify measure of _dissimilarity between groups_ - results in a .bold[dendrogram] - two main strategies 1. Agglomerative nesting (AGNES): bottom up 2. Divisive analysis (DIANA): top down ] .pull-right[ <img src="images/dendrogram2.png" width="95%" style="display: block; margin: auto;" /> ] --- # Group clustering strategies .pull-left[ - .bold[AGNES] - start with individual observation clusters - merge closest two clusters based on: - Single linkage - Complete linkage - Group average linkage - Centroid linkage - Ward's linkage - repeat until all clusters have been merged - .bold[DIANA] - start with all obs in one cluster - iteratively divide one cluster into two daughter clusters - uses a dissimiliarity matrix (see [Kaufman & Rousseeuw, 1990](https://www.amazon.com/Finding-Groups-Data-Introduction-Analysis/dp/0471735787/ref=sr_1_1?ie=UTF8&qid=1548179235&sr=8-1&keywords=Finding+Groups+in+Data%3A+An+Introduction+to+Cluster+Analysis)) - .blue[Pro tip:] used if focused on relatively small number of clusters ] .pull-right[ <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] --- # Group clustering strategies .pull-left[ - AGNES - start with individual observation clusters - merge closest two clusters based on: - __Single linkage__ - Complete linkage - Group average linkage - Centroid linkage - Ward's linkage - repeat until all clusters have been merged - DIANA - start with all obs in one cluster - iteratively divide one cluster into two daughter clusters - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990) - Pro tip: used if focused on relatively small number of clusters ] .pull-right[ <img src="images/single-linkage.png" width="82%" height="82%" style="display: block; margin: auto;" /> .center.bold.red[Merge clusters where the closest two observations have the smallest distance] ] --- # Group clustering strategies .pull-left[ - AGNES - start with individual observation clusters - merge closest two clusters based on: - Single linkage - __Complete linkage__ - Group average linkage - Centroid linkage - Ward's linkage - repeat until all clusters have been merged - DIANA - start with all obs in one cluster - iteratively divide one cluster into two daughter clusters - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990) - Pro tip: used if focused on relatively small number of clusters ] .pull-right[ <img src="images/complete-linkage.png" width="82%" height="82%" style="display: block; margin: auto;" /> .center.bold.red[Merge clusters where the most distant two observations have the smallest distance] ] --- # Group clustering strategies .pull-left[ - AGNES - start with individual observation clusters - merge closest two clusters based on: - Single linkage - Complete linkage - __Group average linkage__ - Centroid linkage - Ward's linkage - repeat until all clusters have been merged - DIANA - start with all obs in one cluster - iteratively divide one cluster into two daughter clusters - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990) - Pro tip: used if focused on relatively small number of clusters ] .pull-right[ <img src="images/group-average-linkage.png" width="82%" height="82%" style="display: block; margin: auto;" /> .center.bold.red[Merge clusters where the average distance between all observations are smallest] ] --- # Group clustering strategies .pull-left[ - AGNES - start with individual observation clusters - merge closest two clusters based on: - Single linkage - Complete linkage - Group average linkage - __Centroid linkage__ - Ward's linkage - repeat until all clusters have been merged - DIANA - start with all obs in one cluster - iteratively divide one cluster into two daughter clusters - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990) - Pro tip: used if focused on relatively small number of clusters ] .pull-right[ <img src="images/centroid-linkage.png" width="82%" height="82%" style="display: block; margin: auto;" /> .center.bold.red[Merge clusters with the smallest distance between their centroids] ] --- # Group clustering strategies .pull-left[ - AGNES - start with individual observation clusters - merge closest two clusters based on: - Single linkage - Complete linkage - Group average linkage - Centroid linkage - __Ward's linkage__ - repeat until all clusters have been merged - DIANA - start with all obs in one cluster - iteratively divide one cluster into two daughter clusters - uses a dissimiliarity matrix (see Kaufman & Rousseeuw, 1990) - Pro tip: used if focused on relatively small number of clusters ] .pull-right[ <img src="images/wards-linkage.png" width="82%" height="82%" style="display: block; margin: auto;" /> .center.bold.red[Merge clusters with the smallest between cluster variation] ] --- # Comparing AGNES clustering .pull-left[ - AGNES - start with individual observation clusters - merge closest two clusters based on: - Single linkage: produces many "loose" clusters - Complete linkage: produces "compact" clusters - Centroid linkage: depends on variability - Group average linkage: balances clusters - Ward's linkage: balances clusters - repeat until all clusters have been merged <br> .center.bold.red[I tend to look for the method that produces more compact clusters] ] .pull-right[ <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] --- # Applying hierarchical clustering .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunks 7 & 8] .pull-left[ ```r # you can use hclust option1 <- hclust(dist(crime), method = "average") plot(option1) ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] .pull-left[ ```r # or agnes option2 <- cluster::agnes(crime, method = "average") plot(option2, which = 2) ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] --- # Interpreting the dendrogram - vertical axis represents distance at fusion - 9 & 2 appear close on dendrogram but, in fact, their closeness on the dendrogram imply they are approximately the same distance from obs (5, 7, & 8) .pull-left[ <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-25-1.png" width="80%" height="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-26-1.png" width="90%" height="90%" style="display: block; margin: auto;" /> ] --- # Which method to use? .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 9] The `agnes()` function provides an ___agglomerative coefficent___, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure). .code70[ ```r # agglomerative coefficent from earlier agnes() model option2$ac ## [1] 0.7379371 ``` ] -- This allows us to find certain hierarchical clustering methods that can identify stronger clustering structures. .code70[ ```r # methods to assess m <- c("average", "single", "complete", "ward") names(m) <- m # function to compute coefficient ac <- function(x) { cluster::agnes(crime, method = x)$ac } # get agglomerative coefficient for each linkage method map_dbl(m, ac) ## average single complete ward ## 0.7379371 0.6276128 0.8531583 0.9346210 ``` ] --- # Which clusters to use? .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunks 10 & 11] .pull-left[ .center.bold[Visual assessment] ```r # re-run hc with Ward method hc_ward <- cluster::agnes(crime, method = "ward") # highlight 4 clusters plot(hc_ward, which = 2) rect.hclust(hc_ward, k = 4, border = 2:5) ``` <img src="images/hclust-visual-assessment.png" width="1513" style="display: block; margin: auto;" /> ] .pull-right[ .center.bold[Elbow method or Gap stat] ```r # fviz_nbclust can use differen FUNctions # see ?fviz_nbclust & ?hcut fviz_nbclust(crime, FUN = hcut, method = "gap_stat", verbose = FALSE) ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] --- # Extract results and visualize .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 12] ```r as_tibble(crime) %>% mutate( * cluster = cutree(hc_ward, k = 3), # use cutree to get clusters label = paste0(row.names(USArrests), " (", cluster, ")") ) %>% gather(Crime, Rate, Murder, Assault, Rape) %>% ggplot(aes(UrbanPop, Rate, color = factor(cluster), label = label)) + geom_text(show.legend = FALSE) + facet_wrap(~ Crime) + ylab("arrests per 100,000 residents (standardized)") ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- # Ohio's cluster .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunks 13] .bold[Common states between _k_-means and Hierarchical clustering] ``` ## [1] "Arkansas" "Connecticut" "Delaware" "Hawaii" ## [5] "Indiana" "Kansas" "Massachusetts" "New Jersey" ## [9] "Ohio" "Oklahoma" "Oregon" "Pennsylvania" ## [13] "Rhode Island" "Utah" "Virginia" "Washington" ## [17] "Wyoming" ``` .bold[Different states between _k_-means and Hierarchical clustering] ``` ## [1] "Kentucky" "Missouri" ``` --- # Final Thoughts .pull-left[ .center.bold[What to remember] - We measure differences between obs with - Euclidean distance - Manhattan distance - Correlation-based distances - Data prep - clean & standardize data - _k_-means - minimizes within cluster variance - find optimal clusters w/elbow & Gap stat - Hierarchical - AGNES vs DIANA - Many ways to merge clusters - Use method that produces compact clusters ] .pull-right[ .center.bold[What to learn] - Many more methods exist - `daisy()`: compute distance measures for mixed data - `pam()`: medoid clustering - `clara()`: big data clustering - `diana()`: Devisive hierarchical clustering - Great resources - [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) - [Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/) - [Practical Guide to Cluster Analysis in R](https://www.amazon.com/Practical-Guide-Cluster-Analysis-Unsupervised/dp/1542462703) - [Finding Groups in Data](https://www.amazon.com/Finding-Groups-Data-Introduction-Analysis/dp/0471735787) ] --- class: center, middle, inverse # Dimension Reduction via PCA --- # Motivation PCA ___reduces the dimensionality of a feature set___. .pull-left[ * .bold[Objective]: explain the variability of our feature set using fewer variables than the original data set. * .bold[Use cases:] - descriptive: explaining common feature attributes - customer profiling - student profiling - recidivism profiling - _________ profiling - predictive: feature engineering for downstream modeling - reducing noise - minimizing multicollinearity ] .pull-right[ <br> <img src="images/pca-icon.jpg" width="964" style="display: block; margin: auto;" /> ] --- # Motivation PCA ___reduces the dimensionality of a feature set___. .pull-left[ * .opacity20[Objective: explain the variability of our feature set using fewer variables than the original data set.] * .opacity20[Use cases:] - descriptive: explaining common feature attributes - customer profiling - student profiling - recidivism profiling - _________ profiling - .opacity20[predictive: feature engineering for downstream modeling] ] .pull-right[ <img src="02-unsupervised-learning_files/figure-html/example-pca-1.png" style="display: block; margin: auto;" /> ] --- # Motivation PCA ___reduces the dimensionality of a feature set___. .pull-left[ * .opacity20[Objective: explain the variability of our feature set using fewer variables than the original data set.] * .opacity20[Use cases:] - .opacity20[descriptive: explaining common feature attributes] <br><br><br><br><br> - predictive: feature engineering for downstream modeling - reducing noise - minimizing multicollinearity ] .pull-right[ <img src="02-unsupervised-learning_files/figure-html/example-pca2-1.png" style="display: block; margin: auto;" /> ] --- # Finding principal components .pull-left[ The *first principal component* of a data set `\(X_1\)`, `\(X_2\)`, ..., `\(X_p\)` is the linear combination of the features `$$Z_{1} = \phi_{11}X_{1} + \phi_{21}X_{2} + ... + \phi_{p1}X_{p}$$` that has the largest variance. We refer to `\(\phi_{11}, \dots, \phi_{n1}\)` as the _loadings_ of the first principal component. We can find the second principal component `\(Z_2\)` by identifying the linear combination of `\(X_1,\dots , X_p\)` that has maximal variance out of all linear combinations that are __*uncorrelated*__ with `\(Z_1\)`. ] -- .pull-right[ <img src="https://gifimage.net/wp-content/uploads/2017/10/how-do-you-say-gif-9.gif" style="display: block; margin: auto;" /> .center.bold.red[Do you prefer pictures?] ] --- # Finding principal components .pull-left[ The *first principal component* of a data set `\(X_1\)`, `\(X_2\)`, ..., `\(X_p\)` is the linear combination of the features `$$Z_{1} = \phi_{11}X_{1} + \phi_{21}X_{2} + ... + \phi_{p1}X_{p}$$` that has the largest variance. We refer to `\(\phi_{11}, \dots, \phi_{n1}\)` as the _loadings_ of the first principal component. We can find the second principal component `\(Z_2\)` by identifying the linear combination of `\(X_1,\dots , X_p\)` that has maximal variance out of all linear combinations that are __*uncorrelated*__ with `\(Z_1\)`. ] .pull-right[ <img src="02-unsupervised-learning_files/figure-html/create-pca-image-1.png" style="display: block; margin: auto;" /> ] --- # Finding principal components .pull-left[ The *first principal component* of a data set `\(X_1\)`, `\(X_2\)`, ..., `\(X_p\)` is the linear combination of the features `$$Z_{1} = \phi_{11}X_{1} + \phi_{21}X_{2} + ... + \phi_{p1}X_{p}$$` that has the largest variance. We refer to `\(\phi_{11}, \dots, \phi_{n1}\)` as the _loadings_ of the first principal component. We can find the second principal component `\(Z_2\)` by identifying the linear combination of `\(X_1,\dots , X_p\)` that has maximal variance out of all linear combinations that are __*uncorrelated*__ with `\(Z_1\)`. ] .pull-right[ <img src="images/3D-PCA.png" width="672" style="display: block; margin: auto;" /> <br> .right.font50[Made with pca3d package] ] --- # Prepare our data for PCA .pull-left[ 1. Rows are observations (individuals) and columns are variables. 2. Any missing value in the data must be removed or estimated. 3. The data must be standardized (centered at mean zero and scaled to one standard deviation) to make variables comparable. ] .pull-right[ ```r # exact same procedure you ran for clustering crime <- USArrests %>% drop_na() %>% scale() %>% print() ``` ] --- # Applying PCA .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 14] .pull-left[ 1. .opacity20[Rows are observations (individuals) and columns are variables.] 2. .opacity20[Any missing value in the data must be removed or estimated.] 3. .opacity20[The data must be standardized (centered at mean zero and scaled to one standard deviation) to make variables comparable.] 4. Apply `prcomp()` ] .pull-right[ ```r # perform PCA pca_result <- prcomp(crime, scale = FALSE) # PCA model output names(pca_result) ## [1] "sdev" "rotation" "center" "scale" "x" ``` .font120[`rotation`] provides the principal component loadings ( `\(\phi_{ij}\)` ) ] --- # Understanding PCs .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 15] .pull-left[ * .font110[`rotation`] provides the principal component loadings ( `\(\phi_{ij}\)` ) * There will be the same number of PCs as variables * Positive vs negative direction: - By default, loadings (aka eigenvectors) in R point in the negative direction. - Positive pointing eigenvectors are more intuitive - To use the positive-pointing vector, we multiply the default loadings by -1. ] .pull-right[ ```r # convert loadings to positive pca_result$rotation <- -pca_result$rotation # there will be the same number of PCs as variables pca_result$rotation ## PC1 PC2 PC3 PC4 ## Murder 0.5358995 -0.4181809 0.3412327 -0.64922780 ## Assault 0.5831836 -0.1879856 0.2681484 0.74340748 ## UrbanPop 0.2781909 0.8728062 0.3780158 -0.13387773 ## Rape 0.5434321 0.1673186 -0.8177779 -0.08902432 ``` ] <br> .center.bold[__Loadings represent coefficients; illustrates each variables influence on the principal component__] --- # Understanding PCs .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 16] We can visualize these contributions .pull-left[ ```r fviz_contrib(pca_result, choice = "var", axes = 1) ``` <img src="02-unsupervised-learning_files/figure-html/pca-pc1-contributions-1.png" style="display: block; margin: auto;" /> .center[PC1 appears to represent violent crime] ] .pull-right[ ```r fviz_contrib(pca_result, choice = "var", axes = 2) ``` <img src="02-unsupervised-learning_files/figure-html/pca-pc2-contributions-1.png" style="display: block; margin: auto;" /> .center[PC2 appears to represent urban density] ] --- # Understanding PCs .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 17] .pull-left[ * .font110[`x`] provides the principal component scores * The principal components scores simply places a standardized score for each observation for each principal component. * Interpretation example: Alaska - PC1: 1.93 standard deviations above average value for PC1 (high violent crime) - PC2: 1 standard deviation below average value for PC2 (low urbanization) ] .pull-right[ ```r # convert scores to positive pca_result$x <- -pca_result$x pca_result$x ## PC1 PC2 PC3 PC4 ## Alabama 0.97566045 -1.12200121 0.43980366 -0.154696581 *## Alaska 1.93053788 -1.06242692 -2.01950027 0.434175454 ## Arizona 1.74544285 0.73845954 -0.05423025 0.826264240 ## Arkansas -0.13999894 -1.10854226 -0.11342217 0.180973554 ## California 2.49861285 1.52742672 -0.59254100 0.338559240 ## Colorado 1.49934074 0.97762966 -1.08400162 -0.001450164 ## Connecticut -1.34499236 1.07798362 0.63679250 0.117278736 ## Delaware 0.04722981 0.32208890 0.71141032 0.873113315 ## Florida 2.98275967 -0.03883425 0.57103206 0.095317042 ## Georgia 1.62280742 -1.26608838 0.33901818 -1.065974459 ## Hawaii -0.90348448 1.55467609 -0.05027151 -0.893733198 ## Idaho -1.62331903 -0.20885253 -0.25719021 0.494087852 ## Illinois 1.36505197 0.67498834 0.67068647 0.120794916 ## Indiana -0.50038122 0.15003926 -0.22576277 -0.420397595 ## Iowa -2.23099579 0.10300828 -0.16291036 -0.017379470 ## Kansas -0.78887206 0.26744941 -0.02529648 -0.204421034 ## Kentucky -0.74331256 -0.94880748 0.02808429 -0.663817237 ## Louisiana 1.54909076 -0.86230011 0.77560598 -0.450157791 ## Maine -2.37274014 -0.37260865 0.06502225 ``` ] --- # Selecting the number of principal components .pull-left[ * The goal of PCA is dimension reduction but how many PCs do we keep? * Two common approaches: 1. scree plot 2. proportion of variance explained ] --- # Selecting the number of PCs .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 18] .pull-left[ * The goal of PCA is dimension reduction but how many PCs do we keep? * Two common approaches: 1. scree plot a. Eigenvalue: retain PCs `\(\geq\)` 1 (an eigenvalue of 1 means that the principal component would explain about one variable’s worth of the variability) b. Variance explained: look for elbow ] .pull-right[ ```r fviz_screeplot(pca_result, choice = "eigenvalue") ``` <img src="02-unsupervised-learning_files/figure-html/scree-plots-1.png" style="display: block; margin: auto;" /> ```r fviz_screeplot(pca_result, choice = "variance") ``` <img src="02-unsupervised-learning_files/figure-html/scree-plots-2.png" style="display: block; margin: auto;" /> ] --- # Selecting the number of PCs .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 18] .pull-left[ * The goal of PCA is dimension reduction but how many PCs do we keep? * Two common approaches: 1. .opacity20[scree plot] 2. proportion of variance explained ] .pull-right[ ```r # compute eigenvalues eigen <- pca_result$sdev^2 # compute the PVE of each principal component PVE <- eigen / sum(eigen) # how many PCs required to explain at least 90% of total variability min(which(cumsum(PVE) >= .90)) ## [1] 3 ``` ] <br> .center[_PVE provides us a technical way to identify the optimal number of principal components to keep based on the total variability that we would like to account for. In feature engineering, it is common for the default to be set at 95%._] --- # Insight interpretation .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 19] .pull-left[ ```r fviz_pca_var(pca_result, alpha.var = "contrib") ``` <img src="02-unsupervised-learning_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r fviz_pca(pca_result, alpha.ind = .5, labelsize = 3, repel = TRUE) ``` <img src="02-unsupervised-learning_files/figure-html/pca-2-dimensional-loadings-scatterplot-1.png" style="display: block; margin: auto;" /> ] --- # PCA with mixed data .pull-left[ * Rarely do we have only numeric data * But what do we do with categorical features? * The answer is...it depends ] .pull-right[ <img src="https://media.giphy.com/media/3dWbqRU3Q4YAo/giphy.gif" width="90%" height="90%" style="display: block; margin: auto;" /> ] --- # PCA with mixed data .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 20] .scrollable90[ .pull-left[ * Rarely do we have only numeric data * But what do we do with categorical features? * The answer is...it depends * .bold[PCA for inference] - option 1: one hot encode
<img src="https://emojis.slackmojis.com/emojis/images/1542340469/4974/notinterested.gif?1542340469" style="height:2em; width:auto; "/>
- option 2: use GLRMs
<img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1.5em; width:auto; "/>
] .pull-right[ .center.bold[One-hot encoding example] ```r # full ames data set --> recode ordinal variables to numeric ames_full <- AmesHousing::make_ames() %>% mutate_if(str_detect(names(.), "Qual|Cond|QC|Qu"), as.numeric) # one-hot encode --> retain only the features and not sale price full_rank <- caret::dummyVars(Sale_Price ~ ., data = ames_full, fullRank = TRUE) ames_1hot <- predict(full_rank, ames_full) # new dimensions dim(ames_1hot) ## [1] 2930 240 # apply PCA to one-hot encoded data pca_one_hot <- prcomp(ames_1hot, scale = TRUE) # sign adjustment to loadings and scores pca_one_hot$rotation <- -pca_one_hot$rotation pca_one_hot$x <- -pca_one_hot$x # scree plot fviz_screeplot(pca_one_hot, ncp = 20, choice = "eigenvalue") ``` <img src="02-unsupervised-learning_files/figure-html/one-hot-1.png" style="display: block; margin: auto;" /> ```r # variable contributions fviz_pca_var(pca_one_hot, alpha.var = "contrib", geom = "text", labelsize = 3, arrowsize = NULL) ``` <img src="02-unsupervised-learning_files/figure-html/one-hot2-1.png" style="display: block; margin: auto;" /> ] ] --- # PCA with mixed data .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 21] .scrollable90[ .pull-left[ * Rarely do we have only numeric data * But what do we do with categorical features? * The answer is...it depends * PCA for inference - option 1: one hot encode - option 2: use GLRMs * .bold[PCA for downstream modeling] ] .pull-right[ ```r # get feature set ames_full <- AmesHousing::make_ames() features <- subset(ames_full, select = -Sale_Price) # preprocess data preprocess <- caret::preProcess( x = features, method = c("center", "scale", "pca"), thresh = 0.95 ) preprocess ## Created from 2930 samples and 80 variables ## ## Pre-processing: ## - centered (34) ## - ignored (46) ## - principal component signal extraction (34) ## - scaled (34) ## ## PCA needed 26 components to capture 95 percent of the variance ``` ] ] --- # .red[A lot of information in a little amount of time!] <img src="https://i.gifer.com/CdDA.gif" width="80%" height="80%" style="display: block; margin: auto;" /> --- # Final Thoughts .pull-left[ .center.bold[What to remember] - PCA reduces the dimensionality of a feature set - PCA explains the variability of our feature set using fewer variables than the original data set. - Data prep - clean & standardize data - convert `prcomp()` eigenvalues & scores to positive - Eigenvalues (loadings) represent a variables influence on the principal component - PC scores represent standardized value for observation for a given PC - Select number of PCs with - scree plot or - PVE - Got mixed data - one-hot encode for inference - PCA on numeric for downstream modeling ] .pull-right[ .center.bold[What to learn] - Many more methods exist; definitely start learning - Non-negative matrix factorization - [Generalized low rank models](https://stanford.edu/~boyd/papers/pdf/glrm.pdf) - Great resources - [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) - [Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/) - [Practical Guide to Principal Component Methods in R](https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=Practical+Guide+to+Principal+Component+Methods+in+R) ] --- # Questions? <img src="https://media3.giphy.com/media/DUrdT2xEmJWbS/giphy.gif" width="60%" height="60%" style="display: block; margin: auto;" /> --- # Back home <br><br><br><br> [.center[
<i class="fas fa-home fa-10x faa-FALSE animated "></i>
]](https://github.com/uc-r/Advanced-R) .center[https://github.com/uc-r/Advanced-R]