Feature & Target Engineering

class: clear, center, middle

background-image: url(images/engineering-icon.jpg)
background-position: center
background-size: cover

<br><br><br><br><br><br><br><br><br><br><br><br><br>
.font200.bold.white[Feature & Target Engineering]

---
# Introduction

Data pre-processing and engineering techniques generally refer to the .blue[___addition, deletion, or transformation of data___].

.pull-left[

.center.bold.font120[Thoughts]

- Substantial time commitment
- 1 hr module doesn't do justice
- Not a "sexy" area to study but well worth your time
- Additional resources to start with:
   - [Feature Engineering and Selection: A Practical Approach for Predictive Models](http://www.feat.engineering/)
   - [Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists](https://www.amazon.com/Feature-Engineering-Machine-Learning-Principles/dp/1491953241)

]

.pull-right[

.center.bold.font120[Overview]

- Target engineering
- Missingness
- Feature filtering
- Numeric feature engineering
- Categorical feature engineering
- Dimension reduction
- Proper implementation

]

---
# Prereqs .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 1]

.pull-left[

.center.bold.font120[Packages]

```r
library(dplyr)
library(ggplot2)
library(rsample)
library(recipes)
```

]

.pull-right[

.center.bold.font120[Data]

```r
# ames data
ames <- AmesHousing::make_ames()

# split data
set.seed(123)
split <- initial_split(ames, strata = "Sale_Price")
ames_train <- training(split)
```

]

---
class: center, middle, inverse

.font300.white[Target Engineering]

---
# Normality correction

.pull-left[

Not a requirement but...

- can improve predictive accuracy for parametric & distance-based models
- can correct for residual assumption violations
- minimizes effects of outliers

plus...

- sometimes used to for shaping the business problem as well

.center[_“taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.”_]

]

.pull-right[

<center>
`$\texttt{Sale_Price} = \beta_0 + \beta_1\texttt{Year_Built} + \epsilon$`
</center>

]

---
# Transformation options

.pull-left[

- log (or log with offset)

- Box-Cox: automates process of finding proper transformation

$$
 \begin{equation} 
 y(\lambda) =
`\begin{cases}
   \frac{y^\lambda-1}{\lambda}, & \text{if}\ \lambda \neq 0 \\
   \log y, & \text{if}\ \lambda = 0.
\end{cases}`
\end{equation}`
$$

- Yeo-Johnson: modified Box-Cox for non-strictly positive values

]

.pull-right[

We'll put these pieces together later

```r
step_log()
step_BoxCox()
step_YeoJohnson()
```

]

---
class: center, middle, inverse

.font300.white[Missingness]

.white[_Many models cannot cope with missing data so imputation strategies may be necessary._]

---
# Visualizing .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 2]

An uncleaned version of Ames housing data:

```r
sum(is.na(AmesHousing::ames_raw))
## [1] 13997
```

.pull-left[

```r
AmesHousing::ames_raw %>%
  is.na() %>%
  reshape2::melt() %>%
  ggplot(aes(Var2, Var1, fill=value)) + 
    geom_raster() + 
    coord_flip() +
    scale_y_continuous(NULL, expand = c(0, 0)) +
    scale_fill_grey(name = "", labels = c("Present", "Missing")) +
    xlab("Observation") +
    theme(axis.text.y  = element_text(size = 4))
```

]

.pull-right[

]

---
# Visualizing .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 3]

An uncleaned version of Ames housing data:

```r
sum(is.na(AmesHousing::ames_raw))
## [1] 13997
```

.pull-left[

```r
extracat::visna(AmesHousing::ames_raw, sort = "b")
```

]

.pull-right[

]

---
# Structural vs random .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 4]

.pull-left[

Missing values can be a result of many different reasons; however, these reasons are usually lumped into two categories:

* informative missingess

* missingness at random

]

.pull-right[

```r
AmesHousing::ames_raw %>% 
  filter(is.na(`Garage Type`)) %>% 
  select(`Garage Type`, `Garage Cars`, `Garage Area`)
## # A tibble: 157 x 3
##    `Garage Type` `Garage Cars` `Garage Area`
##    <chr>                 <int>         <int>
##  1 <NA>                      0             0
##  2 <NA>                      0             0
##  3 <NA>                      0             0
##  4 <NA>                      0             0
##  5 <NA>                      0             0
##  6 <NA>                      0             0
##  7 <NA>                      0             0
##  8 <NA>                      0             0
##  9 <NA>                      0             0
## 10 <NA>                      0             0
## # … with 147 more rows
```

]

<br>

.center.bold[Determines how you will, and if you can/should, impute.]

---
# Imputation

.pull-left[

Primary methods:

- Estimated statistic (i.e. mean, median, mode)

- K-nearest neighbor

- Tree-based (bagged trees)

]

.pull-right[

.center.font80[.red[Actual values] vs .blue[imputed values]]

]

---
# Imputation

.pull-left[

Primary methods:

- Estimated statistic (i.e. mean, median, mode)

- K-nearest neighbor

- Tree-based (bagged trees)

]

.pull-right[

We'll put these pieces together later

```r
step_meanimpute()
step_medianimpute()
step_modeimpute()
step_knnimpute()
step_bagimpute()
```

]

---
class: center, middle, inverse

.font300.white[Feature Filtering]

---
# More is not always better!

Excessive noisy variables can...

.font120.bold[reduce accuracy]

---
# More is not always better!

Excessive noisy variables can...

.font120.bold[increase computation time]

---
# Options for filtering .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 5]

.pull-left[
Filtering options include:

- removing 
   - zero variance features
   - near-zero variance features
   - highly correlated features (better to do dimension reduction)

- Feature selection
   - beyond scope of module
   - see [Applied Predictive Modeling, ch. 19](http://appliedpredictivemodeling.com/)
]

.pull-right[

```r
caret::nearZeroVar(ames_train, saveMetrics= TRUE) %>% 
  rownames_to_column() %>% 
  filter(nzv)
##               rowname  freqRatio percentUnique zeroVar  nzv
## 1              Street  218.90000    0.09095043   FALSE TRUE
## 2               Alley   22.31522    0.13642565   FALSE TRUE
## 3        Land_Contour   23.05814    0.18190086   FALSE TRUE
## 4           Utilities 2197.00000    0.13642565   FALSE TRUE
## 5          Land_Slope   22.76087    0.13642565   FALSE TRUE
## 6         Condition_2  242.00000    0.31832651   FALSE TRUE
## 7           Roof_Matl  127.47059    0.36380173   FALSE TRUE
## 8           Bsmt_Cond   19.71717    0.27285130   FALSE TRUE
## 9      BsmtFin_Type_2   24.20513    0.31832651   FALSE TRUE
## 10       BsmtFin_SF_2  486.00000    9.54979536   FALSE TRUE
## 11            Heating  103.09524    0.27285130   FALSE TRUE
## 12    Low_Qual_Fin_SF  723.33333    1.22783083   FALSE TRUE
## 13      Kitchen_AbvGr   22.60215    0.18190086   FALSE TRUE
## 14         Functional   40.90000    0.36380173   FALSE TRUE
## 15     Enclosed_Porch  102.72222    7.41246021   FALSE TRUE
## 16 Three_season_porch  723.33333    1.18235562   FALSE TRUE
## 17       Screen_Porch  183.18182    4.77489768   FALSE TRUE
## 18          Pool_Area 2190.00000    0.45475216   FALSE TRUE
## 19            Pool_QC  730.00000    0.22737608   FALSE TRUE
## 20       Misc_Feature   31.22059    0.27285130   FALSE TRUE
## 21           Misc_Val  151.85714    1.40973170   FALSE TRUE
```

]

---
# Options for filtering

.pull-left[
Filtering options include:

- removing 
   - zero variance features
   - near-zero variance features
   - highly correlated features (better to do dimension reduction)

- Feature selection
   - beyond scope of module
   - see [Applied Predictive Modeling, ch. 19](http://appliedpredictivemodeling.com/)
]

.pull-right[

We'll put these pieces together later

```r
step_zv()
step_nzv()
step_corr()
```

]

---
class: center, middle, inverse

.font300.white[Numeric Feature Engineering]

---
# Transformations

.pull-left[
* skewness
   - parametric models that have distributional assumptions (i.e. GLMs, regularized models)
   - log
   - Box-Cox or Yeo-Johnson
   
* standardization
   - Models that incorporate linear functions (GLM, NN) and distance functions (i.e. KNN, clustering) of input features are sensitive to the scale of the inputs 
   - centering _and_ scaling so that numeric variables have `$\mu = 0; \sigma = 1$` 
]   
 
.pull-right[

<img src="04-engineering_files/figure-html/standardizing-1.png" style="display: block; margin: auto;" />
]

---
# Transformations

.pull-right[

We'll put these pieces together later

```r
step_log()
step_BoxCox()
step_YeoJohnson()
step_center()
step_scale()
```

]

---
class: center, middle, inverse

.font300.white[Categorical Feature Engineering]

---
# One-hot & Dummy encoding

.pull-left[

Many models require all predictor variables to be numeric (i.e. GLMs, SVMs, NNets)

<table class="table table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> id </th>
   <th style="text-align:left;"> x </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> a </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> c </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:left;"> b </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> c </td>
  </tr>
</tbody>
</table>
   
Two most common approaches include...

]

.pull-right[

.bold.center[Dummy encoding]

.bold.center[One-hot encoding]

]

---
# Label encoding .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 6]

.pull-left[
* One-hot and dummy encoding are not good when:
   - you have a lot of categorical features
   - with high cardinality
   - or you have ordinal features

* Label encoding:
   - pure numeric conversion of the levels of a categorical variable
   - most common: ordinal encoding
]

.pull-right[

.center.bold[Quality variables with natural ordering]

```r
ames_train %>% select(matches("Qual|QC|Qu"))
## # A tibble: 2,199 x 9
##    Overall_Qual Exter_Qual Bsmt_Qual Heating_QC Low_Qual_Fin_SF
##    <fct>        <fct>      <fct>     <fct>                <int>
##  1 Above_Avera… Typical    Typical   Typical                  0
##  2 Good         Good       Typical   Excellent                0
##  3 Average      Typical    Good      Good                     0
##  4 Above_Avera… Typical    Typical   Excellent                0
##  5 Very_Good    Good       Good      Excellent                0
##  6 Very_Good    Good       Good      Excellent                0
##  7 Good         Typical    Typical   Good                     0
##  8 Above_Avera… Typical    Good      Good                     0
##  9 Above_Avera… Typical    Good      Excellent                0
## 10 Good         Typical    Good      Good                     0
## # … with 2,189 more rows, and 4 more variables: Kitchen_Qual <fct>,
## #   Fireplace_Qu <fct>, Garage_Qual <fct>, Pool_QC <fct>
```

]

---
# Label encoding .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 7]

.pull-left[
* One-hot and dummy encoding are not good when:
   - you have a lot of categorical features
   - with high cardinality
   - or you have ordinal features

* Label encoding:
   - pure numeric conversion of the levels of a categorical variable
   - most common: ordinal encoding
]

.pull-right[

.center.bold[Original encoding for `Overall_Qual`]

```r
count(ames_train, Overall_Qual)
## # A tibble: 10 x 2
##    Overall_Qual       n
##    <fct>          <int>
##  1 Very_Poor          3
##  2 Poor              12
##  3 Fair              29
##  4 Below_Average    166
##  5 Average          607
##  6 Above_Average    553
##  7 Good             458
##  8 Very_Good        266
##  9 Excellent         81
## 10 Very_Excellent    24
```

]

---
# Label encoding .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 8]

.pull-left[
* One-hot and dummy encoding are not good when:
   - you have a lot of categorical features
   - with high cardinality
   - or you have ordinal features

* Label encoding:
   - pure numeric conversion of the levels of a categorical variable
   - most common: ordinal encoding
]

.pull-right[

.center.bold[Label/ordinal encoding for `Overall_Qual`]

```r
recipe(Sale_Price ~ ., data = ames_train) %>%
  step_integer(Overall_Qual) %>%
  prep(ames_train) %>%
  bake(ames_train) %>%
  count(Overall_Qual)
## # A tibble: 10 x 2
##    Overall_Qual     n
##           <dbl> <int>
##  1            1     3
##  2            2    12
##  3            3    29
##  4            4   166
##  5            5   607
##  6            6   553
##  7            7   458
##  8            8   266
##  9            9    81
## 10           10    24
```

]

---
# Common categorical encodings

We'll put these pieces together later

```r
step_dummy()
step_dummy(one_hot = TRUE)
step_integer()
step_ordinalscore()
```

---
class: center, middle, inverse

.font300.white[Dimension Reduction]

---
# PCA

.pull-left[
* We can use PCA for downstream modeling

* In the Ames data, there are potential clusters of highly correlated variables:

- proxies for size: `Lot_Area`, `Gr_Liv_Area`, `First_Flr_SF`, `Bsmt_Unf_SF`, etc.
   - quality fields: `Overall_Qual`, `Garage_Qual`, `Kitchen_Qual`, `Exter_Qual`, etc.

* It would be nice if we could combine/amalgamate the variables in these clusters into a single variable that represents them.

* In fact, we can explain 95% of the variance in our numeric features with 38 PCs

]

.pull-right[

]

---
# PCA

.pull-left[
* We can use PCA for downstream modeling

* In the Ames data, there are potential clusters of highly correlated variables:

- proxies for size: `Lot_Area`, `Gr_Liv_Area`, `First_Flr_SF`, `Bsmt_Unf_SF`, etc.
   - quality fields: `Overall_Qual`, `Garage_Qual`, `Kitchen_Qual`, `Exter_Qual`, etc.

* It would be nice if we could combine/amalgamate the variables in these clusters into a single variable that represents them.

* In fact, we can explain 95% of the variance in our numeric features with 38 PCs

]

.pull-right[

We'll put these pieces together later

```r
step_pca()
step_kpca()
step_pls()
step_spatialsign()
```

]

---
class: center, middle, inverse

.font300.white[Blueprints]

---
# Sequential steps

.pull-left[

.bold.center.font120[Some thoughts to consider]

- If using a log or Box-Cox transformation, don’t center the data first or do any operations that might make the data non-positive. 
- Standardize your numeric features prior to one-hot/dummy encoding.
- If you are lumping infrequently categories together, do so before one-hot/dummy encoding.
- Although you can perform dimension reduction procedures on categorical features, it is common to primarily do so on numeric features when doing so for feature engineering purposes.

]

.pull-right[

.bold.center.font120[Suggested ordering]

1. Filter out zero or near-zero variance features
2. Perform imputation if required
3. Normalize to resolve numeric feature skewness
4. Standardize (center and scale) numeric features
5. Perform dimension reduction (i.e. PCA) on numeric features
6. Create one-hot or dummy encoded features

]

---
# Data leakage

___Data leakage___ is when information from outside the training dataset is used to create the model.

- Often occurs when doing feature engineering
- Feature engineering should be done in isolation of each resampling iteration

---
# Putting the process together

.pull-left[
.font120[

* __recipes__ provides a convenient way to create feature engineering blueprints

]
]

.pull-right[

]

.center.bold.font120[https://tidymodels.github.io/recipes/index.html]

---
# Putting the process together

.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. recipe: define your pre-processing blueprint
   2. prepare: estimate parameters based on training data
   3. bake/juice: apply blueprint to new data

]

---
# Putting the process together .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 9]

.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. .bold[recipe: define your pre-processing blueprint]
   2. prepare: estimate parameters based on training data
   3. bake/juice: apply blueprint to new data

<br>

.center.blue[Check out all the available `step_xxx()` functions at http://bit.ly/step_functions]

]

.pull-right[

```r
blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_nzv(all_nominal()) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_integer(matches("Qual|Cond|QC|Qu"))

blueprint
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         80
## 
## Operations:
## 
## Sparse, unbalanced variable filter on all_nominal()
## Centering for all_numeric(), -all_outcomes()
## Scaling for all_numeric(), -all_outcomes()
## Integer encoding for matches("Qual|Cond|QC|Qu")
```

]

---
# Putting the process together .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 10]

.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. recipe: define your pre-processing blueprint
   2. .bold[prepare: estimate parameters based on training data]
   3. bake/juice: apply blueprint to new data

]

.pull-right[

```r
prepare <- prep(blueprint, training = ames_train)
prepare
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         80
## 
## Training data contained 2199 data points and no missing data.
## 
## Operations:
## 
## Sparse, unbalanced variable filter removed Street, Alley, ... [trained]
## Centering for Lot_Frontage, Lot_Area, ... [trained]
## Scaling for Lot_Frontage, Lot_Area, ... [trained]
## Integer encoding for Condition_1, Overall_Qual, Overall_Cond, ... [trained]
```

]

---
# Putting the process together .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 11]

.scrollable90[
.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. recipe: define your pre-processing blueprint
   2. prepare: estimate parameters based on training data
   3. .bold[bake: apply blueprint to new data]

]

.pull-right[

```r
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)

baked_train
## # A tibble: 2,199 x 68
##    MS_SubClass MS_Zoning Lot_Frontage Lot_Area Lot_Shape Lot_Config
##    <fct>       <fct>            <dbl>    <dbl> <fct>     <fct>     
##  1 One_Story_… Resident…       0.692   0.515   Slightly… Corner    
##  2 One_Story_… Resident…       1.05    0.125   Regular   Corner    
##  3 Two_Story_… Resident…       0.484   0.460   Slightly… Inside    
##  4 Two_Story_… Resident…       0.603  -0.0227  Slightly… Inside    
##  5 One_Story_… Resident…      -0.496  -0.656   Regular   Inside    
##  6 One_Story_… Resident…      -0.436  -0.646   Slightly… Inside    
##  7 Two_Story_… Resident…       0.0682 -0.333   Regular   Inside    
##  8 Two_Story_… Resident…       0.513  -0.0199  Slightly… Corner    
##  9 One_Story_… Resident…      -1.71   -0.273   Slightly… Inside    
## 10 One_Story_… Resident…       0.810   0.00212 Regular   Inside    
## # … with 2,189 more rows, and 62 more variables: Neighborhood <fct>,
## #   Condition_1 <dbl>, Bldg_Type <fct>, House_Style <fct>,
## #   Overall_Qual <dbl>, Overall_Cond <dbl>, Year_Built <dbl>,
## #   Year_Remod_Add <dbl>, Roof_Style <fct>, Exterior_1st <fct>,
## #   Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>,
## #   Exter_Qual <dbl>, Exter_Cond <dbl>, Foundation <fct>, Bsmt_Qual <dbl>,
## #   Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>,
## #   BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>,
## #   Heating_QC <dbl>, Central_Air <fct>, Electrical <fct>,
## #   First_Flr_SF <dbl>, Second_Flr_SF <dbl>, Low_Qual_Fin_SF <dbl>,
## #   Gr_Liv_Area <dbl>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
## #   Full_Bath <dbl>, Half_Bath <dbl>, Bedroom_AbvGr <dbl>,
## #   Kitchen_AbvGr <dbl>, Kitchen_Qual <dbl>, TotRms_AbvGrd <dbl>,
## #   Fireplaces <dbl>, Fireplace_Qu <dbl>, Garage_Type <fct>,
## #   Garage_Finish <fct>, Garage_Cars <dbl>, Garage_Area <dbl>,
## #   Garage_Qual <dbl>, Garage_Cond <dbl>, Paved_Drive <fct>,
## #   Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>, Enclosed_Porch <dbl>,
## #   Three_season_porch <dbl>, Screen_Porch <dbl>, Pool_Area <dbl>,
## #   Fence <fct>, Misc_Val <dbl>, Mo_Sold <dbl>, Year_Sold <dbl>,
## #   Sale_Type <fct>, Sale_Condition <dbl>, Sale_Price <int>,
## #   Longitude <dbl>, Latitude <dbl>
```

]
]

---
# Simplifying with __caret__

.pull-left[

* __recipes__ provides a convenient way to create feature engineering blueprints

* 3 main components to consider
   1. recipe: define your pre-processing blueprint
   2. prepare: estimate parameters based on training data
   3. bake: apply blueprint to new data
   
* Luckily, __caret__ simplifies this process for us.
   1. We supply __caret__ a recipe
   2. __caret__ will prepare & bake within each resample

]

.pull-right[

<br>

]

---
# Putting the process together .red[<span><i class="fas  fa-hand-point-right faa-horizontal animated " style=" color:red;"></i></span> code chunk 12]

.scrollable90[
.pull-left[
Let's add a blueprint to our modeling process for analyzing the Ames housing data:

1. Split into training vs testing data

2. .blue[Create feature engineering blueprint]

3. Specify a resampling procedure

4. Create our hyperparameter grid

5. Execute grid search

6. Evaluate performance
]

.pull-right[

.center.bold[<span><i class="fas  fa-exclamation-triangle faa-FALSE animated " style=" color:red;"></i></span>  This grid search takes ~8 min <span><i class="fas  fa-exclamation-triangle faa-FALSE animated " style=" color:red;"></i></span>]

```r
# 1. stratified sampling with the rsample package
set.seed(123)
split  <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- training(split)
ames_test   <- testing(split)

# 2. Feature engineering
blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_nzv(all_nominal()) %>%
  step_integer(matches("Qual|Cond|QC|Qu")) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)

# 3. create a resampling method
cv <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  repeats = 5
  )

# 4. create a hyperparameter grid search
hyper_grid <- expand.grid(k = seq(2, 25, by = 1))

# 5. execute grid search with knn model
#    use RMSE as preferred metric
knn_fit <- train(
  blueprint, 
  data = ames_train, 
  method = "knn", 
  trControl = cv, 
  tuneGrid = hyper_grid,
  metric = "RMSE"
  )

# 6. evaluate results
# print model results
knn_fit
## k-Nearest Neighbors 
## 
## 2054 samples
##   80 predictor
## 
## Recipe steps: nzv, integer, center, scale, dummy 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1849, 1849, 1849, 1848, 1850, 1848, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    2  36404.68  0.7944128  22482.91
##    3  35319.96  0.8073602  21554.09
##    4  35424.81  0.8054775  21361.14
##    5  35214.10  0.8100148  21163.72
##    6  34645.72  0.8183326  20894.01
##    7  34409.24  0.8220020  20832.48
##    8  34023.75  0.8275806  20669.88
##    9  33818.08  0.8312492  20596.20
##   10  33744.59  0.8326048  20624.06
##   11  33734.82  0.8337820  20623.66
##   12  33723.32  0.8348085  20606.53
##   13  33794.99  0.8347543  20671.04
##   14  33972.82  0.8341336  20765.06
##   15  34075.31  0.8336435  20809.05
##   16  34150.00  0.8339415  20853.42
##   17  34203.56  0.8341864  20940.34
##   18  34284.83  0.8337899  21012.43
##   19  34325.85  0.8337150  21063.53
##   20  34381.42  0.8333470  21140.70
##   21  34424.06  0.8332503  21173.41
##   22  34443.72  0.8334388  21195.66
##   23  34489.65  0.8335386  21225.77
##   24  34509.98  0.8335924  21241.38
##   25  34532.88  0.8338991  21275.36
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 12.

# plot cross validation results
ggplot(knn_fit$results, aes(k, RMSE)) + 
  geom_line() +
  geom_point() +
  scale_y_continuous(labels = scales::dollar)
```

]
]

---
# Putting the process together

.center.bold.font120[Feature engineering alone reduced our error by $10,000!]

---
# Questions?

---
# Back home

<br><br><br><br>
[.center[<span><i class="fas  fa-home fa-10x faa-FALSE animated "></i></span>]](https://github.com/uc-r/Advanced-R)

.center[https://github.com/uc-r/Advanced-R]