class: clear, center, middle background-image: url(images/engineering-icon.jpg) background-position: center background-size: cover <br><br><br><br><br><br><br><br><br><br><br><br><br> .font200.bold.white[Feature & Target Engineering] --- # Introduction Data pre-processing and engineering techniques generally refer to the .blue[___addition, deletion, or transformation of data___]. .pull-left[ .center.bold.font120[Thoughts] - Substantial time commitment - 1 hr module doesn't do justice - Not a "sexy" area to study but well worth your time - Additional resources to start with: - [Feature Engineering and Selection: A Practical Approach for Predictive Models](http://www.feat.engineering/) - [Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists](https://www.amazon.com/Feature-Engineering-Machine-Learning-Principles/dp/1491953241) ] -- .pull-right[ .center.bold.font120[Overview] - Target engineering - Missingness - Feature filtering - Numeric feature engineering - Categorical feature engineering - Dimension reduction - Proper implementation ] --- # Prereqs .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 1] .pull-left[ .center.bold.font120[Packages] ```r library(dplyr) library(ggplot2) library(rsample) library(recipes) ``` ] .pull-right[ .center.bold.font120[Data] ```r # ames data ames <- AmesHousing::make_ames() # split data set.seed(123) split <- initial_split(ames, strata = "Sale_Price") ames_train <- training(split) ``` ] --- class: center, middle, inverse .font300.white[Target Engineering] --- # Normality correction .pull-left[ Not a requirement but... - can improve predictive accuracy for parametric & distance-based models - can correct for residual assumption violations - minimizes effects of outliers plus... - sometimes used to for shaping the business problem as well .center[_“taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.”_] ] .pull-right[ <br><br> <center> `\(\texttt{Sale_Price} = \beta_0 + \beta_1\texttt{Year_Built} + \epsilon\)` </center> <img src="04-engineering_files/figure-html/skewed-residuals-1.png" style="display: block; margin: auto;" /> ] --- # Transformation options .pull-left[ - log (or log with offset) - Box-Cox: automates process of finding proper transformation $$ \begin{equation} y(\lambda) = `\begin{cases} \frac{y^\lambda-1}{\lambda}, & \text{if}\ \lambda \neq 0 \\ \log y, & \text{if}\ \lambda = 0. \end{cases}` \end{equation}` $$ - Yeo-Johnson: modified Box-Cox for non-strictly positive values ] .pull-right[ We'll put these pieces together later ```r step_log() step_BoxCox() step_YeoJohnson() ``` ] <img src="04-engineering_files/figure-html/distribution-comparison-1.png" style="display: block; margin: auto;" /> --- class: center, middle, inverse .font300.white[Missingness] .white[_Many models cannot cope with missing data so imputation strategies may be necessary._] --- # Visualizing .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 2] An uncleaned version of Ames housing data: ```r sum(is.na(AmesHousing::ames_raw)) ## [1] 13997 ``` .pull-left[ ```r AmesHousing::ames_raw %>% is.na() %>% reshape2::melt() %>% ggplot(aes(Var2, Var1, fill=value)) + geom_raster() + coord_flip() + scale_y_continuous(NULL, expand = c(0, 0)) + scale_fill_grey(name = "", labels = c("Present", "Missing")) + xlab("Observation") + theme(axis.text.y = element_text(size = 4)) ``` ] .pull-right[ <img src="04-engineering_files/figure-html/missing-distribution-plot-1.png" style="display: block; margin: auto;" /> ] --- # Visualizing .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 3] An uncleaned version of Ames housing data: ```r sum(is.na(AmesHousing::ames_raw)) ## [1] 13997 ``` .pull-left[ ```r extracat::visna(AmesHousing::ames_raw, sort = "b") ``` ] .pull-right[ <img src="04-engineering_files/figure-html/missing-distribution-plot2-1.png" style="display: block; margin: auto;" /> ] --- # Structural vs random .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 4] .pull-left[ Missing values can be a result of many different reasons; however, these reasons are usually lumped into two categories: * informative missingess * missingness at random ] .pull-right[ ```r AmesHousing::ames_raw %>% filter(is.na(`Garage Type`)) %>% select(`Garage Type`, `Garage Cars`, `Garage Area`) ## # A tibble: 157 x 3 ## `Garage Type` `Garage Cars` `Garage Area` ## <chr> <int> <int> ## 1 <NA> 0 0 ## 2 <NA> 0 0 ## 3 <NA> 0 0 ## 4 <NA> 0 0 ## 5 <NA> 0 0 ## 6 <NA> 0 0 ## 7 <NA> 0 0 ## 8 <NA> 0 0 ## 9 <NA> 0 0 ## 10 <NA> 0 0 ## # … with 147 more rows ``` ] <br> .center.bold[Determines how you will, and if you can/should, impute.] --- # Imputation .pull-left[ Primary methods: - Estimated statistic (i.e. mean, median, mode) - K-nearest neighbor - Tree-based (bagged trees) ] .pull-right[ .center.font80[.red[Actual values] vs .blue[imputed values]] <img src="04-engineering_files/figure-html/imputation-examples-1.png" style="display: block; margin: auto;" /> ] --- # Imputation .pull-left[ Primary methods: - Estimated statistic (i.e. mean, median, mode) - K-nearest neighbor - Tree-based (bagged trees) ] .pull-right[ We'll put these pieces together later ```r step_meanimpute() step_medianimpute() step_modeimpute() step_knnimpute() step_bagimpute() ``` ] --- class: center, middle, inverse .font300.white[Feature Filtering] --- # More is not always better! Excessive noisy variables can... .font120.bold[reduce accuracy] <img src="images/accuracy-comparison-1.png" width="2560" style="display: block; margin: auto;" /> --- # More is not always better! Excessive noisy variables can... .font120.bold[increase computation time] <img src="images/impact-on-time-1.png" width="2560" style="display: block; margin: auto;" /> --- # Options for filtering .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 5] .pull-left[ Filtering options include: - removing - zero variance features - near-zero variance features - highly correlated features (better to do dimension reduction) - Feature selection - beyond scope of module - see [Applied Predictive Modeling, ch. 19](http://appliedpredictivemodeling.com/) ] .pull-right[ ```r caret::nearZeroVar(ames_train, saveMetrics= TRUE) %>% rownames_to_column() %>% filter(nzv) ## rowname freqRatio percentUnique zeroVar nzv ## 1 Street 218.90000 0.09095043 FALSE TRUE ## 2 Alley 22.31522 0.13642565 FALSE TRUE ## 3 Land_Contour 23.05814 0.18190086 FALSE TRUE ## 4 Utilities 2197.00000 0.13642565 FALSE TRUE ## 5 Land_Slope 22.76087 0.13642565 FALSE TRUE ## 6 Condition_2 242.00000 0.31832651 FALSE TRUE ## 7 Roof_Matl 127.47059 0.36380173 FALSE TRUE ## 8 Bsmt_Cond 19.71717 0.27285130 FALSE TRUE ## 9 BsmtFin_Type_2 24.20513 0.31832651 FALSE TRUE ## 10 BsmtFin_SF_2 486.00000 9.54979536 FALSE TRUE ## 11 Heating 103.09524 0.27285130 FALSE TRUE ## 12 Low_Qual_Fin_SF 723.33333 1.22783083 FALSE TRUE ## 13 Kitchen_AbvGr 22.60215 0.18190086 FALSE TRUE ## 14 Functional 40.90000 0.36380173 FALSE TRUE ## 15 Enclosed_Porch 102.72222 7.41246021 FALSE TRUE ## 16 Three_season_porch 723.33333 1.18235562 FALSE TRUE ## 17 Screen_Porch 183.18182 4.77489768 FALSE TRUE ## 18 Pool_Area 2190.00000 0.45475216 FALSE TRUE ## 19 Pool_QC 730.00000 0.22737608 FALSE TRUE ## 20 Misc_Feature 31.22059 0.27285130 FALSE TRUE ## 21 Misc_Val 151.85714 1.40973170 FALSE TRUE ``` ] --- # Options for filtering .pull-left[ Filtering options include: - removing - zero variance features - near-zero variance features - highly correlated features (better to do dimension reduction) - Feature selection - beyond scope of module - see [Applied Predictive Modeling, ch. 19](http://appliedpredictivemodeling.com/) ] .pull-right[ We'll put these pieces together later ```r step_zv() step_nzv() step_corr() ``` ] --- class: center, middle, inverse .font300.white[Numeric Feature Engineering] --- # Transformations .pull-left[ * skewness - parametric models that have distributional assumptions (i.e. GLMs, regularized models) - log - Box-Cox or Yeo-Johnson * standardization - Models that incorporate linear functions (GLM, NN) and distance functions (i.e. KNN, clustering) of input features are sensitive to the scale of the inputs - centering _and_ scaling so that numeric variables have `\(\mu = 0; \sigma = 1\)` ] .pull-right[ <img src="04-engineering_files/figure-html/standardizing-1.png" style="display: block; margin: auto;" /> ] --- # Transformations .pull-left[ * skewness - parametric models that have distributional assumptions (i.e. GLMs, regularized models) - log - Box-Cox or Yeo-Johnson * standardization - Models that incorporate linear functions (GLM, NN) and distance functions (i.e. KNN, clustering) of input features are sensitive to the scale of the inputs - centering _and_ scaling so that numeric variables have `\(\mu = 0; \sigma = 1\)` ] .pull-right[ We'll put these pieces together later ```r step_log() step_BoxCox() step_YeoJohnson() step_center() step_scale() ``` ] --- class: center, middle, inverse .font300.white[Categorical Feature Engineering] --- # One-hot & Dummy encoding .pull-left[ Many models require all predictor variables to be numeric (i.e. GLMs, SVMs, NNets) <table class="table table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:left;"> x </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> a </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> c </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> b </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> c </td> </tr> </tbody> </table> Two most common approaches include... ] .pull-right[ .bold.center[Dummy encoding] <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> x.b </th> <th style="text-align:right;"> x.c </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> .bold.center[One-hot encoding] <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> x.a </th> <th style="text-align:right;"> x.b </th> <th style="text-align:right;"> x.c </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> ] --- # Label encoding .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 6] .pull-left[ * One-hot and dummy encoding are not good when: - you have a lot of categorical features - with high cardinality - or you have ordinal features * Label encoding: - pure numeric conversion of the levels of a categorical variable - most common: ordinal encoding ] .pull-right[ .center.bold[Quality variables with natural ordering] ```r ames_train %>% select(matches("Qual|QC|Qu")) ## # A tibble: 2,199 x 9 ## Overall_Qual Exter_Qual Bsmt_Qual Heating_QC Low_Qual_Fin_SF ## <fct> <fct> <fct> <fct> <int> ## 1 Above_Avera… Typical Typical Typical 0 ## 2 Good Good Typical Excellent 0 ## 3 Average Typical Good Good 0 ## 4 Above_Avera… Typical Typical Excellent 0 ## 5 Very_Good Good Good Excellent 0 ## 6 Very_Good Good Good Excellent 0 ## 7 Good Typical Typical Good 0 ## 8 Above_Avera… Typical Good Good 0 ## 9 Above_Avera… Typical Good Excellent 0 ## 10 Good Typical Good Good 0 ## # … with 2,189 more rows, and 4 more variables: Kitchen_Qual <fct>, ## # Fireplace_Qu <fct>, Garage_Qual <fct>, Pool_QC <fct> ``` ] --- # Label encoding .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 7] .pull-left[ * One-hot and dummy encoding are not good when: - you have a lot of categorical features - with high cardinality - or you have ordinal features * Label encoding: - pure numeric conversion of the levels of a categorical variable - most common: ordinal encoding ] .pull-right[ .center.bold[Original encoding for `Overall_Qual`] ```r count(ames_train, Overall_Qual) ## # A tibble: 10 x 2 ## Overall_Qual n ## <fct> <int> ## 1 Very_Poor 3 ## 2 Poor 12 ## 3 Fair 29 ## 4 Below_Average 166 ## 5 Average 607 ## 6 Above_Average 553 ## 7 Good 458 ## 8 Very_Good 266 ## 9 Excellent 81 ## 10 Very_Excellent 24 ``` ] --- # Label encoding .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 8] .pull-left[ * One-hot and dummy encoding are not good when: - you have a lot of categorical features - with high cardinality - or you have ordinal features * Label encoding: - pure numeric conversion of the levels of a categorical variable - most common: ordinal encoding ] .pull-right[ .center.bold[Label/ordinal encoding for `Overall_Qual`] ```r recipe(Sale_Price ~ ., data = ames_train) %>% step_integer(Overall_Qual) %>% prep(ames_train) %>% bake(ames_train) %>% count(Overall_Qual) ## # A tibble: 10 x 2 ## Overall_Qual n ## <dbl> <int> ## 1 1 3 ## 2 2 12 ## 3 3 29 ## 4 4 166 ## 5 5 607 ## 6 6 553 ## 7 7 458 ## 8 8 266 ## 9 9 81 ## 10 10 24 ``` ] --- # Common categorical encodings We'll put these pieces together later ```r step_dummy() step_dummy(one_hot = TRUE) step_integer() step_ordinalscore() ``` --- class: center, middle, inverse .font300.white[Dimension Reduction] --- # PCA .pull-left[ * We can use PCA for downstream modeling * In the Ames data, there are potential clusters of highly correlated variables: - proxies for size: `Lot_Area`, `Gr_Liv_Area`, `First_Flr_SF`, `Bsmt_Unf_SF`, etc. - quality fields: `Overall_Qual`, `Garage_Qual`, `Kitchen_Qual`, `Exter_Qual`, etc. * It would be nice if we could combine/amalgamate the variables in these clusters into a single variable that represents them. * In fact, we can explain 95% of the variance in our numeric features with 38 PCs ] .pull-right[ <img src="04-engineering_files/figure-html/pca-1.png" style="display: block; margin: auto;" /> ] --- # PCA .pull-left[ * We can use PCA for downstream modeling * In the Ames data, there are potential clusters of highly correlated variables: - proxies for size: `Lot_Area`, `Gr_Liv_Area`, `First_Flr_SF`, `Bsmt_Unf_SF`, etc. - quality fields: `Overall_Qual`, `Garage_Qual`, `Kitchen_Qual`, `Exter_Qual`, etc. * It would be nice if we could combine/amalgamate the variables in these clusters into a single variable that represents them. * In fact, we can explain 95% of the variance in our numeric features with 38 PCs ] .pull-right[ We'll put these pieces together later ```r step_pca() step_kpca() step_pls() step_spatialsign() ``` ] --- class: center, middle, inverse .font300.white[Blueprints] --- # Sequential steps .pull-left[ .bold.center.font120[Some thoughts to consider] - If using a log or Box-Cox transformation, don’t center the data first or do any operations that might make the data non-positive. - Standardize your numeric features prior to one-hot/dummy encoding. - If you are lumping infrequently categories together, do so before one-hot/dummy encoding. - Although you can perform dimension reduction procedures on categorical features, it is common to primarily do so on numeric features when doing so for feature engineering purposes. ] -- .pull-right[ .bold.center.font120[Suggested ordering] 1. Filter out zero or near-zero variance features 2. Perform imputation if required 3. Normalize to resolve numeric feature skewness 4. Standardize (center and scale) numeric features 5. Perform dimension reduction (i.e. PCA) on numeric features 6. Create one-hot or dummy encoded features ] --- # Data leakage ___Data leakage___ is when information from outside the training dataset is used to create the model. - Often occurs when doing feature engineering - Feature engineering should be done in isolation of each resampling iteration <img src="images/data-leakage.png" width="80%" height="80%" style="display: block; margin: auto;" /> --- # Putting the process together .pull-left[ .font120[ * __recipes__ provides a convenient way to create feature engineering blueprints ] ] .pull-right[ <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/recipes.png" width="70%" height="70%" style="display: block; margin: auto;" /> ] .center.bold.font120[https://tidymodels.github.io/recipes/index.html] --- # Putting the process together .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. recipe: define your pre-processing blueprint 2. prepare: estimate parameters based on training data 3. bake/juice: apply blueprint to new data ] --- # Putting the process together .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 9] .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. .bold[recipe: define your pre-processing blueprint] 2. prepare: estimate parameters based on training data 3. bake/juice: apply blueprint to new data <br> .center.blue[Check out all the available `step_xxx()` functions at http://bit.ly/step_functions] ] .pull-right[ ```r blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>% step_nzv(all_nominal()) %>% step_center(all_numeric(), -all_outcomes()) %>% step_scale(all_numeric(), -all_outcomes()) %>% step_integer(matches("Qual|Cond|QC|Qu")) blueprint ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 80 ## ## Operations: ## ## Sparse, unbalanced variable filter on all_nominal() ## Centering for all_numeric(), -all_outcomes() ## Scaling for all_numeric(), -all_outcomes() ## Integer encoding for matches("Qual|Cond|QC|Qu") ``` ] --- # Putting the process together .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 10] .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. recipe: define your pre-processing blueprint 2. .bold[prepare: estimate parameters based on training data] 3. bake/juice: apply blueprint to new data ] .pull-right[ ```r prepare <- prep(blueprint, training = ames_train) prepare ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 80 ## ## Training data contained 2199 data points and no missing data. ## ## Operations: ## ## Sparse, unbalanced variable filter removed Street, Alley, ... [trained] ## Centering for Lot_Frontage, Lot_Area, ... [trained] ## Scaling for Lot_Frontage, Lot_Area, ... [trained] ## Integer encoding for Condition_1, Overall_Qual, Overall_Cond, ... [trained] ``` ] --- # Putting the process together .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 11] .scrollable90[ .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. recipe: define your pre-processing blueprint 2. prepare: estimate parameters based on training data 3. .bold[bake: apply blueprint to new data] ] .pull-right[ ```r baked_train <- bake(prepare, new_data = ames_train) baked_test <- bake(prepare, new_data = ames_test) baked_train ## # A tibble: 2,199 x 68 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Lot_Shape Lot_Config ## <fct> <fct> <dbl> <dbl> <fct> <fct> ## 1 One_Story_… Resident… 0.692 0.515 Slightly… Corner ## 2 One_Story_… Resident… 1.05 0.125 Regular Corner ## 3 Two_Story_… Resident… 0.484 0.460 Slightly… Inside ## 4 Two_Story_… Resident… 0.603 -0.0227 Slightly… Inside ## 5 One_Story_… Resident… -0.496 -0.656 Regular Inside ## 6 One_Story_… Resident… -0.436 -0.646 Slightly… Inside ## 7 Two_Story_… Resident… 0.0682 -0.333 Regular Inside ## 8 Two_Story_… Resident… 0.513 -0.0199 Slightly… Corner ## 9 One_Story_… Resident… -1.71 -0.273 Slightly… Inside ## 10 One_Story_… Resident… 0.810 0.00212 Regular Inside ## # … with 2,189 more rows, and 62 more variables: Neighborhood <fct>, ## # Condition_1 <dbl>, Bldg_Type <fct>, House_Style <fct>, ## # Overall_Qual <dbl>, Overall_Cond <dbl>, Year_Built <dbl>, ## # Year_Remod_Add <dbl>, Roof_Style <fct>, Exterior_1st <fct>, ## # Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, ## # Exter_Qual <dbl>, Exter_Cond <dbl>, Foundation <fct>, Bsmt_Qual <dbl>, ## # Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, ## # BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, ## # Heating_QC <dbl>, Central_Air <fct>, Electrical <fct>, ## # First_Flr_SF <dbl>, Second_Flr_SF <dbl>, Low_Qual_Fin_SF <dbl>, ## # Gr_Liv_Area <dbl>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, ## # Full_Bath <dbl>, Half_Bath <dbl>, Bedroom_AbvGr <dbl>, ## # Kitchen_AbvGr <dbl>, Kitchen_Qual <dbl>, TotRms_AbvGrd <dbl>, ## # Fireplaces <dbl>, Fireplace_Qu <dbl>, Garage_Type <fct>, ## # Garage_Finish <fct>, Garage_Cars <dbl>, Garage_Area <dbl>, ## # Garage_Qual <dbl>, Garage_Cond <dbl>, Paved_Drive <fct>, ## # Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>, Enclosed_Porch <dbl>, ## # Three_season_porch <dbl>, Screen_Porch <dbl>, Pool_Area <dbl>, ## # Fence <fct>, Misc_Val <dbl>, Mo_Sold <dbl>, Year_Sold <dbl>, ## # Sale_Type <fct>, Sale_Condition <dbl>, Sale_Price <int>, ## # Longitude <dbl>, Latitude <dbl> ``` ] ] --- # Simplifying with __caret__ .pull-left[ * __recipes__ provides a convenient way to create feature engineering blueprints * 3 main components to consider 1. recipe: define your pre-processing blueprint 2. prepare: estimate parameters based on training data 3. bake: apply blueprint to new data * Luckily, __caret__ simplifies this process for us. 1. We supply __caret__ a recipe 2. __caret__ will prepare & bake within each resample ] .pull-right[ <br> <img src="https://media1.tenor.com/images/6358cb41e076a3c517e5a9988b1dc888/tenor.gif?itemid=5711499" width="90%" height="90%" style="display: block; margin: auto;" /> ] --- # Putting the process together .red[
<i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"></i>
code chunk 12] .scrollable90[ .pull-left[ Let's add a blueprint to our modeling process for analyzing the Ames housing data: 1. Split into training vs testing data 2. .blue[Create feature engineering blueprint] 3. Specify a resampling procedure 4. Create our hyperparameter grid 5. Execute grid search 6. Evaluate performance ] .pull-right[ .center.bold[
<i class="fas fa-exclamation-triangle faa-FALSE animated " style=" color:red;"></i>
This grid search takes ~8 min
<i class="fas fa-exclamation-triangle faa-FALSE animated " style=" color:red;"></i>
] ```r # 1. stratified sampling with the rsample package set.seed(123) split <- initial_split(ames, prop = 0.7, strata = "Sale_Price") ames_train <- training(split) ames_test <- testing(split) # 2. Feature engineering blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>% step_nzv(all_nominal()) %>% step_integer(matches("Qual|Cond|QC|Qu")) %>% step_center(all_numeric(), -all_outcomes()) %>% step_scale(all_numeric(), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) # 3. create a resampling method cv <- trainControl( method = "repeatedcv", number = 10, repeats = 5 ) # 4. create a hyperparameter grid search hyper_grid <- expand.grid(k = seq(2, 25, by = 1)) # 5. execute grid search with knn model # use RMSE as preferred metric knn_fit <- train( blueprint, data = ames_train, method = "knn", trControl = cv, tuneGrid = hyper_grid, metric = "RMSE" ) # 6. evaluate results # print model results knn_fit ## k-Nearest Neighbors ## ## 2054 samples ## 80 predictor ## ## Recipe steps: nzv, integer, center, scale, dummy ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1849, 1849, 1849, 1848, 1850, 1848, ... ## Resampling results across tuning parameters: ## ## k RMSE Rsquared MAE ## 2 36404.68 0.7944128 22482.91 ## 3 35319.96 0.8073602 21554.09 ## 4 35424.81 0.8054775 21361.14 ## 5 35214.10 0.8100148 21163.72 ## 6 34645.72 0.8183326 20894.01 ## 7 34409.24 0.8220020 20832.48 ## 8 34023.75 0.8275806 20669.88 ## 9 33818.08 0.8312492 20596.20 ## 10 33744.59 0.8326048 20624.06 ## 11 33734.82 0.8337820 20623.66 ## 12 33723.32 0.8348085 20606.53 ## 13 33794.99 0.8347543 20671.04 ## 14 33972.82 0.8341336 20765.06 ## 15 34075.31 0.8336435 20809.05 ## 16 34150.00 0.8339415 20853.42 ## 17 34203.56 0.8341864 20940.34 ## 18 34284.83 0.8337899 21012.43 ## 19 34325.85 0.8337150 21063.53 ## 20 34381.42 0.8333470 21140.70 ## 21 34424.06 0.8332503 21173.41 ## 22 34443.72 0.8334388 21195.66 ## 23 34489.65 0.8335386 21225.77 ## 24 34509.98 0.8335924 21241.38 ## 25 34532.88 0.8338991 21275.36 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was k = 12. # plot cross validation results ggplot(knn_fit$results, aes(k, RMSE)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::dollar) ``` <img src="04-engineering_files/figure-html/example-blue-print-application-1.png" style="display: block; margin: auto;" /> ] ] --- # Putting the process together .center.bold.font120[Feature engineering alone reduced our error by $10,000!] <img src="https://media1.tenor.com/images/2b6d0826f02a9ba7c9d4384a740013e9/tenor.gif?itemid=5531028" width="90%" height="90%" style="display: block; margin: auto;" /> --- # Questions? <img src="http://www.whitehouse51.com/thumbnail/a/any-questions-meme-100-images-thanks-for-listening-any-1.jpeg" width="50%" height="50%" style="display: block; margin: auto;" /> --- # Back home <br><br><br><br> [.center[
<i class="fas fa-home fa-10x faa-FALSE animated "></i>
]](https://github.com/uc-r/Advanced-R) .center[https://github.com/uc-r/Advanced-R]