class: clear, center, middle background-image: url(images/data-types-icon.png) <br><br><br><br><br><br><br><br><br><br><br><br> .font200.bold[Data Types] --- # Data types .font120.center[Data types are involved in nearly every task in the data science flow; however, most relevant to you today will be in the visualization and transformation tasks.] <img src="images/visualize-transform-task.png" width="2560" style="display: block; margin: auto;" /> --- class: yourturn # Your Turn! .center.font130.bold.blue[What types of data are in this data set?] <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> household_id </th> <th style="text-align:left;"> basket_id </th> <th style="text-align:left;"> brand </th> <th style="text-align:left;"> product_category </th> <th style="text-align:left;"> transaction_timestamp </th> <th style="text-align:right;"> quantity </th> <th style="text-align:left;"> multi_items </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 2030 </td> <td style="text-align:left;"> 41338197126 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> VEGETABLES - ALL OTHERS </td> <td style="text-align:left;"> 2017-12-21 15:49:15 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 222 </td> <td style="text-align:left;"> 33041273501 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> GREETING CARDS/WRAP/PARTY SPLY </td> <td style="text-align:left;"> 2017-05-05 21:38:00 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 1333 </td> <td style="text-align:left;"> 31490396845 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> CANNED JUICES </td> <td style="text-align:left;"> 2017-01-19 17:47:27 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 2312 </td> <td style="text-align:left;"> 33316120536 </td> <td style="text-align:left;"> Private </td> <td style="text-align:left;"> EGGS </td> <td style="text-align:left;"> 2017-05-25 11:26:21 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 2007 </td> <td style="text-align:left;"> 41297606588 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> SOUP </td> <td style="text-align:left;"> 2017-12-18 13:57:52 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 400 </td> <td style="text-align:left;"> 40436037487 </td> <td style="text-align:left;"> Private </td> <td style="text-align:left;"> FRZN POTATOES </td> <td style="text-align:left;"> 2017-10-22 16:29:52 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 2322 </td> <td style="text-align:left;"> 40545066388 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> VEGETABLES - ALL OTHERS </td> <td style="text-align:left;"> 2017-10-30 15:21:52 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 1001 </td> <td style="text-align:left;"> 32957555802 </td> <td style="text-align:left;"> Private </td> <td style="text-align:left;"> TOMATOES </td> <td style="text-align:left;"> 2017-04-30 12:51:01 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 2194 </td> <td style="text-align:left;"> 34141497504 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> SOFT DRINKS </td> <td style="text-align:left;"> 2017-07-17 20:38:02 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 1399 </td> <td style="text-align:left;"> 32053257851 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> SALD DRSNG/SNDWCH SPRD </td> <td style="text-align:left;"> 2017-02-28 19:05:30 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 1767 </td> <td style="text-align:left;"> 31390731305 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> CANDY - CHECKLANE </td> <td style="text-align:left;"> 2017-01-14 18:11:03 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 1064 </td> <td style="text-align:left;"> 31344121808 </td> <td style="text-align:left;"> Private </td> <td style="text-align:left;"> FROZEN PIE/DESSERTS </td> <td style="text-align:left;"> 2017-01-09 16:02:04 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 1621 </td> <td style="text-align:left;"> 40424144445 </td> <td style="text-align:left;"> Private </td> <td style="text-align:left;"> YOGURT </td> <td style="text-align:left;"> 2017-10-21 18:23:27 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;"> 1368 </td> <td style="text-align:left;"> 31281057009 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> DRIED FRUIT </td> <td style="text-align:left;"> 2017-01-06 20:19:05 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:left;"> 768 </td> <td style="text-align:left;"> 41310832228 </td> <td style="text-align:left;"> National </td> <td style="text-align:left;"> CAT FOOD </td> <td style="text-align:left;"> 2017-12-19 15:53:16 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> TRUE </td> </tr> </tbody> </table> --- # Packages to work with these data types .pull-left[ ```r install.packages("tidyverse") ``` .bold[does the equivalent of...] ```r install.packages("ggplot2") install.packages("tibble") install.packages("tidyr") install.packages("readr") install.packages("purrr") install.packages("dplyr") *install.packages("stringr") *install.packages("forcats") *install.packages("lubridate") *install.packages("hms") install.packages("DBI") install.packages("haven") install.packages("httr") install.packages("jsonlite") install.packages("readxl") install.packages("rvest") install.packages("xml2") install.packages("modelr") install.packages("broom") ``` ] .pull-right[ ```r library(tidyverse) ``` .bold[does the equivalent of...] ```r library(ggplot2) library(tibble) library(tidyr) library(readr) library(purrr) library(dplyr) *library(stringr) *library(forcats) ``` ] --- # Requirements .pull-left[ ### Packages ```r library(tidyverse) library(forcats) # to work with factors library(lubridate) # to work with dates library(hms) # to work with dates ``` ] .pull-right[ ### Data sets ```r library(completejourney) # complete journey data transactions <- transactions_sample products # imported data households <- data.table::fread("data/households.csv", data.table = FALSE) %>% as_tibble() ``` ] --- class: clear, center, middle background-image: url(images/logical-icon.jpg) background-size: cover .font300.bold[Logicals] --- # Logicals R's data type for .blue[boolean values] (i.e. TRUE and FALSE) .pull-left[ ```r typeof(TRUE) ## [1] "logical" typeof(FALSE) ## [1] "logical" typeof(c(TRUE, TRUE, FALSE)) ## [1] "logical" ``` ] .pull-right[ ```r transactions %>% select(basket_id, coupon_disc) %>% * mutate(used_coupon = coupon_disc > 0) ## # A tibble: 1,469,307 x 3 ## basket_id coupon_disc used_coupon ## <chr> <dbl> <lgl> ## 1 31198570044 0 FALSE ## 2 31198570047 0 FALSE ## 3 31198655051 0 FALSE ## 4 31198705046 0 FALSE ## 5 31198705046 0 FALSE ## 6 31198705046 0 FALSE ## 7 31198705046 0 FALSE ## 8 31198676055 0 FALSE ## 9 31198676055 0 FALSE ## 10 31198676055 0 FALSE ## # ... with 1,469,297 more rows ``` ] --- # Most useful skill... .red[math with logicals] When you do math with logicals, .blue[TRUE becomes 1] and .blue[FALSE becomes 0.] -- .pull-left[ The .bold[sum] of a logical vector is the .bold[count of TRUEs] <br><br> ```r # logical comparison x <- c(1, 2, 3, 4) < 4 x ## [1] TRUE TRUE TRUE FALSE # sum of elements that meet that condition sum(x) ## [1] 3 ``` ] .pull-right[ The .bold[mean] of a logical vector is the .bold[proportion of TRUEs] ```r # proportion of elements that meet that condition mean(x) ## [1] 0.75 ``` ] --- class: yourturn # Your Turn! .pull-left[ ### Challenge Using the __transactions__ data and the `coupon_disc` variable 1. How many transactions used a coupon (coupon_disc > 0)? 2. What proportion of transactions used a coupon? ] -- .pull-right[ ### Solution ```r transactions %>% mutate(coupon_used = coupon_disc > 0) %>% summarise( count = sum(coupon_used), prop = mean(coupon_used) ) ## # A tibble: 1 x 2 ## count prop ## <int> <dbl> ## 1 21889 0.0149 ``` ] --- class: clear, center, middle background-image: url(images/character_string.png) background-size: cover .font300.bold.grey[Character Strings] --- # Working with character strings <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a> .pull-left[ <br> * Often, we have character strings in our data that are long (i.e. description fields), messy (i.e. manual user input), and/or inconsistent * Working with strings in base R can be a little frustrating primarily because of syntax inconsistencies * The [__stringr__](https://stringr.tidyverse.org/index.html) package allows you to work with strings easily ] .pull-right[ <img src="images/stringr-large.png" width="80%" height="80%" style="display: block; margin: auto;" /> ] --- # stringr functions <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a> .center[ .font200[All __stringr__ functions start with .bold.grey[`str_`]] .font130[.grey[`str_`].blue[`sub()`]] .font130[.grey[`str_`].blue[`count()`]] .font130[.grey[`str_`].blue[`replace()`]] .font130[.grey[`str_`].blue[`detect()`]] .font130[.grey[`str_`].blue[`remove()`]] .font130[...] .content-box-grey[.bold[Check out all the options with `stringr::str_ + tab`]] ] --- # stringr functions <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a> .scrollable90[ .pull-left[ Let's look at the variety of meat products: <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> product_category </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> FROZEN MEAT </td> </tr> <tr> <td style="text-align:left;"> FRZN MEAT/MEAT DINNERS </td> </tr> <tr> <td style="text-align:left;"> MEAT - SHELF STABLE </td> </tr> <tr> <td style="text-align:left;"> MEAT - MISC </td> </tr> <tr> <td style="text-align:left;"> LUNCHMEAT </td> </tr> <tr> <td style="text-align:left;"> DELI MEATS </td> </tr> <tr> <td style="text-align:left;"> SMOKED MEATS </td> </tr> <tr> <td style="text-align:left;"> RW FRESH PROCESSED MEAT </td> </tr> <tr> <td style="text-align:left;"> MEAT SUPPLIES </td> </tr> <tr> <td style="text-align:left;"> FROZEN PACKAGE MEAT </td> </tr> </tbody> </table> ] .pull-right[ ```r # character string vector x <- c("FROZEN MEAT", "FRZN MEAT/MEAT DINNERS", "MEAT - MISC", "CEREAL") # force to lower case str_to_lower(x) ## [1] "frozen meat" "frzn meat/meat dinners" ## [3] "meat - misc" "cereal" # extract first 4 characters str_sub(x, start = 1, end = 4) ## [1] "FROZ" "FRZN" "MEAT" "CERE" # detect if "meat" is in each element str_detect(x, pattern = "MEAT") ## [1] TRUE TRUE TRUE FALSE # replace first "MEAT" in each element with "NON-VEGGIE str_replace(x, pattern = "MEAT", replacement = "NON-VEGGIE") ## [1] "FROZEN NON-VEGGIE" "FRZN NON-VEGGIE/MEAT DINNERS" ## [3] "NON-VEGGIE - MISC" "CEREAL" # replace all "MEAT" in each element with "NON-VEGGIE str_replace_all(x, pattern = "MEAT", replacement = "NON-VEGGIE") ## [1] "FROZEN NON-VEGGIE" ## [2] "FRZN NON-VEGGIE/NON-VEGGIE DINNERS" ## [3] "NON-VEGGIE - MISC" ## [4] "CEREAL" ``` ] ] --- # Most useful skills <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a> <br> .font200[ 1. How to find matches for patterns 2. How to extract / replace substrings 3. Regular expressions ] <br> .center[.content-box-grey[.bold[We will only hit the basics here, I cover this more thoroughly in Intermediate R]]] --- # Find matches for patterns <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a> What if we wanted to analyze transactions for all .bold[meat] products? .pull-left[ - `str_detect()` returns TRUE/FALSE - use with `filter()` to return only TRUEs ```r products %>% select(product_id, product_category) %>% * filter(str_detect(product_category, "MEAT")) ## # A tibble: 3,607 x 2 ## product_id product_category ## <chr> <chr> ## 1 30003 FROZEN MEAT ## 2 31493 FRZN MEAT/MEAT DINNERS ## 3 34997 FRZN MEAT/MEAT DINNERS ## 4 36406 FRZN MEAT/MEAT DINNERS ## 5 36561 FRZN MEAT/MEAT DINNERS ## 6 36618 MEAT - SHELF STABLE ## 7 36722 MEAT - MISC ## 8 37220 LUNCHMEAT ## 9 38412 LUNCHMEAT ## 10 39376 FRZN MEAT/MEAT DINNERS ## # ... with 3,597 more rows ``` ] -- .pull-right[ How many products are meat products? ```r products %>% distinct(product_id, product_category) %>% * mutate(meat_product = str_detect(product_category, "MEAT")) %>% summarize( count = sum(meat_product, na.rm = TRUE), prop = mean(meat_product, na.rm = TRUE) ) ## # A tibble: 1 x 2 ## count prop ## <int> <dbl> ## 1 3607 0.0393 ``` ] --- # Extract/ replace substrings <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a> What if we wanted to analyze transactions for all .bold[frozen food] products? .pull-left[ Notice that we have products categorized as .bold.blue["FROZEN"] and .bold.blue["FRZN"] <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> product_category </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> FRZN ICE </td> </tr> <tr> <td style="text-align:left;"> FRZN VEGETABLE/VEG DSH </td> </tr> <tr> <td style="text-align:left;"> FRZN FRUITS </td> </tr> <tr> <td style="text-align:left;"> SEAFOOD - FROZEN </td> </tr> <tr> <td style="text-align:left;"> FROZEN PIZZA </td> </tr> <tr> <td style="text-align:left;"> FROZEN MEAT </td> </tr> <tr> <td style="text-align:left;"> FRZN MEAT/MEAT DINNERS </td> </tr> <tr> <td style="text-align:left;"> FRZN BREAKFAST FOODS </td> </tr> <tr> <td style="text-align:left;"> FRZN JCE CONC/DRNKS </td> </tr> <tr> <td style="text-align:left;"> FROZEN PIE/DESSERTS </td> </tr> <tr> <td style="text-align:left;"> FROZEN BREAD/DOUGH </td> </tr> <tr> <td style="text-align:left;"> FRZN NOVELTIES/WTR ICE </td> </tr> <tr> <td style="text-align:left;"> FRZN POTATOES </td> </tr> <tr> <td style="text-align:left;"> FROZEN </td> </tr> <tr> <td style="text-align:left;"> FROZEN CHICKEN </td> </tr> <tr> <td style="text-align:left;"> FROZEN - BOXED(GROCERY) </td> </tr> <tr> <td style="text-align:left;"> FRZN SEAFOOD </td> </tr> <tr> <td style="text-align:left;"> FROZEN PACKAGE MEAT </td> </tr> </tbody> </table> ] .pull-right[ - We can .blue[replace] all "FRZN" instances with .blue[`str_replace`] - Often, we want to replace ___all___ instances, not just the first ```r products %>% * mutate(product_category = str_replace_all(product_category, pattern = "FRZN", replacement = "FROZEN")) %>% filter(str_detect(product_category, "FROZEN")) %>% distinct(product_category) ## # A tibble: 18 x 1 ## product_category ## <chr> ## 1 FROZEN ICE ## 2 FROZEN VEGETABLE/VEG DSH ## 3 FROZEN FRUITS ## 4 SEAFOOD - FROZEN ## 5 FROZEN PIZZA ## 6 FROZEN MEAT ## 7 FROZEN MEAT/MEAT DINNERS ## 8 FROZEN BREAKFAST FOODS ## 9 FROZEN JCE CONC/DRNKS ## 10 FROZEN PIE/DESSERTS ## 11 FROZEN BREAD/DOUGH ## 12 FROZEN NOVELTIES/WTR ICE ## 13 FROZEN POTATOES ## 14 FROZEN ## 15 FROZEN CHICKEN ## 16 FROZEN - BOXED(GROCERY) ## 17 FROZEN SEAFOOD ## 18 FROZEN PACKAGE MEAT ``` ] --- # Regular expressions <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a> - What we have been doing is finding very simple .blue[___regular expressions___] - REGEX provides a concise way to identify patterns within character strings - We cover this more in-depth in the Intermediate R course ```r # all products the start with "FROZEN", "FRZN", or end with the word "ICE" products %>% filter(str_detect(product_category, regex("(^FROZEN|FRZN|//bICE$)"))) %>% distinct(product_category) ## # A tibble: 17 x 1 ## product_category ## <chr> ## 1 FRZN ICE ## 2 FRZN VEGETABLE/VEG DSH ## 3 FRZN FRUITS ## 4 FROZEN PIZZA ## 5 FROZEN MEAT ## 6 FRZN MEAT/MEAT DINNERS ## 7 FRZN BREAKFAST FOODS ## 8 FRZN JCE CONC/DRNKS ## 9 FROZEN PIE/DESSERTS ## 10 FROZEN BREAD/DOUGH ## 11 FRZN NOVELTIES/WTR ICE ## 12 FRZN POTATOES ## 13 FROZEN ## 14 FROZEN CHICKEN ## 15 FROZEN - BOXED(GROCERY) ## 16 FRZN SEAFOOD ## 17 FROZEN PACKAGE MEAT ``` --- class: yourturn # Your Turn! .pull-left[ ### Challenge Using the __products__ data and `product_category` variable 1. How many products contain "SEAFOOD" 2. What is the proportion of products that contain "SEAFOOD" __Hint:__ ```r products %>% distinct(product_category) %>% mutate(seafood = _____) %>% summarise( count = _____, prop = _____ ) ``` ] -- .pull-right[ ### Solution ```r products %>% distinct(product_category) %>% mutate(seafood = str_detect(product_category, pattern = "SEAFOOD")) %>% summarise( count = sum(seafood, na.rm = TRUE), prop = mean(seafood, na.rm = TRUE) ) ## # A tibble: 1 x 2 ## count prop ## <int> <dbl> ## 1 6 0.0195 ``` ] --- class: clear, center, middle background-image: url(images/factor-icon.jpg) background-size: cover <br><br><br> .font300.bold.grey[Factors] --- # Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> .pull-left[ <br> * Factors are a useful data structure; particularly for modeling and visualizations because they control the order of levels * Working with factors in base R can be a little frustrating because of syntax inconsistencies and a handful of missing tools * The [__forcats__](https://forcats.tidyverse.org/index.html) package allows you to modify factors with minimal pain ] .pull-right[ <img src="images/forcats-large.png" width="80%" height="80%" style="display: block; margin: auto;" /> ] --- # Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> R’s representation of categorical data. Consists of: - A set of discrete values - An ordered set of valid levels ```r eyes <- factor(x = c("blue", "green", "green"), levels = c("blue", "brown", "green")) eyes ## [1] blue green green ## Levels: blue brown green ``` -- Stored as an integer vector with a levels attribute ```r unclass(eyes) ## [1] 1 3 3 ## attr(,"levels") ## [1] "blue" "brown" "green" ``` --- # Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> Categorical variables can have levels that are ordered, unordered, collapsable, etc. Consider: .pull-left[ ```r households %>% distinct(marital) ## # A tibble: 4 x 1 ## marital ## <chr> ## 1 Unknown ## 2 Married ## 3 null ## 4 Single ``` ] .pull-right[ ```r households %>% distinct(income_range) ## # A tibble: 7 x 1 ## income_range ## <chr> ## 1 35-49K ## 2 50-74K ## 3 75-99K ## 4 UNDER 35K ## 5 150K+ ## 6 100-150K ## 7 null ``` ] <br><br> .center[.content-box-gray[.bold[Why would we care about changeing these levels? 🤔]]] --- # Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> Often, we want to adjust the categories or the ordering of categories for a categorical variable. Consider: .pull-left[ ```r ggplot(households, aes(marital)) + geom_bar() ``` <img src="day-2b-data-types_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r ggplot(households, aes(income_range)) + geom_bar() ``` <img src="day-2b-data-types_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- # Most useful skills... <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> .pull-left[ .font150[ <br> - Reorder the levels - Recode the levels - Collapse levels ] ] .pull-right[ .font150[ <br> All __forcats__ functions start with .grey[`fct_`] - .grey[`fct_`].blue[`relevel()`] - .grey[`fct_`].blue[`recode()`] - .grey[`fct_`].blue[`collapse()`] - .grey[`fct_`].blue[`unique()`] ] ] --- # Reorder the levels <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> We can .blue[reorder] factor levels with .blue[`fct_relevel()`] .pull-left[ ```r households <- households %>% * mutate(income_range = fct_relevel(income_range, "UNDER 35K", "35-49K", "50-74K", "75-99K", "100-150K", "150K+", "null")) households %>% count(income_range) ## # A tibble: 7 x 2 ## income_range n ## <fct> <int> ## 1 UNDER 35K 790 ## 2 35-49K 876 ## 3 50-74K 947 ## 4 75-99K 594 ## 5 100-150K 581 ## 6 150K+ 347 ## 7 null 865 ``` ] .pull-right[ ```r ggplot(households, aes(income_range)) + geom_bar() ``` <img src="day-2b-data-types_files/figure-html/plot-releveled-income-1.png" style="display: block; margin: auto;" /> ] --- # Recode the levels <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> We can .blue[recode] factor levels with .blue[`fct_recode()`] .pull-left[ ```r households <- households %>% * mutate(income_range = fct_recode(income_range, Unknown = "null")) households %>% count(income_range) ## # A tibble: 7 x 2 ## income_range n ## <fct> <int> ## 1 UNDER 35K 790 ## 2 35-49K 876 ## 3 50-74K 947 ## 4 75-99K 594 ## 5 100-150K 581 ## 6 150K+ 347 ## 7 Unknown 865 ``` ] .pull-right[ ```r ggplot(households, aes(income_range)) + geom_bar() ``` <img src="day-2b-data-types_files/figure-html/recode-income-plot-1.png" style="display: block; margin: auto;" /> ] --- # Collapse the levels <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> We can .blue[collapse] factor levels with .blue[`fct_collapse()`] .pull-left[ ```r households <- households %>% mutate( * marital = fct_collapse(marital, Unknown = c("null", "Unknown")), marital = fct_relevel(marital, "Unknown", after = Inf) ) households %>% count(marital) ## # A tibble: 3 x 2 ## marital n ## <fct> <int> ## 1 Married 2405 ## 2 Single 1388 ## 3 Unknown 1207 ``` ] .pull-right[ ```r ggplot(households, aes(marital)) + geom_bar() ``` <img src="day-2b-data-types_files/figure-html/collapse-marital-status-plot-1.png" style="display: block; margin: auto;" /> ] --- # Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> Sometimes you just want to .blue[reorder a factor for plotting purposes] rather than to permanently change the factor. .blue[`fct_infreq()`], .blue[`fct_rev()`], and .blue[`fct_reorder()`] can be helpful. -- .pull-left[ ```r households %>% mutate(homeowner = fct_collapse(homeowner, Unknown = c("Unknown", "null"))) %>% * ggplot(aes(fct_infreq(homeowner))) + geom_bar() ``` <img src="day-2b-data-types_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r households %>% mutate(homeowner = fct_collapse(homeowner, Unknown = c("Unknown", "null"))) %>% * ggplot(aes(fct_rev(homeowner))) + geom_bar() ``` <img src="day-2b-data-types_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] --- # Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a> Sometimes you just want to .blue[reorder a factor for plotting purposes] rather than to permanently change the factor. .blue[`fct_infreq()`], .blue[`fct_rev()`], and .blue[`fct_reorder()`] can be helpful. .pull-left[ ```r prod_count <- products %>% count(department) %>% drop_na() prod_count ## # A tibble: 32 x 2 ## department n ## <chr> <int> ## 1 AUTOMOTIVE 2 ## 2 CHARITABLE CONT 2 ## 3 CHEF SHOPPE 14 ## 4 CNTRL/STORE SUP 4 ## 5 COSMETICS 3011 ## 6 COUPON 39 ## 7 DELI 2359 ## 8 DRUG GM 31540 ## 9 ELECT &PLUMBING 1 ## 10 FLORAL 938 ## # ... with 22 more rows ``` ] .pull-right[ ```r *ggplot(prod_count, aes(n, fct_reorder(department, n))) + geom_point() ``` <img src="day-2b-data-types_files/figure-html/fct-reorder-plot-1.png" style="display: block; margin: auto;" /> ] --- class: yourturn # Your Turn! .pull-left[ ### Challenge Using the __households__ data 1. Recode the `hh_size` factor so that "null" is now "Unknown" 2. Relevel the `hh_size` factor so that "Unknown" is at the end 3. Use a bar chart to illustrate the distribution of `hh_size` in our data ] .pull-right[ ### Solution ```r households %>% mutate( hh_size = fct_recode(hh_size, Unknown = "null"), hh_size = fct_relevel(hh_size, "Unknown", after = Inf) ) %>% ggplot(aes(hh_size)) + geom_bar() ``` <img src="day-2b-data-types_files/figure-html/yourturn-factors-1.png" style="display: block; margin: auto;" /> ] --- background-image: url(images/date-time-icon.jpg) background-size: cover # Dates & Times --- # Working with dates & times <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a> .pull-left[ <br> * Dates come in many different forms: - 2017/02/03 - February 3, 2017 - 03-Feb-2017 * Working with dates in R can be a bit convoluted and cumbersome * The [__lubridate__](https://lubridate.tidyverse.org/index.html) package allows us to easily handle/manipulate date-time variables ] .pull-right[ <img src="images/lubridate-large.png" width="80%" height="80%" style="display: block; margin: auto;" /> ] --- # Most useful skills... <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a> .font200[ - Creating dates/times (i.e. parsing) - Access and change parts of a date - .opacity20[Deal with time zones] - .opacity20[Do math with instants and time spans] ] --- # Creating dates/times <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a> __lubridate__ has a series of parsing functions that will .blue[create dates] based on the existence and order of the date-time components .pull-left[ .font130[ - `ymd_hms()`, `ymd_hm()`, `ymd_h()`, `ymd()` - `ydm_hms()`, `ydm_hm()`, `ydm_h()`, `ydm()` - `dmy_hms()`, `dmy_hm()`, `dmy_h()`, `dmy()` - `mdy_hms()`, `mdy_hm()`, `mdy_h()`, `mdy()` - and more! ] ] .pull-right[ ```r # year, month, day ymd("2018-12-02") ## [1] "2018-12-02" # year, month, day, hour ymd_h("2018-12-02 01") ## [1] "2018-12-02 01:00:00 UTC" # year, month, day, timestamp ymd_hms("2018-12-02 01:31:27") ## [1] "2018-12-02 01:31:27 UTC" ``` and __lubridate__ does not care about format ```r ymd("2018-12-02") ## [1] "2018-12-02" ymd("2018/12/02") ## [1] "2018-12-02" mdy("February 02, 2018") ## [1] "2018-02-02" ``` ] --- # Accessing components <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a> __lubridate__ has a series of functions to .blue[extract components] of dates .pull-left[ .font130[ - `year()` - `quarter()` - `month()` - `week()` - `wday()` - `hour()` - and more! ] ] .pull-right[ ```r # get year year("2018-12-02 01:31:27") ## [1] 2018 # get quarter quarter("2018-12-02 01:31:27") ## [1] 4 # get month month("2018-12-02 01:31:27", label = TRUE) ## [1] Dec ## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec # get weekday wday("2018-12-02 01:31:27", label = TRUE, abbr = FALSE) ## [1] Sunday ## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday ``` ] --- # Accessing components <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a> - Just like __stringr__, we can use the __lubridate__ functions inside `filter()` and `mutate()` .pull-left[ Use `filter()` to get all transactions that occur on weekends ```r transactions %>% * filter(wday(transaction_timestamp) %in% 6:7) %>% select(basket_id, transaction_timestamp) ## # A tibble: 442,370 x 2 ## basket_id transaction_timestamp ## <chr> <dttm> ## 1 31269173695 2017-01-06 00:15:57 ## 2 31269173695 2017-01-06 00:15:57 ## 3 31269173696 2017-01-06 00:16:57 ## 4 31268902157 2017-01-06 00:40:34 ## 5 31268902157 2017-01-06 00:40:34 ## 6 31268902157 2017-01-06 00:40:34 ## 7 31268902157 2017-01-06 00:40:34 ## 8 31268902157 2017-01-06 00:40:34 ## 9 31268902157 2017-01-06 00:40:34 ## 10 31268902157 2017-01-06 00:40:34 ## # ... with 442,360 more rows ``` ] .pull-right[ Use `mutate()` to create a new variable for the day of week ```r transactions %>% * mutate(weekday = wday(transaction_timestamp, label = TRUE)) %>% select(basket_id, transaction_timestamp, weekday) ## # A tibble: 1,469,307 x 3 ## basket_id transaction_timestamp weekday ## <chr> <dttm> <ord> ## 1 31198570044 2017-01-01 06:53:26 Sun ## 2 31198570047 2017-01-01 07:10:28 Sun ## 3 31198655051 2017-01-01 07:26:30 Sun ## 4 31198705046 2017-01-01 07:30:27 Sun ## 5 31198705046 2017-01-01 07:30:27 Sun ## 6 31198705046 2017-01-01 07:30:27 Sun ## 7 31198705046 2017-01-01 07:30:27 Sun ## 8 31198676055 2017-01-01 07:56:33 Sun ## 9 31198676055 2017-01-01 07:56:33 Sun ## 10 31198676055 2017-01-01 07:56:33 Sun ## # ... with 1,469,297 more rows ``` ] --- class: yourturn # Your Turn! .pull-left[ ### Challenge Using the __transactions__ data set 1. Make a bar chart showing the total number of transactions by weekday. Which weekday experiences the most traffic? 2. Make a line chart showing the total daily sales value (`sum(sales_value)`) for each day of the year (hint: use `yday()`). Is there any obvious trend in the daily total sales value? ] -- .pull-right[ ### Solution ```r # 1 transactions %>% mutate(weekday = wday(transaction_timestamp, label = TRUE)) %>% ggplot(aes(weekday)) + geom_bar() ``` <img src="day-2b-data-types_files/figure-html/yourturn-lubridate-1-1.png" style="display: block; margin: auto;" /> ] --- class: yourturn # Your Turn! .pull-left[ ### Challenge Using the __transactions__ data set 1. Make a bar chart showing the total number of transactions by weekday. Which weekday experiences the most traffic? 2. Make a line chart showing the total daily sales value (`sum(sales_value)`) for each day of the year (hint: use `yday()`). Is there any obvious trend in the daily total sales value? ] .pull-right[ ### Solution ```r # 2 transactions %>% mutate(day = yday(transaction_timestamp)) %>% group_by(day) %>% summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>% ggplot(aes(x = day, y = total_sales)) + geom_line() ``` <img src="day-2b-data-types_files/figure-html/yourturn-lubridate-2-1.png" style="display: block; margin: auto;" /> ] --- # Questions? .font120.center[We can do a lot with data types, and the packages reviewed here can do much more than what we covered; however, this gets you started with manipulating data types for analytic purposes.] <img src="images/questions.png" width="400" height="400" style="display: block; margin: auto;" />