Data Types

<br><br><br><br><br><br><br><br><br><br><br><br>
.font200.bold[Data Types]

---

# Data types

.font120.center[Data types are involved in nearly every task in the data science flow; however, most relevant to you today will be in the visualization and transformation tasks.]

---

.center.font130.bold.blue[What types of data are in this data set?]

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> household_id </th>
   <th style="text-align:left;"> basket_id </th>
   <th style="text-align:left;"> brand </th>
   <th style="text-align:left;"> product_category </th>
   <th style="text-align:left;"> transaction_timestamp </th>
   <th style="text-align:right;"> quantity </th>
   <th style="text-align:left;"> multi_items </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 2030 </td>
   <td style="text-align:left;"> 41338197126 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> VEGETABLES - ALL OTHERS </td>
   <td style="text-align:left;"> 2017-12-21 15:49:15 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 222 </td>
   <td style="text-align:left;"> 33041273501 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> GREETING CARDS/WRAP/PARTY SPLY </td>
   <td style="text-align:left;"> 2017-05-05 21:38:00 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1333 </td>
   <td style="text-align:left;"> 31490396845 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> CANNED JUICES </td>
   <td style="text-align:left;"> 2017-01-19 17:47:27 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2312 </td>
   <td style="text-align:left;"> 33316120536 </td>
   <td style="text-align:left;"> Private </td>
   <td style="text-align:left;"> EGGS </td>
   <td style="text-align:left;"> 2017-05-25 11:26:21 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2007 </td>
   <td style="text-align:left;"> 41297606588 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> SOUP </td>
   <td style="text-align:left;"> 2017-12-18 13:57:52 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 400 </td>
   <td style="text-align:left;"> 40436037487 </td>
   <td style="text-align:left;"> Private </td>
   <td style="text-align:left;"> FRZN POTATOES </td>
   <td style="text-align:left;"> 2017-10-22 16:29:52 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2322 </td>
   <td style="text-align:left;"> 40545066388 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> VEGETABLES - ALL OTHERS </td>
   <td style="text-align:left;"> 2017-10-30 15:21:52 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1001 </td>
   <td style="text-align:left;"> 32957555802 </td>
   <td style="text-align:left;"> Private </td>
   <td style="text-align:left;"> TOMATOES </td>
   <td style="text-align:left;"> 2017-04-30 12:51:01 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2194 </td>
   <td style="text-align:left;"> 34141497504 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> SOFT DRINKS </td>
   <td style="text-align:left;"> 2017-07-17 20:38:02 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1399 </td>
   <td style="text-align:left;"> 32053257851 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> SALD DRSNG/SNDWCH SPRD </td>
   <td style="text-align:left;"> 2017-02-28 19:05:30 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1767 </td>
   <td style="text-align:left;"> 31390731305 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> CANDY - CHECKLANE </td>
   <td style="text-align:left;"> 2017-01-14 18:11:03 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1064 </td>
   <td style="text-align:left;"> 31344121808 </td>
   <td style="text-align:left;"> Private </td>
   <td style="text-align:left;"> FROZEN PIE/DESSERTS </td>
   <td style="text-align:left;"> 2017-01-09 16:02:04 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1621 </td>
   <td style="text-align:left;"> 40424144445 </td>
   <td style="text-align:left;"> Private </td>
   <td style="text-align:left;"> YOGURT </td>
   <td style="text-align:left;"> 2017-10-21 18:23:27 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1368 </td>
   <td style="text-align:left;"> 31281057009 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> DRIED FRUIT </td>
   <td style="text-align:left;"> 2017-01-06 20:19:05 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> TRUE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 768 </td>
   <td style="text-align:left;"> 41310832228 </td>
   <td style="text-align:left;"> National </td>
   <td style="text-align:left;"> CAT FOOD </td>
   <td style="text-align:left;"> 2017-12-19 15:53:16 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> TRUE </td>
  </tr>
</tbody>
</table>

---

# Packages to work with these data types

```r
install.packages("tidyverse")
```

```r
install.packages("ggplot2")
install.packages("tibble")
install.packages("tidyr")
install.packages("readr")
install.packages("purrr")
install.packages("dplyr")
*install.packages("stringr")
*install.packages("forcats")
*install.packages("lubridate")
*install.packages("hms")
install.packages("DBI")
install.packages("haven")
install.packages("httr")
install.packages("jsonlite")
install.packages("readxl")
install.packages("rvest")
install.packages("xml2")
install.packages("modelr")
install.packages("broom")
```

]

```r
library(tidyverse)
```

```r
library(ggplot2) 
library(tibble) 
library(tidyr) 
library(readr) 
library(purrr) 
library(dplyr) 
*library(stringr)
*library(forcats)
```

]

---

# Requirements

### Packages

```r
library(tidyverse)
library(forcats)   # to work with factors
library(lubridate) # to work with dates
library(hms)       # to work with dates
```

]

### Data sets

```r
library(completejourney)

# complete journey data
transactions <- transactions_sample
products

# imported data
households <- data.table::fread("data/households.csv", data.table = FALSE) %>% as_tibble()
```

]

---

.font300.bold[Logicals]

---

# Logicals

R's data type for .blue[boolean values] (i.e. TRUE and FALSE)

```r
typeof(TRUE)
## [1] "logical"

typeof(FALSE)
## [1] "logical"

typeof(c(TRUE, TRUE, FALSE))
## [1] "logical"
```

]

```r
transactions %>%
  select(basket_id, coupon_disc) %>% 
* mutate(used_coupon = coupon_disc > 0)
## # A tibble: 1,469,307 x 3
##    basket_id   coupon_disc used_coupon
##    <chr>             <dbl> <lgl>      
##  1 31198570044           0 FALSE      
##  2 31198570047           0 FALSE      
##  3 31198655051           0 FALSE      
##  4 31198705046           0 FALSE      
##  5 31198705046           0 FALSE      
##  6 31198705046           0 FALSE      
##  7 31198705046           0 FALSE      
##  8 31198676055           0 FALSE      
##  9 31198676055           0 FALSE      
## 10 31198676055           0 FALSE      
## # ... with 1,469,297 more rows
```

]

---

# Most useful skill... .red[math with logicals]

When you do math with logicals, .blue[TRUE becomes 1] and .blue[FALSE becomes 0.]

The .bold[sum] of a logical vector is the .bold[count of TRUEs]
<br><br>

```r
# logical comparison
x <- c(1, 2, 3, 4) < 4
x
## [1]  TRUE  TRUE  TRUE FALSE

# sum of elements that meet that condition
sum(x)
## [1] 3
```

]

The .bold[mean] of a logical vector is the .bold[proportion of TRUEs]

```r
# proportion of elements that meet that condition
mean(x)
## [1] 0.75
```

]

---
class: yourturn
# Your Turn!

### Challenge

Using the __transactions__ data and the `coupon_disc` variable

1. How many transactions used a coupon (coupon_disc > 0)?
2. What proportion of transactions used a coupon?

]

### Solution

```r
transactions %>% 
  mutate(coupon_used = coupon_disc > 0) %>%
  summarise(
    count = sum(coupon_used),
    prop  = mean(coupon_used)
    )
## # A tibble: 1 x 2
##   count   prop
##   <int>  <dbl>
## 1 21889 0.0149
```

]

---

.font300.bold.grey[Character Strings]

---

# Working with character strings <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a>

<br>

* Often, we have character strings in our data that are long (i.e. description fields), messy (i.e. manual user input), and/or inconsistent

* Working with strings in base R can be a little frustrating primarily because of syntax inconsistencies

* The [__stringr__](https://stringr.tidyverse.org/index.html) package allows you to work with strings easily

]

]

---

# stringr functions <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a>

]

---

# stringr functions <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a>

Let's look at the variety of meat products:

]

```r
# character string vector
x <- c("FROZEN MEAT", "FRZN MEAT/MEAT DINNERS", "MEAT - MISC", "CEREAL")

# force to lower case
str_to_lower(x)
## [1] "frozen meat"            "frzn meat/meat dinners"
## [3] "meat - misc"            "cereal"

# extract first 4 characters
str_sub(x, start = 1, end = 4)
## [1] "FROZ" "FRZN" "MEAT" "CERE"

# detect if "meat" is in each element
str_detect(x, pattern = "MEAT")
## [1]  TRUE  TRUE  TRUE FALSE

# replace first "MEAT" in each element with "NON-VEGGIE
str_replace(x, pattern = "MEAT", replacement = "NON-VEGGIE")
## [1] "FROZEN NON-VEGGIE"            "FRZN NON-VEGGIE/MEAT DINNERS"
## [3] "NON-VEGGIE - MISC"            "CEREAL"

# replace all "MEAT" in each element with "NON-VEGGIE
str_replace_all(x, pattern = "MEAT", replacement = "NON-VEGGIE")
## [1] "FROZEN NON-VEGGIE"                 
## [2] "FRZN NON-VEGGIE/NON-VEGGIE DINNERS"
## [3] "NON-VEGGIE - MISC"                 
## [4] "CEREAL"
```

]
]

---

# Most useful skills <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a>

<br>

1. How to find matches for patterns

2. How to extract / replace substrings

3. Regular expressions

]

<br>
.center[.content-box-grey[.bold[We will only hit the basics here, I cover this more thoroughly in Intermediate R]]]

---

# Find matches for patterns <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a>

What if we wanted to analyze transactions for all .bold[meat] products?

- `str_detect()` returns TRUE/FALSE
- use with `filter()` to return only TRUEs

```r
products %>% 
  select(product_id, product_category) %>%
* filter(str_detect(product_category, "MEAT"))
## # A tibble: 3,607 x 2
##    product_id product_category      
##    <chr>      <chr>                 
##  1 30003      FROZEN MEAT           
##  2 31493      FRZN MEAT/MEAT DINNERS
##  3 34997      FRZN MEAT/MEAT DINNERS
##  4 36406      FRZN MEAT/MEAT DINNERS
##  5 36561      FRZN MEAT/MEAT DINNERS
##  6 36618      MEAT - SHELF STABLE   
##  7 36722      MEAT - MISC           
##  8 37220      LUNCHMEAT             
##  9 38412      LUNCHMEAT             
## 10 39376      FRZN MEAT/MEAT DINNERS
## # ... with 3,597 more rows
```

]

How many products are meat products?

```r
products %>%
  distinct(product_id, product_category) %>%
* mutate(meat_product = str_detect(product_category, "MEAT")) %>%
  summarize(
    count = sum(meat_product, na.rm = TRUE),
    prop  = mean(meat_product, na.rm = TRUE)
  )
## # A tibble: 1 x 2
##   count   prop
##   <int>  <dbl>
## 1  3607 0.0393
```

]

---

# Extract/ replace substrings <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a>

What if we wanted to analyze transactions for all .bold[frozen food] products?

Notice that we have products categorized as .bold.blue["FROZEN"] and .bold.blue["FRZN"]

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> product_category </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> FRZN ICE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRZN VEGETABLE/VEG DSH </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRZN FRUITS </td>
  </tr>
  <tr>
   <td style="text-align:left;"> SEAFOOD - FROZEN </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FROZEN PIZZA </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FROZEN MEAT </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRZN MEAT/MEAT DINNERS </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRZN BREAKFAST FOODS </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRZN JCE CONC/DRNKS </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FROZEN PIE/DESSERTS </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FROZEN BREAD/DOUGH </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRZN NOVELTIES/WTR ICE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRZN POTATOES </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FROZEN </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FROZEN CHICKEN </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FROZEN - BOXED(GROCERY) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FRZN SEAFOOD </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FROZEN PACKAGE MEAT </td>
  </tr>
</tbody>
</table>

]

- We can .blue[replace] all "FRZN" instances with .blue[`str_replace`]
- Often, we want to replace ___all___ instances, not just the first

```r
products %>%
* mutate(product_category = str_replace_all(product_category, pattern = "FRZN", replacement = "FROZEN")) %>%
  filter(str_detect(product_category, "FROZEN")) %>%
  distinct(product_category)
## # A tibble: 18 x 1
##    product_category        
##    <chr>                   
##  1 FROZEN ICE              
##  2 FROZEN VEGETABLE/VEG DSH
##  3 FROZEN FRUITS           
##  4 SEAFOOD - FROZEN        
##  5 FROZEN PIZZA            
##  6 FROZEN MEAT             
##  7 FROZEN MEAT/MEAT DINNERS
##  8 FROZEN BREAKFAST FOODS  
##  9 FROZEN JCE CONC/DRNKS   
## 10 FROZEN PIE/DESSERTS     
## 11 FROZEN BREAD/DOUGH      
## 12 FROZEN NOVELTIES/WTR ICE
## 13 FROZEN POTATOES         
## 14 FROZEN                  
## 15 FROZEN CHICKEN          
## 16 FROZEN - BOXED(GROCERY) 
## 17 FROZEN SEAFOOD          
## 18 FROZEN PACKAGE MEAT
```

]

---

# Regular expressions <a href="https://stringr.tidyverse.org/"><img src="images/stringr.png" class="stringr-hex", align="right"></a>

- What we have been doing is finding very simple .blue[___regular expressions___]

- REGEX provides a concise way to identify patterns within character strings

- We cover this more in-depth in the Intermediate R course

```r
# all products the start with "FROZEN", "FRZN", or end with the word "ICE"
products %>%
  filter(str_detect(product_category, regex("(^FROZEN|FRZN|//bICE$)"))) %>%
  distinct(product_category)
## # A tibble: 17 x 1
##    product_category       
##    <chr>                  
##  1 FRZN ICE               
##  2 FRZN VEGETABLE/VEG DSH 
##  3 FRZN FRUITS            
##  4 FROZEN PIZZA           
##  5 FROZEN MEAT            
##  6 FRZN MEAT/MEAT DINNERS 
##  7 FRZN BREAKFAST FOODS   
##  8 FRZN JCE CONC/DRNKS    
##  9 FROZEN PIE/DESSERTS    
## 10 FROZEN BREAD/DOUGH     
## 11 FRZN NOVELTIES/WTR ICE 
## 12 FRZN POTATOES          
## 13 FROZEN                 
## 14 FROZEN CHICKEN         
## 15 FROZEN - BOXED(GROCERY)
## 16 FRZN SEAFOOD           
## 17 FROZEN PACKAGE MEAT
```

---

### Challenge

Using the __products__ data and `product_category` variable

1. How many products contain "SEAFOOD"
2. What is the proportion of products that contain "SEAFOOD"

__Hint:__

```r
products %>%
  distinct(product_category) %>%
  mutate(seafood = _____) %>%
  summarise(
    count = _____,
    prop  = _____
    )
```

]

### Solution

```r
products %>%
  distinct(product_category) %>%
  mutate(seafood = str_detect(product_category, pattern = "SEAFOOD")) %>%
  summarise(
    count = sum(seafood, na.rm = TRUE),
    prop  = mean(seafood, na.rm = TRUE)
    )
## # A tibble: 1 x 2
##   count   prop
##   <int>  <dbl>
## 1     6 0.0195
```

]

---

<br><br><br>
.font300.bold.grey[Factors]

---

# Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

<br>

* Factors are a useful data structure; particularly for modeling and visualizations because they control the order of levels

* Working with factors in base R can be a little frustrating because of syntax inconsistencies and a handful of missing tools

* The [__forcats__](https://forcats.tidyverse.org/index.html) package allows you to modify factors with minimal pain

]

]

---

# Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

R’s representation of categorical data. Consists of:

- A set of discrete values
- An ordered set of valid levels

```r
eyes <- factor(x = c("blue", "green", "green"), levels = c("blue", "brown", "green"))
eyes
## [1] blue  green green
## Levels: blue brown green
```

Stored as an integer vector with a levels attribute

```r
unclass(eyes)
## [1] 1 3 3
## attr(,"levels")
## [1] "blue"  "brown" "green"
```

---

# Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

Categorical variables can have levels that are ordered, unordered, collapsable, etc.  Consider:

```r
households %>% distinct(marital)
## # A tibble: 4 x 1
##   marital
##   <chr>  
## 1 Unknown
## 2 Married
## 3 null   
## 4 Single
```

]

```r
households %>% distinct(income_range)
## # A tibble: 7 x 1
##   income_range
##   <chr>       
## 1 35-49K      
## 2 50-74K      
## 3 75-99K      
## 4 UNDER 35K   
## 5 150K+       
## 6 100-150K    
## 7 null
```

]

<br><br>
.center[.content-box-gray[.bold[Why would we care about changeing these levels? 🤔]]]

---
# Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

Often, we want to adjust the categories or the ordering of categories for a categorical variable.  Consider:

```r
ggplot(households, aes(marital)) +
  geom_bar()
```

]

```r
ggplot(households, aes(income_range)) +
  geom_bar()
```

]

---
# Most useful skills... <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

<br>

- Reorder the levels

- Recode the levels

- Collapse levels

]
]

<br>

All __forcats__ functions start with .grey[`fct_`]

- .grey[`fct_`].blue[`relevel()`]
- .grey[`fct_`].blue[`recode()`]
- .grey[`fct_`].blue[`collapse()`]
- .grey[`fct_`].blue[`unique()`]

]
]

---
# Reorder the levels <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

We can .blue[reorder] factor levels with .blue[`fct_relevel()`]

```r
households <- households %>% 
* mutate(income_range = fct_relevel(income_range, "UNDER 35K", "35-49K", "50-74K", "75-99K", "100-150K", "150K+", "null"))

households %>% count(income_range)
## # A tibble: 7 x 2
##   income_range     n
##   <fct>        <int>
## 1 UNDER 35K      790
## 2 35-49K         876
## 3 50-74K         947
## 4 75-99K         594
## 5 100-150K       581
## 6 150K+          347
## 7 null           865
```

]

```r
ggplot(households, aes(income_range)) +
  geom_bar()
```

]

---
# Recode the levels <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

We can .blue[recode] factor levels with .blue[`fct_recode()`]

```r
households <- households %>%
* mutate(income_range = fct_recode(income_range, Unknown = "null"))

]

```r
ggplot(households, aes(income_range)) +
  geom_bar()
```

]

---
# Collapse the levels <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

We can .blue[collapse] factor levels with .blue[`fct_collapse()`]

```r
households <- households %>%
  mutate(
*   marital = fct_collapse(marital, Unknown = c("null", "Unknown")),
    marital = fct_relevel(marital, "Unknown", after = Inf)
    )

households %>% count(marital)
## # A tibble: 3 x 2
##   marital     n
##   <fct>   <int>
## 1 Married  2405
## 2 Single   1388
## 3 Unknown  1207
```

]

```r
ggplot(households, aes(marital)) +
  geom_bar()
```

]

---

# Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

Sometimes you just want to .blue[reorder a factor for plotting purposes] rather than to permanently change the factor.  .blue[`fct_infreq()`], .blue[`fct_rev()`], and .blue[`fct_reorder()`] can be helpful.

```r
households %>%
  mutate(homeowner = fct_collapse(homeowner, Unknown = c("Unknown", "null"))) %>%
* ggplot(aes(fct_infreq(homeowner))) +
  geom_bar()
```

]

```r
households %>%
  mutate(homeowner = fct_collapse(homeowner, Unknown = c("Unknown", "null"))) %>%
* ggplot(aes(fct_rev(homeowner))) +
  geom_bar()
```

]

---

# Working with factors <a href="https://forcats.tidyverse.org/"><img src="images/forcats.png" class="forcats-hex", align="right"></a>

```r
prod_count <- products %>% 
  count(department) %>%
  drop_na()

prod_count
## # A tibble: 32 x 2
##    department          n
##    <chr>           <int>
##  1 AUTOMOTIVE          2
##  2 CHARITABLE CONT     2
##  3 CHEF SHOPPE        14
##  4 CNTRL/STORE SUP     4
##  5 COSMETICS        3011
##  6 COUPON             39
##  7 DELI             2359
##  8 DRUG GM         31540
##  9 ELECT &PLUMBING     1
## 10 FLORAL            938
## # ... with 22 more rows
```

]

```r
*ggplot(prod_count, aes(n, fct_reorder(department, n))) +
  geom_point()
```

]

---

### Challenge

Using the __households__ data

1. Recode the `hh_size` factor so that "null" is now "Unknown"

2. Relevel the `hh_size` factor so that "Unknown" is at the end

3. Use a bar chart to illustrate the distribution of `hh_size` in our data

]

### Solution

```r
households %>% 
  mutate(
    hh_size = fct_recode(hh_size, Unknown = "null"),
    hh_size = fct_relevel(hh_size, "Unknown", after = Inf)
    ) %>%
  ggplot(aes(hh_size)) +
  geom_bar()
```

]

---

# Dates & Times

---

# Working with dates & times <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a>

<br>

* Dates come in many different forms:
   - 2017/02/03
   - February 3, 2017
   - 03-Feb-2017

* Working with dates in R can be a bit convoluted and cumbersome

* The [__lubridate__](https://lubridate.tidyverse.org/index.html) package allows us to easily handle/manipulate date-time variables

]

]

---

# Most useful skills... <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a>

- Creating dates/times (i.e. parsing)

- Access and change parts of a date

- .opacity20[Deal with time zones]

- .opacity20[Do math with instants and time spans]

]

---

# Creating dates/times <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a>

__lubridate__ has a series of parsing functions that will .blue[create dates] based on the existence and order of the date-time components

.font130[
- `ymd_hms()`, `ymd_hm()`, `ymd_h()`, `ymd()`
- `ydm_hms()`, `ydm_hm()`, `ydm_h()`, `ydm()`
- `dmy_hms()`, `dmy_hm()`, `dmy_h()`, `dmy()`
- `mdy_hms()`, `mdy_hm()`, `mdy_h()`, `mdy()`
- and more!

]
]

```r
# year, month, day
ymd("2018-12-02")
## [1] "2018-12-02"

# year, month, day, hour
ymd_h("2018-12-02 01")
## [1] "2018-12-02 01:00:00 UTC"

# year, month, day, timestamp
ymd_hms("2018-12-02 01:31:27")
## [1] "2018-12-02 01:31:27 UTC"
```

and __lubridate__ does not care about format

```r
ymd("2018-12-02")
## [1] "2018-12-02"
ymd("2018/12/02")
## [1] "2018-12-02"
mdy("February 02, 2018")
## [1] "2018-02-02"
```

]

---
# Accessing components <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a>

__lubridate__ has a series of functions to .blue[extract components] of dates

]
]

```r
# get year
year("2018-12-02 01:31:27")
## [1] 2018

# get quarter
quarter("2018-12-02 01:31:27")
## [1] 4

# get month
month("2018-12-02 01:31:27", label = TRUE)
## [1] Dec
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

# get weekday
wday("2018-12-02 01:31:27", label = TRUE, abbr = FALSE)
## [1] Sunday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
```

]

---

# Accessing components <a href="https://lubridate.tidyverse.org/"><img src="images/lubridate.png" class="lubridate-hex", align="right"></a>

- Just like __stringr__, we can use the __lubridate__ functions inside `filter()` and `mutate()`

Use `filter()` to get all transactions that occur on weekends

```r
transactions %>%
* filter(wday(transaction_timestamp) %in% 6:7) %>%
  select(basket_id, transaction_timestamp)
## # A tibble: 442,370 x 2
##    basket_id   transaction_timestamp
##    <chr>       <dttm>               
##  1 31269173695 2017-01-06 00:15:57  
##  2 31269173695 2017-01-06 00:15:57  
##  3 31269173696 2017-01-06 00:16:57  
##  4 31268902157 2017-01-06 00:40:34  
##  5 31268902157 2017-01-06 00:40:34  
##  6 31268902157 2017-01-06 00:40:34  
##  7 31268902157 2017-01-06 00:40:34  
##  8 31268902157 2017-01-06 00:40:34  
##  9 31268902157 2017-01-06 00:40:34  
## 10 31268902157 2017-01-06 00:40:34  
## # ... with 442,360 more rows
```

]

Use `mutate()` to create a new variable for the day of week

```r
transactions %>%
* mutate(weekday = wday(transaction_timestamp, label = TRUE)) %>%
  select(basket_id, transaction_timestamp, weekday)
## # A tibble: 1,469,307 x 3
##    basket_id   transaction_timestamp weekday
##    <chr>       <dttm>                <ord>  
##  1 31198570044 2017-01-01 06:53:26   Sun    
##  2 31198570047 2017-01-01 07:10:28   Sun    
##  3 31198655051 2017-01-01 07:26:30   Sun    
##  4 31198705046 2017-01-01 07:30:27   Sun    
##  5 31198705046 2017-01-01 07:30:27   Sun    
##  6 31198705046 2017-01-01 07:30:27   Sun    
##  7 31198705046 2017-01-01 07:30:27   Sun    
##  8 31198676055 2017-01-01 07:56:33   Sun    
##  9 31198676055 2017-01-01 07:56:33   Sun    
## 10 31198676055 2017-01-01 07:56:33   Sun    
## # ... with 1,469,297 more rows
```

]

---

### Challenge

Using the __transactions__ data set

1. Make a bar chart showing the total number of transactions by weekday. Which weekday experiences the most traffic?

2. Make a line chart showing the total daily sales value (`sum(sales_value)`) for each day of the year (hint: use `yday()`).  Is there any obvious trend in the daily total sales value?

]

### Solution

```r
# 1
transactions %>%
  mutate(weekday = wday(transaction_timestamp, label = TRUE)) %>% 
  ggplot(aes(weekday)) +
  geom_bar()
```

]

---
class: yourturn
# Your Turn!

### Challenge

Using the __transactions__ data set

1. Make a bar chart showing the total number of transactions by weekday. Which weekday experiences the most traffic?

2. Make a line chart showing the total daily sales value (`sum(sales_value)`) for each day of the year (hint: use `yday()`).  Is there any obvious trend in the daily total sales value?

]

### Solution

```r
# 2
transactions %>%
  mutate(day = yday(transaction_timestamp)) %>% 
  group_by(day) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  ggplot(aes(x = day, y = total_sales)) +
  geom_line()
```

]

---

# Questions?

.font120.center[We can do a lot with data types, and the packages reviewed here can do much more than what we covered; however, this gets you started with manipulating data types for analytic purposes.]