Relational Data

.font300.bold[Relational Data]

---

# Joining data is part of...

---

# What is relational data

_.font120[“It’s rare that a data analysis involves only a single table of data. .blue[Typically you have many tables of data, and you must combine them] to answer the questions that you’re interested in.”_

--- Garrett Grolemund

]
]

]

---

# Types of joins

To work with relational data you need join operations that work with pairs of tables.  There are two families of verbs designed to work with relational data:

* __Mutating joins__: add new variables to one data frame by matching observations in another.

* __Filter joins__:  filter observations from one data frame based on whether or not they match an observation in the other table.

]

]

---

# Prerequisites

### Packages

```r
library(tidyverse) # or library(dplyr)
```

]

### Example data

```r
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
```

]

---

# Prerequisites

### Exercise data

```r
transactions <- data.table::fread("data/transactions.csv", data.table = F) %>% as_tibble()
products     <- data.table::fread("data/products.csv", data.table = F) %>% as_tibble()
households   <- data.table::fread("data/households.csv", data.table = F) %>% as_tibble()
```

]

### Exercise data connections

]

---

# Keys <span><i class="fas  fa-key faa-FALSE animated " style=" color:red;"></i></span>

- __keys__ are variables that connect pairs of tables
- A primary key uniquely identifies an observation in its own table
- A foreign key uniquely identifies an observation in another table

]
--

.center.font120.bold[Example data keys]

]

.center.font120.bold[Exercise data keys]

]

---
class: clear, center, middle

.font300.bold[Mutating Joins]

---

# Inner join

<br>
.font120[

- Simplest type of join
- Keeps all observations where key values match
- Discards observatoins that don't match
- Add variables from y to x
]

```r
x %>% inner_join(y, by = "key")
## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2
```

]

]

---

# Outer joins

<br>
.font120[

- Outer joins keep ___all___ observations that appear in at least one of the tables
- There are 3 types of outer joins:
]

.center.font120[.blue.bold[left join]: keeps all observations in x]

```r
x %>% left_join(y, by = "key")
## # A tibble: 3 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     3 x3    <NA>
```

]

]

<br>

---

# Outer joins

<br>
.font120[

- Outer joins keep ___all___ observations that appear in at least one of the tables
- There are 3 types of outer joins:
]

.center.font120[.blue.bold[right join]: keeps all observations in y]

```r
x %>% right_join(y, by = "key")
## # A tibble: 3 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     4 <NA>  y3
```

]

]

<br>

---

# Outer joins

<br>
.font120[

- Outer joins keep ___all___ observations that appear in at least one of the tables
- There are 3 types of outer joins:
]

.center.font120[.blue.bold[full join]: keeps all observations in x & y]

```r
x %>% full_join(y, by = "key")
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     3 x3    <NA> 
## 4     4 <NA>  y3
```

]

]

<br>

---

# Outer joins

<br>
.font120[

- Outer joins keep ___all___ observations that appear in at least one of the tables
- There are 3 types of outer joins:
]

<br>
.font120[
- left join: keeps all observations in x
- right join: keeps all observations in y
- full join: keeps all observations in x & y
] 
]

]

---
class: yourturn

# Your Turn!

### Challenge

1. Join the transactions and products data using `inner_join()`. The join key is the `product_num` variable.

2. Join the transactions, products, and households data using two `inner_join()`s. The join key between transactions and products is the `product_num` variable and the join key between transactions and housholds is the `hshd_num` variable.

]

### Solution

```r
# 1
trans_prod <- transactions %>% inner_join(products, by = "product_num")

# 2
combined <- transactions %>% 
  inner_join(products, by = "product_num") %>%
  inner_join(households, by = "hshd_num")

head(combined)
## # A tibble: 6 x 21
##   basket_num hshd_num purchase_ product_num spend units store_r week_num
##        <int>    <int> <chr>           <int> <dbl> <int> <chr>      <int>
## 1     100369     3708 09-DEC-17       93466  3.18     2 SOUTH        101
## 2     891779      719 20-SEP-17       85201  3.49     1 CENTRAL       90
## 3     609562     4995 07-MAR-17     2507006  0.89     1 CENTRAL       62
## 4     760220       44 19-JUN-17     4819172  8.99     1 SOUTH         77
## 5     869525     3937 04-SEP-17     1055355  1        1 SOUTH         88
## 6     922989     2356 13-OCT-17     4285485  2.87     1 WEST          93
## # ... with 13 more variables: year <int>, department <chr>,
## #   commodity <chr>, brand_ty <chr>, x5 <chr>, l <chr>, age_range <chr>,
## #   marital <chr>, income_range <chr>, homeowner <chr>,
## #   hshd_composition <chr>, hh_size <chr>, children <chr>
```

]
]

---
class: clear, center, middle

.font300.bold[Filtering Joins]

---

# Filtering joins

<br>
.font120[

* Filtering joins affect the observations rather than adding variables

* Use when wanting to filter one data set based on foreign key variables in another data set

* There are 2 types of filtering joins:

- .bold.font120[`semi_join()`]
   - .bold.font120[`anti_join()`]
   
]

---

# Filtering joins

```r
x %>% semi_join(y, by = "key")
## # A tibble: 2 x 2
##     key val_x
##   <dbl> <chr>
## 1     1 x1   
## 2     2 x2
```

]

```r
x %>% anti_join(y, by = "key")
## # A tibble: 1 x 2
##     key val_x
##   <dbl> <chr>
## 1     3 x3
```

]

---
class: yourturn

# Your Turn!

### Challenge

1. Of the 5000 households in our __households__ data, how many do we transaction data for?

2. Of the 151,141 products in our __products__ data, how many are not represented in our __transactions__ data?

<br>

### Solution

```r
# 1
households %>% 
* semi_join(transactions) %>%
  tally()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  4509

# 2
products %>%
* anti_join(transactions) %>%
  tally()
## # A tibble: 1 x 1
##       n
##   <int>
## 1 66247
```

]

---

# Quick tip: defining keys

.center.font130[What if our key names don’t match?]

.center.font120.bold[x]

]

.center.font120.bold[y]

]

<br>

```r
x %>% inner_join(y, by = c("key1" = "key2"))
## # A tibble: 2 x 3
##    key1 val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2
```

---

# Things to remember

<br>

* .bold[mutating joins]: add new variables to one data frame by matching key values in another. Includes `inner_join`, `left_join`, `right_join`, `full_join`

* .bold[filtering joins]: filter observations from one data frame based on whether or not they match a key value in the other table. Includes `semi_join` and `anti_join`

---
class: clear, center, middle

.font300.bold.white[One last challenge!]

---
class: yourturn

# Your Turn!

### Challenge

Compute the total `spend` by `commodity` for household (`hshd_num`) 3708.  See if you can plot the results in rank order.

### Steps:

```r
households %>%
  filter(______) %>%               # filter for hshd_num 3708
  inner_join(______) %>%           # inner join w/transactions
  inner_join(______) %>%           # inner join w/products
  group_by(______) %>%             # group by commodity
  summarize(total = ______) %>%    # compute total spend
  ggplot(aes(______, ______)) +    # plot total spend vs. commodity
  geom_point()
```

---
class: yourturn

# Your Turn!

### Challenge

Compute the total `spend` by `commodity` for household (`hshd_num`) 3708.  See if you can plot the results in rank order.

### Solution:

```r
households %>%
  filter(hshd_num == 3708) %>%
  inner_join(transactions) %>%
  inner_join(products) %>%
  group_by(commodity) %>%
  summarize(total = sum(spend, na.rm = TRUE)) %>%
  ggplot(aes(total, reorder(commodity, total))) +
  geom_point()
```

]

]

---

# Questions?

<br>