class: clear, center, middle background-image: url(images/relational-nycflights.png) background-size: cover .font300.bold[Relational Data] --- # Joining data is part of... <br><br> <img src="images/transform-task.png" width="2552" style="display: block; margin: auto;" /> --- # What is relational data .pull-left[ <br> _.font120[“It’s rare that a data analysis involves only a single table of data. .blue[Typically you have many tables of data, and you must combine them] to answer the questions that you’re interested in.”_ --- Garrett Grolemund ] ] .pull-right[ <br> <img src="images/relational-nycflights.png" width="1056" style="display: block; margin: auto;" /> ] --- # Types of joins To work with relational data you need join operations that work with pairs of tables. There are two families of verbs designed to work with relational data: .pull-left[ * __Mutating joins__: add new variables to one data frame by matching observations in another. * __Filter joins__: filter observations from one data frame based on whether or not they match an observation in the other table. ] .pull-right[ <img src="images/dplyr.png" width="60%" height="60%" style="display: block; margin: auto;" /> ] --- # Prerequisites .pull-left[ ### Packages ```r library(tidyverse) # or library(dplyr) ``` ] .pull-right[ ### Example data ```r x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 3, "x3" ) y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3" ) ``` <img src="images/join-setup.png" width="35%" style="display: block; margin: auto;" /> ] --- # Prerequisites .pull-left[ ### Exercise data ```r transactions <- data.table::fread("data/transactions.csv", data.table = F) %>% as_tibble() products <- data.table::fread("data/products.csv", data.table = F) %>% as_tibble() households <- data.table::fread("data/households.csv", data.table = F) %>% as_tibble() ``` ] .pull-right[ ### Exercise data connections <img src="images/customer-data-relations.png" width="3421" style="display: block; margin: auto;" /> ] --- # Keys
<i class="fas fa-key faa-FALSE animated " style=" color:red;"></i>
.font120[ - __keys__ are variables that connect pairs of tables - A primary key uniquely identifies an observation in its own table - A foreign key uniquely identifies an observation in another table ] -- .pull-left[ .center.font120.bold[Example data keys] <img src="images/example-data-keys.png" width="60%" style="display: block; margin: auto;" /> ] .pull-left[ .center.font120.bold[Exercise data keys] <img src="images/customer-data-relations.png" width="3421" style="display: block; margin: auto;" /> ] --- class: clear, center, middle .font300.bold[Mutating Joins] --- # Inner join <br> .font120[ - Simplest type of join - Keeps all observations where key values match - Discards observatoins that don't match - Add variables from y to x ] -- .pull-left[ ```r x %>% inner_join(y, by = "key") ## # A tibble: 2 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] .pull-right[ <img src="images/join-inner.png" width="901" style="display: block; margin: auto;" /> ] --- # Outer joins <br> .font120[ - Outer joins keep ___all___ observations that appear in at least one of the tables - There are 3 types of outer joins: ] -- .pull-left[ .center.font120[.blue.bold[left join]: keeps all observations in x] ```r x %>% left_join(y, by = "key") ## # A tibble: 3 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] .pull-right[ <img src="images/join-left.png" width="1029" style="display: block; margin: auto;" /> ] <br> .center[.content-box-grey[.bold[Note how missing values get filled in with `NA`]]] --- # Outer joins <br> .font120[ - Outer joins keep ___all___ observations that appear in at least one of the tables - There are 3 types of outer joins: ] .pull-left[ .center.font120[.blue.bold[right join]: keeps all observations in y] ```r x %>% right_join(y, by = "key") ## # A tibble: 3 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y3 ``` ] .pull-right[ <img src="images/join-right.png" width="1048" style="display: block; margin: auto;" /> ] <br> .center[.content-box-grey[.bold[Note how missing values get filled in with `NA`]]] --- # Outer joins <br> .font120[ - Outer joins keep ___all___ observations that appear in at least one of the tables - There are 3 types of outer joins: ] .pull-left[ .center.font120[.blue.bold[full join]: keeps all observations in x & y] ```r x %>% full_join(y, by = "key") ## # A tibble: 4 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y3 ``` ] .pull-right[ <img src="images/join-full.png" width="1045" style="display: block; margin: auto;" /> ] <br> .center[.content-box-grey[.bold[Note how missing values get filled in with `NA`]]] --- # Outer joins <br> .font120[ - Outer joins keep ___all___ observations that appear in at least one of the tables - There are 3 types of outer joins: ] .pull-left[ <br> .font120[ - left join: keeps all observations in x - right join: keeps all observations in y - full join: keeps all observations in x & y ] ] .pull-right[ <img src="images/join-outer.png" width="50%" style="display: block; margin: auto;" /> ] --- class: yourturn # Your Turn! .pull-left[ ### Challenge 1. Join the transactions and products data using `inner_join()`. The join key is the `product_num` variable. 2. Join the transactions, products, and households data using two `inner_join()`s. The join key between transactions and products is the `product_num` variable and the join key between transactions and housholds is the `hshd_num` variable. ] -- .pull-right[ ### Solution .code70[ ```r # 1 trans_prod <- transactions %>% inner_join(products, by = "product_num") # 2 combined <- transactions %>% inner_join(products, by = "product_num") %>% inner_join(households, by = "hshd_num") head(combined) ## # A tibble: 6 x 21 ## basket_num hshd_num purchase_ product_num spend units store_r week_num ## <int> <int> <chr> <int> <dbl> <int> <chr> <int> ## 1 100369 3708 09-DEC-17 93466 3.18 2 SOUTH 101 ## 2 891779 719 20-SEP-17 85201 3.49 1 CENTRAL 90 ## 3 609562 4995 07-MAR-17 2507006 0.89 1 CENTRAL 62 ## 4 760220 44 19-JUN-17 4819172 8.99 1 SOUTH 77 ## 5 869525 3937 04-SEP-17 1055355 1 1 SOUTH 88 ## 6 922989 2356 13-OCT-17 4285485 2.87 1 WEST 93 ## # ... with 13 more variables: year <int>, department <chr>, ## # commodity <chr>, brand_ty <chr>, x5 <chr>, l <chr>, age_range <chr>, ## # marital <chr>, income_range <chr>, homeowner <chr>, ## # hshd_composition <chr>, hh_size <chr>, children <chr> ``` ] ] --- class: clear, center, middle .font300.bold[Filtering Joins] --- # Filtering joins <br> .font120[ * Filtering joins affect the observations rather than adding variables * Use when wanting to filter one data set based on foreign key variables in another data set * There are 2 types of filtering joins: - .bold.font120[`semi_join()`] - .bold.font120[`anti_join()`] ] --- # Filtering joins .pull-left[ .font110[.bold.blue[Semi joins] keep all observations in x that have a match in y] ```r x %>% semi_join(y, by = "key") ## # A tibble: 2 x 2 ## key val_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ``` <img src="images/join-semi.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ .font110[.bold.blue[Anti joins] keep all observations that don't have a match] ```r x %>% anti_join(y, by = "key") ## # A tibble: 1 x 2 ## key val_x ## <dbl> <chr> ## 1 3 x3 ``` <img src="images/join-anti.png" width="80%" style="display: block; margin: auto;" /> ] --- class: yourturn # Your Turn! .pull-left[ ### Challenge 1. Of the 5000 households in our __households__ data, how many do we transaction data for? 2. Of the 151,141 products in our __products__ data, how many are not represented in our __transactions__ data? <br> .center[_Hint: after you use the right join functions, pipe the result into `tally()`_] ] -- .pull-right[ ### Solution ```r # 1 households %>% * semi_join(transactions) %>% tally() ## # A tibble: 1 x 1 ## n ## <int> ## 1 4509 # 2 products %>% * anti_join(transactions) %>% tally() ## # A tibble: 1 x 1 ## n ## <int> ## 1 66247 ``` ] --- # Quick tip: defining keys .center.font130[What if our key names don’t match?] .pull-left[ .center.font120.bold[x] <table class="table" style="font-size: 25px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> key1 </th> <th style="text-align:center;"> val_x </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;width: 1in; "> 1 </td> <td style="text-align:center;width: 1in; "> x1 </td> </tr> <tr> <td style="text-align:center;width: 1in; "> 2 </td> <td style="text-align:center;width: 1in; "> x2 </td> </tr> <tr> <td style="text-align:center;width: 1in; "> 3 </td> <td style="text-align:center;width: 1in; "> x3 </td> </tr> </tbody> </table> ] .pull-right[ .center.font120.bold[y] <table class="table" style="font-size: 25px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> key2 </th> <th style="text-align:center;"> val_y </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;width: 1in; "> 1 </td> <td style="text-align:center;width: 1in; "> y1 </td> </tr> <tr> <td style="text-align:center;width: 1in; "> 2 </td> <td style="text-align:center;width: 1in; "> y2 </td> </tr> <tr> <td style="text-align:center;width: 1in; "> 4 </td> <td style="text-align:center;width: 1in; "> y3 </td> </tr> </tbody> </table> ] <br> -- ```r x %>% inner_join(y, by = c("key1" = "key2")) ## # A tibble: 2 x 3 ## key1 val_x val_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` --- # Things to remember <br> * .bold[mutating joins]: add new variables to one data frame by matching key values in another. Includes `inner_join`, `left_join`, `right_join`, `full_join` * .bold[filtering joins]: filter observations from one data frame based on whether or not they match a key value in the other table. Includes `semi_join` and `anti_join` --- class: clear, center, middle background-image: url(images/your-killing-me-smalls.webp) background-size: cover .font300.bold.white[One last challenge!] --- class: yourturn # Your Turn! ### Challenge Compute the total `spend` by `commodity` for household (`hshd_num`) 3708. See if you can plot the results in rank order. ### Steps: ```r households %>% filter(______) %>% # filter for hshd_num 3708 inner_join(______) %>% # inner join w/transactions inner_join(______) %>% # inner join w/products group_by(______) %>% # group by commodity summarize(total = ______) %>% # compute total spend ggplot(aes(______, ______)) + # plot total spend vs. commodity geom_point() ``` --- class: yourturn # Your Turn! ### Challenge Compute the total `spend` by `commodity` for household (`hshd_num`) 3708. See if you can plot the results in rank order. ### Solution: .pull-left[ ```r households %>% filter(hshd_num == 3708) %>% inner_join(transactions) %>% inner_join(products) %>% group_by(commodity) %>% summarize(total = sum(spend, na.rm = TRUE)) %>% ggplot(aes(total, reorder(commodity, total))) + geom_point() ``` ] .pull-right[ <img src="day-2d-joins_files/figure-html/final-challenge-plot-1.png" style="display: block; margin: auto;" /> ] --- # Questions? <br> <img src="images/questions.png" width="450" height="450" style="display: block; margin: auto;" />