class: title-slide <a href="https://github.com/uc-r/Intermediate-R/"><img style="position: absolute; top: 0; right: 0; border: 0;" src="https://s3.amazonaws.com/github/ribbons/forkme_right_darkblue_121621.png" alt="Fork me on GitHub"></a> <br><br><br><br> # Day
2
: Intermediate
<i class="fab fa-r-project faa-FALSE animated faa-slow " style=" color:steelblue;"></i>
## .font70[.italic['Success is stumbling from failure to failure with no loss of enthusiasm'] - Winston Churchill] ### Brad Boehmke ### Jan 31 - Feb 1, 2019 --- # Today's schedule
<i class="fas fa-calendar-alt faa-FALSE animated " style=" color:red;"></i>
<br> | Topic | Time | |:------|:------:| | Review | 9:00-9:30 | | Iteration with loops | 9:30-10:30 | | Break | 10:30 - 10:45 | | Iteration with functional programming | 10:45-12:00 | | Lunch | 12:00 - 1:00 | | Writing functions | 1:00-2:30 | | Break | 2:30-2:45 | | Case study | 2:45-4:00 | | Q&A | 4:00-4:30 | --- class: clear, center, middle background-image: url(images/review-day2.gif) background-size: cover --- # Prereqs ```r library(tidyverse) library(nycflights13) ``` --- # Transforming data What are the normal __dplyr__ functions to perform the following: * ______: pick observations based on certain conditions * ______: pick variables of interest * ______: compute statistical summaries * ______: perform operations at different levels of your data * ______: reorder data * ______: create new variables --- # Transforming data What are the normal __dplyr__ functions to perform the following: * .blue.bold[`filter`]: pick observations based on certain conditions * .blue.bold[`select`]: pick variables of interest * .blue.bold[`summarize`]: compute statistical summaries * .blue.bold[`group_by`]: perform operations at different levels of your data * .blue.bold[`arrange`]: reorder data * .blue.bold[`mutate`]: create new variables --- # Scoped variable transformations * dplyr scoped variants: - ______: execute function(s) on all variables - ______: on variables that meet a certain condition - ______: for pre-specified variables * argument functions within scoped variants: - ______: specify the variables to be executed on - ______: specify the functions to be executed * helper functions for `filter_*()` - ______: filter for rows where all variables meet the specified condition - ______: filter for rows where at least one variable meets the specified condition <br> .center.bold.italic.red.font120[Spend 2 minutes with your neighbor(s) and fill in the blanks.] --- # Scoped variable transformations * dplyr scoped variants: - .blue.bold[`*_all()`]: execute function(s) on all variables - .blue.bold[`*_if()`]: on variables that meet a certain condition - .blue.bold[`*_at()`]: for pre-specified variables * argument functions within scoped variants: - .blue.bold[`vars()`]: specify the variables to be executed on - .blue.bold[`funs()`]: specify the functions to be executed * helper functions for `filter_*()` - .blue.bold[`all_vars()`]: filter for rows where all variables meet the specified condition - .blue.bold[`any_vars()`]: filter for rows where at least one variable meets the specified condition --- class: yourturn # Your Turn! .pull-left[ ### Challenge #1 Use the proper scoped variant of `summarize()` to redo the following more efficiently. ```r library(tidyverse) library(nycflights13) flights %>% group_by(month) %>% summarize( dep_actual = mean(dep_time, na.rm = TRUE), dep_sched = mean(sched_dep_time, na.rm = TRUE), dep_delay = mean(dep_delay, na.rm = TRUE), arr_actual = mean(arr_time, na.rm = TRUE), arr_sched = mean(sched_arr_time, na.rm = TRUE), arr_delay = mean(arr_delay, na.rm = TRUE), ) ``` ] -- .pull-right[ ### Solution ```r flights %>% group_by(month) %>% summarize_at(vars(matches("dep_|arr_")), mean, na.rm = TRUE) ## # A tibble: 12 x 7 ## month dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1347. 1341. 10.0 1523. 1548. ## 2 2 1348. 1342. 10.8 1522. 1547. ## 3 3 1359. 1354. 13.2 1510. 1546. ## 4 4 1353. 1351. 13.9 1501. 1537. ## 5 5 1351. 1345. 13.0 1503. 1533. ## 6 6 1351. 1346. 20.8 1468. 1527. ## 7 7 1353. 1347. 21.7 1456. 1521. ## 8 8 1350. 1345. 12.6 1495. 1519. ## 9 9 1334. 1335. 6.72 1504. 1534. ## 10 10 1340. 1336. 6.24 1520. 1539. ## 11 11 1344. 1342. 5.44 1523. 1545. ## 12 12 1357. 1345. 16.6 1505. 1543. ## # … with 1 more variable: arr_delay <dbl> ``` ] --- class: yourturn # Your Turn! .pull-left[ ### Challenge #2 * Fill in the `mutate_if()` function to standardize the numeric variables. * To standardize, use this function `(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)` ```r flights %>% select(carrier, matches("dep_|arr_")) %>% mutate_if( .predicate = _______, .funs = ________, ) ``` ] -- .pull-right[ ### Solution ```r flights %>% select(carrier, matches("dep_|arr_")) %>% mutate_if( .predicate = is.numeric, .funs = funs((. - mean(., na.rm = TRUE)) / sd(., na.rm = TRUE)) ) ## # A tibble: 336,776 x 7 ## carrier dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 UA -1.70 -1.77 -0.265 -1.26 -1.44 ## 2 UA -1.67 -1.74 -0.215 -1.22 -1.42 ## 3 AA -1.65 -1.72 -0.265 -1.09 -1.38 ## 4 B6 -1.65 -1.71 -0.339 -0.934 -1.03 ## 5 DL -1.63 -1.59 -0.464 -1.29 -1.41 ## 6 UA -1.63 -1.68 -0.414 -1.43 -1.63 ## 7 B6 -1.63 -1.59 -0.439 -1.10 -1.37 ## 8 EV -1.62 -1.59 -0.389 -1.49 -1.64 ## 9 B6 -1.62 -1.59 -0.389 -1.25 -1.39 ## 10 AA -1.62 -1.59 -0.364 -1.40 -1.59 ## # … with 336,766 more rows, and 1 more variable: arr_delay <dbl> ``` ] --- class: yourturn # Your Turn! .pull-left[ ### Challenge #3 Complete the following to filter out any observation where a "delay" variable contains an `NA`. ```r flights %>% filter_at(vars(contains("delay")), _____(!is.na(.))) ``` ] -- .pull-right[ ### Solution ```r flights %>% filter_at(vars(contains("delay")), any_vars(!is.na(.))) ## # A tibble: 328,521 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## 6 2013 1 1 554 558 -4 740 ## 7 2013 1 1 555 600 -5 913 ## 8 2013 1 1 557 600 -3 709 ## 9 2013 1 1 557 600 -3 838 ## 10 2013 1 1 558 600 -2 753 ## # … with 328,511 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` ] --- # Control statements .center.bold.italic.red.font120[Spend 2 minutes with your neighbor(s) and fill in the blanks.] <img src="images/control-statements-quiz.png" width="2220" style="display: block; margin: auto;" /> --- # Control statements <br> <img src="images/control-statement-summary.png" width="2233" style="display: block; margin: auto;" /> --- class: yourturn # Your Turn! .pull-left[ ### Challenge #1 Fill in the following code chunk so that: - if month has value .blue[1-9] the file name printed out will be .blue[`"data/month-0X.csv"`] - if month has value .blue[10-12] the file name printed out will be .blue[`"data/month-1X.csv"`] - if month is an .blue[invalid month number] (not 1-12), the result printed out is .blue[`"Invalid month"`] - test it out for when month equals 6, 10, & 13 ] .pull-right[ ### Solution ```r month <- 4 if(month _____) { paste0("data/", "Month-0", month, ".csv") } _____ if(month _____) { paste0("data/", "Month-", month, ".csv") } else { print("_____") } ``` ] --- class: yourturn # Your Turn! .scrollable90[ .pull-left[ ### Challenge #1 Fill in the following code chunk so that: - if month has value .blue[1-9] the file name printed out will be .blue[`"data/month-0X.csv"`] - if month has value .blue[10-12] the file name printed out will be .blue[`"data/month-1X.csv"`] - if month is an .blue[invalid month number] (not 1-12), the result printed out is .blue[`"Invalid month"`] - test it out for when month equals 6, 10, & 13 ] .pull-right[ ### Solution ```r month <- 4 if(month %in% 1:9) { paste0("data/", "Month-0", month, ".csv") } else if(month %in% 10:12) { paste0("data/", "Month-", month, ".csv") } else { print("Invalid month") } ## [1] "data/Month-04.csv" month <- 13 if(month %in% 1:9) { paste0("data/", "Month-0", month, ".csv") } else if(month %in% 10:12) { paste0("data/", "Month-", month, ".csv") } else { print("Invalid month") } ## [1] "Invalid month" ``` ] ] --- class: yourturn # Your Turn! .pull-left[ ### Challenge #2 Use `ifelse()` or `if_else()` to print .bold[_"greater than or equal"_] or .bold[_"less than"_] for each element of `x`. Use 0.5 as the threshold. ```r x <- runif(10) ``` ] -- .pull-right[ ### Solution ```r ifelse(x >= .5, "greather than or equal to", "less than") ## [1] "greather than or equal to" "greather than or equal to" ## [3] "less than" "greather than or equal to" ## [5] "greather than or equal to" "less than" ## [7] "greather than or equal to" "less than" ## [9] "greather than or equal to" "less than" ``` ] --- class: yourturn # Your Turn! .pull-left[ Fill in the blanks below to assign each flight to a severity rating of 1, 2, 3, or 4 based on the arrival delay (`arr_delay`) variable: * `severity = 1`: if `arr_delay` < 25th percentile * `severity = 2`: if `arr_delay` < 50th percentile * `severity = 3`: if `arr_delay` < 75th percentile * `severity = 4`: if `arr_delay` >= 75th percentile ```r flights %>% filter(arr_delay > 0) %>% select(carrier, tailnum, arr_delay) %>% mutate(severity = case_when( ______ ~ 1, ______ ~ 2, ______ ~ 3, ______ ~ 4 )) ``` .center.bold[Hint: use `quantile(x, perc_value)`] ] -- .pull-right[ ```r flights %>% filter(arr_delay > 0) %>% select(carrier, tailnum, arr_delay) %>% mutate(severity = case_when( arr_delay < quantile(arr_delay, .25) ~ 1, arr_delay < quantile(arr_delay, .50) ~ 2, arr_delay < quantile(arr_delay, .75) ~ 3, TRUE ~ 4 )) ## # A tibble: 133,004 x 4 ## carrier tailnum arr_delay severity ## <chr> <chr> <dbl> <dbl> ## 1 UA N14228 11 2 ## 2 UA N24211 20 2 ## 3 AA N619AA 33 3 ## 4 UA N39463 12 2 ## 5 B6 N516JB 19 2 ## 6 AA N3ALAA 8 2 ## 7 UA N29129 7 1 ## 8 AA N3DUAA 31 3 ## 9 MQ N542MQ 12 2 ## 10 MQ N730MQ 16 2 ## # … with 132,994 more rows ``` ] --- # Workflow <br> .font120[ 1. ______: Organizes each data analysis into its own project? 2. ______: Combines text and code to create efficient and reproducible analytic deliverables (i.e. reports, presentations) ] --- # Workflow <br> .font120[ 1. .bold.blue[R Projects]: Organizes each data analysis into its own project? 2. .bold.blue[R Markdown]: Combines text and code to create efficient and reproducible analytic deliverables (i.e. reports, presentations) ] --- # Questions before
moving on?
<br> <img src="images/questions.png" width="450" height="450" style="display: block; margin: auto;" />