Day 2 Intro

class: title-slide 
<a href="https://github.com/uc-r/Intermediate-R/"><img style="position: absolute; top: 0; right: 0; border: 0;" src="https://s3.amazonaws.com/github/ribbons/forkme_right_darkblue_121621.png" alt="Fork me on GitHub"></a>

# Day 2: Intermediate

## .font70[.italic['Success is stumbling from failure to failure with no loss of enthusiasm'] - Winston Churchill]

### Brad Boehmke
### Jan 31 - Feb 1, 2019

---

# Today's schedule

| Topic | Time |
|:------|:------:|
| Review | 9:00-9:30 |
| Iteration with loops | 9:30-10:30 |
| Break | 10:30 - 10:45 |
| Iteration with functional programming | 10:45-12:00 |
| Lunch | 12:00 - 1:00 |
| Writing functions | 1:00-2:30 |
| Break | 2:30-2:45 |
| Case study | 2:45-4:00 |
| Q&A | 4:00-4:30 |

---

---
# Prereqs

```r
library(tidyverse)
library(nycflights13)
```

---
# Transforming data

What are the normal __dplyr__ functions to perform the following:

* &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: pick observations based on certain conditions

* &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: pick variables of interest

* &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: compute statistical summaries

* &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: perform operations at different levels of your data

* &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: reorder data

* &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: create new variables

---

# Transforming data

What are the normal __dplyr__ functions to perform the following:

* .blue.bold[`filter`]: pick observations based on certain conditions

* .blue.bold[`select`]: pick variables of interest

* .blue.bold[`summarize`]: compute statistical summaries

* .blue.bold[`group_by`]: perform operations at different levels of your data

* .blue.bold[`arrange`]: reorder data

* .blue.bold[`mutate`]: create new variables

---
# Scoped variable transformations

* dplyr scoped variants:
   - &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: execute function(s) on all variables 
   - &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: on variables that meet a certain condition
   - &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: for pre-specified variables

* argument functions within scoped variants:
   - &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: specify the variables to be executed on
   - &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: specify the functions to be executed

* helper functions for `filter_*()`
   - &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: filter for rows where all variables meet the specified condition
   - &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: filter for rows where at least one variable meets the specified condition

.center.bold.italic.red.font120[Spend 2 minutes with your neighbor(s) and fill in the blanks.]

---
# Scoped variable transformations

* dplyr scoped variants:
   - .blue.bold[`*_all()`]: execute function(s) on all variables 
   - .blue.bold[`*_if()`]: on variables that meet a certain condition
   - .blue.bold[`*_at()`]: for pre-specified variables

* argument functions within scoped variants:
   - .blue.bold[`vars()`]: specify the variables to be executed on
   - .blue.bold[`funs()`]: specify the functions to be executed

* helper functions for `filter_*()`
   - .blue.bold[`all_vars()`]: filter for rows where all variables meet the specified condition
   - .blue.bold[`any_vars()`]: filter for rows where at least one variable meets the specified condition

---

### Challenge #1

Use the proper scoped variant of `summarize()` to redo the following more efficiently.

```r
library(tidyverse)
library(nycflights13)

flights %>%
  group_by(month) %>%
  summarize(
    dep_actual = mean(dep_time, na.rm = TRUE),
    dep_sched  = mean(sched_dep_time, na.rm = TRUE),
    dep_delay  = mean(dep_delay, na.rm = TRUE),
    arr_actual = mean(arr_time, na.rm = TRUE),
    arr_sched  = mean(sched_arr_time, na.rm = TRUE),
    arr_delay  = mean(arr_delay, na.rm = TRUE),
  )
```

]

### Solution

```r
flights %>%
 group_by(month) %>%
 summarize_at(vars(matches("dep_|arr_")), mean, na.rm = TRUE)
## # A tibble: 12 x 7
## month dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1347. 1341. 10.0 1523. 1548.
## 2 2 1348. 1342. 10.8 1522. 1547.
## 3 3 1359. 1354. 13.2 1510. 1546.
## 4 4 1353. 1351. 13.9 1501. 1537.
## 5 5 1351. 1345. 13.0 1503. 1533.
## 6 6 1351. 1346. 20.8 1468. 1527.
## 7 7 1353. 1347. 21.7 1456. 1521.
## 8 8 1350. 1345. 12.6 1495. 1519.
## 9 9 1334. 1335. 6.72 1504. 1534.
## 10 10 1340. 1336. 6.24 1520. 1539.
## 11 11 1344. 1342. 5.44 1523. 1545.
## 12 12 1357. 1345. 16.6 1505. 1543.
## # … with 1 more variable: arr_delay <dbl>
```

]

---

### Challenge #2

* Fill in the `mutate_if()` function to standardize the numeric variables.

* To standardize, use this function `(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)`

```r
flights %>%
  select(carrier, matches("dep_|arr_")) %>%
  mutate_if(
    .predicate = _______, 
    .funs = ________,
    )
```

]

### Solution

```r
flights %>%
 select(carrier, matches("dep_|arr_")) %>%
 mutate_if(
 .predicate = is.numeric, 
 .funs = funs((. - mean(., na.rm = TRUE)) / sd(., na.rm = TRUE))
 )
## # A tibble: 336,776 x 7
## carrier dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 UA -1.70 -1.77 -0.265 -1.26 -1.44
## 2 UA -1.67 -1.74 -0.215 -1.22 -1.42
## 3 AA -1.65 -1.72 -0.265 -1.09 -1.38
## 4 B6 -1.65 -1.71 -0.339 -0.934 -1.03
## 5 DL -1.63 -1.59 -0.464 -1.29 -1.41
## 6 UA -1.63 -1.68 -0.414 -1.43 -1.63
## 7 B6 -1.63 -1.59 -0.439 -1.10 -1.37
## 8 EV -1.62 -1.59 -0.389 -1.49 -1.64
## 9 B6 -1.62 -1.59 -0.389 -1.25 -1.39
## 10 AA -1.62 -1.59 -0.364 -1.40 -1.59
## # … with 336,766 more rows, and 1 more variable: arr_delay <dbl>
```

]

---

### Challenge #3

Complete the following to filter out any observation where a "delay" variable contains an `NA`.

```r
flights %>% 
  filter_at(vars(contains("delay")), _____(!is.na(.)))
```

]

### Solution

```r
flights %>% 
 filter_at(vars(contains("delay")), any_vars(!is.na(.)))
## # A tibble: 328,521 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 328,511 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
```

]

---

# Control statements

.center.bold.italic.red.font120[Spend 2 minutes with your neighbor(s) and fill in the blanks.]

---

# Control statements

---

### Challenge #1

Fill in the following code chunk so that:

- if month has value .blue[1-9] the file name printed out will be .blue[`"data/month-0X.csv"`]
- if month has value .blue[10-12] the file name printed out will be .blue[`"data/month-1X.csv"`]
- if month is an .blue[invalid month number] (not 1-12), the result printed out is .blue[`"Invalid month"`]
- test it out for when month equals 6, 10, & 13

]

### Solution

```r
month <- 4

if(month _____) {
  paste0("data/", "Month-0", month, ".csv")
} _____ if(month _____) {
  paste0("data/", "Month-", month, ".csv")
} else {
  print("_____")
}
```

]

---
class: yourturn
# Your Turn!

### Challenge #1

Fill in the following code chunk so that:

]

### Solution

```r
month <- 4

if(month %in% 1:9) {
  paste0("data/", "Month-0", month, ".csv")
} else if(month %in% 10:12) {
  paste0("data/", "Month-", month, ".csv")
} else {
  print("Invalid month")
}
## [1] "data/Month-04.csv"

month <- 13

---
class: yourturn
# Your Turn!

### Challenge #2

Use `ifelse()` or `if_else()` to print .bold[_"greater than or equal"_] or .bold[_"less than"_] for each element of `x`.  Use 0.5 as the threshold.

```r
x <- runif(10)
```

]

### Solution

```r
ifelse(x >= .5, "greather than or equal to", "less than")
##  [1] "greather than or equal to" "greather than or equal to"
##  [3] "less than"                 "greather than or equal to"
##  [5] "greather than or equal to" "less than"                
##  [7] "greather than or equal to" "less than"                
##  [9] "greather than or equal to" "less than"
```

]

---

Fill in the blanks below to assign each flight to a severity rating of 1, 2, 3, or 4 based on the arrival delay (`arr_delay`) variable:

* `severity = 1`: if `arr_delay` < 25th percentile
* `severity = 2`: if `arr_delay` < 50th percentile
* `severity = 3`: if `arr_delay` < 75th percentile
* `severity = 4`: if `arr_delay` >= 75th percentile

```r
flights %>%
  filter(arr_delay > 0) %>%
  select(carrier, tailnum, arr_delay) %>%
  mutate(severity = case_when(
    ______ ~ 1,
    ______ ~ 2,
    ______ ~ 3,
    ______ ~ 4
  ))
```

.center.bold[Hint: use `quantile(x, perc_value)`]

]

```r
flights %>%
 filter(arr_delay > 0) %>%
 select(carrier, tailnum, arr_delay) %>%
 mutate(severity = case_when(
 arr_delay < quantile(arr_delay, .25) ~ 1,
 arr_delay < quantile(arr_delay, .50) ~ 2,
 arr_delay < quantile(arr_delay, .75) ~ 3,
 TRUE ~ 4
 ))
## # A tibble: 133,004 x 4
## carrier tailnum arr_delay severity
## <chr> <chr> <dbl> <dbl>
## 1 UA N14228 11 2
## 2 UA N24211 20 2
## 3 AA N619AA 33 3
## 4 UA N39463 12 2
## 5 B6 N516JB 19 2
## 6 AA N3ALAA 8 2
## 7 UA N29129 7 1
## 8 AA N3DUAA 31 3
## 9 MQ N542MQ 12 2
## 10 MQ N730MQ 16 2
## # … with 132,994 more rows
```

]

---
# Workflow

.font120[
1. &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: Organizes each data analysis into its own project?

2. &lowbar;&lowbar;&lowbar;&lowbar;&lowbar;&lowbar;: Combines text and code to create efficient and reproducible analytic deliverables (i.e. reports, presentations)
]

---
# Workflow

2. .bold.blue[R Markdown]: Combines text and code to create efficient and reproducible analytic deliverables (i.e. reports, presentations)
]

---

# Questions before moving on?