+ - 0:00:00
Notes for current slide
Notes for next slide














Writing Functions "Writing good functions is a lifetime journey." - Hadley Wickham

1 / 53

Overview

  • Why functions are good

  • Anatomy of a function

  • Defining functions

  • Defining your arguments

  • Checking arguments and other conditions

  • Lazy evaluation

  • Lexical scoping

  • Distributing your function(s)

  • Learning more

2 / 53

Why functions are good

  • A function centralizes a common task to a single, abstract method

  • This helps to:

    • increase reusability which...
    • reduces verbosity
    • reduces repetitiveness
    • reduces the chance of making an error
    • reduces instances requiring updates when changes are needed
    • reduces amount of code to test
3 / 53

Why functions are good

  • A function centralizes a common task to a single, abstract method

  • This helps to:

    • increase reusability which...
    • reduces verbosity
    • reduces repetitiveness
    • reduces the chance of making an error
    • reduces instances requiring updates when changes are needed
    • reduces amount of code to test

Consider the following from Hadley Wickham’s book R for Data Science. Where's the error?

df <- tibble::tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
3 / 53

Why functions are good

  • A function centralizes a common task to a single, abstract method

  • This helps to:

    • increase reusability which...
    • reduces verbosity
    • reduces repetitiveness
    • reduces the chance of making an error
    • reduces instances requiring updates when changes are needed
    • reduces amount of code to test

Consider the following from Hadley Wickham’s book R for Data Science. Where's the error?

df <- tibble::tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
4 / 53

Prereqs

Base R

5 / 53

Prereqs

Base R

5 / 53

Prereqs

Base R


I will use some packages throughout but these are simply for examples. You will be notified but sometimes you just need to learn to base R.

5 / 53

Key ingredients of a function


  1. Name

  2. arguments

  3. body

  4. environment


my_fun <- function(arg1, arg2) {
<< body >>
}
6 / 53

Key ingredients of a function


  1. Name

  2. arguments

  3. body

  4. environment

pv <- function(fv, r, n = 5) {
fv / (1 + r)^n
}
args(pv)
## function (fv, r, n = 5)
## NULL
body(pv)
## {
## fv/(1 + r)^n
## }
environment(pv)
## <environment: R_GlobalEnv>
7 / 53

Key ingredients of a function


  1. Name

  2. arguments

  3. body

  4. environment

pv <- function(fv, r, n = 5) {
fv / (1 + r)^n
}
args(pv)
## function (fv, r, n = 5)
## NULL
body(pv)
## {
## fv/(1 + r)^n
## }
environment(pv)
## <environment: R_GlobalEnv>


Note how the environment is our global environment

7 / 53

Your Turn!

Identify the arguments, body, and environment of:

  1. read.csv

  2. dplyr::add_count

  3. sum

8 / 53

Your Turn!

Identify the arguments, body, and environment of:

  1. read.csv

  2. dplyr::add_count

  3. sum

# #1 read.csv
args(read.csv)
## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
## fill = TRUE, comment.char = "", ...)
## NULL
body(read.csv)
## read.table(file = file, header = header, sep = sep, quote = quote,
## dec = dec, fill = fill, comment.char = comment.char, ...)
environment(read.csv)
## <environment: namespace:utils>
8 / 53

Your Turn!

Identify the arguments, body, and environment of:

  1. read.csv

  2. dplyr::add_count

  3. sum

# #2 dplyr::add_count
args(dplyr::add_count)
## function (x, ..., wt = NULL, sort = FALSE)
## NULL
body(dplyr::add_count)
## {
## g <- group_vars(x)
## grouped <- group_by(x, ..., add = TRUE)
## out <- add_tally(grouped, wt = !!enquo(wt), sort = sort)
## grouped_df(out, g)
## }
environment(dplyr::add_count)
## <environment: namespace:dplyr>
9 / 53

Your Turn!

Identify the arguments, body, and environment of:

  1. read.csv

  2. dplyr::add_count

  3. sum

# #3 sum
args(sum)
## function (..., na.rm = FALSE)
## NULL
body(sum) # NULL because its a primitive
## NULL
environment(sum) # NULL because its a primitive
## NULL



sum() is a primitive function

10 / 53

Defining a function

  • We define a function with <- just like we define any other R object

  • Use informative names; strive to use verbs when possible

present_value <- function() {
}
compute_pv <- function() {
}
11 / 53

Defining a function

  • We define a function with <- just like we define any other R object

  • Use informative names; strive to use verbs when possible

  • We can define a function with no arguments; however, this is rarely useful

present_value <- function() {
}
compute_pv <- function() {
}
compute_pv <- function() {
1000 / (1 + 0.05)^10
}
compute_pv()
## [1] 614


12 / 53

Defining a function

  • We define a function with <- just like we define any other R object

  • Use informative names; strive to use verbs when possible

  • We can define a function with no arguments; however, this is rarely useful

present_value <- function() {
}
compute_pv <- function() {
}
compute_pv <- function() {
1000 / (1 + 0.05)^10
}
compute_pv()
## [1] 614


We want a function that does one thing consistently well, yet provide flexibility in the inputs provided.

12 / 53

Defining a function

  • We define a function with <- just like we define any other R object

  • Use informative names; strive to use verbs when possible

  • We can define a function with no arguments; however, this is rarely useful

  • Consequently, most functions have all key ingredients
present_value <- function() {
}
compute_pv <- function() {
}
compute_pv <- function() {
1000 / (1 + 0.05)^10
}
compute_pv <- function(fv, r, n) {
fv / (1 + r)^n
}
compute_pv(fv = 1000, r = .05, n = 10)
## [1] 614
compute_pv(fv = 2000, r = .08, n = 5)
## [1] 1361
13 / 53

Your Turn!

  • Define a function titled ratio that takes arguments x and y and returns their ratio, x / y

  • Call ratio() with arguments 3 and 4

14 / 53

Your Turn!

  • Define a function titled ratio that takes arguments x and y and returns their ratio, x / y

  • Call ratio() with arguments 3 and 4

ratio <- function(x, y) {
x / y
}
ratio(3, 4)
## [1] 0.75
14 / 53

Calling arguments in different ways

Many ways to call arguments:

  • Using argument names
compute_pv(fv = 1000, r = .05, n = 10)
## [1] 613.9133
15 / 53

Calling arguments in different ways

Many ways to call arguments:

  • Using argument names

  • Positional matching

compute_pv(fv = 1000, r = .05, n = 10)
## [1] 613.9133
compute_pv(1000, .05, 10)
## [1] 613.9133
16 / 53

Calling arguments in different ways

Many ways to call arguments:

  • Using argument names

  • Positional matching

  • Must use names if you change order

  • ...otherwise error or incorrect computation will occur

compute_pv(fv = 1000, r = .05, n = 10)
## [1] 613.9133
compute_pv(1000, .05, 10)
## [1] 613.9133
compute_pv(r = .05, fv = 1000, n = 10)
## [1] 613.9133
compute_pv(.05, 1000, 10)
## [1] 4.950274e-32
17 / 53

Calling arguments in different ways

Many ways to call arguments:

  • Using argument names

  • Positional matching

  • Must use names if you change order

  • ...otherwise error or incorrect computation will occur

  • missing arguments results in error

compute_pv(fv = 1000, r = .05, n = 10)
## [1] 613.9133
compute_pv(1000, .05, 10)
## [1] 613.9133
compute_pv(r = .05, fv = 1000, n = 10)
## [1] 613.9133
compute_pv(.05, 1000, 10)
## [1] 4.950274e-32
compute_pv(1000, .05)
## Error in compute_pv(1000, 0.05): argument "n" is missing, with no default
18 / 53

Ordering your arguments

Ordering arguments in your functions is important:

  • pipe (%>%) operator

    • tidyverse pkgs consistently have data as first argument
    • makes for intuitive syntax & easy flow
  • positional matching

    • which args are most vs less important
    • helps dictate where defaults need to go (next slide)
# bad --> rounding to digits is least important parameter
compute_pv <- function(digits, r, n, fv) {
<< body >>
}
# better but could still be improved --> does a user always need to specify digits to round to?
compute_pv <- function(fv, r, n, digits) {
<< body >>
}
19 / 53

Setting (good) defaults

  • As our functions require more inputs...

  • many of them can likely do with defaults

  • Usually, the first 1-3 arguments are not preset but the rest are

  • Good defaults are largely based on

    • methodology best practices (i.e. hyperparameters, removing missing values)
    • industry best practices or current values (i.e. current interest rates)
    • organizational best practices (i.e. golden rules)
compute_pv <- function(fv, r, n, digits = 2) {
round(fv / (1 + r)^n, digits = digits)
}
# by default, will round to 2 digits
compute_pv(1000, .05, 10)
## [1] 613.91
# to adjust, specify by position
compute_pv(1000, .05, 10, 0)
## [1] 614
# or by name
compute_pv(1000, .05, 10, digits = 0)
## [1] 614
20 / 53

Your Turn - Part 1

  • Earlier in these slides you saw the following code duplicated:
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
  • Can you write a function called rescale() that takes argument x and executes this code?

  • Test it on the following vector:

set.seed(123)
test_vector <- runif(20, min = 25, max = 40)
21 / 53

Your Turn - Part 1

  • Earlier in these slides you saw the following code duplicated:
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
  • Can you write a function called rescale() that takes argument x and executes this code?

  • Test it on the following vector:

set.seed(123)
test_vector <- runif(20, min = 25, max = 40)
rescale <- function(x){
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale(test_vector)
## [1] 0.2684 0.8158 0.4011 0.9193 0.9821 0.0038 0.5313 0.9296 0.5568 0.4532
## [11] 1.0000 0.4496 0.6947 0.5800 0.0665 0.9377 0.2230 0.0000 0.3125 0.9975
21 / 53

Your Turn - Part 2

  • Now add an argument to rescale() that allows you to round the output to a specified decimal.

  • Set the default to 2.

22 / 53

Your Turn - Part 2

  • Now add an argument to rescale() that allows you to round the output to a specified decimal.

  • Set the default to 2.

rescale <- function(x, digits = 2){
rng <- range(x, na.rm = TRUE)
scaled <- (x - rng[1]) / (rng[2] - rng[1])
round(scaled, digits = digits)
}
# default
rescale(test_vector)
## [1] 0.27 0.82 0.40 0.92 0.98 0.00 0.53 0.93 0.56 0.45 1.00 0.45 0.69 0.58
## [15] 0.07 0.94 0.22 0.00 0.31 1.00
# 3 digits
rescale(test_vector, digits = 3)
## [1] 0.268 0.816 0.401 0.919 0.982 0.004 0.531 0.930 0.557 0.453 1.000
## [12] 0.450 0.695 0.580 0.067 0.938 0.223 0.000 0.312 0.997
22 / 53

Checking arguments and other conditions

  • We've created a function that allows users to specify their inputs

  • But how do we ensure they provide us with the right kind of inputs?

  • Or what if we want to be able to provide some sort of feedback to the user?

  • There are several ways to signal conditions to function users:

23 / 53

Checking arguments and other conditions

  • We've created a function that allows users to specify their inputs

  • But how do we ensure they provide us with the right kind of inputs?

  • Or what if we want to be able to provide some sort of feedback to the user?

  • There are several ways to signal conditions to function users:

    • stop() & stopifnot(): signal an error (no way for a function to continue and execution must stop)
23 / 53

Checking arguments and other conditions

  • We've created a function that allows users to specify their inputs

  • But how do we ensure they provide us with the right kind of inputs?

  • Or what if we want to be able to provide some sort of feedback to the user?

  • There are several ways to signal conditions to function users:

    • stop() & stopifnot(): signal an error (no way for a function to continue and execution must stop)

    • warning(): signal a warning (something has gone wrong but the function has been able to at least partially recover.)

23 / 53

Checking arguments and other conditions

  • We've created a function that allows users to specify their inputs

  • But how do we ensure they provide us with the right kind of inputs?

  • Or what if we want to be able to provide some sort of feedback to the user?

  • There are several ways to signal conditions to function users:

    • stop() & stopifnot(): signal an error (no way for a function to continue and execution must stop)

    • warning(): signal a warning (something has gone wrong but the function has been able to at least partially recover.)

    • message(): signal an informative message (function works fine but user should be informed of something.)

The most common you will use are stop() & message()

23 / 53

stop() & stopifnot()

  • The stop functions are most commonly used to check for proper inputs

  • but can be used to stop the function procedures because the user's environment is not properly established

  • Weigh the benefits of highly custom, informative error messages (via stop) versus short to the point (via stopifnot)

  • Execution will stop at the first violation

# stop
compute_pv <- function(fv, r, n, digits = 2) {
if(!is.numeric(fv)) {
stop("`fv` must be numeric", call. = FALSE)
}
if(!is.numeric(r)) {
stop("`r` must be numeric", call. = FALSE)
}
if(!is.numeric(n)) {
stop("`n` must be numeric", call. = FALSE)
}
if(!is.numeric(digits)) {
stop("`digits` must be numeric", call. = FALSE)
}
round(fv / (1 + r)^n, digits = digits)
}
compute_pv("1000", ".05", 10)
## Error: `fv` must be numeric
# stopifnot
compute_pv <- function(fv, r, n, digits = 2) {
stopifnot(is.numeric(fv), is.numeric(r), is.numeric(n), is.numeric(digits))
round(fv / (1 + r)^n, digits = digits)
}
compute_pv("1000", ".05", 10)
## Error in compute_pv("1000", ".05", 10): is.numeric(fv) is not TRUE
24 / 53

warning()

  • warnings() are rarely used but...

  • can be useful to signal non-terminating concerns

    • hyperparameter clashes in a grid search
    • an extremely long compute time and the user may want to re-assess
    • deprecated functions
    • NAs, NULLs, Infs exist, which result in NA output
lubridate::mdy(c("2-1-2019", "1995"))
## Warning: 1 failed to parse.
## [1] "2019-02-01" NA
25 / 53

warning()

  • warnings() are rarely used but...

  • can be useful to signal non-terminating concerns

    • hyperparameter clashes in a grid search
    • an extremely long compute time and the user may want to re-assess
    • deprecated functions
    • NAs, NULLs, Infs exist, which result in NA output
lubridate::mdy(c("2-1-2019", "1995"))
## Warning: 1 failed to parse.
## [1] "2019-02-01" NA
present_value <- function(fv, r, n, digits = 2) {
warning("`present_value()` is deprecated. Use `compute_pv()`", call. = FALSE)
stopifnot(is.numeric(fv), is.numeric(r), is.numeric(n), is.numeric(digits))
round(fv / (1 + r)^n, digits = digits)
}
present_value(1000, .05, 10)
## Warning: `present_value()` is deprecated. Use `compute_pv()`
## [1] 613.91
25 / 53

message()

  • designed to be informational

  • use them to tell the user that you’ve done something on their behalf

  • may represent:

26 / 53

message()

  • designed to be informational

  • use them to tell the user that you’ve done something on their behalf

  • may represent:

    • default argument that may be non-trivial
library(ggplot2)
ggplot(iris, aes(Sepal.Length)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

27 / 53

message()

  • designed to be informational

  • use them to tell the user that you’ve done something on their behalf

  • may represent:

    • default argument that may be non-trivial
    • how information was interpreted by the function
df <- readr::read_csv("data/Month-01.csv")
## Parsed with column specification:
## cols(
## Account_ID = col_integer(),
## Transaction_Timestamp = col_datetime(format = ""),
## Factor_A = col_integer(),
## Factor_B = col_integer(),
## Factor_C = col_character(),
## Factor_D = col_integer(),
## Factor_E = col_character(),
## Response = col_integer(),
## Transaction_Status = col_character(),
## Month = col_character()
## )
28 / 53

message()

  • designed to be informational

  • use them to tell the user that you’ve done something on their behalf

  • may represent:

    • default argument that may be non-trivial
    • how information was interpreted by the function
    • words of encouragement!


29 / 53

message()

  • designed to be informational

  • use them to tell the user that you’ve done something on their behalf

  • may represent:

    • default argument that may be non-trivial
    • how information was interpreted by the function
    • words of encouragement!
    • or even industry/business best practices
compute_pv <- function(fv, r, n, digits = 2) {
stopifnot(is.numeric(fv), is.numeric(r), is.numeric(n), is.numeric(digits))
if(r < 0.01 || r > 0.25) {
message("Your interest rate exceeds the normal ranges of 1-25%")
}
round(fv / (1 + r)^n, digits = digits)
}
compute_pv(1000, .26, 10)
## Your interest rate exceeds the normal ranges of 1-25%
## [1] 99.15
30 / 53

message()

  • designed to be informational

  • use them to tell the user that you’ve done something on their behalf

  • may represent:

    • default argument that may be non-trivial
    • how information was interpreted by the function
    • words of encouragement!
    • or even industry/business best practices
  • Always provide an option to suppress messages

compute_pv <- function(fv, r, n, digits = 2, quiet = FALSE) {
stopifnot(is.numeric(fv), is.numeric(r), is.numeric(n), is.numeric(digits))
if(!quiet) {
if(r < 0.01 || r > 0.25) {
message("Your interest rate exceeds the normal ranges of 1-25%")
}
# other messages here...
}
round(fv / (1 + r)^n, digits = digits)
}
# default provides message
compute_pv(1000, .26, 10)
## Your interest rate exceeds the normal ranges of 1-25%
## [1] 99.15
# user can silence
compute_pv(1000, .26, 10, quiet = TRUE)
## [1] 99.15
31 / 53

Your Turn!

Let's go back to our rescale() function:

rescale <- function(x, digits = 2){
rng <- range(x, na.rm = TRUE)
scaled <- (x - rng[1]) / (rng[2] - rng[1])
round(scaled, digits = digits)
}
32 / 53

Your Turn!

Let's go back to our rescale() function:

rescale <- function(x, digits = 2){
rng <- range(x, na.rm = TRUE)
scaled <- (x - rng[1]) / (rng[2] - rng[1])
round(scaled, digits = digits)
}
  • Move the na.rm = TRUE option to the function arguments so that the user can control whether or not to remove missing values.
  • Add a warning() that tells the user NAs are present (if they use na.rm = FALSE)
  • Add a message() that tells the user how many NAs were removed
  • add a missing value to your test_vector and test out your revised function
test_vector <- c(NA, test_vector)

Your results should look something like this:

# results in an error
rescale(letters)
## Error: `x` must be a numeric atomic vector
# results in warning
rescale(test_vector, na.rm = FALSE)
## Warning: There are 1 NAs. Remove them with `na.rm = TRUE`
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# results in message
rescale(test_vector)
## 1 NAs were removed.
## [1] 0.27 0.82 0.40 0.92 0.98 0.00 0.53 0.93 0.56 0.45 1.00 0.45 0.69 0.58
## [15] 0.07 0.94 0.22 0.00 0.31 1.00
33 / 53

Your Turn!

Let's go back to our rescale() function:

rescale <- function(x, digits = 2){
rng <- range(x, na.rm = TRUE)
scaled <- (x - rng[1]) / (rng[2] - rng[1])
round(scaled, digits = digits)
}
  • Move the na.rm = TRUE option to the function arguments so that the user can control whether or not to remove missing values.
  • Add a warning() that tells the user NAs are present (if they use na.rm = FALSE)
  • Add a message() that tells the user how many NAs were removed
  • add a missing value to your test_vector and test out your revised function
test_vector <- c(NA, test_vector)
rescale <- function(x, digits = 2, na.rm = TRUE, quiet = FALSE){
# make sure x an numeric atomic vector
if(!is.atomic(x) || !is.numeric(x)) {
stop("`x` must be a numeric atomic vector", call. = FALSE)
}
# compute number of missing values
missing_n <- sum(is.na(x))
# warning if na.rm = FALSE and NAs are present
if(!na.rm) {
warning(paste(
"There are",
missing_n,
"NAs. Remove them with `na.rm = TRUE`"),
call. = FALSE)
}
# message to report how many NAs were removed
if(na.rm) {
x <- x[!is.na(x)]
if(!quiet) {
message(paste(missing_n, "NAs were removed."))
}
}
# compute scaling
rng <- range(x)
scaled <- (x - rng[1]) / (rng[2] - rng[1])
round(scaled, digits = digits)
}
# results in an error
rescale(letters)
## Error: `x` must be a numeric atomic vector
# results in warning
rescale(test_vector, na.rm = FALSE)
## Warning: There are 1 NAs. Remove them with `na.rm = TRUE`
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# results in message
rescale(test_vector)
## 1 NAs were removed.
## [1] 0.27 0.82 0.40 0.92 0.98 0.00 0.53 0.93 0.56 0.45 1.00 0.45 0.69 0.58
## [15] 0.07 0.94 0.22 0.00 0.31 1.00
34 / 53

Lazy evaluation

  • In R, function arguments are lazily evaluated: only evaluated if accessed
h01 <- function(x) {
10
}
h01(x = stop("This is an error!"))
## [1] 10
35 / 53

Lazy evaluation

  • In R, function arguments are lazily evaluated: only evaluated if accessed

  • This provides you with lots of flexibility in when and how to evaluate inputs

h02 <- function(x, y, z) {
if(missing(z)) {
sum(x, y)
} else {
sum(x, y, z)
}
}
# z does not exist
h02(x = 1, y = 2)
## [1] 3
# z does exist
h02(x = 1, y = 2, z = 3)
## [1] 6
36 / 53

Lazy evaluation

  • In R, function arguments are lazily evaluated: only evaluated if accessed

  • This provides you with lots of flexibility in when and how to evaluate inputs

  • You need to be careful with how you use this, as it even allows you to use arguments outside of your function environment

h03 <- function(x, y) {
if(exists("z")) {
x + y + z
} else {
x + y
}
}
# z does not exist
h03(x = 1, y = 2)
## [1] 3
# z exists in global environment
z <- 3
h03(x = 1, y = 2)
## [1] 6
37 / 53

Lazy evaluation

  • In R, function arguments are lazily evaluated: only evaluated if accessed

  • This provides you with lots of flexibility in when and how to evaluate inputs

  • You need to be careful with how you use this, as it even allows you to use arguments outside of your function environment

  • One of the more common applications is to use NULL to indicate that a variable is not required but can be used if supplied

compute_pv <- function(fv, r, n, multiplier = NULL) {
# compute present value
pv <- round(fv / (1 + r)^n)
# apply multiplier if its supplied
if(is.null(multiplier)) {
pv
} else {
pv * multiplier
}
}
# w/o multiplier
compute_pv(1000, .05, 10)
## [1] 614
# with multiplier
compute_pv(1000, .05, 10, multiplier = 1.25)
## [1] 767.5
38 / 53

Lexical scoping

  • R uses a nested environment structure

  • a function has:

    • its own environment
    • plus one or more parent environments
    • always a fresh start
  • a function will search:

    1. itself
    2. through hierarchical parental environments
  • a function will never:

    • use a parameter from a parent environment when it exists in the current environment
    • search within another function unless it's its parent

39 / 53

Pure functions

  • This is why h03() in the previous example worked

  • However, we now have an impure function

  • Pure function:

    • returns the same result if given the same arguments
    • does not cause any observable side effects
40 / 53

Pure functions

  • This is why h03() in the previous example worked

  • However, we now have an impure function

  • Pure function:

    • returns the same result if given the same arguments
    • does not cause any observable side effects


Constency is our friend!

# h03() changes depending on z's value
# outside of the function call
z <- 4
h03(x = 1, y = 2)
## [1] 7
z <- 5
h03(x = 1, y = 2)
## [1] 8
# if your function is dependent on randomness
# allow users to set seed
get_n_samples <- function(x, n, seed = NULL) {
if (!is.null(seed)) {
set.seed(seed)
}
sample(x, size = n)
}
x <- 1:10
# allows user to always get same results
get_n_samples(x, 5, seed = 123)
## [1] 3 8 4 7 6
get_n_samples(x, 5, seed = 123)
## [1] 3 8 4 7 6
41 / 53

Pure functions

  • This is why h03() in the previous example worked

  • However, we now have an impure function

  • Pure function:

    • returns the same result if given the same arguments
    • does not cause any observable side effects


Don't change things the user doesn't want or expect!

get_n_samples <- function(x, n, seed = NULL) {
if (!is.null(seed)) {
set.seed(seed)
}
samples <- sample(x, size = n)
# overwrite x
x <<- samples
# change options to print 8 decimals
options(digits = 2)
samples
}
x <- seq(.0001, 1, by = 0.001)
get_n_samples(x, 5, seed = 123)
## [1] 0.29 0.79 0.41 0.88 0.94
x
## [1] 0.29 0.79 0.41 0.88 0.94
42 / 53

Pure functions

Take-away:

  • functions should do one thing

  • functions should do that one thing consistently

  • functions should not do anything else

43 / 53

Distributing your function(s)

Two main ways to distribute your functions:

  1. using the source() function to read an R script
    • easy to get started
    • all functions/objects in the sourced .R script will be imported to your global environment
    • does not scale well


source("my-functions/custom-stat-functions.R")
ls()
## [1] "skewness" "std_dev" "std_error" "variance"
purrr::map_dbl(mtcars, skewness)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0.65 -0.19 0.41 0.77 0.28 0.45 0.39 0.26 0.39 0.56 1.12
44 / 53

Distributing your function(s)

Two main ways to distribute your functions:

  1. using the source() function to read an R script

    • easy to get started
    • all functions/objects in the sourced .R script will be imported to your global environment
    • does not scale well
  2. writing a package (package writing will not be covered)

    • once you have a few functions that serve a common use
    • hard to get started
    • scales beautifully


library(gitr)
lsf.str("package:gitr")
## git_access_token : function ()
## git_add : function (files = "all", verbose = TRUE)
## git_clone : function (repo, path = ".", branch = NULL)
## git_commit : function (message, verbose = TRUE)
## git_file_size : function (files = "all", sort = TRUE)
## git_log : function (limit = Inf, detailed = TRUE, author = NULL, stats = FALSE,
## show_diff = FALSE, message = NULL, file = NULL)
## git_pull : function (force = FALSE)
## git_push : function (force = FALSE)
## git_rdone : function (message, files = "all", force = FALSE, verbose = TRUE)
## git_status : function (verbose = FALSE)
## install_internal_pkg : function (package, github_acct = "Analytics-Tools", host = "github.8451.com/api/v3",
## auth_token = NULL, ...)
## launch_github : function (...)
45 / 53

Bells & whistles

We can do many other things with functions such as:

  • add progress bars
import_monthly_data <- function(files, quiet = FALSE) {
if(!quiet) {
pb <- progress_estimated(length(files))
}
special_import <- function(x) {
if (quiet) {
suppressMessages(readr::read_csv(x))
} else {
pb$tick()$print()
suppressMessages(readr::read_csv(x))
}
}
purrr::map_dfr(paste0("data/", files), special_import)
}
import_monthly_data(list.files("data"))
## |=================================================================|100% ~0 s remaining
## # A tibble: 698,159 x 10
## Account_ID Transaction_Timest… Factor_A Factor_B Factor_C Factor_D Factor_E Response
## <int> <dttm> <int> <int> <chr> <int> <chr> <int>
## 1 5 2009-01-08 00:16:41 2 6 VI 20 A 1020
## 2 16 2009-01-20 22:40:08 2 6 VI 20 H 1020
## 3 28 2009-01-19 13:24:55 2 6 VI 21 NULL 1020
## 4 40 2009-01-05 16:10:58 2 6 VI 20 H 1020
## 5 62 2009-01-21 19:13:13 2 6 VI 20 B 1020
## 6 64 2009-01-01 18:53:02 7 6 MC 20 NULL 1020
## 7 69 2009-01-08 00:15:19 2 6 VI 20 H 1020
## 8 69 2009-01-19 09:33:22 2 6 VI 20 H 1020
## 9 70 2009-01-05 12:07:47 2 6 VI 20 B 1020
## 10 79 2009-01-07 19:41:18 7 6 MC 20 NULL 1020
## # … with 698,149 more rows, and 2 more variables: Transaction_Status <chr>, Month <chr>
46 / 53

Bells & whistles

We can do many other things with functions such as:

  • add progress bars

  • take any number of additional arguments with ... to:

    • pass arguments on to internal arguments
plot_strongest_corr <- function(data, response, ...) {
# find strong correlation
vars <- setdiff(names(data), response)
correlations <- sapply(data[, vars], function(x) cor(data[, response], x))
x <- names(which.max(abs(correlations)))
# plot relationship
plot(data[, x], data[, response], xlab = x, ylab = response, ...)
}
plot_strongest_corr(mtcars, "mpg", pch = 19, col = "blue", main = "Some cool title")

47 / 53

Bells & whistles

We can do many other things with functions such as:

  • add progress bars

  • take any number of additional arguments with ... to:

    • pass arguments on to internal arguments
    • or accept unknown number of inputs
select <- function(.data, ...) {
# convert ... to a list and then to a string
arguments <- substitute(...())
vars <- paste(arguments)
# index for
.data[, vars]
}
select(mtcars, mpg, wt)
## mpg wt
## Mazda RX4 21 2.6
## Mazda RX4 Wag 21 2.9
## Datsun 710 23 2.3
## Hornet 4 Drive 21 3.2
## Hornet Sportabout 19 3.4
## Valiant 18 3.5
## Duster 360 14 3.6
## Merc 240D 24 3.2
## Merc 230 23 3.1
## Merc 280 19 3.4
## Merc 280C 18 3.4
## Merc 450SE 16 4.1
## Merc 450SL 17 3.7
## Merc 450SLC 15 3.8
## Cadillac Fleetwood 10 5.2
## Lincoln Continental 10 5.4
## Chrysler Imperial 15 5.3
## Fiat 128 32 2.2
## Honda Civic 30 1.6
## Toyota Corolla 34 1.8
## Toyota Corona 22 2.5
## Dodge Challenger 16 3.5
## AMC Javelin 15 3.4
## Camaro Z28 13 3.8
## Pontiac Firebird 19 3.8
## Fiat X1-9 27 1.9
## Porsche 914-2 26 2.1
## Lotus Europa 30 1.5
## Ford Pantera L 16 3.2
## Ferrari Dino 20 2.8
## Maserati Bora 15 3.6
## Volvo 142E 21 2.8
48 / 53

Bells & whistles

We can do many other things with functions such as:

  • add progress bars

  • take any number of additional arguments with ...

  • improve efficiency with C++

# create our own sum function
sumR <- function(x) {
total <- 0
for (i in seq_along(x)) {
total <- total + x[i]
}
total
}
# use Rcpp to write a sum function in C++
library(Rcpp)
cppFunction('double sumC(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}')
x <- runif(1e3)
bench::mark(
sum(x),
sumC(x),
sumR(x)
)[1:6]
## # A tibble: 3 x 6
## expression min mean median max `itr/sec`
## <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl>
## 1 sum(x) 950ns 1.14µs 968ns 21.39µs 877949.
## 2 sumC(x) 2.12µs 4.42µs 4.54µs 1.09ms 226007.
## 3 sumR(x) 35.82µs 39.34µs 37.27µs 256.51µs 25421.
49 / 53

Bells & whistles

We can do many other things with functions such as:

  • add progress bars

  • take any number of additional arguments with ...

  • improve efficiency with C++


And so much more!

# create our own sum function
sumR <- function(x) {
total <- 0
for (i in seq_along(x)) {
total <- total + x[i]
}
total
}
# use Rcpp to write a sum function in C++
library(Rcpp)
cppFunction('double sumC(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}')
x <- runif(1e3)
bench::mark(
sum(x),
sumC(x),
sumR(x)
)[1:6]
## # A tibble: 3 x 6
## expression min mean median max `itr/sec`
## <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl>
## 1 sum(x) 950ns 1.27µs 995ns 37.5µs 784899.
## 2 sumC(x) 2.12µs 3.67µs 2.51µs 1.24ms 272609.
## 3 sumR(x) 35.63µs 38.98µs 37.22µs 132.34µs 25655.
50 / 53

Time permitting

Practice writing the following functions and test them on the given test vector

# test vector
set.seed(123)
x <- rlnorm(100)
# create a function that computes the variance of x
(1 / (length(x) - 1)) * sum((x - mean(x))^2)
# create a function that computes the standard deviation of x
sqrt((1 / (length(x) - 1)) * sum((x - mean(x))^2))
# create a function that computes the standard error of x
var_x <- (1 / (length(x) - 1)) * sum((x - mean(x))^2)
sqrt(var_x / length(x))
# create a function that computes the skewness of x
n <- length(x)
v <- var(x)
m <- mean(x)
third.moment <- (1 / (n - 2)) * sum((x - m)^3)
third.moment / (var(x)^(3 / 2))
# test vector
set.seed(123)
x <- rlnorm(100)
# variance function output
variance(x)
## [1] 3
# stardard dev output
std_dev(x)
## [1] 1.7
# standard error output
std_error(x)
## [1] 0.17
# skewness output
skewness(x)
## [1] 2.3
51 / 53

Learning more

Advanced R: https://adv-r.hadley.nz/

52 / 53

Questions


53 / 53

Overview

  • Why functions are good

  • Anatomy of a function

  • Defining functions

  • Defining your arguments

  • Checking arguments and other conditions

  • Lazy evaluation

  • Lexical scoping

  • Distributing your function(s)

  • Learning more

2 / 53
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow