- R has the ability to work with a variety of data
*types* - As an analyst, you should become familiar with dealing with the following types of data:

Data Type | Description |
---|---|

- Numbers | integer (i.e. 1,2,3…), double (i.e. 1.5, 3.66) |

- Character Strings | "r", "I attend UC", etc. |

- Regular expressions | patterns within text strings |

- Factors | nominal (male, female), ordinal (freshman, sophmore, junior), interval ($0-25, $26-50, $51-75) |

- Dates | calendar dates (i.e. 2016-08-06, 08/06/2016), weekdays, hours, etc. |

- Logical | `TRUE` , `FALSE` , `any` , `all` |

Numeric data pimarily comes in two forms: integer & double (double precision floating point)

# create a string of double-precision values dbl_var <- c(1, 2.5, 4.5) class(dbl_var) ## [1] "numeric" # placing an L after the values creates a string of integers int_var <- c(1L, 6L, 10L) class(int_var) ## [1] "integer"

We can coerce integers to doubles and vice versa with `as.double()`

and `as.integer()`

as.integer(dbl_var) ## [1] 1 2 4 int_to_dbl <- as.double(int_var) class(int_to_dbl) ## [1] "numeric" # Combining double and integer will automatically coerce to the simplest form (double) c(dbl_var, int_var) ## [1] 1.0 2.5 4.5 1.0 6.0 10.0

You've already seen logical operators using `==, !=, <, <=, >, >=`

x <- c(4, 4, 9, 12) y <- c(4, 4, 9, 12.00000008) x == y ## [1] TRUE TRUE TRUE FALSE

Can also test for **exact equality** with `identical()`

and **near equality** with `all.equal()`

z <- c(4, 4, 9, 12) identical(x, y) ## [1] FALSE identical(x, z) ## [1] TRUE all.equal(x, y) ## [1] TRUE

We can also round numbers multiple ways:

x <- c(1, 1.35, 1.7, 2.05, 2.4, 2.75) # round to the nearest integer round(x) ## [1] 1 1 2 2 2 3 # round up ceiling(x) ## [1] 1 2 2 3 3 3 # round down floor(x) ## [1] 1 1 1 2 2 2 # round to a specified decimal round(x, digits = 1) ## [1] 1.0 1.4 1.7 2.0 2.4 2.8

Import the `numbers-your-turn.csv`

file in the data folder

1. Are the vectors `x`

& `y`

equal? Exactly or approximately equal?

2. Are the vectors `y`

& `z`

equal? Exactly or approximately equal?

3. Round `x`

& `y`

numbers to the 4th digit

4. Are these vectors equal now?

# import the numbers-your-turn.csv file in the data folder df <- read.csv("data/numbers-your-turn.csv") # 1. Are the vectors x & y equal? Exactly or approximately equal? identical(df$x, df$y) ## [1] FALSE all.equal(df$x, df$y) ## [1] "Mean relative difference: 1.041407e-07" # 2. Are the vectors x & z equal? Exactly or approximately equal? identical(df$y, df$z) ## [1] FALSE all.equal(df$y, df$z) ## [1] TRUE # 3. Round x & y numbers to the 4th digit x <- round(df$x, digits = 4) y <- round(df$y, digits = 4) # 4. Are these vectors equal now? identical(x, y) ## [1] TRUE all.equal(x, y) ## [1] TRUE

Create character strings using `""`

a <- "learning to create" b <- "character strings"

Combine character strings with `c`

, `paste()`

or `paste0`

# create a vector containing two elements - a and b c(a, b) ## [1] "learning to create" "character strings" # create a vector containing one element - a and b combined paste(a, b) ## [1] "learning to create character strings" # paste multiple strings paste("I", "love", "R") ## [1] "I love R" # change the separator paste("I", "love", "R", sep = "-") ## [1] "I-love-R" # collapse space between characters paste0("I", "love", "R") ## [1] "IloveR"

Use `class()`

, `mode()`

and/or `is.character()`

to assess the data type

a <- "Life of" b <- pi class(a) ## [1] "character" mode(a) ## [1] "character" is.character(pi) ## [1] FALSE

Use `as.character()`

to convert non-character to a character

as.character(pi) ## [1] "3.14159265358979"

Combining characters and non-characters will coerce all inputs to a character

c(a, b) ## [1] "Life of" "3.14159265358979"

Use `length()`

to count the number of elements (individual character strings) in a vector

length("How many elements are in this string?") ## [1] 1 length(c("How", "many", "elements", "are", "in", "this", "string?")) ## [1] 7

Use `nchar()`

to count the number of characters in each element

nchar("How many characters are in this string?") ## [1] 39 nchar(c("How", "many", "characters", "are", "in", "this", "string?")) ## [1] 3 4 10 3 2 4 7

Characters are becoming a key form of data which organizations are increasingly leveraging for business analytics

- Character manipulation (regex and text analysis) is a rich field; far more complex than we have time for
- More opportunities to learn
- Data Wrangling with R class
- Dealing with Characters tutorial
- Dealing with Regular Expressions tutorial

Key Words : **finite options** and **levels**

- nominal variables
- male, female
- brunnette, blonde, red, black
- Hispanic, Caucasion, Asian, African

- ordinal variables
- slow, medium, fast
- freshman, sophomore, junior, senior

- interval variables
- $1-100, $101-200, $201-300
- 0-10, 11-20, 21-30

Create nominal factors with `factor()`

gender <- c("male", "female", "female") class(gender) ## [1] "character" gender2 <- factor(gender) class(gender2) ## [1] "factor" gender2 ## [1] male female female ## Levels: female male

set level preferences with `level`

argument

factor(gender, levels = c("male", "female")) ## [1] male female female ## Levels: male female

Create ordinal/interval factors with `ordered()`

; set level preferences with `level`

argument

age.range <- c("18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above", "Under 18") class(age.range) ## [1] "character" # turn x into an ordered factor - levels default to the order of the data age.range2 <- ordered(age.range) class(age.range2) ## [1] "ordered" "factor" age.range2 ## [1] 18-24 25-34 35-44 45-54 55-64 65 or Above ## [7] Under 18 ## 7 Levels: 18-24 < 25-34 < 35-44 < 45-54 < 55-64 < ... < Under 18

set level preferences with `level`

argument

ordered(age.range, levels = c("Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above")) ## [1] 18-24 25-34 35-44 45-54 55-64 65 or Above ## [7] Under 18 ## 7 Levels: Under 18 < 18-24 < 25-34 < 35-44 < 45-54 < ... < 65 or Above

If you want to know the levels that exist in your factor variable use `levels()`

facebook <- read.delim("data/facebook.tsv") levels(facebook$gender) ## [1] "female" "male"

We can use the `table()`

function to quickly assess the counts of each level

table(facebook$gender) ## ## female male ## 40254 58574

Import the `reddit.csv`

file in the data folder

1. What are the levels for the `income.range`

variable?

2. Properly order the levels for `income.range`

.

3. What are the counts for each level?

# import the reddit.csv file in the data folder reddit <- read.csv("data/reddit.csv") # 1. What are the levels for the `income.range` variable? levels(reddit$income.range) ## [1] "$100,000 - $149,999" "$150,000 or more" "$20,000 - $29,999" ## [4] "$30,000 - $39,999" "$40,000 - $49,999" "$50,000 - $69,999" ## [7] "$70,000 - $99,999" "Under $20,000" # 2. Properly order the levels for income.range. reddit$income.range <- ordered(reddit$income.range, levels = c("Under $20,000", "$20,000 - $29,999", "$30,000 - $39,999", "$40,000 - $49,999", "$50,000 - $69,999", "$70,000 - $99,999", "$100,000 - $149,999", "$150,000 or more")) # 3. What are the counts for each level? table(reddit$income.range) ## ## Under $20,000 $20,000 - $29,999 $30,000 - $39,999 ## 7892 3206 2904 ## $40,000 - $49,999 $50,000 - $69,999 $70,000 - $99,999 ## 2686 4133 4101 ## $100,000 - $149,999 $150,000 or more ## 3522 2695

- The
`lubridate`

package makes working with dates extremely easy - To create a date variable we simply need to know the year-month-day order

Function | Order of elements in date-time |
---|---|

`ymd()` |
year, month, day |

`ydm()` |
year, day, month |

`mdy()` |
month, day, year |

`dmy()` |
day, month, year |

`hm()` |
hour, minute |

`hms()` |
hour, minute, second |

`ymd_hms()` |
year, month, day, hour, minute, second |

- The
`lubridate`

package makes working with dates extremely easy - To create a date variable we simply need to know the year-month-day order

dates <- c("2015-07-01", "2015-08-01", "2015-09-01") class(dates) ## [1] "character"

Convert this character string to date format with `lubridate`

's `ymd()`

function

# install.packages("lubridate") # run this line if you have not yet installed lubridate library(lubridate) dates2 <- ymd(dates) class(dates2) ## [1] "Date" dates2 ## [1] "2015-07-01" "2015-08-01" "2015-09-01"

- Sometimes your date data are collected in separate elements
- To convert these separate data into one date object incorporate the
`ISOdate()`

function:

yr <- c("2012", "2013", "2014", "2015") mo <- c("1", "5", "7", "2") day <- c("02", "22", "15", "28") # ISOdate converts to a POSIXct object full_date <- ISOdate(year = yr, month = mo, day = day) full_date ## [1] "2012-01-02 12:00:00 GMT" "2013-05-22 12:00:00 GMT" ## [3] "2014-07-15 12:00:00 GMT" "2015-02-28 12:00:00 GMT"

We can truncate the unused time data by converting with `as.Date()`

as.Date(full_date) ## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28"

We can also easily extract components of dates using `lubridate`

Function | Date-time element to extract |
---|---|

`year()` |
Year |

`month()` |
Month |

`week()` |
Week |

`yday()` |
Day of year |

`mday()` |
Day of month |

`wday()` |
Day of week |

`hour()` |
Hour |

`minute()` |
Minute |

`second()` |
Second |

`tz()` |
Time zone |

We can also easily extract components of dates using `lubridate`

Extract time components:

year(full_date) ## [1] 2012 2013 2014 2015 week(full_date) ## [1] 1 21 28 9 wday(full_date, label = TRUE) ## [1] Mon Wed Tues Sat ## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

Manipulate or change date-time components by using the function and then assignment

as.Date(full_date) ## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28" year(full_date) <- c(2014, 2015, 2015, 2016) as.Date(full_date) ## [1] "2014-01-02" "2015-05-22" "2015-07-15" "2016-02-28"

- We can also do regular statistical summaries of date objects
- Illustrate with the
`lakers`

data set that comes with the`lubridate`

package

dates <- ymd(lakers$date) min(dates) ## [1] "2008-10-28" max(dates) ## [1] "2009-04-14" mean(dates) ## [1] "2009-01-22" median(dates) ## [1] "2009-01-21" summary(dates) ## Min. 1st Qu. Median Mean 3rd Qu. ## "2008-10-28" "2008-12-10" "2009-01-21" "2009-01-22" "2009-03-09" ## Max. ## "2009-04-14"

Import the `facebook.tsv`

file in the data folder

1. Create a new date variable that combines the `dob_day`

,

Â Â Â Â Â `dob_month`

, & `dob_year`

variables.

2. What is the `min`

, `max`

, `mean`

, and `median`

date of births in

Â Â Â Â Â this data frame?

**NOTE:** If you save the new variable as `facebook$dob <- _____________`

it will add this new variable to the facebook data frame

# Import the `facebook.tsv` file in the data folder facebook <- read.delim("data/facebook.tsv") # 1. Create a new date variable that combines the dob_day, dob_month, & dob_year variables. facebook$dob <- as.Date(ISOdate(year = facebook$dob_year, month = facebook$dob_month, day = facebook$dob_day)) # 2. What is the min, max, mean, and median date of births in this data frame? summary(facebook$dob) ## Min. 1st Qu. Median Mean 3rd Qu. ## "1900-01-01" "1963-08-14" "1985-01-20" "1976-03-12" "1993-01-01" ## Max. ## "2000-10-27"

We already saw how we can get `TRUE`

/`FALSE`

responses from comparing elements

x <- c(4, 4, 9, 12, 2, 2, 10) y <- c(4, 5, 9, 13, 2, 1, 10) x == y ## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE

This is just vector containing logical elements

z <- x == y class(z) ## [1] "logical"

We can assess if any or all the elements are `TRUE`

any(z) ## [1] TRUE all(z) ## [1] FALSE

Operator/Function | Description |
---|---|

`as.double()` , `as.integer()` |
coerce to double floating point or integer numbers |

`identical()` , `all.equal()` |
test for exact and near equality |

`round()` , `ceiling()` , `floor()` |
round numbers |

`c()` , `paste()` , `paste0()` |
combine character strings |

`as.character()` |
coerce non-character to a character |

`nchar()` |
count the number of characters in each element |

`factor()` , `ordered()` |
create or coerce to factor variables |

`levels()` |
assess the levels of a factor |

`table()` |
get the counts of each level |

Operator/Function | Description |
---|---|

`ymd()` , `mdy()` , `hm()` , etc |
`lubridate` : create or convert to date-time variable |

`Isodate()` |
create date variable by mergine separate date components |

`as.Date()` |
truncate date-time variable to just date variable |

`year()` , `week()` , etc |
`lubridate` : extract individual date components |

`any()` , `all()` |
assess if any or all elements are `TRUE` |

5 minutes!